Finding start of compressed data for items in a zip with zip4j

I'm trying to find the start of compressed data for each zip entry using zip4j. Great library for returning the local header offset, which Java's ZipFile does not do. However I'm wondering if there is a more reliable way than what I'm doing below to get the start of the compressed data? Thanks in advance.

offset = header.getOffsetLocalHeader();
offset += 30; //add fixed file header
offset += header.getFilenameLength(); // add filename field length
offset += header.getExtraFieldLength(); //add extra field length

//not quite the right number, sometimes have to add 4
//seems to be some header data that is outside the extra field value 
offset += 4;

Edit Here is a sample zip: https://alexa-public.s3.amazonaws.com/test.zip

The code below decompresses each item properly but won't work without the +4.

        String path = "/Users/test/Desktop/zip test/test.zip";
        List<FileHeader> fileHeaders = new ZipFile(path).getFileHeaders();
        for (FileHeader header : fileHeaders) {
            long offset = 30 + header.getOffsetLocalHeader() + header.getFileNameLength() + header.getExtraFieldLength();
            //fudge factor!
            offset += 4;

            RandomAccessFile f = new RandomAccessFile(path, "r");
            byte[] buffer = new byte[(int) header.getCompressedSize()];
            f.seek(offset);

            f.read(buffer, 0, (int) header.getCompressedSize());
            f.close();

            Inflater inf = new Inflater(true);
            inf.setInput(buffer);
            byte[] inflatedContent = new byte[(int) header.getUncompressedSize()];
            inf.inflate(inflatedContent);
            inf.end();
            FileOutputStream fos = new FileOutputStream(new File("/Users/test/Desktop/" + header.getFileName()));
            fos.write(inflatedContent);
            fos.close();
        }

Solution

The reason you have to add 4 to the offset in your example is because the size of the extra data field in central directory of this entry (= file header) is different than the one in local file header, and it is perfectly legal as per zip specification to have different extra data field sizes in central directory and local header. In fact the extra data field we are talking about, Extended Timestamp extra field (signature 0x5455), has an official definition which has varied lengths between the two.

Extended Timestamp extra field (signature 0x5455)

Local-header version:

| Value         | Size          | Description                           |
| ------------- |---------------|---------------------------------------|
| 0x5455        | Short         | tag for this extra block type ("UT")  |
| TSize         | Short         | total data size for this block        |
| Flags         | Byte          | info bits                             |
| (ModTime)     | Long          | time of last modification (UTC/GMT)   |
| (AcTime)      | Long          | time of last access (UTC/GMT)         |
| (CrTime)      | Long          | time of original creation (UTC/GMT)   |


 Central-header version:

| Value         | Size          | Description                           |
| ------------- |---------------|---------------------------------------|
| 0x5455        | Short         | tag for this extra block type ("UT")  |
| TSize         | Short         | total data size for this block        |
| Flags         | Byte          | info bits                             |
| (ModTime)     | Long          | time of last modification (UTC/GMT)   |

In the sample zip file you have attached, the tool which creates the zip file adds a 4 byte additional information compared to the central directory for this extra field.

Relying on the extra field length from central directory to reach to start of data can be error prone. A more reliable way to achieve what you want is to read the extra field length from local header. I have modified your code slightly to consider the extra field length from local header and not from central header to reach to the start of data.

import net.lingala.zip4j.model.FileHeader;
import net.lingala.zip4j.util.RawIO;
import org.junit.Test;

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.util.List;
import java.util.zip.DataFormatException;
import java.util.zip.Inflater;

public class ZipTest {

  private static final int OFFSET_TO_EXTRA_FIELD_LENGTH_SIZE = 28;

  private RawIO rawIO = new RawIO();

  @Test
  public void testExtractWithDataOffset() throws IOException, DataFormatException {
    String basePath = "/Users/slingala/Downloads/test/";
    String path = basePath + "test.zip";
    List<FileHeader> fileHeaders = new ZipFile(path).getFileHeaders();

    for (FileHeader header : fileHeaders) {
      RandomAccessFile f = new RandomAccessFile(path, "r");
      byte[] buffer = new byte[(int) header.getCompressedSize()];
      f.seek(OFFSET_TO_EXTRA_FIELD_LENGTH_SIZE);
      int extraFieldLength = rawIO.readShortLittleEndian(f);

      f.skipBytes(header.getFileNameLength() + extraFieldLength);

      f.read(buffer, 0, (int) header.getCompressedSize());
      f.close();

      Inflater inf = new Inflater(true);
      inf.setInput(buffer);
      byte[] inflatedContent = new byte[(int) header.getUncompressedSize()];
      inf.inflate(inflatedContent);
      inf.end();

      FileOutputStream fos = new FileOutputStream(new File(basePath + header.getFileName()));
      fos.write(inflatedContent);
      fos.close();
    }
  }
}

On a side note, I wonder why you want to read the data, deal with inflater and extract the content yourself? With zip4j you can extract all entires with ZipFile.extractAll() or you can also extract each entry in the zip file with streams if you wish with ZipFile.getInputStream(). A skeleton example is:

ZipFile zipFile = new ZipFile("filename.zip");
FileHeader fileHeader = zipFile.getFileHeader("entry_name_in_zip.txt");
InputStream inputStream = zipFile.getInputStream(fileHeader);

Once you have the inputstream, you can read the content and write it to any outputstream. This way you can extract each entry in the zip file without having to deal with the inflaters yourself.