javaxmlsplitvtd-xml

Exception when split big XML file in small chunks with VTD-XML


I am developing a small program that divides a very big XML file (more than 2Gb) in little chunks.

After a research of many libraries, I chose VTD-XML (using VTDGenHuge for big files) and I started developing a little code test. But I am getting an issue when I read the segment bytes of the file.

I get offset and lenght with:

            long [] l = vn.getElementFragment();

Then I get the information of result:

            int offset = (int) (l[0] >> 64);
            int len = new Integer("" + l[1]);

Finally I try to extract the segment of bytes to write it in another file:

            b = new byte[len];
            fis.read(b, offset, len); **//<===== this is the exception problem**

But I am getting java.lang.IndexOutOfBoundsException

Also, when I assign a fixed number to the array of bytes (new byte[400] for example], the program ends OK, but the output file is corrupted.

My code:

    File fo = new File("\\path\\post_people.xml");
    FileOutputStream fos = new FileOutputStream(fo);

    int count = 0;

    File f = new File("\\path\\people.xml");
    FileInputStream fis = new FileInputStream(f);
    byte[] b;

    VTDGenHuge vg = new VTDGenHuge();
    if (vg.parseFile("\\path\\people.xml", false, VTDGenHuge.MEM_MAPPED)){

        VTDNavHuge vn = vg.getNav();

        AutoPilotHuge ap = new AutoPilotHuge();
        ap.bind(vn);
        ap.selectXPath("/people/person"); //here it could be posible add another condition

        while (ap.evalXPath() != -1) {
            long [] l = vn.getElementFragment();
            int offset = (int) (l[0] >> 64);
            int len = new Integer("" + l[1]);
            b = new byte[len];
            fis.read(b, offset, len); //<===== this is the line problem

            fos.write(b); // writing the fragment out into other file

            count++;

            if (count == 3) { //this is just a test
                break;
            }

        }

    }

A sample of XML file:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<people>
    <person>
        <name>Nombre 0</name>
        <lastName>ApPaterno 1</lastName>
        <birthdate>2017-11-10T10:20:44.926-05:00</birthdate>
        <age>0</age>
        <address>
            <streetType>Tipo Calle 0</streetType>
            <streetName>Nombre de Calle 0</streetName>
            <number>0</number>
        </address>
    </person>
    <person>
        <name>Nombre 1</name>
        <lastName>ApPaterno 1</lastName>
        <birthdate>2017-11-10T10:20:44.926-05:00</birthdate>
        <age>1</age>
        <address>
            <streetType>Tipo Calle 1</streetType>
            <streetName>Nombre de Calle 1</streetName>
            <number>1</number>
        </address>
    </person>
</people>

Please, can you help me, guys?

UPDATE and SOLUTION:

Finally, the fragment code that I should modify was the following:

long [] l = vn.getElementFragment();
int offset = (int) (l[0] >> 64);
int len = new Integer("" + l[1]);
b = new byte[len];

fis.getChannel().position(0); //must return to position 0
fis.skip(offset); //must move to offset position
fis.read(b, 0, len);

Solution

  • As you've pointed out the main issue in your code is within the read of the inputstream:

    int offset = (int) (l[0] >> 64);
    int len = new Integer("" + l[1]);
    b = new byte[len];
    fis.read(b, offset, len);
    

    According to InputStream.read()'s JavaDoc:

    The first byte read is stored into element b[off], the next one into b[off+1], and so on.

    This means that your actual buffer either has to be of lenght offset + len, which leave the bytes 0 to offset as 0, or you skip the first offset bytes of the input stream and read len bytes into the buffer by filling the buffer from position 0 onwards.

    If you replace the above code with

    int offset = (int) (l[0] >> 64);
    int len = new Integer("" + l[1]);
    b = new byte[len];
    fis.skip(offset);
    fis.read(b, 0, len);
    

    the buffer should fill with the bytes of the actual String representation of

    <person>
        <name>Nombre 0</name>
        <lastName>ApPaterno 1</lastName>
        <birthdate>2017-11-10T10:20:44.926-05:00</birthdate>
        <age>0</age>
        <address>
            <streetType>Tipo Calle 0</streetType>
            <streetName>Nombre de Calle 0</streetName>
            <number>0</number>
        </address>
    </person>