I am developing a small program that divides a very big XML file (more than 2Gb) in little chunks.
After a research of many libraries, I chose VTD-XML (using VTDGenHuge for big files) and I started developing a little code test. But I am getting an issue when I read the segment bytes of the file.
I get offset and lenght with:
long [] l = vn.getElementFragment();
Then I get the information of result:
int offset = (int) (l[0] >> 64);
int len = new Integer("" + l[1]);
Finally I try to extract the segment of bytes to write it in another file:
b = new byte[len];
fis.read(b, offset, len); **//<===== this is the exception problem**
But I am getting java.lang.IndexOutOfBoundsException
Also, when I assign a fixed number to the array of bytes (new byte[400] for example], the program ends OK, but the output file is corrupted.
My code:
File fo = new File("\\path\\post_people.xml");
FileOutputStream fos = new FileOutputStream(fo);
int count = 0;
File f = new File("\\path\\people.xml");
FileInputStream fis = new FileInputStream(f);
byte[] b;
VTDGenHuge vg = new VTDGenHuge();
if (vg.parseFile("\\path\\people.xml", false, VTDGenHuge.MEM_MAPPED)){
VTDNavHuge vn = vg.getNav();
AutoPilotHuge ap = new AutoPilotHuge();
ap.bind(vn);
ap.selectXPath("/people/person"); //here it could be posible add another condition
while (ap.evalXPath() != -1) {
long [] l = vn.getElementFragment();
int offset = (int) (l[0] >> 64);
int len = new Integer("" + l[1]);
b = new byte[len];
fis.read(b, offset, len); //<===== this is the line problem
fos.write(b); // writing the fragment out into other file
count++;
if (count == 3) { //this is just a test
break;
}
}
}
A sample of XML file:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<people>
<person>
<name>Nombre 0</name>
<lastName>ApPaterno 1</lastName>
<birthdate>2017-11-10T10:20:44.926-05:00</birthdate>
<age>0</age>
<address>
<streetType>Tipo Calle 0</streetType>
<streetName>Nombre de Calle 0</streetName>
<number>0</number>
</address>
</person>
<person>
<name>Nombre 1</name>
<lastName>ApPaterno 1</lastName>
<birthdate>2017-11-10T10:20:44.926-05:00</birthdate>
<age>1</age>
<address>
<streetType>Tipo Calle 1</streetType>
<streetName>Nombre de Calle 1</streetName>
<number>1</number>
</address>
</person>
</people>
Please, can you help me, guys?
UPDATE and SOLUTION:
Finally, the fragment code that I should modify was the following:
long [] l = vn.getElementFragment();
int offset = (int) (l[0] >> 64);
int len = new Integer("" + l[1]);
b = new byte[len];
fis.getChannel().position(0); //must return to position 0
fis.skip(offset); //must move to offset position
fis.read(b, 0, len);
As you've pointed out the main issue in your code is within the read of the inputstream:
int offset = (int) (l[0] >> 64);
int len = new Integer("" + l[1]);
b = new byte[len];
fis.read(b, offset, len);
According to InputStream.read()'s JavaDoc:
The first byte read is stored into element b[off], the next one into b[off+1], and so on.
This means that your actual buffer either has to be of lenght offset + len, which leave the bytes 0 to offset as 0, or you skip the first offset bytes of the input stream and read len bytes into the buffer by filling the buffer from position 0 onwards.
If you replace the above code with
int offset = (int) (l[0] >> 64);
int len = new Integer("" + l[1]);
b = new byte[len];
fis.skip(offset);
fis.read(b, 0, len);
the buffer should fill with the bytes of the actual String representation of
<person>
<name>Nombre 0</name>
<lastName>ApPaterno 1</lastName>
<birthdate>2017-11-10T10:20:44.926-05:00</birthdate>
<age>0</age>
<address>
<streetType>Tipo Calle 0</streetType>
<streetName>Nombre de Calle 0</streetName>
<number>0</number>
</address>
</person>