I'm implementing a program which should get files from the user in Angular, send them to the node.js backend, where there those will be read and parsed into an array of object.
I went with javascript because it was the fastest, for me, to implement this program and test it out.
If you have any other language advice I'll consider it in my next implementation.
So far my XML files are 110mb the largest and 16mb the smallest.
I have to be able to parse up to 3000 files that can vary between those sizes.
All of these files have the same structures and tags, in almost the same order.
Example XML:
<start_xml_tag>
<property1>something</property1>
.
.
<property10>somethingElse</property10>
<Offerte>
<codice>123</codice>
<nome>andrea</nome>
<stato>italia</stato>
</Offerte>
... 1Milion row after
<Offerte>
<codice>123</codice>
<nome>andrea</nome>
<stato>italia</stato>
</Offerte>
...2 milion row after
</start_xml_tag>
The result attended is an array containing objects, where each object is like this:
{ codice: 123, nome: andrea stato: italia }
In advance the user can insert a filter where if the field stato is equal to (for example) 'italia' the result in the objects should only be those who have the stato = italia.
I will provide my code as soon as I can, I'm not at home until tomorrow so I don't have it right here.
Can you help me or tell me where I'm thinking wrong?
Thanks in advance!
So far I did achieve this by loading the files by text with smaller numbers of files up to 400 or even 600.
Then I switched to using stream sax to read the files in chunks, but it was taking forever to finish the parsing, I'm not sure if i made some mistakes, it was parsing all good, but after 9hrs it was still running.
The chunk size was set to 100000 bytes, higher than this i would get max stack size error.
I'm now trying with a xpath approach to see if i can target all 'Offerte' children with an expression, but I'm new to this and i still think I won't be too efficient at all.
import { Injectable } from '@angular/core';
import { HttpClient, } from '@angular/common/http';
import * as sax from 'sax';
import { targetTags, TargetTags, startTagName, checkUnitReferenceNo, checkStatusCd } from 'src/models/util';
const BATCH_SIZE = 100;
@Injectable({
providedIn: 'root'
})
export class BatchXmlService {
constructor(private http: HttpClient) { }
private result: TargetTags[] = [];
async processFiles(files: FileList, unitReferenceNoFilter: any, statusCdFilters: any) {
let result = [];
for (let i = 0; i < files.length; i += BATCH_SIZE) {
const batch = Array.from(files).slice(i, i + BATCH_SIZE);
result.push(...await this.processBatch(batch, unitReferenceNoFilter, statusCdFilters));
}
return result;
}
async processBatch(batch: File[], unitReferenceNoFilter: any, statusCdFilters: any) {
const promises = batch.map(file => this.parseFile(file, unitReferenceNoFilter, statusCdFilters));
const parsedData = await Promise.all(promises);
return parsedData.flatMap(x=>x);
}
// Function to parse a single XML file
async parseFile(file: File, unitReferenceNoFilter: any, statusCdFilters: any): Promise<any[]> {
const strict = true;
return new Promise((resolve, reject) => {
const saxStream = sax.createStream(strict);
saxStream.on('error', (error) => {
reject(error);
});
const parsedData: any[] = []; // Array to store parsed data
let lastTag = "";
let inTargetSection = false; // Flag to track if we're within the target section
let skipParsing = false;
saxStream.on('opentag', (node) => {
if (node.name === startTagName) {
inTargetSection = true;
parsedData.push({});
}
else if (inTargetSection && !skipParsing) {
lastTag = node.name;
if(targetTags.includes(node.name)){
// Store tag name as key and content as value in the current offer
const currentOffer = parsedData[parsedData.length - 1];
if (currentOffer) {
currentOffer[node.name] = "";
}
}
}
});
saxStream.on('text', (text) => {
// Assuming you only care about text content within target tags
if(text.trim() === "") return;
if (inTargetSection && lastTag?.length > 0 && targetTags.includes(lastTag) && !skipParsing) {
const currentOffer = parsedData[parsedData.length - 1];
if(lastTag){
if(
lastTag === "UNIT_REFERENCE_NO" && !checkUnitReferenceNo(unitReferenceNoFilter, text.trim()) ||
lastTag === "STATUS_CD" && !checkStatusCd(statusCdFilters, text.trim())
){
skipParsing = true;
parsedData.pop();
return;
}
currentOffer[lastTag] = text.trim();
}
}
});
saxStream.on('closetag', (nodeName) => {
if (nodeName === startTagName) {
inTargetSection = false;
skipParsing = false;
}
});
saxStream.on('end', () => {
resolve(parsedData);
});
const reader = new FileReader();
reader.readAsArrayBuffer(file);
reader.onload = () => {
const arrayBuffer = reader.result as ArrayBuffer;
const byteArray = new Uint8Array(arrayBuffer); // Convert to Uint8Array
let remainingData = byteArray;
while (remainingData.length > 0) {
const chunkSize = Math.min(remainingData.length, 100000); // Read in chunks of 50000 bytes
saxStream.write(String.fromCharCode.apply(null, remainingData.slice(0, chunkSize)));
remainingData = remainingData.slice(chunkSize);
}
saxStream.end(); // Call end after processing all data
};
reader.onerror = (error) => {
reject(error); // Handle file read errors
};
});
}
}
<form [formGroup]="form" class="d-flex flex-column justify-content-around w-100" (ngSubmit)="generateExcel()" style="height: 300px">
<div class="d-flex flex-column">
<label for="Filtro_UNIT_REFERENCE_NO">Filtro per UNIT_REFERENCE_NO <sup style="color: red">*</sup></label>
<div class="d-flex justify-content-between w-50">
<select class="form-control" formControlName="UNIT_REFERENCE_NO" (change)="setUnitReferenceNoFilter()">
<option *ngFor="let item of unitReferenceNoOptions" [value]="item.value">{{item.label}}</option>
</select>
</div>
</div>
<div class="d-flex flex-column">
<label for="Filtro_STATUS_CD">Filtro per STATUS_CD</label>
<div class="d-flex justify-content-between w-50">
<ng-select [items]="statusCdOptions" class="w-100"
[multiple]="true"
placeholder="Se nessuna scelta e' selezionata allora si estraggono tutte"
formControlName="STATUS_CD"
bindLabel="label"
bindValue="value">
</ng-select>
</div>
</div>
<input #myInput for="files-xml" (click)="checkFiltersInserted($event)" formControlName="files" class="form-control mx-1" type="file" class="file-upload bg-secondary-subtle w-50" (change)="onUpload($event)" required webkitdirectory multiple />
<button class="btn btn-primary w-25" type="submit" [disabled]="!isUnitReferenceNoFiltered || (loadingData && result !== [])">
<span *ngIf="loadingData" class="spinner-border spinner-border-sm" role="status" aria-hidden="true"></span>
Genera Excel
</button>
</form>
And my app.ts onUpload function is this:
async onUpload(event: any) {
const files = event.target.files;
this.loadingData = true;
try{
this.result = await this.batchXmlService.processFiles(files, this.form.controls['UNIT_REFERENCE_NO'].value, this.form.controls['STATUS_CD'].value);
this.loadingData = false;
}
catch(error){
this.loadingData = false;
}
}
I tried to clear the code as best as I could, I had some more commented code in the mix from previous tests.
I'm aware that this is on Angular.js and not Node.js, I'm now trying to write this same logic in node.js to delegate this calculation on the backend.
I also forgot the second filter is an array of choices where from the previous XML example the field 'nome' could have different values all contained in the multiselect option list.
This filter works this way: If the list contanins ['andrea', 'francesco', 'brian']
It filters the object so that all the results are matching one of these three strings.
Accepted XML example:
<Offerte>
<codice>123</codice>
<nome>andrea</nome>
<stato>italia</stato>
</Offerte>
Discarded xml example:
<Offerte>
<codice>123</codice>
<nome>john</nome>
<stato>italia</stato>
</Offerte>
I was able to solve my problem, following what jdweng suggested with his ps script, I then went back to my first solution of the parsing which looks similar to his.
At first I decided to get the files from a folder directly from node.js and not to transfer them from front-end to back-end because it would take very long to package them and send them.
Then I tried to read the full files but those files where too big and my memory would crash, so I implemented a read by chunk logic.
So for anyone of who's struggling with something that looks like my case I did this:
Getting files from the folder and start to read them one by one:
async function getExcelRows(filterNome, filterStato){
let result = [];
const fileNames = await readFolder();
for(let fileName of fileNames){
const response = await readAndParseFile(folderPath+fileName, filterNome, filterStato);
result.push(...response);
}
return JSON.stringify(result);
}
Implementation for reading the folder:
async function readFolder(){
const files = await fsp.readdir(folderPath);
return files.filter(file => file.endsWith('.xml'));
}
Implementation of reading files in chunks:
async function readAndParseFile(filePath, filterNome, filterStato){
return new Promise((resolve, reject) => {
let keeper = {};
let isToErase = true;
let result = [];
const re = new RegExp("<"+objTag+">", 'g');
const stream = fs.createReadStream(filePath, { encoding: 'utf-8' });
stream.on('data', async (chunk) => {
if(isToErase){
keeper = {};
keeper.remain = chunk;
}
else{
if(keeper.remain.indexOf("<"+objTag+">") != -1)
testFileOfferteNumber--;
keeper.remain +=chunk;
}
testFileOfferteNumber += keeper.remain.match(re).length;
if(keeper.remain.indexOf("<"+objTag+">") !== -1){
keeper = await parseFile(keeper.remain, filterNome, filterStato);
if(keeper?.result.length > 0){
result.push(...keeper.result);
keeper.result = [];
}
if(keeper?.remain != "")
isToErase = false;
else
isToErase = true;
}
});
stream.on('end', () => {
resolve(result);
});
stream.on('error', (error) => {
reject(error);
});
});
}
testFileOfferteNumber variable was to check if i was actually processing all tags without errors nothing more then that.
So here we read the file in chunk and if after parsing I still have some uncomplete content left i return that back to this function and I set the isToErase to false if that's the case, so that the next chunk will be appended to my current remaining content. Otherwize I just take the new chunk and process it.
Implementation of the parsing logic:
// objTag = "Offerte"
// filterNome = ["andrea"]
// filterStato = "italia"
function parseFile(source, filterNome, filterStato){
return new Promise((resolve, reject)=>{
try{
const endTagLen = ("</"+objTag+">").length;
let result = [];
while(source.indexOf("<"+objTag+">") !== -1){
if(source.indexOf("</"+objTag+">") === -1)
return resolve({result: result, remain: source});
testResultOfferteNumber++;
if(
!checkNomeFilter(filterNome, checkField(source, 'nome') ? getField(source, "nome") : "") ||
!checkStato(filterStato, checkField(source, 'stato') ? getField(source, "stato") : "")
){
source = source.slice(source.indexOf("</"+objTag+">")+endTagLen);
continue;
}
const obj = {
codice: checkField(source, 'codice') ? getField(source, "codice") : "",
nome: checkField(source, 'nome') ? getField(source, "nome") : "",
stato: checkField(source, 'stato') ? getField(source, "stato") : "",
}
source = source.slice(source.indexOf("</"+objTag+">")+endTagLen);
result.push(obj);
}
resolve({result: result, remain: source});
}
catch(e){
reject(e);
}
});
}
testResultOfferteNumber is the variable to compare with the previous one to check if all tags have been processed correctly.
So here I loop through all the tags "Offerte" as long as there are and in this "while" I check if there is a closing tag , if I don't, I return all what's left to the reading function which will append the next chunk of text to this text, doing so it will be able to check this content afterwards.
After, I apply my filters to avoid parsing data which is not accepted before parsing it, then because I know how the xml structure is, I will simply extract all 3 tags I need at once.
After this parsing I slice the already parsed part of the file away and restart the loop with the new content in the <Offerte>...</Offerte>
.
I left the chunk size to the default one which worked for me perfectly, 3000 files were processed in about 20-30 minutes with no crashes.
For my case I needed to return the stringify of the result so I could create an Excel file.
I hope this solution helps someone struggling like me, thanks to everyone who helped me with this problem expecially jdweng.