node.jsxlsxnode-streams

Is there a good way to read a 600K row excel file(180mb) in nextjs/nodejs


So in my Nextjs application the user uploads an excel sheet then I have to parse it. That is go through it and get some data out of it only from specific columns. My application worked perfectly using this logic. But that was working only for small excel files. As soon as I started using real big excel sheets it started failing

This code was working for small sheets

import { NextRequest, NextResponse } from 'next/server';
import * as XLSX from 'xlsx';
XLSX.set_fs(fs);

export const POST = async (req: NextRequest) => {
    try {
        const formData = await req.formData();
        const file = formData.getAll('files')[0] as File;
        const name = formData.get('name') as string;

        // read the file
        let arrayBuffer = await file.arrayBuffer();
        const sheet = XLSX.read(arrayBuffer, { type: 'array', cellDates: true });

        //download the XL file
        XLSX.writeFileXLSX(sheet, `./downloads/${file.name}`, {
            cellDates: true,
        });

        // convert the sheet to json data and format it according to name type 
        const data = XLSX.utils.sheet_to_json(sheet.Sheets[sheet.SheetNames[0]]);
        
        return NextResponse.json(data);
    } catch (error) {
        if (error instanceof Error) {
            console.error(error.message);
            return NextResponse.json(error.message, { status: 500 });
        }
    }
}

But the above code returns empty data now. So I searched a bit and found that I need to use streams and chunks to handle such a large data. Then I modified my code like this.

import { NextRequest, NextResponse } from 'next/server';
import ExcelJS from 'exceljs';
import * as fs from 'fs';
export const POST = async (req: NextRequest) => {
    try {
        const formData = await req.formData();
        const file = formData.getAll('files')[0] as File;
        const name= formData.get('Name') as string;

        if (!file) {
            throw new Error('File is missing');
        }
        if (!name) {
            throw new Error('Name is missing');
        }
        const path = `./downloads/${file.name}`;
        const arrayBuffer = await file.arrayBuffer();
        const buffer = Buffer.from(arrayBuffer);

        fs.writeFileSync(path, buffer);

        const readStream = fs.createReadStream(path);
        const workbook = new ExcelJS.Workbook();
        await workbook.xlsx.read(readStream);
        const worksheet = workbook.getWorksheet(1);
        const chunkSize = 1024 * 1024;
        let rows: any = [];
        let chunkedData = [];
        if (!worksheet) {
            return;
        }
        worksheet.eachRow({ includeEmpty: true }, (row, rowNumber) => {
            rows.push(row.values);
            if (rows.length === chunkSize) {
                chunkedData.push(rows);
                rows = [];
            }
        });
        if (rows.length > 0) {
            chunkedData.push(rows);
        }
        return NextResponse.json({ success: true, data: chunkedData });
    } catch (error) {
        console.error('Error:', error);
        if (error instanceof Error) {
            return NextResponse.json({ success: false, error: error.message });
        }
    }
};

But this still doesn't work and now I'm getting this error Error: RangeError: Invalid string length. I tried also increasing the ram limit for node in my nextjs script "dev": "node --max-old-space-size=8192 ./node_modules/next/dist/bin/next dev",. still the node process stops at 2048mb ram usage and throws the above error.

I don't know what I'm doing wrong or my aprroach is totally wrong or what.

System specs windows 11 16GB ram Ryzen 5 pro 6650U


Solution

  • Introduction

    You need to handle large Excel files in a way that avoids loading the entire file into memory at once. This can be achieved by using streams to read the file in chunks. First, you need to create a readable stream from the file buffer. This will allow you to process the file piece by piece.

    Short example (non-working)

    Here’s an example of the stream:

    const reader = new FileReader();

    reader.onload = (event) => {
      const buffer = event.target.result;
      const readableStream = new Readable({ read: () => {} });
      readableStream.push(buffer);
      readableStream.push(null);
      // Process the readableStream using ExcelJS
    };
    

    The problem is that reading the entire file into memory can cause memory overload, especially with large files. To avoid this, you should use ExcelJS's streaming capabilities to read the workbook. This approach helps to keep the memory usage low by processing the file incrementally. Here's how to read the workbook as a stream:

    const workbook = new ExcelJS.stream.xlsx.WorkbookReader();
    await workbook.xlsx.read(readableStream);
    

    You need to process the data in chunks to efficiently manage memory usage. By setting a chunk size, you can handle large datasets without exhausting system resources. As you read each row, you can accumulate the rows into chunks and process them once the chunk size is reached. Here’s an example of processing the data in chunks:

    const chunkSize = 1024;
    let chunkedData = [];
    let rows = [];
    
    for (const worksheet of workbook.worksheets) {
        for (const row of worksheet) {
            rows.push(row.values);
            if (rows.length === chunkSize) {
                chunkedData.push(rows);
                rows = [];
            }
        }
    }
    

    Finally, you need to return the processed data in your response. This ensures that your application handles large files without crashing or running into memory issues. Here’s how you can send the processed data back to the client:

    if (rows.length > 0) {
        chunkedData.push(rows);
    }
    return NextResponse.json({ success: true, data: chunkedData });
    

    By following these steps, you can efficiently handle large Excel files in your application, avoiding memory overload and ensuring smooth performance.