node.jsarraysjsonnodejs-stream

Remove duplicates from a JSON file of huge size in NodeJs


I have a huge JSON file of size > 80000 MB containing 700,000,000 records. File content:

     {
        "rows": [
            {"empId":"1014456","blockId":"b6566"},
            {"empId":"1014456","blockId":"b6566"},
            {"empId":"1014457","blockId":"b6556"},
            {"empId":"1014458","blockId":"b6567"}
            ...
            ]
    }

I want to remove duplicates using empId as key. How do I do this in nodeJs? Do I need to use streams?


Solution

  • you can use lodash uniqby:

    _.uniqBy([
                {"empId":"1014456","blockId":"b6566"},
                {"empId":"1014456","blockId":"b6566"},
                {"empId":"1014457","blockId":"b6556"},
                {"empId":"1014458","blockId":"b6567"}
                ...
                ], 'empId');

    read more about it here: https://lodash.com/docs/4.17.15#uniqBy