javamultithreadingcollectionsjava-streamparallelstream

Want to compare two Lists of records, save commons to a new list ,Records are around 1M and taking a lot of time to process


I'm processing 2 csv files and checking common entries and saving them into a new csv file .however the comparison is taking a lot of time.My approach is to first read all the data from files into ArrayList then using parallelStream over main list, i do comparison on the other list and append the common entries with a string builder which will then be saved to the new csv file. Below is my code for this.

allReconFileLines.parallelStream().forEach(baseLine -> {

            String[] baseLineSplitted = baseLine.split(",|,,");
            if (baseLineSplitted != null && baseLineSplitted.length >= 13 && baseLineSplitted[13].trim().equalsIgnoreCase("#N/A")) {
                for (int i = 0; i < allCompleteFileLines.size(); i++) {
                    String complteFileLine = allCompleteFileLines.get(i);
                    String[] reconLineSplitted = complteFileLine.split(",|,,");
                    if (reconLineSplitted != null && reconLineSplitted[3].replaceAll("^\"|\"$", "").trim().equals(baseLineSplitted[3].replaceAll("^\"|\"$", "").trim())) {
                        //pw.write(complteFileLine);
                        matchedLines.append(complteFileLine);
                       
                        break;
                    }
                }
            }
        });
   pw.write(matchedLines.toString());

Currently it is taking hours to process. How can i make it quick ?


Solution

  • Read the keys of one file into e.g. a HashSet, and then as you're reading the second file, for each line check if it's in the set and if so write it out. This way you only need enough memory to keep the keys of one file.