javaperformancecollectionsadditionremoveall

Fast way to get different lines from 2 big files in Java


In Java, I have a method that reads two files, with each line being a GUID. The lines are unordered. The output is two new files with the lines that appear only on each file.

Example files:

| Input_1 | Input_2 |         | Output_1 | Output_2 |
| ------- | ------- |         | -------- | -------- |
| abcdef  | uvwxyz  |    >    | mnopqr   | uvwxyz   |
| ghijkl  | ghijkl  |
| mnopqr  | abcdef  |

I managed to do it fine with one Collection<String> for each file and some addAll() + removeAll() shenanigans, however the files are growing in size and this whole thing is taking some time now. Each file has about 600k lines.

Is there a fast way to improve this code just using another type of collection or I need to refactor my way of doing?

Code in question:

//Read two files
Collection<String> guidFile1 = readFileGuid(pathFile1);
Collection<String> guidFile2 = readFileGuid(pathFile2);

//Add file1 and remove file2
Collection<String> leftFromFile1 = new ArrayList<String>();
leftFromFile1.addAll(guidFile1);
leftFromFile1.removeAll(guidFile2);

//Add file2 and remove file1
Collection<String> leftFromFile2 = new ArrayList<String>();
leftFromFile2.addAll(guidFile2);
leftFromFile2.removeAll(guidFile1);

//Outputs
System.out.println("Leftover from file1: " + leftFromFile1.size());
System.out.println("Leftover from file2: " + leftFromFile2.size());

Solution

  • In your code, the removeAll is the most expensive operation. If both files are 100 lines long, then removeAll would perform 20,000x operations. If you include addAll, then it would perform a total of 20,200 operations. What you want is 200 operations. The way to do that is to use a Set.

      static HashSet<String> readFileGuid(String path) throws IOException{
          HashSet<String> guidFile = new HashSet<>();
          Scanner s = new Scanner(new File(path));
          while(s.hasNextLine()) 
            guidFile.add(s.nextLine());
          s.close();
          return guidFile;
      }
      
      static List<String> subtract(HashSet<String> s1, HashSet<String> s2){
          List<String> result = new ArrayList<>();
          Iterator<String> it = s1.iterator();
          while(it.hasNext()){
              String item = it.next();
              if (!s2.contains(item))
                result.add(item);
          }
          return result;
      }
      
      public static void main (String[]args) throws IOException{
          HashSet<String> guidFile1 = readFileGuid("input1.txt");
          HashSet<String> guidFile2 = readFileGuid("input2.txt");
          
          List<String> leftFromFile1 = subtract(guidFile1, guidFile2);
          List<String> leftFromFile2 = subtract(guidFile2, guidFile1);
          
          System.out.println("file1:" + leftFromFile1);
          System.out.println("file2:" + leftFromFile2);
      }