Combining two huge text files and removing duplicates

The Specific Problem

I am starting with two gigantic input text files containing lots of lines with a single string value on each line. I'd like to write a Java program to combine the contents of those two files into one output text file, while also removing any duplicates. What method would be best for performance?

What data sizes are we talking about?

The combined file might be 250 GB, maybe larger. I did a back-of-the-napkin calculation using a smaller example file and the 250 GB file may have around 27 billion lines (very rough) if it's proportional.

So we're definitely running into the territory of max Integer value in Java (2,147,483,647). We can imagine that the value on each line is a max of 20 characters, otherwise it will be discarded.

And let's assume the input files and output file (and likely any intermediate data structure used for duplicate checks) will be too large for memory. I will have the use of a computer with a large amount of memory, but it may not be enough.

Let me also note that the average length of a line is probably 7-10 characters. And I will be discarding any lines that are longer than 25 characters. I mention this because hashing was brought up as a method for duplicate checking. It seems the hash for these values would take up more space than the actual value.

Planning

Initially I was thinking of how to do the duplicate check to reduce processing time, as I'm sure I will want to make it run as fast as possible. However I realize the more immediate problem is how to deal with the large sizes and whatever I use to check for duplicates or sort will not fit in memory. Possibly sort the input files first, then merge together and discard duplicates in one pass. The output file can be sorted, doesn't need to maintain original order.

I've also been doing my research.

I came across this previous question- Removing duplicate strings from huge text files

The only answer mostly deals with data sizes that would fit into memory. But at the end it mentions a possible way to do this with a large file (just one file) that doesn't fit into RAM. I'm not sure I fully understand it, and it would seem that the duplicate checking mechanism itself would also be very large and not fit into memory contrary to what the author intended.

And I read up on External Sorting. https://en.wikipedia.org/wiki/External_sorting

That could definitely be something I try if I implement it with sorting first.

What method do you suggest? Many thanks for input!

Solution

Here is my solution

Duplicate checking via database

Based on suggestions here and elsewhere, I decided to go with a database table to do the work of removing duplicates. The table is defined with UNIQUE. When entries are added, any that are duplicates to those already in the table are discarded.

I used an SQLite database, and the SQLite JDBC library by xerial. I don't know if it's fastest among database solutions for this purpose, but it is easy to deal with. Not requiring a server, and just dealing with files, will surely be an advantage too.

I created the SQLite database in SQLite like this-

CREATE TABLE lines (
    content TEXT UNIQUE
);

It's a table with one column of text type, and unique specified to prevent duplicates. An attempt to insert a duplicate entry would give an exception.

Initial counting of lines

Initially my app runs through all the input files specified and counts the lines in each. I was able to do this in a multithreaded way (for each input file) and that improved speed.

Arrays.stream(inputFileArray)
            .parallel()
            .forEach(inputFile -> {
                inputFile.countLines();
                inputFile.resetToBeginning();
            });

This inputFile and the countLines() method ultimately use BufferedInputStream to read the data. It looks for the newline character in every byte. It does use a read size of 4 KB since that seemed fast in my testing. If the last line in the file is a newline, I don't count that. But every other newline character is counted, and that ends up being the number of lines. (It's possible BufferedInputStream is quicker than BufferedReader in this instance, though I'm not sure. BufferedReader requires converting from bytes to String/character data.) This counting of lines ends up just requiring a negligible amount of time compared to the core processing.

Main processing of lines

I have a basic loop through the input files, and a nested loop reading the lines, and acting on them. This is the core processing lines portion of the app. This reading actually uses BufferedReader, as it's the perfect fit for reading lines, and I do need the lines as character data, so there is no benefit to leaving them as bytes.

For each line, I do an up-front rules check which can rule out some lines. I can do different things here like just discard every line that's longer/shorter than a certain length, or discard any lines that contain non-ASCII characters.

Next an attempt is made to insert the line content to the database.

Batching

I used batching for the SQL inserts to improve speed. This was big! I have a batch size of 4 million entries right now. Rather than doing preparedStatementInsert.executeUpdate(), I do PreparedStatementInsert.addBatch(). And then I check for the count and compare to the batch size (modulus), and then do preparedStatementInsert.executeBatch().

psInsert.addBatch();

[...]

if (insertsCount % INSERTS_BATCH_SIZE == 0) {
    psInsert.executeBatch();
}

Transactions & commits

And I'm also using transactions + commits (disabled auto-commit). It's a similar idea. Before the SQL inserts, I conn.setAutoCommit(false). Then within the loop over the lines, I do a conn.commit(), also based on a modulus of a chosen transaction size. I have that at 16 million right now. I'll play with these sizes more when I'm doing bigger runs.

After the loop through the lines is complete, I do one extra psInsert.executeBatch() and conn.commit() to make sure everything is executed to the database.

Initially I was counting the number of duplicates discarded while in progress by catching the exception on inserts, and checking if it contained the message about (just checked for "UNIQUE" basically). With small files that worked. But with batching, and large files, I had errors trying to catch the exceptions and count them. I was getting the following error -

[SQLITE_CONSTRAINT_UNIQUE] A UNIQUE constraint failed (UNIQUE constraint failed: lines.content)

It was throwing the exception even when I was trying to catch it.

I ended up using INSERT OR IGNORE in my insert statements. This silently fails on duplicates, or other violations. That means I can't count duplicates by catching exceptions - and now I don't track that total while in progress so it can be reported to the user in periodic progress reports.

INSERT OR IGNORE INTO lines (content) VALUES (?)

I will probably revisit that. It's possible I need to catch the exception differently. Though I think it's throwing out the whole batch if there is one exception within the batch, and that's no good.

The alternative would be to do a SELECT on the database and count how many were added and compare to how many it processed. That could be done after every insert, or periodically after so many lines processed like the batch execution. And of course that would have a performance hit. I decided to just do that at the very end for performance reasons.

Writing to output file

At this point, the database has all of the entries across all input files, and they are deduplicated.

To write to output, first the app does a simple SELECT on all entries in the database table. An ORDER BY can be added here to sort alphabetically.

SELECT content FROM lines ORDER BY content ASC

Then it loops through the ResultSet of that, and writes lines to the output file. I did try making a local write buffer to batch up a bunch of lines to write, and then write them in the batches. But I found I got no performance gain from that. It's already using a buffer and I guess it does just fine. This stage of the execution is quite short compared to the core processing, but with a huge file it can still take many hours of course.


String currLineForOutput;
while (rsLinesContent.next()) {
    currLineForOutput = rsLinesContent.getString("content");                                
    outputFile.writeLine(currLineForOutput);
    numLinesWrittenToOutput++;
    if (numLinesWrittenToOutput % REPORTING_INTERVAL_NUM_LINES_WRITTEN == 0){
        //report lines
    }                
}
rsLinesContent.close();
psLinesContent.close();

Database file notes

Then I optionally do some database cleanup here. Initially I emptied the database, and then added a VACUUM so the database file shrank, otherwise it would remain large (which is for performance reasons to reduce file re-sizing). But then the use-case was adjusted and the user wanted to run this to add entries from a new file, potentially a bunch of times. With the previous setup and an empty database, the insertions (duplicate checks) would be repeating work. To reduce processing time, especially for huge files, my app now leaves the database file as it is at the end of processing. This way the user does not need to include all previous input files, or a "master" input file, in the command. They only need to include the new files in the command. The database retains the cumulative deduplicated content from previous runs, and significant time is saved. And actually now my app allows the user to specify the database file to use, so the database can be moved to a different drive due to large sizes, for possible performance improvement, or to keep an older database copy for some reason.

Parallelization

I did make one attempt to parallelize the processing of lines, and inserting them. But I ran into SQLite errors due to contention: [SQLITE_BUSY] The database file is locked (database is locked)

I'll have another go at it soon.

Run times

Previous versions have been run on some quite large files. For anyone curious, I'll list out some times. I don't have times for the newest versions, but I think it would be a little quicker.

A single 262.2 GB file with 23.1 billion lines

Duration: 11 hours, 39 minutes
Duration counting lines in input: 4 minutes
Duration processing lines: 6 hours, 42 minutes
Duration writing lines to output: 4 hours, 51 minutes

4 input files - 250.9 GB, 9.0 GB, 15.0 GB, 28.3 GB (303 GB, 27.7 billion lines total)

Duration: 1 day, 17 hours
Duration counting lines in input: 4 minutes
Duration processing lines: 1 day, 9 hours
Duration writing lines to output: 7 hours, 4 minutes

Operating system sort command

OldDogProgrammer had mentioned the sort command in Windows, and Toby Speight mentioned sort commands as well. These may cover part of the functionality. So I tried out the Windows sort command myself in my smaller tests. To remove duplicates, the unique option has to be used.

I did see it was slightly faster. I'll have to run a new comparison at some point since my app is faster than it was. But my app also is designed for the data structures to be in "disk" storage, not memory (due to huge sizes), and it's possible the sort command keeps them in memory. I haven't tried it on huge size files yet. However, the sort command had some disadvantages, and actually it's not useful for this purpose.

Disadvantages of sort command:

"The sort command doesn't distinguish between uppercase and lowercase letters"
No progress reporting - no idea how it's doing if it's really large dataset. If it's running for potentially days, this is a problem.
It doesn't allow for any extra rules to remove entries (up-front rules)
It requires sorting, can't keep original sort order if desired

The lack of distinguishing between uppercase and lowercase letters basically kills its usefulness for my purposes here. Those need to be seen as distinct, and not lead to discarding as duplicates. I have not yet looked into the sort command for Unix/Linux/MacOS.

Update

I reworked my code to use a producer and consumer model. The producer reads lines from the input file(s), does the up-front rules check on each line, and puts the lines into a queue. The consumer did all the database inserts, pulling lines out of the queue and inserting them. So these two were decoupled a bit.

With this new setup, I made a second attempt at parallelizing the code that inserts the lines into the database. However, that ran into the same issue. Even with just 2 threads and small batches and transactions, I had that error. The one time I only had one error, and it seemed like it only skipped a very minimal amount of data judging by the output size... however the execution time was much longer, due to the small batch and transaction sizes. So I abandoned that.

But I also reworked the parallelization of the producer - reading lines from input and performing the up-front rules check. That seemed to help a little, but not that much.

Overall it is slightly improved in performance with these new changes. But it seems I can't meaningfully improve the writing to the SQLite database. This isn't unexpected based on what I've read out there, but I had to give it a try. And that portion is the bulk of the run time. Any further attempt to use parallelization for speed improvement would require another method, maybe a different database system or some different way of checking for duplicates, likely with intermediate sorting like External Sorting.