javafiledirectoryjava-streamjava-nio

How to delete Duplicate files in folder using Java-Stream?


I have a folder with the same photos but having different names and I want to remove the duplicates (doesn't matter which ones) using Stream API, to which I'm fairly new.

I tried to use this method, but of course it's not that simple, and it's just deleting all files.

File directory = new File("D:\\Photos\\Test");

List<File> files = Arrays.asList(Objects.requireNonNull(directory.listFiles()));
files.stream().distinct().forEach(file -> Files.delete(file.toPath()));

I also tried to convert each file into a byte array and apply distinct() on the stream of byte arrays, but it didn't find any duplicates.

Is there a way to make this happen by using only stream?


Solution

  • but of course it's not that simple and it's just deleting all files

    Sure thing, distinct() in the stream of File objects would preserve in the stream files having distinct paths (because equals() of the file doesn't care about the content, it compares paths), and since all files would have distinct paths they all get removed.

    What you really need is a logic for determining the contents of two files are the same and since Java 12 we have method Files.mismatch() bytes of the specified files and returns the first index of mismatch, or -1 if they are identical.

    Another important thing to note is that in this case, Stream IPA isn't the right tool because of the need of dealing with checked exceptions. Both mismatch() and delete() throw IOException (which common for method from Files class), and we can't propagate it outside the stream. Exception-handling logic inside the lambda looks ugly and completely defeats the readability. You have an option of extracting the code which invokes mismatch() and delete() into two separate method, but it would lead to duplication of the exception-handling logic.

    The better option would be to use DirectoryStream as a mean of traversal, and handle exceptions right on the spot:

    public static void removeDuplicates(Path targetFolder, Path originalFile) {
        
        try(DirectoryStream<Path> paths = Files.newDirectoryStream(targetFolder)) {
            
            for (Path path: paths) {
                if (Files.mismatch(path, originalFile) == -1 
                    && !originalFile.equals(path)) { // files match & file isn't the original one
    
                    Files.delete(path);
                }
            }
                
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
    

    Sidenote: File class is legacy, avoid using it. Stick with Path and Files instead.

    In case when there's no particular original file, and you need to analyze and clean up the folder from duplicates, you can fire the method shown above for every file in the folder. But it would result in reading the same files multiple times, which is not desired.

    To avoid reading files multiple times, we can calculate a hash of every encountered file and offer every hash to a Set. If the hash gets rejected, that means that the file is a duplicate.

    In the code below, SHA-256 is used as a hashing algorithm.

    public static void removeDuplicates(Path targetFolder) {
        try (DirectoryStream<Path> paths = Files.newDirectoryStream(targetFolder)) {
            
            Set<String> seen = new HashSet<>();
            
            for (Path path : paths) {
                if (Files.isDirectory(path)) continue;
                
                if (!seen.add(getHash(path))) { // hash sum has been encountered previously - hence the fail is a duplicate
                    
                    Files.delete(path);
                }
            }
            
        } catch (IOException | NoSuchAlgorithmException e) {
            e.printStackTrace();
        }
    }
    
    public static String getHash(Path path) throws NoSuchAlgorithmException, IOException {
        MessageDigest md = MessageDigest.getInstance("SHA-256");
        md.update(Files.readAllBytes(path));
        return toHexadecimal(md.digest());
    }
    
    public static String toHexadecimal(byte[] bytes) {
        
        return IntStream.range(0, bytes.length)
            .mapToObj(i -> String.format("%02x", bytes[i]))
            .collect(Collectors.joining());
    }
    

    Note that although it's possible that two different files would produce the same hash, it's extremely unlikely. And the code shown above ignores the possibility of collisions.

    If you wonder how the code which can handle collisions might look like, here is an extended version.