javaperformancedirectory-walk

Is there a workaround for Java's poor performance on walking huge directories?


I am trying to process files one at a time that are stored over a network. Reading the files is fast due to buffering is not the issue. The problem I have is just listing the directories in a folder. I have at least 10k files per folder over many folders.

Performance is super slow since File.list() returns an array instead of an iterable. Java goes off and collects all the names in a folder and packs it into an array before returning.

The bug entry for this is https://bugs.java.com/bugdatabase/view_bug;jsessionid=db7fcf25bcce13541c4289edeb4?bug_id=4285834 and doesn't have a work around. They just say this has been fixed for JDK7.

A few questions:

  1. Does anybody have a workaround to this performance bottleneck?
  2. Am I trying to achieve the impossible? Is performance still going to be poor even if it just iterates over the directories?
  3. Could I use the beta JDK7 builds that have this functionality without having to build my entire project on it?

Solution

  • Although it's not pretty, I solved this kind of problem once by piping the output of dir/ls to a file before starting my app, and passing in the filename.

    If you needed to do it within the app, you could just use system.exec(), but it would create some nastiness.

    You asked. The first form is going to be blazingly fast, the second should be pretty fast as well.

    Be sure to do the one item per line (bare, no decoration, no graphics), full path and recurse options of your selected command.

    EDIT:

    30 minutes just to get a directory listing, wow.

    It just struck me that if you use exec(), you can get it's stdout redirected into a pipe instead of writing it to a file.

    If you did that, you should start getting the files immediately and be able to begin processing before the command has completed.

    The interaction may actually slow things down, but maybe not--you might give it a try.

    Wow, I just went to find the syntax of the .exec command for you and came across this, possibly exactly what you want (it lists a directory using exec and "ls" and pipes the result into your program for processing): good link in wayback (Jörg provided in a comment to replace this one from sun that Oracle broke)

    Anyway, the idea is straightforward but getting the code right is annoying. I'll go steal some codes from the internets and hack them up--brb

    
    /**
     * Note: Only use this as a last resort!  It's specific to windows and even
     * at that it's not a good solution, but it should be fast.
     * 
     * to use it, extend FileProcessor and call processFiles("...") with a list
     * of options if you want them like /s... I highly recommend /b
     * 
     * override processFile and it will be called once for each line of output.
     */
    import java.io.*;
    
    public abstract class FileProcessor
    {
       public void processFiles(String dirOptions)
       {
          Process theProcess = null;
          BufferedReader inStream = null;
    
          // call the Hello class
          try
          {
              theProcess = Runtime.getRuntime().exec("cmd /c dir " + dirOptions);
          }
          catch(IOException e)
          {
             System.err.println("Error on exec() method");
             e.printStackTrace();  
          }
    
          // read from the called program's standard output stream
          try
          {
             inStream = new BufferedReader(
                                    new InputStreamReader( theProcess.getInputStream() ));  
             processFile(inStream.readLine());
          }
          catch(IOException e)
          {
             System.err.println("Error on inStream.readLine()");
             e.printStackTrace();  
          }
    
       } // end method
       /** Override this method--it will be called once for each file */
       public abstract void processFile(String filename);
    
    
    } // end class
    

    And thank you code donor at IBM