regexrubyfileglob

Efficient file matching in Ruby with special characters in filenames


I am developing a Ruby application that involves a large number of text files (numbering in the millions). Each of these files is referred to as a two-character string, which can include both regular alphanumeric characters and special characters. Here are a few examples:

File named ab.txt (no special characters)

File named -b.txt (the first character is a special character)

File named %!.txt (both characters are special characters)

I am faced with the task of efficiently selecting a subset of these files based on the input text. When the user enters a string of two characters, the application must select all files in all directories whose names correspond to that string.

I tried to use a regex pattern with glob to directly search for files of interest. For example, if the user enters ab as a string, I use Dir.glob function like Dir.glob("data/*/ab.txt") to directly search for ab.txt files. However, this approach doesn't work if the input string contains special characters. For example, if the input string is -b, Dir.glob("data/*/-b.txt") cannot successfully find -b.txt files.

So my question is: How can I effectively select files based on a two-character string, even if that string includes special characters? Note that because of the large number of files, methods that involve reading all file names into memory are inefficient. I am looking for a way that can immediately select the files I am interested in, similar to the way globbing works when file names do not include special characters.


Solution

  • This is what Find excels at:

    require 'find'
    
    files = []
    pattern = '-b.txt'
    
    Find.find('some/dir') do |path|
      next unless path.end_with? pattern
      files << path
    end
    

    after running that, files contains the file paths of all files within some/dir (or any of its subdirectories) that end with -b.txt.

    I would avoid using regexs, shell-based tools, or globs because they specifically make special characters tricky to use. The nice thing about Find is that it gives you the full file path as a plain string and you can compare it to a plain string, so there is no special character handling involved at all.