bashfile-extensionfile-recovery

Effeciantly moving half a million files based on extention in bash


Scenario:

With Locky virus on the rampage the computer center I work for have found the only method of file recovery is using tools like Recuva now the problem with that is it dumps all the recovered files into a single directory. I would like to move all those files based on there file extensions into categories. All JPG in one all BMP in another ... etc. i have looked around Stackoverflow and based off of various other questions and responses I managed to build a small bash script (sample provided) that kinda does that however it takes forever to finish and i think i have the extensions messed up.

Code:

#!/bin/bash
path=$2   # Starting path to the directory of the junk files
var=0     # How many records were processed
SECONDS=0 # reset the clock so we can time the event

clear

echo "Searching $2 for file types and then moving all files into grouped folders."

# Only want to move Files from first level as Directories are ok were they are
for FILE in `find $2 -maxdepth 1 -type f`
do
  # Split the EXT off for the directory name using AWK
  DIR=$(awk -F. '{print $NF}' <<<"$FILE")
  # DEBUG ONLY
  # echo "Moving file: $FILE into directory $DIR"
  # Make a directory in our path then Move that file into the directory
  mkdir -p "$DIR"
  mv "$FILE" "$DIR"
  ((var++))
done

echo "$var Files found and orginized in:"
echo "$(($diff / 3600)) hours, $((($diff / 60) % 60)) minutes and $(($diff % 60)) seconds."

Question:

How can i make this more efficient while dealing with 500,000+ files? The find takes forever to grab a list of files and in the loop its attempting to create a directory (even if that path is already there). I would like to more efficiently deal with those two particular aspects of the loop if at possible.


Solution

  • The bottleneck of any bash script is usually the number of external processes you start. In this case, you can vastly reduce the number of calls to mv you make by recognizing that a large percentage of the files you want to move will have a common suffix like jpg, etc. Start with those.

    for ext in jpg mp3; do
        mkdir -p "$ext"
        # For simplicity, I'll assume your mv command supports the -t option
        find "$2" -maxdepth 1 -name "*.$ext" -exec mv -t "$ext" {} +
    done
    

    Use -exec mv -t "$ext" {} + means find will pass as many files as possible to each call to mv. For each extension, this means one call to find and a minimum number of calls to mv.

    Once those files are moved, then you can start analyzing files one at a time.

    for f in "$2"/*; do
        ext=${f##*.}
        # Probably more efficient to check in-shell if the directory
        # already exists than to start a new process to make the check
        # for you.
        [[ -d $ext ]] || mkdir "$ext"
        mv "$f" "$ext"
    done
    

    The trade-off occurs in deciding how much work you want to do beforehand identifying the common extensions to minimize the number of iterations of the second for loop.