In Hadoop HDFS, delete several files older than x days and with spaces in the name (Not like UNIX)

I have hundreds of thousands of files in a hadoop directory and I need to debug them. I'm looking to delete files that are more than 3 months old and I'm trying to delete in batches of a thousand files that I get in that directory with that condition, but I'm having problems. Among the multitude of files there are files that have some space in the name, like "hello word.csv". I've tried doing the batches, using arrays in unix or writing the output to a file, but one way or another, it doesn't recognise the files in when doing hdfs dfs -rm -f

With this, i found that files

list_files=$(hdfs dfs -ls "${folder_in}" | awk '!/^d/ {print $0}' | awk -v days=${dias} '!/^d/ && $6 < strftime("%Y-%m-%d", systime() - days * 24 * 60 * 60) { print substr($0, index($0,$8)) }')

I wanted to delete the HDFS files in batches by loading an array in the shellscript as follows:

while IFS="" read -r file; do
    files+=(\"${file}\")
    echo -e "\"$file\"" > ${PATH_TMP}/file_del_proof.tmp
done <<< "$list_files"

With the following script I tried to delete the HDFS files:

    total_lines=$(wc -l < "${PATH_TMP}/file_del_proof.tmp")
    start_line=1
    while [ $start_line -le $total_lines ]; do
        end_line=$((start_line + batch_size - 1))
        end_line=$((end_line > total_lines ? total_lines : end_line))
        hdfs dfs -rm -f -skipTrash $(awk -v end_line=${end_line} -v start_line=${start_line} 'NR >= start_line && NR <= end_line' "${PATH_TMP}/file_del_proof.tmp")
        start_line=$((end_line + 1))
   done

The problem is that in that list appear some files that have spaces in the name, I can not find an automatic way to delete those files with more than a certain time in the HDFS because some come with spaces in the name and when deleting, for example if the files are called "hello word.csv", "hello word2.csv", "hello word2.csv", it only interprets the hello remaining for one line.

hdfs dfs -rm /folder/hello

One person gave me the idea to delete the 3 oldest months, first move the 3 most recent months to a temporary folder, delete everything left in the folder and then move from that temporary folder to the original one. But still if I want to move those with space names, I fail that mv because the files with spaces, are not moved.

Does anyone have any suggestions? The idea I was given was to replace the spaces in the hdfs with _ in the files that had that particularity, but I wanted to see if anyone knew of any other option to delete them without doing that preprocessing of changing the name.

Solution

It feels like using hdfs dfs -stat instead of hdfs dfs -ls would be a better choice; example:

$ hdfs dfs -stat '%Y,%F,%n/' some/dir/*
1391807842598,regular file,hello world.csv/
1388041686026,directory,someDir/
1388041686026,directory,otherDir/
1391807875417,regular file,File2.txt/
1391807842724,regular file,File one, two, three!.txt/

remark: I added a trailing / to the output format so that awk can use it as record separator (it's a character that can't appear in a filename). Also, using a , as field separator allows to accurately get the first two fields (the filename might have commas in it but you can just strip the first two fields from the record).

Now, all that's left to do is to select the files whose "modification time" is smaller than the time-of-day - N days, rebuild their full path and output the latter as a NUL-delimited list for xargs -0 to process:

#!/bin/bash

folder_in=some/path
dias=30

printf '%s\0' "$folder_in"/* |
xargs -0 hdfs dfs -stat '%Y,%F,%n/' |
awk -v days="$dias" '
    BEGIN {
        RS = "/";
        basepath = ARGV[1];
        delete ARGV[1];
        srand();
        modtime = (srand() - days * 86400) * 1000
    }
    $2 == "regular file" && $1 < modtime {
        sub(/^([^,]*,){2}/,"");
        printf("%s%c", basepath "/" $0, 0)
    }
' "$folder_in" |
xargs -0 hdfs dfs -rm -f

notes:

Because you're dealing with "hundreds of thousands of files" I'm using the bash builtin printf for expanding the * glob. FYI, any non-builtin command would fail with an Argument list too long error.
As a consequence of using / as record separator in awk, each $1 have a leading \n character; it doesn't matter because $1 is used as a number so the newline is implicitly ignored. Additionally, the last record will be a single \n character, which is filtered out by the $2 == "regular file" condition.