I want to combine the directory name of HDFS with awk. Does this workable? The directory name, not the file name. Here is my awk work fine in the local:
awk 'NR <= 1000 && FNR == 1{print FILENAME}' ./*
And then I want to combine it with hadoop fs -ls like this:
hadoop fs -ls xxx/* | xargs awk 'NR <= 1000 && FNR == 1{print FILENAME}'
but show me: awk: cmd. line:2: fatal: cannot open file `-rwxrwxrwx' for reading (No such file or directory)
I also have tried like:
awk 'NR <= 1000 && FNR == 1{print FILENAME}' < hadoop fs -ls xxx/*
awk 'NR <= 1000 && FNR == 1{print FILENAME}' < $(hadoop fs -ls xxx/*)
awk 'NR <= 1000 && FNR == 1{print FILENAME}' $(hadoop fs -ls xxx/*)
These all failed without surprisingly, I consider awk execute file in the directory need read every file, not like the content of file that can pass it as streaming to awk. Am I right? Who can give me a workable solution to do that?
Thanks in advance.
It seems to me that you want to access files that are on a hadoop file-system. This is a virtual file-system, and you only have access to the meta-data of your file. If you want to operate on your file, it is then also important to first copy the file locally. This can be done using hadoop fs -get
. After creating a local copy, you can start operating on the files. There is however an alternative way using hadoop fs -cat
.
Normally I would say Never parse the output of ls
, but with Hadoop, you don't have a choice here. The output of hadoop fs -ls
is not similar to the standard output of the Unix/Linux command ls
. It is closely related to ls -l
and returns the following output:
permissions number_of_replicas userid groupid filesize modification_date modification_time filename
using this and piping it to awk
we get a list of files that are of use. So we can now just setup a while-loop:
c=0
while read -r file; do
[ $c -le 1000 ] && echo "${file}"
nr=$(hadoop fs -cat "${file}" | wc -l)
((c+=nr))
done < <(hadoop fs -ls xxx/* | awk '!/^d/{print substr($0,index($8,$0))}')
note: your initial error was due to the non-unix-like output of hadoop fs -ls
. The program awk
received a filename -rwxrwxrwx
which is actually a permission of the file itself.