linuxhadoophdfshadoop2

HDFS getting confused if same path is present on local node as well


I am using Hadoop provided by CDH 5.4.1 cluster. The issue I'm facing is that there is a directory on HDFS with path /tmp/data It has some csv files say abc.csv Now this same folder is present on one of the node's (say node 1) local linux fs as well and containing a csv file xyz.csv.

When I am running the following command from the node1: hdfs dfs -ls /tmp/data/*.csv I am expecting the output to show abc.csv however I get an error saying

ls: `/tmp/data/xyz.csv': No such file or directory

The same command gives correct output when run on other nodes which dont have this same folder path on their local linux fs.

My understanding was that since I am using hdfs dfs command Hadoop should be looking in dfs space only and not get confused with local Linux filesystem but that seems to be incorrect.

Please provide pointers on what could be the reason behind this behaviour?


Solution

  • You are seeing the effects of Bash (or whatever your shell of choice) globbing and expanding the wildcard before passing the argument to the HDFS command. There is a file on your local file system at /tmp/data/xyz.csv. Therefore, the command that really gets invoked is hdfs dfs -ls /tmp/data/xyz.csv. Since xyz.csv does not exist in your HDFS cluster, it is reported as file not found.

    You can work around this by wrapping your arguments in single quotes to prevent globbing expansion:

    > # local file system
    > ls /tmp/data/*.csv
    /tmp/data/xyz.csv
    
    > # attempting to check HDFS, but wildcard expansion happens before invoking command
    > hdfs dfs -ls /tmp/data/*.csv
    ls: `/tmp/data/xyz.csv': No such file or directory
    
    > # wrap in single quotes to prevent globbing expansion
    > hdfs dfs -ls '/tmp/data/*.csv'
    -rw-r--r--   3 naurc001 supergroup          0 2017-02-02 11:52 /tmp/data/abc.csv