pythonhadoophdfswebhdfs

Hadoop: Failed to connect to HDFS(Hadoop) using python


I am trying to connect to HDFS which is in VM with Ubuntu by using python jupyter tool from windows10. Can anybody help me with the below connection error am getting. Thank you.

Package used: pywebhdfs ubuntu 18.0.4 windows 10

'''

from pywebhdfs.webhdfs import PyWebHdfsClient
from pprint import pprint

HDFS_CONNECTION = PyWebHdfsClient(host='localhost',port='9000', user_name='root-sai')

HDFS_CONNECTION.list_dir('hdfs"//localhost:9000/New')

''' Error:-

ConnectionError: HTTPConnectionPool(host='localhost', port=9000): Max retries exceeded with url: /webhdfs/v1/hdfs%22//localhost%3A9000/New?op=LISTSTATUS&user.name=root-sai (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000250AB1FF438>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it'))

Solution

  • The webhdfs port is not same as the RPC port. By default, it is 50070.

    If webhdfs is not enabled (by default, this is enabled), add this property in hdfs-site.xml

    <property>
        <name>dfs.webhdfs.enabled</name>
        <value>true</value>
    </property>
    

    You can test whether webhdfs is enabled by invoking a curl request.

    Testing whether the /tmp directory exists, update the value of user.name as required.

    curl -i "http://localhost:50070/webhdfs/v1/tmp?user.name=hadoop-user&op=GETFILESTATUS"
    

    Initialize the PyWebHdfsClient,

    HDFS_CONNECTION = PyWebHdfsClient(host='localhost',port='50070', user_name='root-sai')
    
    HDFS_CONNECTION.list_dir('/New')