hdfsparquetpyarrowlibhdfs

connecting pyarrow with libhdfs3


I'm trying to connect to a hadoop cluster via pyarrows' HdfsClient / hdfs.connect().

I noticed pyarrows' have_libhdfs3() function, which returns False.

How does one go about getting the required hdfs support for pyarrow? I understand there's a conda command for libhdfs3, but I pretty much need to make it work through some "vanilla" way that doesn't involve things like conda.

If it's of importance, the files I'm interested in reading are parquet files.

EDIT:

The creators of hdfs3 library have made a repo that allows installing libhdfs3:

http://hdfs3.readthedocs.io/en/latest/install.html


Solution

  • On ubuntu this worked for me -

    echo "deb https://dl.bintray.com/wangzw/deb trusty contrib" | sudo tee /etc/apt/sources.list.d/bintray-wangzw-deb.list
    sudo apt-get install -y apt-transport-https
    sudo apt-get update
    sudo apt-get install libhdfs3 libhdfs3-dev
    

    It should work on other Linux distros as well using the appropriate installer. Taken from:

    http://hdfs3.readthedocs.io/en/latest/install.html