bluedatahpe-container-platformezmeral

How can I read data from DataTap using cpython?


I would like to read data from DataTap using cpython.

In spark, I can do something like:

df = spark.read.csv("dtap://MaprClus2/tmp/airline-safety.csv")

How can I do the same if I am using cpython, for example when I don't have a pyspark Jupyter kernel?


Solution

  • One option is to use a subprocess to call out to the hadoop cli command:

    from subprocess import check_output
    import pandas as pd
    from io import BytesIO
    
    def hdfs_read(fpath):
        out = check_output(['hadoop', 'fs', '-cat', fpath])
        return BytesIO(out) 
    
    data = hdfs_read("dtap://MaprClus2/tmp/airline-safety.csv")
    
    # row 1 contains hadoop cli warning so remove it
    pd.read_csv(data, sep=",", skiprows=1)