I would like to read data from DataTap using cpython.
In spark, I can do something like:
df = spark.read.csv("dtap://MaprClus2/tmp/airline-safety.csv")
How can I do the same if I am using cpython, for example when I don't have a pyspark Jupyter kernel?
One option is to use a subprocess to call out to the hadoop
cli command:
from subprocess import check_output
import pandas as pd
from io import BytesIO
def hdfs_read(fpath):
out = check_output(['hadoop', 'fs', '-cat', fpath])
return BytesIO(out)
data = hdfs_read("dtap://MaprClus2/tmp/airline-safety.csv")
# row 1 contains hadoop cli warning so remove it
pd.read_csv(data, sep=",", skiprows=1)