I have been dealing with this problem for a week. I use the command
from dask import dataframe as ddf
ddf.read_parquet("http://IP:port/webhdfs/v1/user/...")
I got invalid parquet magic. However ddf.read_parquet is Ok with "webhdfs://"
I would like the ddf.read_parquet works for http because I want to use it in dask-ssh cluster for workers without hdfs access.
Although the comments already partly answer this question, I thought I would add some information as an answer
fsspec
) as a backend filesystem; but to get partitioning within a file, you need to get the size of that file, and to resolve globs, you need to be able to get a list of links, neither of which are necessarily provided by any given server"hdfs://"
). However, kerberos-secured webHDFS can be tricky to work with, depending on how the security was set up.