I am trying to read a parquet file through pandas in databricks notebook. The cluster has permission to access adls.
import pandas as pd
pdf = pd.read_parquet("abfss://abc.parquet")
But pandas is not able to read it and throws the below error.
ValueError Traceback (most recent call last)
<command-2342282971496650> in <module>
1 import pandas as pd
2 parquet_file = 'abfss://abc.parquet'
----> 3 pd.read_parquet(parquet_file)
/databricks/python/lib/python3.8/site-packages/pandas/io/parquet.py in read_parquet(path, engine, columns, use_nullable_dtypes, **kwargs)
457 """
458 impl = get_engine(engine)
--> 459 return impl.read(
460 path, columns=columns, use_nullable_dtypes=use_nullable_dtypes, **kwargs
461 )
/databricks/python/lib/python3.8/site-packages/pandas/io/parquet.py in read(self, path, columns, use_nullable_dtypes, storage_options, **kwargs)
212 )
213
--> 214 path_or_handle, handles, kwargs["filesystem"] = _get_path_or_handle(
215 path,
216 kwargs.pop("filesystem", None),
/databricks/python/lib/python3.8/site-packages/pandas/io/parquet.py in _get_path_or_handle(path, fs, storage_options, mode, is_dir)
64 fsspec = import_optional_dependency("fsspec")
65
---> 66 fs, path_or_handle = fsspec.core.url_to_fs(
67 path_or_handle, **(storage_options or {})
68 )
/databricks/python/lib/python3.8/site-packages/fsspec/core.py in url_to_fs(url, **kwargs)
369 else:
370 protocol = split_protocol(url)[0]
--> 371 cls = get_filesystem_class(protocol)
372
373 options = cls._get_kwargs_from_urls(url)
/databricks/python/lib/python3.8/site-packages/fsspec/registry.py in get_filesystem_class(protocol)
206 if protocol not in registry:
207 if protocol not in known_implementations:
--> 208 raise ValueError("Protocol not known: %s" % protocol)
209 bit = known_implementations[protocol]
210 try:
ValueError: Protocol not known: abfss
I tried a workaround to do this.
import pandas as pd
import pyspark.pandas as ps
pdf = ps.read_parquet("abfss://abc.parquet").to_pandas()
The above query actually takes a lot of time in converting the pyspark.pandas dataframe to pandas dataframe.
NOTE: I cannot mount the adls to dbfs because dbfs is disabled by the platform team and hence all the operations need to be done on adls.
I am looking for a faster way or a simpler way to read files from adls gen2 using python pandas.
Any leads would be highly appreciated.
Finally the problem is resolved, and now I am able to read the data in adls using pandas library. No need of spark or koalas conversion.
pd.read_parquet("file_path", storage_options = "")
Follow this article, for storage_options.
https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/tutorial-use-pandas-spark-pool