pythonpandasapache-sparkdatabricksparquet

How to read a file stored in adls gen 2 using pandas?


I am trying to read a parquet file through pandas in databricks notebook. The cluster has permission to access adls.

import pandas as pd 
pdf = pd.read_parquet("abfss://abc.parquet")

But pandas is not able to read it and throws the below error.

ValueError                                Traceback (most recent call last)
<command-2342282971496650> in <module>
  1 import pandas as pd
  2 parquet_file = 'abfss://abc.parquet'
  ----> 3 pd.read_parquet(parquet_file)

  /databricks/python/lib/python3.8/site-packages/pandas/io/parquet.py in read_parquet(path, engine, columns, use_nullable_dtypes, **kwargs)
457     """
458     impl = get_engine(engine)
--> 459     return impl.read(
460         path, columns=columns, use_nullable_dtypes=use_nullable_dtypes, **kwargs
461     )

/databricks/python/lib/python3.8/site-packages/pandas/io/parquet.py in read(self, path, columns, use_nullable_dtypes, storage_options, **kwargs)
212                 )
213 
--> 214         path_or_handle, handles, kwargs["filesystem"] = _get_path_or_handle(
215             path,
216             kwargs.pop("filesystem", None),

/databricks/python/lib/python3.8/site-packages/pandas/io/parquet.py in _get_path_or_handle(path, fs, storage_options, mode, is_dir)
 64         fsspec = import_optional_dependency("fsspec")
 65 
 ---> 66         fs, path_or_handle = fsspec.core.url_to_fs(
 67             path_or_handle, **(storage_options or {})
 68         )

 /databricks/python/lib/python3.8/site-packages/fsspec/core.py in url_to_fs(url, **kwargs)
369     else:
370         protocol = split_protocol(url)[0]
--> 371         cls = get_filesystem_class(protocol)
372 
373         options = cls._get_kwargs_from_urls(url)

/databricks/python/lib/python3.8/site-packages/fsspec/registry.py in get_filesystem_class(protocol)
206     if protocol not in registry:
207         if protocol not in known_implementations:
--> 208             raise ValueError("Protocol not known: %s" % protocol)
209         bit = known_implementations[protocol]
210         try:

ValueError: Protocol not known: abfss

I tried a workaround to do this.

import pandas as pd
import pyspark.pandas as ps 
pdf = ps.read_parquet("abfss://abc.parquet").to_pandas() 

The above query actually takes a lot of time in converting the pyspark.pandas dataframe to pandas dataframe.

NOTE: I cannot mount the adls to dbfs because dbfs is disabled by the platform team and hence all the operations need to be done on adls.

I am looking for a faster way or a simpler way to read files from adls gen2 using python pandas.


Solution

  • Finally the problem is resolved, and now I am able to read the data in adls using pandas library. No need of spark or koalas conversion.

    pd.read_parquet("file_path", storage_options = "")
    

    Follow this article, for storage_options.

    https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/tutorial-use-pandas-spark-pool