pythonamazon-web-servicespysparkaws-glueapache-iceberg

ModuleNotFound: Cannot find package "Pyiceberg" in AWS Glue Spark Job


I am reading data from a data catalog table, applying some processing, and then I need to store my output in IceBerg Tables in my aggregate area of my datalake on S3.

I tried the following approach:

%pip install pyiceberg

It shows that the cell ran successfully and that pyiceberg is installed. Then I try to import the package in my Spark Job before applying the PySpark transformation

from pyiceberg.catalog import load_catalog
from pyiceberg.schema import Schema
from pyiceberg.types import DoubleType, StringType, TimestampType, NestedField
from pyiceberg.partitioning import PartitionSpec, PartitionField
from pyiceberg.transforms import YearTransform, MonthTransform, DayTransform
from pyiceberg.table.sorting import SortOrder, SortField
from pyiceberg.transforms import IdentityTransform

But I face the issue that the package is not found, I have tried everything.


ModuleNotFoundError: No module named 'pyiceberg'

I have tried pip install pyiceberg, I have tried restarting my Kernel after installing the package and have tried **pip install iceberg ** directly. But nothing is working.

My goal is to write the transformed data to an IceBerg table which the script creates, I have tried earlier writing to parquet in my datalake and that worked successfully.


Solution

  • You should only specify additional packages using additional-python-modules configuration and you can't use pip to install the packages to a vanilla notebook.

    You need to run the below command:

    %additional_python_modules pyiceberg
    

    And if you have already started your session you need to stop and start again for the change to be effective: