I am reading data from a data catalog table, applying some processing, and then I need to store my output in IceBerg Tables in my aggregate area of my datalake on S3.
I tried the following approach:
%pip install pyiceberg
It shows that the cell ran successfully and that pyiceberg is installed. Then I try to import the package in my Spark Job before applying the PySpark transformation
from pyiceberg.catalog import load_catalog
from pyiceberg.schema import Schema
from pyiceberg.types import DoubleType, StringType, TimestampType, NestedField
from pyiceberg.partitioning import PartitionSpec, PartitionField
from pyiceberg.transforms import YearTransform, MonthTransform, DayTransform
from pyiceberg.table.sorting import SortOrder, SortField
from pyiceberg.transforms import IdentityTransform
But I face the issue that the package is not found, I have tried everything.
ModuleNotFoundError: No module named 'pyiceberg'
I have tried pip install pyiceberg, I have tried restarting my Kernel after installing the package and have tried **pip install iceberg ** directly. But nothing is working.
My goal is to write the transformed data to an IceBerg table which the script creates, I have tried earlier writing to parquet in my datalake and that worked successfully.
You should only specify additional packages using additional-python-modules
configuration and you can't use pip
to install the packages to a vanilla notebook.
You need to run the below command:
%additional_python_modules pyiceberg
And if you have already started your session you need to stop and start again for the change to be effective: