pythonamazon-web-servicesamazon-s3etlaws-glue

AWS Glue 3.0 pythonshell Incompatable versions error


I am programming a data lake with ETL and analytics project and I am using AWS S3 for storage, Glue to create and run the job, and Crawler and athena to create the tables. I am struggling with this error, as reported in the error logs in CloudWatch:

scipy 1.8.0 requires numpy<1.25.0,>=1.17.3, but you have numpy 2.0.2 which is incompatible.
redshift-connector 2.0.907 requires pytz<2022.2,>=2020.1, but you have pytz 2025.2 which is incompatible.
awswrangler 2.15.1 requires numpy<2.0.0,>=1.21.0, but you have numpy 2.0.2 which is incompatible.
awswrangler 2.15.1 requires pandas<2.0.0,>=1.2.0, but you have pandas 2.2.3 which is incompatible.
awscli 1.23.5 requires botocore==1.25.5, but you have botocore 1.38.9 which is incompatible.
awscli 1.23.5 requires s3transfer<0.6.0,>=0.5.0, but you have s3transfer 0.12.0 which is incompatible.
aiobotocore 2.2.0 requires botocore<1.24.22,>=1.24.21, but you have botocore 1.38.9 which is incompatible. 

Note I am using pythonshell, so just using regular Python to run the entire job. My projects packages, sample raw data, and the config file are packaged in a .whl in a dependencies/ directory in my S3. My main.py that my glue_controller.py is telling Glue to run is in my scripts/ directory in S3. In my glue_controller.py this is how I create the job:

            Name=config_data["aws_glue"]["etl_jobs"][0]["name"],
            Role=glue_arn,  # Replace with your IAM role ARN
            JobMode='SCRIPT',
            ExecutionProperty={'MaxConcurrentRuns': 1},
            Command={
                'Name': 'pythonshell',
                'ScriptLocation': config_data["aws_glue"]["S3_URI"],  
                'PythonVersion': '3.9'
            },
            DefaultArguments={
                '--TempDir': str(config_data["s3_bucket"]["bucket"] + config_data["s3_bucket"]["temp"]),
                '--extra-py-files': str(config_data["s3_bucket"]["bucket"] + config_data["s3_bucket"]["dependencies"] + 'pokemon_datalake_and_anltx-0.1.0-cp39-none-any.whl'),
                '--job-language': 'python'
            },
            MaxRetries=0,
            GlueVersion='3.0',
            Description='Job for processing raw Pokémon data.'
        )
        print(f"Job has been created: {response['Name']}")

I am not utilizing '--additional-python-modules' parameter, and although I have before to include boto3 and pandas the error still came up and the job failed a couple of seconds after starting. Now I just have a requirements.txt that explicity contains all AWS Glue version 3.0 python modules. Can anyone tell me what I am doing wrong and how to resolve this error?


Solution

  • The error occurs because AWS Glue 3.0 PythonShell jobs have specific Python package version requirements that must be met. To fix this, you can create a requirements.txt file with compatible package versions. Then,

    create a Python script that uses these packages

    1. Create a Glue job with the following configuration

    2. Upload your script to S3

    3. Create the Glue job using AWS CLI