I am programming a data lake with ETL and analytics project and I am using AWS S3 for storage, Glue to create and run the job, and Crawler and athena to create the tables. I am struggling with this error, as reported in the error logs in CloudWatch:
scipy 1.8.0 requires numpy<1.25.0,>=1.17.3, but you have numpy 2.0.2 which is incompatible.
redshift-connector 2.0.907 requires pytz<2022.2,>=2020.1, but you have pytz 2025.2 which is incompatible.
awswrangler 2.15.1 requires numpy<2.0.0,>=1.21.0, but you have numpy 2.0.2 which is incompatible.
awswrangler 2.15.1 requires pandas<2.0.0,>=1.2.0, but you have pandas 2.2.3 which is incompatible.
awscli 1.23.5 requires botocore==1.25.5, but you have botocore 1.38.9 which is incompatible.
awscli 1.23.5 requires s3transfer<0.6.0,>=0.5.0, but you have s3transfer 0.12.0 which is incompatible.
aiobotocore 2.2.0 requires botocore<1.24.22,>=1.24.21, but you have botocore 1.38.9 which is incompatible.
Note I am using pythonshell, so just using regular Python to run the entire job. My projects packages, sample raw data, and the config file are packaged in a .whl in a dependencies/ directory in my S3. My main.py that my glue_controller.py is telling Glue to run is in my scripts/ directory in S3. In my glue_controller.py this is how I create the job:
Name=config_data["aws_glue"]["etl_jobs"][0]["name"],
Role=glue_arn, # Replace with your IAM role ARN
JobMode='SCRIPT',
ExecutionProperty={'MaxConcurrentRuns': 1},
Command={
'Name': 'pythonshell',
'ScriptLocation': config_data["aws_glue"]["S3_URI"],
'PythonVersion': '3.9'
},
DefaultArguments={
'--TempDir': str(config_data["s3_bucket"]["bucket"] + config_data["s3_bucket"]["temp"]),
'--extra-py-files': str(config_data["s3_bucket"]["bucket"] + config_data["s3_bucket"]["dependencies"] + 'pokemon_datalake_and_anltx-0.1.0-cp39-none-any.whl'),
'--job-language': 'python'
},
MaxRetries=0,
GlueVersion='3.0',
Description='Job for processing raw Pokémon data.'
)
print(f"Job has been created: {response['Name']}")
I am not utilizing '--additional-python-modules' parameter, and although I have before to include boto3 and pandas the error still came up and the job failed a couple of seconds after starting. Now I just have a requirements.txt that explicity contains all AWS Glue version 3.0 python modules. Can anyone tell me what I am doing wrong and how to resolve this error?
The error occurs because AWS Glue 3.0 PythonShell jobs have specific Python package version requirements that must be met. To fix this, you can create a requirements.txt file with compatible package versions. Then,
create a Python script that uses these packages
Create a Glue job with the following configuration
Upload your script to S3
Create the Glue job using AWS CLI