pythonpython-3.xamazon-web-servicespysparkaws-glue

Unable to read text file in Glue job


I am trying to read the schema from a text file under the same package as the code but cannot read that file using the AWS glue job. I will use that schema for creating a dataframe using Pyspark. I can load that file locally. I am zipping the code files as .zip, placing them under the s3 bucket, and then referencing them in the glue job. Every other thing works fine. No problem there. But when I try the below code it doesn't work.

file_path = os.path.join(Path(os.path.dirname(os.path.relpath(__file__))), "verifications.txt")
multiline_data = None
with open(file_path, 'r') as data_file:
   multiline_data = data_file.read()
self.logger.info(f"Schema is {multiline_data}")
           

This code throws the below error:

Error Category: UNCLASSIFIED_ERROR; NotADirectoryError: [Errno 20] Not a directory: 'src.zip/src/ingestion/jobs/verifications.txt'  

I also tried with abs_path but it didn't help either. The same block of code works fine locally.

I also tried directly passing the "./verifications.txt" path but no luck.

So how do I read this file?


Solution

  • As @Bogdan mentioned the way to do this is use S3 to store the verifications.txt file. Here's some example code using boto3

    import boto3
    
    # Hardcoded S3 bucket/key (these are normally passed in as Glue Job params)
    s3_bucket = 'your-bucket-name'
    s3_key = 'path/to/verifications.txt'
    
    # Read data from S3 using boto3
    s3_client = boto3.client('s3')
    response = s3_client.get_object(Bucket=s3_bucket, Key=s3_key)
    multiline_data = response['Body'].read().decode('utf-8')
    

    If you want to access the file from inside the zip directly (given your comment) you might have to get more fancy...

    import boto3
    import zipfile
    import io
    
    # Initialize boto3 client for S3
    s3 = boto3.client('s3')
    
    # Define the bucket name and the zip file key
    bucket_name = 'your-bucket-name'
    zip_file_key = 'path/to/src.zip'
    
    # Download the zip file from S3
    zip_obj = s3.get_object(Bucket=bucket_name, Key=zip_file_key)
    buffer = io.BytesIO(zip_obj['Body'].read())
    
    # Open the zip file in memory
    with zipfile.ZipFile(buffer, 'r') as zip_ref:
        # List all files in the zip
        print("Files in the zip:", zip_ref.namelist())
    
        # Open and read a specific file within the zip without extracting
        with zip_ref.open('verifications.txt') as file:
            text_content = file.read().decode('utf-8')
            print("Contents of the text file:", text_content)