amazon-web-servicesaws-glue

Missing Columns in AWS Glue


I'm trying to upskill with AWS Glue, and implement a simple test pipeline that hopefully will later become a much bigger set of pipelines to handle our ETL. But I'm seeing some behaviour that I don't understand and I'm hoping someone here can help me get to the bottom of it.

I have a number of parquet files, which have different columns. Let's say there are three different groups of columns a parquet might have (some of our real data is subject to the provider adding/removing/renaming columns on a whim):

For arguments sake, lets say I want to 'merge' the gender and sex columns into gender_sex so that my output format is (regardless of input format):

I have a Glue table set up over the parquet files, and it correctly detects there are 4 possible columns (id, name, sex and gender). However, when I run a Glue Job it only seems to recognise the id, name and gender columns. The sex column is always null (or not present).

I've tried running a simpler job (code below) that just reads from the catalogue and outputs to JSON, and the issue seems to be reading from the catalogue (not my transformations/merging) as this simple test never writes a sex column (only ever id, name and gender).

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

# Script generated for node Data Catalog table
DataCatalogtable_node1 = glueContext.create_dynamic_frame.from_catalog(
    database="jb_ingest",
    table_name="jb_test_glue_dev",
    transformation_ctx="DataCatalogtable_node1",
)

# Script generated for node S3 bucket
S3bucket_node3 = glueContext.write_dynamic_frame.from_options(
    frame=DataCatalogtable_node1,
    connection_type="s3",
    format="json",
    connection_options={
        "path": "s3://jb-output-20230426103815496800000001/foo/",
        "partitionKeys": [],
    },
    transformation_ctx="S3bucket_node3",
)

job.commit()

I'm new to AWS Glue - what am I doing wrong? How do I get it to read all the present columns and allow me to normalise the data?


Solution

  • Solution

    To fix this issue when working with Parquet files in AWS Glue, set the following configuration:

    # If you're working with a Glue Job, apply this when creating the GlueContext
    
    glueContext.sql("set spark.sql.parquet.mergeSchema=true")
    

    Explanation

    I faced the same issue just yesterday and spent quite some time fixing it. In my case, the columns were not completely empty; they contained around 100 records out of 70,000, yet Glue still didn’t recognize them.

    Setting spark.sql.parquet.mergeSchema=true allows Spark to merge schemas from different Parquet files, even if some columns are missing in some of them. This is essential when your dataset has sparse columns that are not consistently present across all files.

    Important Note

    In an AWS Glue Job, this configuration must be applied when creating the GlueContext, because the GlueContext is not dynamic and cannot be altered later in the job's execution. However, if you're working with a Glue Notebook, you can change this setting at any point during the session.