amazon-web-servicesscalaapache-sparkaws-glue

Why does using getField() return a none type in AWS glue for scala for an Array of objects despite confirming it is there


When using getField() I am able to return simple values but when I try and use it on an array of objects it returns None. So I've tried to create a simple version of the Scala code I'm having problems with below I'm trying to run a Glue job and as part of that do some mapping on a dynamicFrame. I'm aware I could just switch to a data frame to do this but I'm curious why the dynamic frame doesn't work as expected.

// some code to get the source
//
// example object
// {
//    "name" : "Mike",
//    "age": 43,
//    "kids : [
//       {"age" : 10, "name" : "Jack"},
//       {"age" : 13, "name" : "Jill"}
//    ] 
// }
  
  val exampleFrame = dataSource.getDynamicFrame()

  val mappedExampleFrame = exampleFrame.map { record =>
    println("does kids exist = " + record.schema.containsField("kids")) //this returns true

    val name = record.getField("name").getOrElse("NA").toString //returns Mike
    val age = record.getField("name").getOrElse(0).asInstanceOf[Int] //returns 43
    val kids = record.getField("kids").getOrElse(Seq()).asInstanceOf[List[Map[String,Any]]] // returns an empty Sequence

    // do some mapping  
  } 


// some sink code

I have also do some other debugging and confirmed that when just using get instead of getOrElse it does return a none type as the getField() returns an option but this doesn't make sense to me given that we've confirmed it is in the schema and I have confirmed the record definitely does have a value there. Like I said I'm aware this could be bypassed by just using a data frame (and in practice I have done) but I'd still like to know why the dynamic frame doesn't work as expected.

additionally I've also tried

using a case class

val kids = record.getField("kids").getOrElse(Seq()).asInstanceOf[List[Kids]] // returns an empty Sequence

and pulling it out as it's own record

val kids = record.getField("kids").getOrElse(Seq()).asInstanceOf[List[DynamicRecord]] // returns an empty Sequence

also its glue v4 if that makes a difference and spark 3.3 with scala 2.12.18


Solution

  • The behavior you're encountering when using getField() with AWS Glue DynamicFrames is rooted in the way DynamicFrames handle nested structures like arrays of objects. While DynamicFrames offer flexibility and schema inference, they do have some limitations and quirks when dealing with complex data structures, such as arrays of objects.

    Common Issue: Nested Array of Objects The issue you're seeing with getField() returning None for an array of objects could be due to how the array is being accessed or how the structure is being interpreted by DynamicFrame

    Here is a correct way to handle your use case using DynamicFrames in Scala:

    import com.amazonaws.services.glue.DynamicFrame
    import com.amazonaws.services.glue.util.JsonOptions
    import com.amazonaws.services.glue.util.GlueArgParser
    import com.amazonaws.services.glue.log.GlueLogger
    
    import com.amazonaws.services.glue.{GlueContext, DynamicRecord}
    import org.apache.spark.SparkContext
    import org.apache.spark.sql.SparkSession
    
    object GlueApp {
      def main(sysArgs: Array[String]): Unit = {
        val spark: SparkSession = SparkSession.builder().appName("GlueApp").getOrCreate()
        val glueContext: GlueContext = new GlueContext(new SparkContext())
    
        // Assuming `dataSource` is your data source, for example a S3 location
        val exampleFrame: DynamicFrame = glueContext.getCatalogSource(database = "your_database", tableName = "your_table").getDynamicFrame()
    
        val mappedExampleFrame = exampleFrame.map { record =>
          val logger = new GlueLogger()
    
          logger.info("does kids exist = " + record.schema.containsField("kids")) // Should return true
    
          val name = record.getField("name").getOrElse("NA").toString // Should return "Mike"
          val age = record.getField("age").getOrElse(0).asInstanceOf[Int] // Should return 43
    
          // To handle the "kids" field, extract it as an array of DynamicRecords
          val kidsOption = record.getField("kids").map(_.asInstanceOf[List[DynamicRecord]])
    
          kidsOption match {
            case Some(kids) => 
              kids.foreach { kid =>
                val kidName = kid.getField("name").getOrElse("Unknown").toString
                val kidAge = kid.getField("age").getOrElse(0).asInstanceOf[Int]
                logger.info(s"Child Name: $kidName, Child Age: $kidAge")
              }
            case None => 
              logger.info("No kids found")
          }
    
          // Returning the record as-is or perform other transformations
          record
        }
    
        // Assuming you have a sink to write the data back
        glueContext.getSinkWithFormat(
          connectionType = "s3",
          options = JsonOptions(Map("path" -> "s3://your-output-path")),
          format = "json"
        ).writeDynamicFrame(mappedExampleFrame)
      }
    }
    

    Instead of directly casting to List[Map[String, Any]] or another structure, first extract the field using getField("kids") and then map it to List[DynamicRecord]. This is crucial because the kids field in your JSON is an array of objects, which are represented as DynamicRecord objects within the DynamicFrame.