scalaapache-sparkapache-spark-sqlapache-spark-xml

Select Fields that start with a certain pattern: Spark XML Parsing


I am having to parse some very large xml files. There are a few fields within those xml files that I want to extract and then perform some work on them. However, there are some rules that I need to follow, i.e. I can only select fields if they follow a certain pattern.

Here is an example of what I'm trying to achieve:

// Some made up data
val schema = new StructType()
      .add("devices", 
        new StructType()
          .add("thermostats", MapType(StringType,
            new StructType()
              .add("device_id", StringType)
              .add("locale", StringType)
              .add("E2EDK1000", StringType)
              .add("E2EDK2000", StringType)
              .add("E2EDK3000", StringType))))

val nestDataDS2 = Seq("""{
    "devices": {
      "thermostats": {
          "peyiJNo0IldT2YlIVtYaGQ": {
            "device_id": "peyiJNo0IldT2YlIVtYaGQ",
            "locale": "en-US",
            "E2EDK1000": "4.0",
            "E2EDK2000": "VqFabWH21nwVyd4RWgJgNb292wa7hG_dUwo2i2SG7j3-BOLY0BA4sw",
            "E2EDK3000": "Hallway Upstairs"}}}}""").toDS
val nestDF2 = spark
            .read
            .schema(nestSchema2)
            .json(nestDataDS2.rdd)

root
 |-- devices: struct (nullable = true)
 |    |-- thermostats: map (nullable = true)
 |    |    |-- key: string
 |    |    |-- value: struct (valueContainsNull = true)
 |    |    |    |-- device_id: string (nullable = true)
 |    |    |    |-- locale: string (nullable = true)
 |    |    |    |-- E2EDK1000: string (nullable = true)
 |    |    |    |-- E2EDK2000: string (nullable = true)
 |    |    |    |-- E2EDK3000: string (nullable = true)

Given this I want to get to value field so I do the following

val tmp = nestDF2.select($"devices.thermostats.value")
root
 |-- value: struct (nullable = true)
 |    |-- device_id: string (nullable = true)
 |    |-- locale: string (nullable = true)
 |    |-- E2EDK1000: string (nullable = true)
 |    |-- E2EDK2000: string (nullable = true)
 |    |-- E2EDK3000: string (nullable = true)

Here is my issue: I want to select all the fields within value that start with the following pattern E2EDK1. However, I'm stuck on how to do that. Here is the end result that I want:

root
 |-- E2EDK1000: string (nullable = true)

I know I can select that field outright, but in the data I'm working with it is not always the case that E2EDK1000 will always be there. What always will be there is E2EDK1.

I've tried using startsWith() but that does not seem to work, e.g.

val tmp2 = tmp
  .select($"value".getItem(_.startsWith("E2EDK1")))

Solution

  • You can use .* to select all the elements of value column into separate columns, filter in all the element names that start with E2EDK1 and finally select only those columns only as following

    //flattens the struct value column to separate columns
    val tmp = nestDF2.select($"devices.thermostats.value.*")
    //filter in the column names that starts with E2EDK1
    val e2edk1Columns = tmp.columns.filter(_.startsWith("E2EDK1"))
    //select only the columns that starts with E2EDK1
    tmp.select(e2edk1Columns.map(col):_*)
    

    which should give you all the elements of value struct column that starts with E2EDK1 as separate columns. For your given example you should have output as

    +---------+
    |E2EDK1000|
    +---------+
    |null     |
    +---------+
    
    root
     |-- E2EDK1000: string (nullable = true)
    

    You can combine them back to struct if you desire