Suppose I have following xml data:
<students>
<studentId>110</studentId>
<info>
<rollNo>2</rollNo>
<address>
<permanent>abc</permanent>
<temporary>def</temporary>
</address>
</info>
<subjects>
<subject>
<name>maths</name>
<credit>3</credit>
</subject>
<subject>
<name>science</name>
<credit>2</credit>
</subject>
</subjects>
</students>
And its schema is:
root
|-- info: struct (nullable = true)
| |-- address: struct (nullable = true)
| | |-- permanent: string (nullable = true)
| | |-- temporary: string (nullable = true)
| |-- rollNo: long (nullable = true)
|-- studentId: long (nullable = true)
|-- subjects: struct (nullable = true)
| |-- subject: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- credit: long (nullable = true)
| | | |-- name: string (nullable = true)
as root tag being "students"
.
Here, I wanted to update the value of certain columns.
I wanted to update the value of "studentId"
column using UDF
. And I found a way:
df = df.withColumn("studentId", updateValue(col("studentId")))
Then, I wanted to update a nested column i.e. "info.rollNo"
.
Applying above process gave me another new column as "<info.rollNo>updated_value</info.rollNo>
". After searching for a while, I found a way:
val colStruct = df.select(col("info" + ".*")).columns
.filter(_ != "rollNo")
.map(f => col("info" + "." + f))
df = df.withColumn("info",
struct(
(colStruct :+ updateValue(col("info.rollNo")
).as("rollNo")): _*)
)
For the third nested columns, I tried the above mentioned way. But I could not figure out the process.
Here, the question is, could someone explain me the algorithm of updating the nested columns value whose nest level could be 3,4,5 and so on.
For Instance: I want to update following fields.
"info.address.permanent"
which is struct
and
"subjects.subject.credit"
which is the element of array "subject"
PS: If you know any other ways to update certain columns, then please mention it.
I got the answer. Regarding updating the column x of nested data with n1,n2,...,nn nests and c columns in each nest:
i.e. let us update the column => "n1.n2.n3...nn.x"
df = df.withColumn("n1",
struct(
1st nest's columns n1.c except the struct which holds column x,
//like col("n1.col1"), col("n2.col2"), ...,
struct(
2nd nest's columns n2.c except the struct which holds column x,
....
....
....
struct(
nth nest's nn.c columns except column x,
udfApplied(col("n1.n2...nn.x")).as("x")
).as("nn")
).as("n2")
))
val udfApplied = udf((value: String) => {
value + " updated" //update the value here
})
Example of "info.address.permanent":
df = df.withColumn("info",
struct(
col("info.rollNo"),
struct(
col("info.address.temporary"),
udfApplied(col("info.address.permanent")).as("permanent")
).as("address")
))
Example of "subjects.subject.credit": (For array types, everything is same but we need to create struct for each index of element in array)
df = df.withColumn("subjects",
struct(
array(
struct(
col("subjects.subject.name")(0).as("name"),
udfApplied(col("subjects.subject.credit")(0)).as("credit")
).as("subject"),
struct(
col("subjects.subject.name")(1).as("name"),
udfApplied(col("subjects.subject.credit")(1)).as("credit")
).as("subject")
).as("subject")
))
Hope this helps you all