I have XML file which contain all transformation that I need to run over DataFrame
using withColumn
function like below:
How I can apply it over DataFrame
.
I had a written code using Scala
ToolBox
and runTmirror
, which internally compile code and run these rules over DataFrame
. Which was working perfectly for less than 100 Columns. But now requirement has been changed and Number of columns have increased from 80 to 210 so this code is failing due StackOverflow error
. Which is open issue for Scala 2.11 (https://github.com/scala/bug/issues/10026)
So I want to use any Spark utility instead of Scala ToolBox. I have also tried to use foldLeft
but it is also giving error since I am not able to pass column function (like lit
or concat
etc.) as Column type.
XML Rule Files:
<?xml version="1.0" encoding="utf-8" ?>
- <root>
- <columns>
- <column name="col1">
- <![CDATA[ data("columnA")
]]>
</column>
- <column name="col2">
- <![CDATA[lit("ABC")
]]>
</column>
- <column name="col3">
- <![CDATA[concat(col(columnC),col(columnD))
]]>
</column>
</column>
- <column name="col4">
- <![CDATA[ regexp_replace(regexp_replace(regexp_replace(col("ColumnE"), "\\,|\\)", "") , "\\(", "-") , "^(-)$", "0").cast("double")
]]>
</column>
- <column name="col5">
- <![CDATA[ lit("")
]]>
</column>
.
.
.
.
.
</columns>
</root>
Operations that I need to use as
df.withColumn("col1",data("columnA")).withColumn("col2",lit("ABC")).withColumn("col3",concat(col(columnC), col(columnD))).withColumn("col4",regexp_replace(regexp_replace(regexp_replace(col("ColumnE"), "\\,|\\)", "") , "\\(", "-") , "^(-)$", "0").cast("double"))withColumn("col5",lit("")).........
Version that I am using:
Scala 2.11.12
Spark 2.4.3
I have tried to get Solution from Spark
but did't get any solution other than Spark Sql
.
But my rules are very complex and using Spark Sql
will make it more complex, So I stuck with same logic of ToolBox (as Existing System work) and have resolved issue of 100 columns as below:
".withColumn"
if it greater 100 (say 133 Columns) then simply I divide CodeRule
into two parts (one from 1 to 99, FirstRule and seconds rule from 100 to 133, LastRule) and apply these Rules in two steps as belowFirstRule
on input DataFrame
and get DataFrame
.Dataframe
with LastRule
and get final DataFrame
.