scaladataframeapache-sparktransformationtoolbox

how to pass multipleColumns transformation rules from XML file to Dataframe in Spark?


I have XML file which contain all transformation that I need to run over DataFrame using withColumn function like below: How I can apply it over DataFrame.

I had a written code using Scala ToolBox and runTmirror, which internally compile code and run these rules over DataFrame. Which was working perfectly for less than 100 Columns. But now requirement has been changed and Number of columns have increased from 80 to 210 so this code is failing due StackOverflow error. Which is open issue for Scala 2.11 (https://github.com/scala/bug/issues/10026)

So I want to use any Spark utility instead of Scala ToolBox. I have also tried to use foldLeft but it is also giving error since I am not able to pass column function (like lit or concat etc.) as Column type.

XML Rule Files:

    <?xml version="1.0" encoding="utf-8" ?> 
    - <root>
    - <columns>
    - <column name="col1">
    - <![CDATA[ data("columnA")        
      ]]> 
      </column>
    - <column name="col2">
    - <![CDATA[lit("ABC")

      ]]> 
      </column>
    - <column name="col3">
    - <![CDATA[concat(col(columnC),col(columnD))      
      ]]> 
      </column>
      </column>
    - <column name="col4">
    - <![CDATA[         regexp_replace(regexp_replace(regexp_replace(col("ColumnE"), "\\,|\\)", "") , "\\(", "-") , "^(-)$", "0").cast("double")

      ]]> 
      </column>
    - <column name="col5">
    - <![CDATA[                 lit("")

      ]]> 
      </column>
.
.
.
.
. 
     </columns>
      </root>

Operations that I need to use as

df.withColumn("col1",data("columnA")).withColumn("col2",lit("ABC")).withColumn("col3",concat(col(columnC), col(columnD))).withColumn("col4",regexp_replace(regexp_replace(regexp_replace(col("ColumnE"), "\\,|\\)", "") , "\\(", "-") , "^(-)$", "0").cast("double"))withColumn("col5",lit("")).........

Version that I am using:

Scala 2.11.12

Spark 2.4.3


Solution

  • I have tried to get Solution from Spark but did't get any solution other than Spark Sql. But my rules are very complex and using Spark Sql will make it more complex, So I stuck with same logic of ToolBox (as Existing System work) and have resolved issue of 100 columns as below:

    1. First I read all rules from XML and generate CodeRule by concat with .withColumn (same as existing code)
    2. Then I check occurrence of String ".withColumn" if it greater 100 (say 133 Columns) then simply I divide CodeRule into two parts (one from 1 to 99, FirstRule and seconds rule from 100 to 133, LastRule) and apply these Rules in two steps as below
    3. first apply FirstRule on input DataFrame and get DataFrame.
    4. After it I pass resulted Dataframe with LastRule and get final DataFrame.
    5. It is working perfectly for my case.