I have to do the records count in a file per partition in spark data frame and then I have to write output to XML file.
Here is my data frame.
dfMainOutputFinalWithoutNull.coalesce(1).write.partitionBy("DataPartition","StatementTypeCode")
.format("csv")
.option("nullValue", "")
.option("header", "true")
.option("codec", "gzip")
.save("s3://trfsdisu/SPARK/FinancialLineItem/output")
Now I have to count the number of records in each file in each partition and then write output to an XML file.
This is how I am trying to do it.
val count =dfMainOutputFinalWithoutNull.groupBy("DataPartition","StatementTypeCode").count
count.write.format("com.databricks.spark.xml")
.option("rootTag", "items")
.option("rowTag", "item")
.save("s3://trfsdisu/SPARK/FinancialLineItem/Descr")
I am able to print total no of records per partition and print that but when i m trying to create xml file i am getting below error .
java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark.apache.org/third-party-projects.html
I am using Spark 2.2.0, Zeppelin 0.7.2
So do I have to import com.databricks.spark.xml
this, but why because in case of csv file if I am not importing com.databricks.spark.csv
.
Also, can I use cache dfMainOutputFinalWithoutNull
because I will be using it twice to write its data and then count its partitions records and then write in the xml file?
And I added this dependency
<!-- https://mvnrepository.com/artifact/com.databricks/spark-xml_2.10 -->
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-xml_2.10</artifactId>
<version>0.2.0</version>
</dependency>
And restarted interpreter. Then I got the following error.
java.lang.NullPointerException
at org.apache.zeppelin.spark.Utils.invokeMethod(Utils.java:38)
at org.apache.zeppelin.spark.Utils.invokeMethod(Utils.java:33)
at org.apache.zeppelin.spark.SparkInterpreter.createSparkContext_2(SparkInterpreter.java:391)
at org.apache.zeppelin.spark.SparkInterpreter.createSparkContext(SparkInterpreter.java:380)
at org.apache.zeppelin.spark.SparkInterpreter.getSparkContext(SparkInterpreter.java:146)
I will answer my question
so i added below dependency in zepplin
Scala 2.11
groupId: com.databricks
artifactId: spark-xml_2.11
version: 0.4.1
Added below in the zepplin
com.databricks:spark-xml_2.11:0.4.1
And then i was able to create files .