apache-sparkapache-spark-sqlspark-streaminghadoop2

Out of memory issue when compare two large datasets using spark scala


I am daily importing 10 Million records from Mysql to Hive using Spark scala program and comparing datasets yesterdays and todays datasets.

val yesterdayDf=sqlContext.sql("select * from t_yesterdayProducts");
val todayDf=sqlContext.sql("select * from t_todayProducts");
val diffDf=todayDf.except(yesterdayDf);

I am using 3 node cluster and program working fine for 4 million records. For more than 4 million we are facing out of memory issue as RAM memory is not sufficient.

I would like to know best way to compare two large datasets.


Solution

  • Have you tried findout out how many partitions do you have: yesterdayDf.rdd.partitions.size will give you that information for yesterdayDf dataframe and you can do the same for other dataframes too.

    You can also use yesterdayDf.repartition(1000) // (a large number) to see if the OOM problem goes away.