I am daily importing 10 Million records from Mysql to Hive using Spark scala program and comparing datasets yesterdays and todays datasets.
val yesterdayDf=sqlContext.sql("select * from t_yesterdayProducts");
val todayDf=sqlContext.sql("select * from t_todayProducts");
val diffDf=todayDf.except(yesterdayDf);
I am using 3 node cluster and program working fine for 4 million records. For more than 4 million we are facing out of memory issue as RAM memory is not sufficient.
I would like to know best way to compare two large datasets.
Have you tried findout out how many partitions do you have:
yesterdayDf.rdd.partitions.size
will give you that information for yesterdayDf dataframe and you can do the same for other dataframes too.
You can also use
yesterdayDf.repartition(1000) // (a large number)
to see if the OOM problem goes away.