I have two variables. One is a Dataframe and other is a List[Dataframe]. I wish to perform a join on these. At the moment I am using the following appoach:
def joinDfList(SingleDataFrame: DataFrame, DataFrameList: List[DataFrame], groupByCols: List[String]): DataFrame = {
var joinedDf = SingleDataFrame
DataFrameList.foreach(
Df => {
joinedDf = joinedDf.join(Df, groupByCols, "left_outer")
}
)
joinedDf.na.fill(0.0)
}
Is there an approach where we can skip usage of "var" and instead of "foreach" use "foldleft"?
You can simple write it without vars using foldLeft
:
def joinDfList(singleDataFrame: DataFrame, dataFrameList: List[DataFrame], groupByCols: List[String]): DataFrame =
dataFrameList.foldLeft(singleDataFrame)(
(dfAcc, nextDF) => dfAcc.join(nextDF, groupByCols, "left_outer")
).na.fill(0.0)
in this code dfAcc
will be always join with new DataFrame
from dataFrameList
and in the end you will get one DataFrame
Important! be careful, using too many joins in one job might be a reason of performance degradation.