dataframescalaapache-spark

It is possible to use spark Dataframe/Dataset api with accumulators?


I read and filter data, need to count how each filter operation affects result. Is it possible to somehow mixin spark accumulators while using Dataframe/Dataset api?

Sample code:

sparkSession.read
  .format("org.apache.spark.sql.delta.sources.DeltaDataSource")
  .load(path)
  // use spark accumulator to count records that passed filter 
  .where(col("ds") >= dateFromInclusive and col("ds") < dateToExclusive)
  // same here
  .where(col("origin").isin(origins)

Solution

  • You can use count_if to count multiple filters (and get the counts in one pass) but you can't simultaneously filter rows with them as per your code example.

    example from Sql function documentation:

    > SELECT count_if(col % 2 = 0) FROM VALUES (NULL), (0), (1), (2), (3) AS tab(col);
     2
    > SELECT count_if(col IS NULL) FROM VALUES (NULL), (0), (1), (2), (3) AS tab(col);
     1