I am using a spark accumulator to collect statistics of each pipelines.
In a typical pipeline i would read a data_frame :
df = spark.read.format(csv).option("header",'true').load('/mnt/prepared/orders')
df.count() ==> 7 rows
Then i would actually write it in two diferent locations:
df.write.format(delta).option("header",'true').load('/mnt/prepared/orders')
df.write.format(delta).option("header",'true').load('/mnt/reporting/orders_current/')
Unfortunately my accumulator statistics get updated each write
operations. It gives a figure of 14 rows read, while i have only read the input dataframe once.
How can I make my accumulator properly reflects the number of rows that i actually read.
I am a newbie in spark. have checked several threads around the issue, but did not find my answer. Statistical accumulator in Python spark Accumulator reset When are accumulators truly reliable?
The first rule - accumulators aren't 100% reliable. They could be updated multiple times, for example, if tasks were restarted/retried.
In your case, although you read once, it doesn't mean that data won't be re-read again. Read operation just obtains metadata, like, schema, and may read data if you use inferSchema
for some data type, but it doesn't mean that it's actually read the data into memory. You can cache your read dataframe, but it will work only for smaller data sets, as it's also don't guarantee that data won't be evicted, and then need to be re-read