I've got a process which is really bogged down by the version computing for the target delta table.
Little bit of context - there are other things that run, all contributing uniform structured dataframes that I want to persist in a delta table. Ultimately these are all compiled into lots_of_dataframes to be logged.
for i in lots_of_dataframes:
i.write.insertInto("target_delta_table")
# ... take a while to compute version
I've got in to the documentation but couldn't find any setting to ignore the version compute. I did see vacuuming, but not sure that'll do the since there will still be a lot of activity in a small window of time.
I know that I can union all of the dataframes together and just do the insert once, but I'm wondering if there is a more Databricks-ian way to do it. Like a configuration to only maintain 1 version at a time and not worry about computing for a restore.
Most probably, but it's hard to say exactly without details, the problem arise from the following facts:
Spark is lazy - the actual data processing doesn't happen until you perform action, like writing data into a destination table. So if you have a lot of transformations, etc., they will happen when you're writing data.
You're writing data in the loop - you can potentially speedup it a bit by doing a union of all tables into a single dataframe, that will be written into one go:
import functools
unioned = functools.reduce(lambda x,y: x.union(y), lots_of_dataframes)
unioned.write.insertInto("target_delta_table")