We built a pipeline which includes several transforms. The whole pipeline build is completing in more than 30 minutes today while we need to ensure that the data is available in less than 15 minutes.
How is it possible to reduce the total building time?
While the transforms are running, we have noticed that the spark details are greyed out. Taking an example that is representative of several transforms: during more than 10 minutes the Spark details are greyed out, and then the Spark job is actually getting executed:
The Spark job is running in only 3 minutes. After taking that into consideration, it turned out that during 80% of the build duration, the Spark details are greyed out.
What is happening on the build when Spark details are greyed out? How could we reduce this duration?
Spark details are available once the Spark environment initialization has complete. In this step, the most important stage is the download of all packages. Could you look at the list of packages that you are installed in your Library panel and remove any package that is not used by your transforms?
Some packages, due to their size, can take more time than others to be downloaded. Therefore, removing the ones that you are not using is the most efficient way to save time and accelerate the Spark environment initialization.