apache-sparknetcdf-java

netcdf-java: Is concurrent read-only supported for independent data in Spark?


I have a system with 256 cores and fast storage. I have been working with datasets (hundreds of GB) stored in a large number of small files using Spark (scala). Each file consisted of a small number of frames (~50) and frames have no overlapping data. Each frame is associated with a unique time-stamp. Every Spark executor works with a set of files and processes all frames present in these files. I have been able to do all operations very efficiently (CPU utilization >95%).

Now I have to work with a small number of files with each file storing large number of frames. The current approach has become inefficient, apparently the reading is the rate limiting step. I tried to improve reading process by spawning tasks within executors such that different frames are read in parallel with each task opening files with a local file handle. The performance is still terrible.

Considering that frames have non-overlapping data is there anything else I can do? I can first read the times and then create tasks to read them concurrently. But I am not sure that this approach will improve the performance as what I am doing now is also very similar (though at executor level).

I appreciate any corrections in my approach or any suggestions. I understand that netcdf supports parallel reading through MPI interface. But I use Spark. Is this interface a problem?

Thanks in advance.

netcdfAll-5.4.1.jar
Spark version 4.0.0-preview1
Using Scala version 2.13.14 (Java HotSpot(TM) 64-Bit Server VM, Java 21.0.4)

Solution

  • The netCDF files had several data arrays apart from time. So I first read time, associated time with name of the netCDF file it belongs to, and repartitioned the dataframe. Subsequently I added more columns using UDFs. This approach gave almost identical performance for the case where I had had 200,000 frames distributed evenly in either 4 or even 4000 files.