apache-sparkamazon-s3distributed-filesystem

Spark reading from distributed file system?


Say I have data(user events) stored in distributed file system like S3 or HDFS. User events are stored in a directory date wise.

Case 1 Consider that spark job need to read the data for a single day. My understanding is that single spark job will read the data from that day directory and read the data block by block , provide the data to spark cluster for computation. Will that block by block reading process be sequential ?

Case 2 Consider that spark job need to read the data for a more than a day(say 2 days) Question : Here job has to read the data from two separate directory. Do I need to start the two separate spark process(or threads) so that data read from separate directory can be executed in parallel ?


Solution

  • You can achieve this by bucketing and partitioning the data while saving it. Also use parquet file format which is columnar. Spark will apply partition pruning and predicate push down to reduce the amount of data being read for a query. Use multiple executers along with multiple partitions will help parallel processing of data.