apache-sparkhadoophivedremio

Number of splits in dataset exceeds dataset split limit ,Dremio+Hive+Spark


We have a stack consisting of Hadoop+Hive+Spark+Dremio , since Spark writes many HDFS files for a single Hive partition (depending on workers) Dremio is failing when querying the table because the number of HDFS files limit is exceeded , is there any way to solve this without having to manually set a smaller number of workers in spark?(we don't want to lose spark distributed performance and benefits) .


Solution

  • You can use the repartition which will create 1 file per partition. This will ensure that you have atleast 1 task per partition which will ensure that there is enough parallelism maintained in your spark job.

    df.repartition($"a", $"b", $"c", $"d", $"e").write.partitionBy("a", "b", "c", "d", "e").mode(SaveMode.Append).parquet(s"$location")