We have a stack consisting of Hadoop+Hive+Spark+Dremio , since Spark writes many HDFS files for a single Hive partition (depending on workers) Dremio is failing when querying the table because the number of HDFS files limit is exceeded , is there any way to solve this without having to manually set a smaller number of workers in spark?(we don't want to lose spark distributed performance and benefits) .
You can use the repartition
which will create 1 file per partition. This will ensure that you have atleast 1 task per partition which will ensure that there is enough parallelism maintained in your spark job.
df.repartition($"a", $"b", $"c", $"d", $"e").write.partitionBy("a", "b", "c", "d", "e").mode(SaveMode.Append).parquet(s"$location")