pysparkhivedatabricksazure-databricksautoml

Writing wide table (40,000+ columns) to Databricks Hive Metastore for use with AutoML


I want to train a regression prediction model with Azure Databricks AutoML using the GUI. The training data is very wide. All of the columns except for the response variable will be used as features.

To use the Databricks AutoML GUI I have to store the data as a table in the Hive metastore. I have a large DataFrame df with more than 40,000 columns.

print((df.count(), len(df.columns)))
(33030, 45502)

This data is written to a table in Hive using the following PySpark command (I believe this is standard):

df.write.mode('overwrite').saveAsTable("WIDE_TABLE")

Unfortunately this job does not finish within 'acceptable' time (10 hours). I cancel and hence don't have an error message.

When I reduce the number of columns with

df.select(df.columns[:500]).write.mode('overwrite').saveAsTable("WIDE_TABLE")

it fares better and finishes in 9.87 minutes, so the method should work.

Can this be solved:

[EDIT to address questions in comments]

Runtime and driver summary:

2-16 Workers 112-896 GB Memory 32-256 Cores (Standard_DS5_v2)

1 Driver 56 GB Memory, 16 Cores (Same as worker)

Runtime10.4.x-scala2.12

To give an impression of the timings I've added a table below.

columns time (mins)
10 1.94
100 1.92
200 3.04
500 9.87
1000 25.91
5000 938.4

Data type of the remaining columns is Integer.

As far as I know I'm writing the table on the same environment that I am working on. Data flow: Azure Blob CSV -> Data read and wrangling -> PySpark DataFrame -> Hive Table. Last three steps are on the same cloud machine.

Hope this helps!


Solution

  • I think your case is not related to either Spark resource configuration or network connection, it's related to Spark design itself.

    Long in short, Spark is designed for long and narrow data, which is exactly opposite of your dataframe. When you look at your experiment, the time consuming is in exponential growth when your column size increase. Although it's about reading the csv but not writing table, you can check this post for a good explanation on why Spark is not good at handling wide dataframe: Spark csv reading speed is very slow although I increased the number of nodes

    Although I didn't use the Azure AutoML before, based on the dataset to achieve your goal, I think you can try:

    1. Try to use python pandas dataframe and Hive connection library to see if there is any performance enhancement
    2. Concatenate all your column into a single Array / Vector before you write to Hive