[SOLVED] Writing wide table (40,000+ columns) to Databricks Hive Metastore for use with AutoML

Writing wide table (40,000+ columns) to Databricks Hive Metastore for use with AutoML

I want to train a regression prediction model with Azure Databricks AutoML using the GUI. The training data is very wide. All of the columns except for the response variable will be used as features.

To use the Databricks AutoML GUI I have to store the data as a table in the Hive metastore. I have a large DataFrame df with more than 40,000 columns.

print((df.count(), len(df.columns)))
(33030, 45502)

This data is written to a table in Hive using the following PySpark command (I believe this is standard):

df.write.mode('overwrite').saveAsTable("WIDE_TABLE")

Unfortunately this job does not finish within 'acceptable' time (10 hours). I cancel and hence don't have an error message.

When I reduce the number of columns with

df.select(df.columns[:500]).write.mode('overwrite').saveAsTable("WIDE_TABLE")

it fares better and finishes in 9.87 minutes, so the method should work.

Can this be solved:

With a better compute instance?
With a better script?
Not at all and if so, is there another approach?

[EDIT to address questions in comments]

Runtime and driver summary:

2-16 Workers 112-896 GB Memory 32-256 Cores (Standard_DS5_v2)

1 Driver 56 GB Memory, 16 Cores (Same as worker)

Runtime10.4.x-scala2.12

To give an impression of the timings I've added a table below.

columns	time (mins)
10	1.94
100	1.92
200	3.04
500	9.87
1000	25.91
5000	938.4

Data type of the remaining columns is Integer.

As far as I know I'm writing the table on the same environment that I am working on. Data flow: Azure Blob CSV -> Data read and wrangling -> PySpark DataFrame -> Hive Table. Last three steps are on the same cloud machine.

Hope this helps!

Solution

I think your case is not related to either Spark resource configuration or network connection, it's related to Spark design itself.

Long in short, Spark is designed for long and narrow data, which is exactly opposite of your dataframe. When you look at your experiment, the time consuming is in exponential growth when your column size increase. Although it's about reading the csv but not writing table, you can check this post for a good explanation on why Spark is not good at handling wide dataframe: Spark csv reading speed is very slow although I increased the number of nodes

Although I didn't use the Azure AutoML before, based on the dataset to achieve your goal, I think you can try:

Try to use python pandas dataframe and Hive connection library to see if there is any performance enhancement
Concatenate all your column into a single Array / Vector before you write to Hive