I want to train a regression prediction model with Azure Databricks AutoML using the GUI. The training data is very wide. All of the columns except for the response variable will be used as features.
To use the Databricks AutoML GUI I have to store the data as a table in the Hive metastore. I have a large DataFrame df
with more than 40,000 columns.
print((df.count(), len(df.columns)))
(33030, 45502)
This data is written to a table in Hive using the following PySpark command (I believe this is standard):
df.write.mode('overwrite').saveAsTable("WIDE_TABLE")
Unfortunately this job does not finish within 'acceptable' time (10 hours). I cancel and hence don't have an error message.
When I reduce the number of columns with
df.select(df.columns[:500]).write.mode('overwrite').saveAsTable("WIDE_TABLE")
it fares better and finishes in 9.87 minutes, so the method should work.
Can this be solved:
[EDIT to address questions in comments]
Runtime and driver summary:
2-16 Workers 112-896 GB Memory 32-256 Cores (Standard_DS5_v2)
1 Driver 56 GB Memory, 16 Cores (Same as worker)
Runtime10.4.x-scala2.12
To give an impression of the timings I've added a table below.
columns | time (mins) |
---|---|
10 | 1.94 |
100 | 1.92 |
200 | 3.04 |
500 | 9.87 |
1000 | 25.91 |
5000 | 938.4 |
Data type of the remaining columns is Integer
.
As far as I know I'm writing the table on the same environment that I am working on. Data flow: Azure Blob CSV -> Data read and wrangling -> PySpark DataFrame -> Hive Table. Last three steps are on the same cloud machine.
Hope this helps!
I think your case is not related to either Spark resource configuration or network connection, it's related to Spark design itself.
Long in short, Spark is designed for long and narrow data, which is exactly opposite of your dataframe. When you look at your experiment, the time consuming is in exponential growth when your column size increase. Although it's about reading the csv but not writing table, you can check this post for a good explanation on why Spark is not good at handling wide dataframe: Spark csv reading speed is very slow although I increased the number of nodes
Although I didn't use the Azure AutoML before, based on the dataset to achieve your goal, I think you can try: