I am trying to write a large dataset (millions) to an SQL table (Impala) using sparkR in databricks. I have found two options, neither of which are working.
Writing using a simple insertInto fails after five minutes with 'The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached.' It does not restart:
sparkR.session()
insertInto(spark_dt_frame, sql_table , overwrite = FALSE)
The second using COPY INTO seems to hang (runs forever and never completes) even when just inserting 3 rows:
sparkR.session()
sql(paste("COPY INTO ",db_name,'.sql_table',
" FROM ''", spark_data_frame, "'",
" FILEFORMAT = PARQUET",
sep=""
))
It seems these are common issues that databricks only answer for is 'detach and reattach the notebook' which makes no difference. What are my options?
For anyone else who struggles with this issue - it relates to how memory is handled for R dataframes in databricks clusters. To work around it, I have found two options so far:
Convert your df to a partitioned spark dataframe prior to insert (note, you may still need to increase your cluster driver)
spark_df_for_insert <- createDataFrame(r_df, numPartitions=150)
Stop using R dataframes and switch to spark dataframes. This means you will need to change your code and a package like sparklyr will certainly come in handy.
I hope that helps somebody.