databricksimpalasparkr

Options for inserting large dataset from databricks to SQL table using sparkR


I am trying to write a large dataset (millions) to an SQL table (Impala) using sparkR in databricks. I have found two options, neither of which are working.

Writing using a simple insertInto fails after five minutes with 'The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached.' It does not restart:

sparkR.session()
insertInto(spark_dt_frame, sql_table , overwrite = FALSE)

The second using COPY INTO seems to hang (runs forever and never completes) even when just inserting 3 rows:

sparkR.session()
sql(paste("COPY INTO ",db_name,'.sql_table',
  " FROM ''", spark_data_frame, "'",
  " FILEFORMAT = PARQUET",
  sep=""
))

It seems these are common issues that databricks only answer for is 'detach and reattach the notebook' which makes no difference. What are my options?


Solution

  • For anyone else who struggles with this issue - it relates to how memory is handled for R dataframes in databricks clusters. To work around it, I have found two options so far:

    1. Convert your df to a partitioned spark dataframe prior to insert (note, you may still need to increase your cluster driver)

      spark_df_for_insert <- createDataFrame(r_df, numPartitions=150)

    2. Stop using R dataframes and switch to spark dataframes. This means you will need to change your code and a package like sparklyr will certainly come in handy.

    I hope that helps somebody.