sqlapache-sparkpysparkapache-spark-sqldowngrade

Spark 2.3.1 array_join and array_remove


I have coded a pyspark script to execute a SQL file, it worked perfectly fine on the spark latest version, but the target machine has 2.3.1, and it throws exception:

pyspark.sql.utils.AnalysisException: u"Undefined function: 'array_remove'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'

It seems these are not present in the older versions :( can anyone suggest something, i have searched alot but in vain.

my sql piece which is failing is

SELECT NIEDC.*, array_join(array_remove(split(groupedNIEDC.appearedIn,'-'), StudyCode),'-') AS subjects_assigned_to_other_studies 

Solution

  • array_remove and array_join functions were added on spark version 2.4. You can make an UDF and register it to use in a query using this method.