pythonapache-sparkpysparksplit

Pyspark SQL not splitting column


I was trying to split my column using pyspark sql based on the values that are stored in another column, but it doesn't seem to work for some special characters.

Here is my code:

df = spark.createDataFrame([("50000.0#0#0#", "#"),
  ("0@1000.0@", "@"),
  ("1$", "$"),
  ("1000.00^Test_string", "^")],["VALUES", "Delimiter"])

df.registerTempTable("cleanTable")

df2 = spark.sql("""
                SELECT 
                    VALUES, cast(Delimiter as string),
                    split(cast(VALUES as string), cast(Delimiter as string)) as split_values
                FROM cleanTable
                """)

and here is my result:

+-------------------+---------+---------------------+
|VALUES             |Delimiter|split_values         |
+-------------------+---------+---------------------+
|50000.0#0#0#       |#        |[50000.0, 0, 0, ]    |
|0@1000.0@          |@        |[0, 1000.0, ]        |
|1$                 |$        |[1$, ]               |
|1000.00^Test_string|^        |[1000.00^Test_string]|
+-------------------+---------+---------------------+

I don't understand why this is not working with delimiters such like $ and ^, is there any extra config that I have to add to make it works?

Also, it works using pyspark sql, how can I use the split function with a column as parameter to split? I mean split(col('a'),col('b')) is that possible?


Solution

  • You can add the escape character '\\' in front of your Delimiter field. For example:

    df = df.withColumn(
        'DelimiterEscaped',
        F.concat(F.lit('\\'), F.col('Delimiter'))
    )
    df.createOrReplaceTempView("cleanTable")
    
    +-------------------+---------+----------------+
    |             VALUES|Delimiter|DelimiterEscaped|
    +-------------------+---------+----------------+
    |       50000.0#0#0#|        #|              \#|
    |          0@1000.0@|        @|              \@|
    |                 1$|        $|              \$|
    |1000.00^Test_string|        ^|              \^|
    +-------------------+---------+----------------+
    

    Then when you create df2, you should get the correct result:

    df2 = spark.sql("""
                    SELECT 
                        VALUES, cast(Delimiter as string),
                        split(cast(VALUES as string), cast(DelimiterEscaped as string)) as split_values
                    FROM cleanTable
                    """)
    
    +-------------------+---------+----------------------+
    |VALUES             |Delimiter|split_values          |
    +-------------------+---------+----------------------+
    |50000.0#0#0#       |#        |[50000.0, 0, 0, ]     |
    |0@1000.0@          |@        |[0, 1000.0, ]         |
    |1$                 |$        |[1, ]                 |
    |1000.00^Test_string|^        |[1000.00, Test_string]|
    +-------------------+---------+----------------------+