pythondataframeapache-sparkpysparkrdd

Order PySpark Dataframe by applying a function/lambda


I have a PySpark DataFrame which needs ordering on a column ("Reference"). The values in the column typically look like:

["AA.1234.56", "AA.1101.88", "AA.904.33", "AA.8888.88"]

I have a function already which sorts this list:

myFunc = lambda x: [int(a) if a.isdigit() else a for a in x.split(".")]

which yields as required:

["AA.904.33", "AA.1101.88", "AA.1234.56", "AA.8888.88"]

I want to order the DataFrame applying this lambda. I tried with the sortByKey but it is not clear how to isolate the DataFrame for just a specific column. Any ideas?

A generic question that relates to this, but which kind of use cases require that the PySpark DataFrame gets converted to an RDD? The sortByKey function seems to only apply to RDDs, and not DataFrames.


Solution

  • Python udfs will slown your solution down a lot. It's much better to use native Spark's sql functions. In that case your solution could look like this:

    from pyspark.sql import functions as F
    
    df = session.createDataFrame([("AA.1234.56",), ("AA.904.33",), ("AA.1101.88",)], ['data'])
    df.show()
    
    # +----------+
    # |      data|
    # +----------+
    # |AA.1234.56|
    # | AA.904.33|
    # |AA.1101.88|
    # +----------+
    
    df = df.withColumn('spl', F.split(F.col('data'), '\.{1}'))
    df = df.withColumn('1', F.col('spl').getItem(0)). \
        withColumn('2', F.col('spl').getItem(1).cast('int')). \
        withColumn('3', F.col('spl').getItem(2)).\
        orderBy('1', '2', '3')
    df = df.select('data')
    df.show()
    
    # +----------+
    # |      data|
    # +----------+
    # | AA.904.33|
    # |AA.1101.88|
    # |AA.1234.56|
    # +----------+