dataframeapache-sparkdatepysparkapache-spark-sql

Change month numbers to month name in a dataframe (PySpark)


I have a column of month numbers in a dataframe and want to change it to month name, so I used this which resulted in a type error:

df['monthName'] = df['monthNumber'].apply(lambda x: calendar.month_name[x]) 

TypeError: 'Column' object is not callable

How to get month name?

I'm using Spark 2.1.1 and Python 2.7.6.

This is my code for Airline data Analysis:

df_withDelay = df_mappedCarrierNames.filter(df_mappedCarrierNames.ArrDelay > 0)
sqlContext.registerDataFrameAsTable(df_withDelay,"SFO_ArrDelayAnalysisTable")
df_SFOArrDelay = sqlContext.sql \
                      ("select sfo.Month, sum(sfo.ArrDelay) as TotalArrivalDelay \
                      from SFO_ArrDelayAnalysisTable sfo \
                      where (sfo.Dest = 'SFO') \
                      group by sfo.Month")

I am trying to plot a graph with Month vs ArrDelay. From the above code I am getting Month as number. So I tried with the below option

udf = UserDefinedFunction(lambda x: calendar.month_abbr[int(x)], StringType())
new_df_mappedCarrierNames = df_mappedCarrierNames.select(*[udf(column).alias(name) if column == name else column for column in df_mappedCarrierNames.columns])

It works but, in my graph it's not in sorted order. whereas if I use the month numbers, it is in sorted order. My issue is in finding out how to map month numbers to month names in sorted order from Jan to dec.


Solution

  • Here is the clean solution:

    from pyspark.sql.functions import UserDefinedFunction
    from pyspark.sql.types import StringType
    
    #1
    month_lst = ['January', 'Feburary', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
    #2
    df = sqlContext.createDataFrame( [(1, "a", 23.0), (3, "B", -23.0)], ("x1", "x2", "x3"))
    
    name = 'x1'
    #3
    udf = UserDefinedFunction(lambda x: month_lst[int(x%12) - 1], StringType())
    new_df = df.select(*[udf(column).alias(name) if column == name else column for column in df.columns])
    

    1

    You build a dictionary

    2

    Define a dataframe and select the column 3 month_lst[int(x%12) - 1] This Is the important part, where i only assume the input is a float, and return a value from the list

    If you need more help, pleas add a comment,