apache-sparkdatedatetimepysparktimestamp

Spark DataFrame TimestampType - how to get Year, Month, Day values from field?


I have Spark DataFrame with take(5) top rows as follows:

import datetime

[Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=1, value=638.55),
 Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=2, value=638.55),
 Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=3, value=638.55),
 Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=4, value=638.55),
 Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=5, value=638.55)]

It's schema is defined as:

elevDF.printSchema()

root
 |-- date: timestamp (nullable = true)
 |-- hour: long (nullable = true)
 |-- value: double (nullable = true)

How do I get the Year, Month, Day values from the column 'date' of timestamp data type?


Solution

  • Since Spark 1.5 you can use a number of date processing functions:

    import datetime
    from pyspark.sql.functions import year, month, dayofmonth
    
    elevDF = sc.parallelize([
        (datetime.datetime(1984, 1, 1, 0, 0), 1, 638.55),
        (datetime.datetime(1984, 1, 1, 0, 0), 2, 638.55),
        (datetime.datetime(1984, 1, 1, 0, 0), 3, 638.55),
        (datetime.datetime(1984, 1, 1, 0, 0), 4, 638.55),
        (datetime.datetime(1984, 1, 1, 0, 0), 5, 638.55)
    ]).toDF(["date", "hour", "value"])
    
    elevDF.select(
        year("date").alias('year'), 
        month("date").alias('month'), 
        dayofmonth("date").alias('day')
    ).show()
    # +----+-----+---+
    # |year|month|day|
    # +----+-----+---+
    # |1984|    1|  1|
    # |1984|    1|  1|
    # |1984|    1|  1|
    # |1984|    1|  1|
    # |1984|    1|  1|
    # +----+-----+---+
    

    You can use simple map as with any other RDD:

    elevDF = sqlContext.createDataFrame(sc.parallelize([
        Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=1, value=638.55),
        Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=2, value=638.55),
        Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=3, value=638.55),
        Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=4, value=638.55),
        Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=5, value=638.55)]))
    
    (elevDF
        .map(lambda (date, hour, value): (date.year, date.month, date.day))
        .collect())
    

    and the result is:

    [(1984, 1, 1), (1984, 1, 1), (1984, 1, 1), (1984, 1, 1), (1984, 1, 1)]
    

    By the way, datetime.datetime stores the hour anyway so keeping it separately seems to be a waste of memory.