apache-sparkpysparkapache-spark-sqlweek-number

weekofyear() returning seemingly incorrect results for January 1


I'm not quite sure why my code gives 52 as the answer for: weekofyear("01/JAN/2017") .

Does anyone have a possible explanation for this? Is there a better way to do this?

from pyspark.sql import SparkSession, functions
spark = SparkSession.builder.appName('weekOfYear').getOrCreate()
from pyspark.sql.functions import to_date

df = spark.createDataFrame(
    [(1, "01/JAN/2017"), (2, "15/FEB/2017")], ("id", "date")) 

df.show()
+---+-----------+
| id|       date|
+---+-----------+
|  1|01/JAN/2017|
|  2|15/FEB/2017|
+---+-----------+

Calculate the week of the year

df=df.withColumn("weekofyear", functions.weekofyear(to_date(df["date"],"dd/MMM/yyyy")))

df.printSchema()

root
 |-- id: long (nullable = true)
 |-- date: string (nullable = true)
 |-- weekofyear: integer (nullable = true)

df.show()

The 'error' is visible below:

+---+-----------+----------+
| id|       date|weekofyear|
+---+-----------+----------+
|  1|01/JAN/2017|        52|
|  2|15/FEB/2017|         7|
+---+-----------+----------+

Solution

  • It seems like weekofyear() will only return 1 for January 1st if the day of the week is Monday through Thursday.

    To confirm, I created a DataFrame with all "01/JAN/YYYY" from 1900 to 2018:

    df = sqlCtx.createDataFrame(
        [(1, "01/JAN/{y}".format(y=year),) for year in range(1900,2019)],
        ["id", "date"]
    )
    

    Now let's convert it to a date, get the day of the week, and count the values for weekofyear():

    import pyspark.sql.functions as f
    df.withColumn("d", f.to_date(f.from_unixtime(f.unix_timestamp('date', "dd/MMM/yyyy"))))\
        .withColumn("weekofyear", f.weekofyear("d"))\
        .withColumn("dayofweek", f.date_format("d", "E"))\
        .groupBy("dayofweek", "weekofyear")\
        .count()\
        .show()
    #+---------+----------+-----+
    #|dayofweek|weekofyear|count|
    #+---------+----------+-----+
    #|      Sun|        52|   17|
    #|      Mon|         1|   18|
    #|      Tue|         1|   17|
    #|      Wed|         1|   17|
    #|      Thu|         1|   17|
    #|      Fri|        53|   17|
    #|      Sat|        53|    4|
    #|      Sat|        52|   12|
    #+---------+----------+-----+
    

    Note, I am using Spark v 2.1 where to_date() does not accept a format argument, so I had to use the method described in this answer to convert the string to a date.

    Similarly to_date() only returns 1 for:


    Update

    This behavior is consistent with the ISO 8601 definition.