I'm trying to get the day of the week in int format from it saying Mon, Tues, Wed, etc.
I'm using this code right now:
from pyspark.sql.functions import date_format
df_ex1 = df.withColumn("day", date_format('Timestamp', 'E'))
Output:
+-------------------+----+----+----+-----+------------+-----------------+--------------+----+-----+---+
|Timestamp |Open|High|Low |Close|Volume_(BTC)|Volume_(Currency)|Weighted_Price|Year|Month|day|
+-------------------+----+----+----+-----+------------+-----------------+--------------+----+-----+---+
|2011-12-31 09:52:00|4.39|4.39|4.39|4.39 |0.45558086 |2.0 |4.39 |2011|12 |Sat|
|2011-12-31 09:53:00|NaN |NaN |NaN |NaN |NaN |NaN |NaN |2011|12 |Sat|
|2011-12-31 09:54:00|NaN |NaN |NaN |NaN |NaN |NaN |NaN |2011|12 |Sat|
|2011-12-31 09:55:00|NaN |NaN |NaN |NaN |NaN |NaN |NaN |2011|12 |Sat|
|2011-12-31 09:56:00|NaN |NaN |NaN |NaN |NaN |NaN |NaN |2011|12 |Sat|
+-------------------+----+----+----+-----+------------+-----------------+--------------+----+-----+---+
only showing top 5 rows
There are two Spark native ways: dayofweek
and weekday
.
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('2022-08-01', 'Mon'),
('2022-08-02', 'Tue'),
('2022-08-03', 'Wed'),
('2022-08-04', 'Thu'),
('2022-08-05', 'Fri'),
('2022-08-06', 'Sat'),
('2022-08-07', 'Sun')],
['Timestamp', 'day'])
df = df.withColumns({
'day_number_1': F.dayofweek('Timestamp'),
'day_number_2': F.weekday('Timestamp'), # Spark 3.5+
'day_number_3': F.expr("weekday(Timestamp)"), # Spark 2.4+
})
df.show()
# +----------+---+------------+------------+------------+
# | Timestamp|day|day_number_1|day_number_2|day_number_3|
# +----------+---+------------+------------+------------+
# |2022-08-01|Mon| 2| 0| 0|
# |2022-08-02|Tue| 3| 1| 1|
# |2022-08-03|Wed| 4| 2| 2|
# |2022-08-04|Thu| 5| 3| 3|
# |2022-08-05|Fri| 6| 4| 4|
# |2022-08-06|Sat| 7| 5| 5|
# |2022-08-07|Sun| 1| 6| 6|
# +----------+---+------------+------------+------------+