apache-sparkpysparkapache-spark-sqltimestampdayofweek

How to get the weekday from day of month using PySpark


I have a dataframe log_df: enter image description here

I generate a new dataframe based on the following code:

from pyspark.sql.functions import split, regexp_extract 
split_log_df = log_df.select(regexp_extract('value', r'^([^\s]+\s)', 1).alias('host'),
                          regexp_extract('value', r'^.*\[(\d\d/\w{3}/\d{4}:\d{2}:\d{2}:\d{2} -\d{4})]', 1).alias('timestamp'),
                          regexp_extract('value', r'^.*"\w+\s+([^\s]+)\s+HTTP.*"', 1).alias('path'),
                          regexp_extract('value', r'^.*"\s+([^\s]+)', 1).cast('integer').alias('status'),
                          regexp_extract('value', r'^.*\s+(\d+)$', 1).cast('integer').alias('content_size'))
split_log_df.show(10, truncate=False)

the new dataframe is like: enter image description here

I need another column showing the dayofweek, what would be the best elegant way to create it? ideally just adding a udf like field in the select.

Updated: my question is different than the one in the comment, what I need is to make the calculation based on a string in log_df, not based on the timestamp like the comment, so this is not a duplicate question.


Solution

  • I finally resolved the question myself, here is the complete solution:

    1. import date_format, datetime, DataType
    2. first, modify the regexp to extract 01/Jul/1995
    3. convert 01/Jul/1995 to DateType using func
    4. create a udf dayOfWeek to get the week day in brief format (Mon, Tue,...)
    5. using the udf to convert the DateType 01/Jul/1995 to weekday which is Sat enter image description here

    I am not satisfied with my solution as it seems to be so zig-zag, it would be appreciated if anyone can come up with a more elegant solution, thank you in advance.