I have a time series data frame and I count days of each month with the following code:
count_dates=df_inner.groupBy('dates').count().orderBy('dates')
This code gives me a new data frame with 1 column:
dates count
2021-03-01 5
2021-03-02 44
2021-03-03 3
2021-03-04 2
2021-03-05 1
This is another data frame:
name 2021-03-01 2021-03-02 2021-03-03 2021-03-04 2021-03-05
A 40 42 30 1 8
B 80 3 54 2 7
C 10 0 52 2 8
In other words, each date in the first dataset is a column in the other dataset. How I can divide each column of the second dataframe with the count
column of the first df?
Output:
name 2021-03-01 2021-03-02 2021-03-03 2021-03-04 2021-03-05
A 40/5 42/44 30/3 1/2 8/5
B 80/5 3/44 54/3 2/2 7/5
C 10/5 0/44 52/3 2/2 8/5
Try the following.
Example dfs:
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.getOrCreate()
df1 = spark.createDataFrame(
[('2021-03-01', 5),
('2021-03-02', 44),
('2021-03-03', 3),
('2021-03-04', 2),
('2021-03-05', 1)],
['dates', 'count']
)
df2 = spark.createDataFrame(
[('A', 40, 42, 30, 1, 8),
('B', 80, 3, 54, 2, 7),
('C', 10, 0, 52, 2, 8)],
['name', '2021-03-01', '2021-03-02', '2021-03-03', '2021-03-04', '2021-03-05']
)
Script:
# Unpivoting df2
cols_to_unpivot = [f"`{c}`, \'{c}\'" for c in df2.columns if c != 'name']
stack_string = ', '.join(cols_to_unpivot)
df2 = df2.select(
'name',
F.expr(f'stack({len(cols_to_unpivot)}, {stack_string}) as (val, dates)')
)
# Joining
df_joined = df2.join(df1, 'dates', 'full')
# Pivoting the result
df = df_joined.groupBy('name').pivot('dates').agg(F.first(F.col('val') / F.col('count')))
df.show()
# +----+----------+-------------------+------------------+----------+----------+
# |name|2021-03-01| 2021-03-02| 2021-03-03|2021-03-04|2021-03-05|
# +----+----------+-------------------+------------------+----------+----------+
# | B| 16.0|0.06818181818181818| 18.0| 1.0| 7.0|
# | C| 2.0| 0.0|17.333333333333332| 1.0| 8.0|
# | A| 8.0| 0.9545454545454546| 10.0| 0.5| 8.0|
# +----+----------+-------------------+------------------+----------+----------+