I have a pyspark dataframe that I'd like to get the row count for. Once I get the row count, I'd like to add it to the top left corner of the data frame, as shown below.
I've tried creating the row first and doing a union on the empty row and the dataframe, but the empty row gets overwritten. I've tried adding it as a literal in a column, but having trouble nulling the remainder of the column as well as the row. Any advice?
dataframe:
col1 | col2 | col3 | ... | col13 |
---|---|---|---|---|
string | string | timest | ... | int |
for a few rows.
desired output:
row_count | col1 | col2 | col3 | ... | col13 |
---|---|---|---|---|---|
numofrows | |||||
string | string | timest | ... | int |
So the row count would sit where an otherwise empty row and empty column meet.
Assuming df
is your dataframe:
from pyspark.sql import functions as F
cnt = df.count()
columns_list = df.columns
df = df.withColumn("row_count", F.lit(None).cast("int"))
schema = df.schema
cnt_line = spark.createDataFrame([[None for x in columns_list] + [cnt]], schema=schema)
df.unionAll(cnt_line).show()