I have worked with pandas dataframe and need to do some basic selecting/filtering of data , but in pyspark dataframe . I'm running the script as an aws glue job. do i need to convert pyspark dataframe to pandas df to do some basic operations like, filtering and selecting , pandas version below.
my question, is there any difference between pandas and dynamic pyspark dataframe , on how we select/filter data?
panda version
import pandas as pd
df = pd.read_csv('abc.csv')
#select and merge column values
df['combined'] = df['first_name'] + ' ' + df['last_name']
filter_result = df[df['martial_status'] == 'married']
...
from pyspark.context import SparkContext
from awsglue.context import GlueContext
ctx= SparkContext.getOrCreate()
glue_ctx = GlueContext(ctx)
dynamic_frm = glue_ctx.create_dynamic_frame_from_options(
connection_type='s3',
connection_options={'paths': ['s3://.../abc.csv']},
format='csv'
)
Here the equivalent code if you decide to switch to pyspark DF :
from pyspark.sql.functions import concat_ws, col
# Convert DynamicFrame to DataFrame
spark_df = dynamic_frm.toDF()
spark_df = spark_df.withColumn('combined', concat_ws(' ', col('first_name'), col('last_name'))
filter_result = spark_df.filter(col('martial_status') == 'married')
Here the solution if you want to stay in DynamicFrame, it's need to deal with map :
def add_combined_name(record):
record['combined'] = f"{record['first_name']} {record['last_name']}"
return record
def filter_married(record):
return record['martial_status'] == 'married'
dynamic_frm = Map.apply(frame=dynamic_frm, f=add_combined_name)
filtered_dynamic_frm = Filter.apply(frame=dynamic_frm, f=filter_married)