[SOLVED] how to select/filter data in pyspark data frame?

how to select/filter data in pyspark data frame?

I have worked with pandas dataframe and need to do some basic selecting/filtering of data , but in pyspark dataframe . I'm running the script as an aws glue job. do i need to convert pyspark dataframe to pandas df to do some basic operations like, filtering and selecting , pandas version below.

my question, is there any difference between pandas and dynamic pyspark dataframe , on how we select/filter data?

panda version

import pandas as pd

df = pd.read_csv('abc.csv')

#select and merge column values 
df['combined'] = df['first_name'] + ' ' + df['last_name']

filter_result = df[df['martial_status'] == 'married']

...

from pyspark.context import SparkContext
from awsglue.context import GlueContext


ctx= SparkContext.getOrCreate()
glue_ctx = GlueContext(ctx)


dynamic_frm = glue_ctx.create_dynamic_frame_from_options(
  connection_type='s3',
  connection_options={'paths': ['s3://.../abc.csv']},
  format='csv'
)

Solution

Here the equivalent code if you decide to switch to pyspark DF :

from pyspark.sql.functions import concat_ws, col

# Convert DynamicFrame to DataFrame
spark_df = dynamic_frm.toDF()

spark_df = spark_df.withColumn('combined', concat_ws(' ', col('first_name'), col('last_name'))

filter_result = spark_df.filter(col('martial_status') == 'married')

Here the solution if you want to stay in DynamicFrame, it's need to deal with map :

def add_combined_name(record):
    record['combined'] = f"{record['first_name']} {record['last_name']}"
    return record

def filter_married(record):
    return record['martial_status'] == 'married'


dynamic_frm = Map.apply(frame=dynamic_frm, f=add_combined_name)
filtered_dynamic_frm = Filter.apply(frame=dynamic_frm, f=filter_married)