pythonpandasdataframeapache-spark

how to select/filter data in pyspark data frame?


I have worked with pandas dataframe and need to do some basic selecting/filtering of data , but in pyspark dataframe . I'm running the script as an aws glue job. do i need to convert pyspark dataframe to pandas df to do some basic operations like, filtering and selecting , pandas version below.

my question, is there any difference between pandas and dynamic pyspark dataframe , on how we select/filter data?

panda version

import pandas as pd

df = pd.read_csv('abc.csv')

#select and merge column values 
df['combined'] = df['first_name'] + ' ' + df['last_name']

filter_result = df[df['martial_status'] == 'married']

...
from pyspark.context import SparkContext
from awsglue.context import GlueContext


ctx= SparkContext.getOrCreate()
glue_ctx = GlueContext(ctx)


dynamic_frm = glue_ctx.create_dynamic_frame_from_options(
  connection_type='s3',
  connection_options={'paths': ['s3://.../abc.csv']},
  format='csv'
)



Solution

  • Here the equivalent code if you decide to switch to pyspark DF :

    from pyspark.sql.functions import concat_ws, col
    
    # Convert DynamicFrame to DataFrame
    spark_df = dynamic_frm.toDF()
    
    spark_df = spark_df.withColumn('combined', concat_ws(' ', col('first_name'), col('last_name'))
    
    filter_result = spark_df.filter(col('martial_status') == 'married')
    

    Here the solution if you want to stay in DynamicFrame, it's need to deal with map :

    def add_combined_name(record):
        record['combined'] = f"{record['first_name']} {record['last_name']}"
        return record
    
    def filter_married(record):
        return record['martial_status'] == 'married'
    
    
    dynamic_frm = Map.apply(frame=dynamic_frm, f=add_combined_name)
    filtered_dynamic_frm = Filter.apply(frame=dynamic_frm, f=filter_married)