pythondataframepysparkapache-spark-sqldata-profiling

Not able to perform operations on resulting dataframe after "join" operation in PySpark


df=spark.read.csv('data.csv',header=True,inferSchema=True)
rule_df=spark.read.csv('job_rules.csv',header=True)
query_df=spark.read.csv('rules.csv',header=True)

join_df=rule_df.join(query_df,rule_df.Rule==query_df.Rule,"inner").drop(rule_df.Rule).show()
print(join_df.collect().columns)

Here I have created three dataframes: df,rule_df and query_df. I've performed inner join on rule_df and query_df, and stored the resulting dataframe in join_df. However, when I try to simply print the columns of the join_df dataframe, I get the following error-

AttributeError: 'NoneType' object has no attribute 'columns' 

The resultant dataframe is not behaving as one, I'm not able to perform any dataframe operations on it.

I'm guessing this error occurs when you're trying to call an object that doesn't exist, but it shouldn't be the case here as I'm able to view the resultant join_df.

Do I need to perform a different join in order to avoid this error? Might be a silly mistake, but I'm stumped trying to figure out what it is. Please help!


Solution

  • You are doing several mistakes.

    First of all you try to assign the return value of .show() to join_df which returns None.

    Then you are calling the .collect() function which returns a list that contains all of the elements in this RDD. You need to call .columns directly on the DataFrame.

    This should work:

    join_df = rule_df.join(query_df,rule_df.Rule==query_df.Rule,"inner").drop(rule_df.Rule)
    print(join_df.columns)