pandasapache-sparkapache-spark-sql

Convert a Spark DataFrame to Pandas DF


Is there a way to convert a Spark DF (not RDD) to a Pandas DF?

I tried the following:

var some_df = Seq(
 ("A", "no"),
 ("B", "yes"),
 ("B", "yes"),
 ("B", "no")

 ).toDF(
"user_id", "phone_number")

Code:

%pyspark
pandas_df = some_df.toPandas()

Error:

 NameError: name 'some_df' is not defined

Any suggestions.


Solution

  • following should work

    Sample DataFrame

        some_df = sc.parallelize([
         ("A", "no"),
         ("B", "yes"),
         ("B", "yes"),
         ("B", "no")]
         ).toDF(["user_id", "phone_number"])
    

    Converting DataFrame to Pandas DataFrame

        pandas_df = some_df.toPandas()