pyspark

How to optimize the PySpark toPandas() with type hints


I have not seen this warning in PySpark before:

The conversion of DecimalType columns is inefficient and may take a long time. Column names: [PVPERUSER] If those columns are not necessary, you may consider dropping them or converting to primitive types before the conversion.

What is the best way to handle it? Is this a parameter passed into toPandas() or do I need to type the dataframe in a particular way?

My code is a simple pyspark conversation to pandas:

df = data.toPandas()

Solution

  • Try this:

    df = data.select(data.PVPERUSER.cast('float'), data.another_column).toPandas()