pythonapache-sparkpyspark

How to check if pyspark dataframe is empty QUICKLY


I'm trying to check if my pyspark dataframe is empty and I have tried different ways to do that, like:

df.count() == 0
df.rdd.isEmpty()
df.first().isEmpty()

But all this solutions are to slow, taking up to 2 minutes to run. How can I quicly check if my pyspark dataframe is empty or not? Do anyone have a solution for that?

Thank you in advance!


Solution

  • The best way to check if your dataframe is empty or not after reading a table or at any point in time is by using limit(1) first which will reduce the number of rows to only 1 and will increase the speed of all the operation that you are going to do for dataframe checks.

    df.limit(1).count() == 0
    df.limit(1).rdd.isEmpty()
    df.limit(1).take()
    

    If you are just doing a data dependency check on a table and just want to know if table has data or not, it's always best to just apply limit 1 while reading from table itself for e.g.

    df = spark.sql("select * from <table> limit 1")
    

    With that being said about the efficiency of checking dataframe is empty or not, now coming on to which is the fastest way of doing it is using .rdd.isEmpty() compared to count(), first() or take(1)

    Also if you see the backend implementation of first() and take(1) it's completely implemented on top of collect() which is mostly costly and should only be used when extremely necessary.

    The below-mentioned time is based on reading a parquet file with 2390491 records and having 138 columns.

    >>> df.count()
    2390491
    >>> len(df.columns)
    138
    

    Time taken by count(), take(), first()

    Note: These are the time taken after applying .limit(1) to the dataframe for checking whether the dataframe is empty or not.

    Also lastly using, df.rdd.isEmpty() took the least amount of time 29ms after reducing the number of rows to 1. enter image description here

    Hope that helps..!! :)

    UPDATE: If you are using Spark >= 3.3, now you can directly use,

    df.isEmpty()
    

    This is the fastest of all for checking an empty data frame in Spark >= 3.3