pysparkis-emptycol

Quick way to delete empty column [PySpark]


Is there a easy way to drop empty column of a huge dataset (300+ col >100k row) in pyspark ? such as df.dropna(axis=1,how='all') in Python


Solution

  • Yes, you can simply use the answer from here. I've added a threshold parameter to it:

    import pyspark.sql.functions as F
    
    # Sample data
    df = pd.DataFrame({'x1': ['a', '1', '2'],
                       'x2': ['b', None, '2'],
                       'x3': ['c', '0', '3'] })
    df = sqlContext.createDataFrame(df)
    df.show()
    
    def drop_null_columns(df, threshold=0):
        """
        This function drops all columns which contain null values.
        :param df: A PySpark DataFrame
        """
        null_counts = df.select([F.count(F.when(F.col(c).isNull(), c)).alias(c) for c in df.columns]).collect()[0].asDict()
        to_drop = [k for k, v in null_counts.items() if v > threshold]
        df = df.drop(*to_drop)
        return df
    
    # Drops column b2, because it contains null values
    drop_null_columns(df).show()
    

    Output

    +---+---+
    | x1| x3|
    +---+---+
    |  a|  c|
    |  1|  0|
    |  2|  3|
    +---+---+
    

    Column x2 has been dropped.

    You can use threshold=df.count() to while using it