dataframeapache-sparkpysparkapache-spark-sqlcount

Count particular characters within a column using Spark Dataframe API


I have a column with bits in a Spark dataframe df. The columns are of string format:

10001010000000100000000000000000
10001010000000100000000100000000

Is there a simple and effective way to create a new column "no_of_ones" counting the frequency of ones using a Dataframe? Using RDDs, I can map(lambda x:x.count('1')) (PySpark).
Additionally, how can I retrieve a list with the position of the ones?


Solution

  • One way I can think of is to remove all zeroes and then count the length of the field.

    df.show
    +--------------------+
    |          bytestring|
    +--------------------+
    |10001010000000100...|
    |10001010000000100...|
    +--------------------+
    
    
    df.withColumn("no_of_ones" , length(regexp_replace($"bytestring", "0", "")) ).show
    +--------------------+----------+
    |          bytestring|no_of_ones|
    +--------------------+----------+
    |10001010000000100...|         4|
    |10001010000000100...|         5|
    +--------------------+----------+