I have a column with bits in a Spark dataframe df
. The columns are of string format:
10001010000000100000000000000000
10001010000000100000000100000000
Is there a simple and effective way to create a new column "no_of_ones"
counting the frequency of ones using a Dataframe? Using RDDs, I can map(lambda x:x.count('1'))
(PySpark).
Additionally, how can I retrieve a list with the position of the ones?
One way I can think of is to remove all zeroes and then count the length of the field.
df.show
+--------------------+
| bytestring|
+--------------------+
|10001010000000100...|
|10001010000000100...|
+--------------------+
df.withColumn("no_of_ones" , length(regexp_replace($"bytestring", "0", "")) ).show
+--------------------+----------+
| bytestring|no_of_ones|
+--------------------+----------+
|10001010000000100...| 4|
|10001010000000100...| 5|
+--------------------+----------+