I have data in which it has one column named category. Categories are chair and table.
I have tried to do filtering, derived column but derived column give me additional column but i want it at row level.
As the remaining values doesn't exists in the column, Dataflow can't give the count for those.
So, you can follow below approach to achieve your requirement. In this approach, you need an array containing values of all categories like ['chair','table','bowl','plat']
.
This is starting data that I am taking in the dataflow source:
category
chair
table
chair
table
table
chair
chair
table
First use an aggregate transformation to get the category and count columns.
Give category
column in the Group by section and create a new column count
with count(category)
expression in the aggregate section like below.
It will give the result with existing values.
Now, take another aggregate transformation after this. In this, leave group by section empty and in the aggregate section, create a column for an array of existing categories using collect(category)
expression.
It will produce below results.
Next, take a derived column transformation and create two columns using the above original
array column and the array of all categories.
rem_items - toString(unfold(except(['chair','table','bowl','plat'],original)))
count2 - 0
This will first generate an array of non-existing categories and unfolds that array as rows. Those will be converted to string type column like below. Coming to count2
columns, whatever the values that are coming from this transformation, the count for those will be 0
only.
Now, use a select transformation and remove the extra original
column.
As a result, select transformation will only contains two columns.
Now, create a New branch after the aggregate1 transformation and add a Union transformation to it. Use the above select1 transformation as another input and do the union by position.
This will give the desired result.
Add your sink to the union transformation and you can run the dataflow.