I am very new to Python and now I am working on a task requiring me to store rows which appears less than k times from a CSV file.
Basically the CSV file has 3 columns. First column is the part I have to consider. So for example if 'a' appeared less than 5 times in that column I need to pick out those records and store them in a new CSV file.
I managed to use panda's df['column name'].value_counts()
to count how many times each row appeared. Now I am stucking at how to actually pick out those less frequent rows and store them. I have some thoughts such as using a for loop to loop through all rows and use if to check whether certain data in the first column appears less than K times. The part I failed to figure out seems to be how to link the frequency count with certain data in that column.
Below is a screenshot of my data file example of my csv data
Any help is much appreciated! thanks guys!
You mentioned pandas, and here is a pandas approach:
import pandas as pd
# create sample data frame
data = [
(1, 2, 3),
(1, 4, 5),
(1, 6, 7),
(9, 10, 11),
(9, 12, 13),
]
df = pd.DataFrame(data, columns=('x', 'y', 'z'))
# keep rows with value in column 'x' appears at most 'ceiling' times
ceiling = 2
low_freq = df['x'].value_counts().loc[lambda x: x <= ceiling].index
# use boolean mask to find rows such that 'x' is in our low_freq list
mask = df['x'].isin(low_freq)
# print results
print(df[mask])
x y z
3 9 10 11
4 9 12 13
# use df[mask].to_csv(...) to write to csv file
UPDATE:
Here is a way to 'take apart' the code above. For example, what is low_freq
? This lets you see each step in the transformation -- so you can modify / extend the approach.
df['x']
df['x'].value_counts()
df['x'].value_counts().loc[lambda x: x <= ceiling]
df['x'].value_counts().loc[lambda x: x <= ceiling].index
UPDATE 2
Evidently the filtering logic is not working as expected. Let's try a different approach:
import pandas as pd
# create sample data frame
data = [(0, 1, 2, ), (1, 1, 4, ), (2, 1, 6, ),
(3, 9, 10,), (4, 9, 12,), (5, 7, 21,)]
df = (pd.DataFrame(data, columns=('pos_id', 'device_id', 'base_mac'))
.set_index('pos_id'))
Now use groupby()
to count the number of occurrences of each device_id
. This count goes into a new column.
df['dev_id_count'] = (df.groupby('device_id')['device_id']
.transform('count'))
print(df)
device_id base_mac dev_id_count
pos_id
0 1 2 3
1 1 4 3
2 1 6 3
3 9 10 2
4 9 12 2
5 7 21 1
The last step is to filter based on this new column:
mask = df['dev_id_count'] <= 2
print(df[mask])
# output not shown, to save space