pythonpandascsv

Filter CSV rows based on count of column value


I am very new to Python and now I am working on a task requiring me to store rows which appears less than k times from a CSV file.

Basically the CSV file has 3 columns. First column is the part I have to consider. So for example if 'a' appeared less than 5 times in that column I need to pick out those records and store them in a new CSV file.

I managed to use panda's df['column name'].value_counts() to count how many times each row appeared. Now I am stucking at how to actually pick out those less frequent rows and store them. I have some thoughts such as using a for loop to loop through all rows and use if to check whether certain data in the first column appears less than K times. The part I failed to figure out seems to be how to link the frequency count with certain data in that column.

Below is a screenshot of my data file example of my csv data

Any help is much appreciated! thanks guys!


Solution

  • You mentioned pandas, and here is a pandas approach:

    import pandas as pd
    
    # create sample data frame
    data = [
        (1, 2, 3),
        (1, 4, 5),
        (1, 6, 7),
        (9, 10, 11),
        (9, 12, 13),
    ]
    df = pd.DataFrame(data, columns=('x', 'y', 'z'))
    
    # keep rows with value in column 'x' appears at most 'ceiling' times
    ceiling = 2
    low_freq = df['x'].value_counts().loc[lambda x: x <= ceiling].index
    
    # use boolean mask to find rows such that 'x' is in our low_freq list
    mask = df['x'].isin(low_freq)
    
    # print results
    print(df[mask])
    
       x   y   z
    3  9  10  11
    4  9  12  13
    
    # use df[mask].to_csv(...) to write to csv file
    

    UPDATE:

    Here is a way to 'take apart' the code above. For example, what is low_freq? This lets you see each step in the transformation -- so you can modify / extend the approach.

    df['x']
    df['x'].value_counts()
    df['x'].value_counts().loc[lambda x: x <= ceiling]
    df['x'].value_counts().loc[lambda x: x <= ceiling].index
    

    UPDATE 2

    Evidently the filtering logic is not working as expected. Let's try a different approach:

    import pandas as pd
    
    # create sample data frame
    data = [(0, 1, 2, ), (1, 1, 4, ), (2, 1, 6, ),
            (3, 9, 10,), (4, 9, 12,), (5, 7, 21,)]
    df = (pd.DataFrame(data, columns=('pos_id', 'device_id', 'base_mac'))
          .set_index('pos_id'))
    

    Now use groupby() to count the number of occurrences of each device_id. This count goes into a new column.

    df['dev_id_count'] = (df.groupby('device_id')['device_id']
                            .transform('count'))
    print(df)
    
            device_id  base_mac  dev_id_count
    pos_id                                   
    0               1         2             3
    1               1         4             3
    2               1         6             3
    3               9        10             2
    4               9        12             2
    5               7        21             1
    

    The last step is to filter based on this new column:

    mask = df['dev_id_count'] <= 2
    print(df[mask])
    # output not shown, to save space