pythoncsvsuppression

Suppress one csv file against another where values contain specific strings


I am trying to remove all the lines in file1.csv that contain the strings from file2.csv with python. I want it to search for all the values in column1 of file1.csv and remove entire rows where column1 contains in its value same string as in file2.csv.

I know grep -v in bash can do same thing with just one command. However, I need to suppress file1.csv against over 40,000 possible strings in file2.csv. Bash takes forever and even crashes when executing this command.

Does anyone know a solid script that can do what grep -v does in python but when suppressing against a file with thousands of strings?

Just to make sure it's clear:

File1.csv:

column1,column2,column3
www.gamai4xheifw.com,4410,22
www.vfekjfwo11k.com,772,100
www.gosi4xnbdn.com,1793,39
www.tum33kkwfl.com,1100,2
www.eei4xelwf.com,9982,14

File2.csv:

column1
i4x

File3.csv:

column1,column2,column3
www.vfekjfwo11k.com,772,100
www.tum33kkwfl.com,1100,2

But, again, I need it in python because the number of the strings in file2.csv is over 40,000.


Solution

  • One solution which may work for your use case is 3rd party library Pandas + regex.

    However, I strongly recommend you utilise a more efficient algorithm, for example one that implements the trie-based Aho-Corasick, such as this solution.

    import pandas as pd
    from io import StringIO
    
    mystr1 = StringIO("""column1,column2,column3
    www.gamai4xheifw.com,4410,22
    www.vfekjfwo11k.com,772,100
    www.gosi4xnbdn.com,1793,39
    www.tum33kkwfl.com,1100,2
    www.eei4xelwf.com,9982,14""")
    
    mystr2 = StringIO("""column1
    i4x""")
    
    # read files, replace mystr1 / mystr2 with 'File1.csv' / 'File2.csv'
    df = pd.read_csv(mystr1)
    df_filter = pd.read_csv(mystr2)
    
    # create regex string from filter values
    str_filter = '|'.join(df_filter['column1'])
    
    # apply filtering
    df = df[~df['column1'].str.contains(str_filter)]
    
    # export back to csv
    df.to_csv('file_out.csv', index=False)
    
    print(df)
    
                   column1  column2  column3
    1  www.vfekjfwo11k.com      772      100
    3   www.tum33kkwfl.com     1100        2