pythontextduplicatesline-by-line

Remove duplicates in text file line by line


I'm trying to write a Python script that will remove duplicate strings in a text file. However, the de-duplication should only occur within each line.

For example, the text file might contain:

þ;ABC.001.123.1234;þ;;þ;10 ABC\ABCD\ABCDE;10 ABC\ABCD\ABCDE
þ;ABC.001.123.1234;þ;;þ;10 ABC\ABCD\ABCDE;12 EFG\EFG;12 EFG\EFG;þ þ;ABC.001.123.1234;þ;;þ;10 ABC\ABCD\ABCDE;12 EFG\EFG;09 XYZ\XYZ\XYZ;12 EFG\EFG

Thus, in the above example, the script should only remove the bold strings.

I've searched Stack Overflow and elsewhere to try to find a solution, but haven't had much luck. There seem to be many solutions that will remove duplicate lines, but I'm trying to remove duplicates within a line, line-by-line.

Update: Just to clarify - þ is the delimiter for each field, and ; is the delimiter for each item within each field. Within each line, I'm attempting to remove any duplicate strings contained between semicolons.

Update 2: Example edited to reflect that the duplicate value may not always follow directly after the first instance of the value.


Solution

  • import re
    with open('file', 'r') as f:
         file = f.readlines()
    for line in file:
         print(re.sub(r'([^;]+;)(\1)', r'\1', line))
    

    Read the file by lines; then replace the duplicates using re.sub.