I'm trying to write a Python script that will remove duplicate strings in a text file. However, the de-duplication should only occur within each line.
For example, the text file might contain:
þ;ABC.001.123.1234;þ;;þ;10 ABC\ABCD\ABCDE;10 ABC\ABCD\ABCDE;þ
þ;ABC.001.123.1234;þ;;þ;10 ABC\ABCD\ABCDE;12 EFG\EFG;12 EFG\EFG;þ þ;ABC.001.123.1234;þ;;þ;10 ABC\ABCD\ABCDE;12 EFG\EFG;09 XYZ\XYZ\XYZ;12 EFG\EFG;þ
Thus, in the above example, the script should only remove the bold strings.
I've searched Stack Overflow and elsewhere to try to find a solution, but haven't had much luck. There seem to be many solutions that will remove duplicate lines, but I'm trying to remove duplicates within a line, line-by-line.
Update: Just to clarify - þ
is the delimiter for each field, and ;
is the delimiter for each item within each field. Within each line, I'm attempting to remove any duplicate strings contained between semicolons.
Update 2: Example edited to reflect that the duplicate value may not always follow directly after the first instance of the value.
import re
with open('file', 'r') as f:
file = f.readlines()
for line in file:
print(re.sub(r'([^;]+;)(\1)', r'\1', line))
Read the file by lines; then replace the duplicates using re.sub.