pythonstring-comparisontext-comparison

Text comparison between multiple files and remove the duplicate block


I have 3 notepad files in directory , i want to compare 1st file to other 2 and drop the duplicate blocks keep unique output , for Example :

File 1:

  User enter email id {
  email id:(xyz@gamil.com)
  action:enter
  data:string }

User enter password {
passoword:(12345678)
action:enter
data:string }

 User click login {
 action:click
 data:NAN }

File 2 :

User enter email id {
email id:(xyz@gamil.com)
action:enter
data:string }

User enter password {
passoword:(12345678)
action:enter
data:string }

 User navigates another page {
 action:navigates
 data:NAN }

File 3 :

 User enter email id {
 email id:(abc@gamil.com)
 action:enter
 data:string }

 User enter password {
 passoword:(12345678)
 action:enter
 data:string }

 User submit to login {
 action:submit
 data:NAN }

I want output of file 2 and file 3 is :

File 2 :

 User navigates another page {
 action:navigates
 data:NAN }

File 3 :

 User enter email id {
 email id:(abc@gamil.com)
 action:enter
 data:string }
 
 User submit to login {
 action:submit
 data:NAN }

Solution

  • Open the first file and make a list of paragraphs

    with open('file1.txt', 'r') as f:
        paragraphs = f.read().split('\n\n')
    

    Now open the second file and make a list of paragraphs in the second file and remove the paragraphs that are in the first file

    with open('file2.txt', 'r') as f:
        paragraphs2 = f.read().split('\n\n')
        paragraphs2 = [x for x in paragraphs2 if x not in paragraphs]
    

    Now write the changes to the second file

    with open('file2.txt', 'w') as f:
        f.write('\n\n'.join(paragraphs2))
    

    Perform the same operations for the third file too

    with open('file3.txt', 'r') as f:
        paragraphs3 = f.read().split('\n\n')
        paragraphs3 = [x for x in paragraphs3 if x not in paragraphs]
    
    with open('file3.txt', 'w') as f:
        f.write('\n\n'.join(paragraphs3))
    

    What if there are too many files? We use loops as demonstrated below:

    First, create a list of paragraphs

    with open('file1.txt', 'r') as f:
        paragraphs = f.read().split('\n\n')
    

    Create a list of all the files that have to be removed duplicates from

    import os
    lst = [f for f in os.listdir('.') if f.endswith('.txt') and f != 'file1.txt']
    

    Now loop through the list of files and modify them

    for f in lst:
        with open(f, 'r') as file:
            paragraphs_in_other_files = file.read().split('\n\n')
            paragraphs_in_other_files = [p for p in paragraphs_in_other_files if p not in paragraphs]
    
        with open(f, 'w') as file:
            file.write('\n\n'.join(paragraphs_in_other_files))