pythonpython-2.7setdifflibfile-comparison

Python 2.7 - Compare two text files and write only the unique values from first file


I am trying to do the following. Compare two text files ( Masterfile and usedfile) and write the unique values(not common in both) of Masterfile to third file (Newdata ). Both files have one word in each line. example:

Masterfile content

Johnny
transfer
hello
kitty

usedfile content

transfer
hello

expected output in Newdata

Johnny
kitty

I have two solutions but both have problem

solution 1:This gives information like -,+ prefixed to the data final output.

import difflib

with open(r'C:\Master_Data.txt','r') as masterfile:
    with open(r'C:\Used_Data.txt','r') as usedfile:
        with open(r'c:\Ready_to_use.txt','w+') as Newdata:
            tempmaster = masterfile.readlines()
            tempusedfile = usedfile.readlines()
            d = difflib.Differ()
            diff = d.compare(tempmaster,tempusedfile)
            for line in diff:
                Newdata.write(line)

solution 2: I tried using set ,it shows fine when I use print statement but don't know how to write to a file.

with open(r'C:\Master_Data.txt','r') as masterfile:
    with open(r'C:\Used_Data.txt','r') as usedfile:
        with open(r'c:\Ready_to_use.txt','w+') as Newdata:
           difference = set(masterfile).difference(set(usedfile))
           print difference

Can anyone suggest

  1. how I can correct the solution 2 to write to a file.
  2. can I use difflib to accomplish the task
  3. Any better solution to achieve the end result

Solution

  • Ok,

    1) You can use solution 2 to write to a file by adding this:

    difference = set(masterfile).difference(set(usedfile))
    [Newdata.write(x) for x in difference]
    

    This is a shorthand way of doing this:

    for x in difference:
        Newdata.write(line)
    

    However, this will just write each element in the difference set to the Newdata file. If you use this method make sure that you have the correct values in your difference array to start with.

    2) I wouldn't bother using difflib, it's an extra library that isn't required to do something small like this.

    3) This is how I would do it, without using any libraries and simple comparison statements:

    with open(r'Master_Data.txt','r') as masterdata:
    with open(r'Used_Data.txt','r') as useddata:
        with open(r'Ready_to_use.txt','w+') as Newdata:
    
            usedfile = [ x.strip('\n') for x in list(useddata) ] #1
            masterfile = [ x.strip('\n') for x in list(masterdata) ] #2
    
            for line in masterfile: #3
                if line not in usedfile: #4
                    Newdata.write(line + '\n') #5
    

    Here's the explaination:

    First I just opened all the files like you did, just changed the names of the variables. Now, here are the pieces that I've changed

    #1 - This is a shorthanded way of looping through each line in the Used_Data.txt file and remove the \n at the end of each line, so we can compare the words properly.

    #2 - This does the same thing as #1 except with the Master_Data.txt file

    #3 - I loop through each line in the Master_Data.txt file

    #4 - I check to see if the line is not in the masterfile array also exists in the usedfile array.

    #5 - If the if statement is true, then the line from Master_File.txt we are checking does not appear in Used_Data.txt, so we write it to the Ready_to_use.txt file using the call Newdata.write(line + '\n'). The reason we need the '\n' after is so the file knows to start a new line next time we try to write something.