pythonregexduplicatesfile-manipulation

How to remove duplicates in my python script?


I have used a regex search to filter down some results from a text file (searching for ".js") which has given me roughly around 16 results some of which are duplicates. I want to remove duplicates from that output and print either onto the console or redirect it into a file. I have attempted the use of sets and dictionary.fromkeys with no success! Here is what I have at the moment, thank you in advance:

#!/usr/bin/python

import re
import sys

pattern = re.compile("[^/]*\.js")

for i, line in enumerate(open('access_log.txt')):
    for match in re.findall(pattern, line):
        x = str(match)
        print x

Solution

  • Using sets to eliminate duplicates:

    #!/usr/bin/python
    
    import re
    
    pattern = re.compile("[^/]*\.js")
    
    matches = set()
    with open('access_log.txt') as f:
        for line in f:
            for match in re.findall(pattern, line):
                #x = str(match) # or just use match
                if match not in in matches:
                    print match
                    matches.add(match)
    

    But I question your regex:

    You are doing a findall on each line, which suggests that each line might have multiple "hits", such as:

    file1.js file2.js file3.js
    

    But in your regex:

    [^/]*\.js
    

    [^/]* is doing a greedy match and would return only one match, namely the complete line.

    If you made the match non-greedy, i.e. [^/]*?, then you would get 3 matches:

    'file1.js'
    ' file2.js'
    ' file3.js'
    

    But that highlights another potential problem. Do you really want those spaces in the second and third matches for these particular cases? Perhaps in the case of /abc/ def.js you would keep the leading blank that follows /abc/.

    So I would suggest:

    #!/usr/bin/python
    
    import re
    
    pattern = re.compile("""
        (?x)            # verbose mode
        (?:             # first alternative:
            (?<=/)      # positive lookbehind assertion: preceded by '/'
            [^/]*?      # matches non-greedily 0 or more non-'/'
        |               # second alternative
            (?<!/)      # negative lookbehind assertion: not preceded by '/'
            [^/\s]*?    # matches non-greedily 0 or more non-'/' or non-whitespace
        )
        \.js            # matches '.js'
        """)
    
    matches = set()
    with open('access_log.txt') as f:
        for line in f:
            for match in pattern.findall(line):
                if match not in matches:
                    print match
                    matches.add(match)
    

    If the filename cannot have any whitespace, then just use:

    pattern = re.compile("[^\s/]*?\.js")