pythonsearchaho-corasick

Using python (acora) to find lines containing keywords


I'm writing a program that reads in a directory of text files and finds a specific combination of strings that are overlapping (i.e. shared among all files). My current approach is to take one file from this directory, parse it, build a list of every string combo, and then search for this string combo in the other files. For instance, if I'd ten files, I'd read one file, parse it, store the keywords I need, then search the other nine files for this combination. I'd repeat this for every file (making sure that the single file doesn't search itself). To do this, I'm trying to use python's acora module.

The code I've thus far is:

def match_lines(f, *keywords):
    """Taken from [https://pypi.python.org/pypi/acora/], FAQs and Recipes #3."""
    builder = AcoraBuilder('\r', '\n', *keywords)
    ac = builder.build()

    line_start = 0
    matches = False
    for kw, pos in ac.filefind(f):  # Modified from original function; search a file, not a string.
        if kw in '\r\n':
            if matches:
                yield f[line_start:pos]
                matches = False
            line_start = pos + 1
        else:
            matches = True
    if matches:
        yield f[line_start:]


def find_overlaps(f_in, fl_in, f_out):
    """f_in: input file to extract string combo from & use to search other files.
    fl_in: list of other files to search against.
    f_out: output file that'll have all lines and file names that contain the matching string combo from f_in.
    """
    string_list = build_list(f_in)  # Open the first file, read each line & build a list of tuples (string #1, string #2). The "build_list" function isn't shown in my pasted code.
    found_lines = []  # Create a list to hold all the lines (and file names, from fl_in) that are found to have the matching (string #1, string #2).
    for keywords in string_list:  # For each tuple (string #1, string #2) in the list of tuples
        for f in fl_in:  # For each file in the input file list
            for line in match_lines(f, *keywords):
                found_lines.append(line)

As you can probably tell, I used the function match_lines from the acora web page, "FAQ and recipes" #3. I also used it in the mode to parse files (using ac.filefind()), also located from the web page.

The code seems to work, but it's only yielding me the file name that has the matching string combination. My desired output is to write out the entire line from the other files that contain my matching string combination (tuple).


Solution

  • I'm not seeing what here would produce filenames, as you say it does.

    Regardless, to get line numbers, you just need to count them as you pass them in match_lines():

    line_start = 0
    line_number = 0
    matches = False
    text = open(f, 'r').read()
    for kw, pos in ac.filefind(f):  # Modified from original function; search a file, not a string.
        if kw in '\r\n':
            if matches:
                yield line_number, text[line_start:pos]
                matches = False
            line_start = pos + 1
            line_number += 1
        else:
            matches = True
    if matches:
        line_number, yield text[line_start:]