pythonfilegenerator

Handling file in a generator function


I'm to create a generator that accepts a name of the file or a fileobject, words that we're looking for in a line and stop words that tell us that we should skip this line as soon as we meet them.

I wrote a generator function, but I paid attenttion that in my realisation I can't be sure that if I open a file it will be closed afterwards, because it is not guaranteed that the generator will reach the end of its iteration.

def gen_reader(file, lookups, stopwords):
    is_file = False
    try:
        if isinstance(file, str):
            file = open(file, 'r', encoding='UTF-8')
            is_file = True
    except FileNotFoundError:
        raise FileNotFoundError('File not found')

    else:
        lookups = list(map(str.lower, lookups))
        stop_words = list(map(str.lower, stopwords))
        for line in file:
            original_line = line
            line = line.lower()
            if any(lookup in line for lookup in lookups) \
                    and not any(stop_word in line for stop_word in stop_words):
                yield original_line.strip()
    if is_file:
        file.close()

I was going to use context manager "with" and put the search code into it, but if I already got the file, then I'd write the same code again and it wouldn't be nice, would it?

What are your ideas how I can improve my code, I ran of them.


Solution

  • You could write a context managed class that is iterable like this:

    from pathlib import Path
    from collections.abc import Iterator
    from typing import TextIO, Self
    
    
    class Reader:
        def __init__(self, filename: Path, lookups: list[str], stopwords: list[str]):
            self._lookups: set[str] = {e.lower() for e in lookups}
            self._stopwords: set[str] = {e.lower() for e in stopwords}
            self._fd: TextIO = filename.open()
    
        def __enter__(self) -> Self:
            return self
    
        def __exit__(self, *_) -> None:
            self._fd.close()
    
        def __iter__(self) -> Iterator[str]:
            for line in self._fd:
                words: set[str] = {e.lower() for e in line.split()}
                if (self._lookups & words) and not (self._stopwords & words):
                    yield line.rstrip()
    
    
    with Reader(Path("foo.txt"), ["world"], ["banana"]) as reader:
        for line in reader:
            print(line)
    

    Now let's assume that foo.txt looks like this:

    hello world
    banana world
    goodbye my friend
    

    Then the output would be:

    hello world
    

    Thus we ensure that the file descriptor is properly closed at the appropriate point.

    Note also the use of sets for optimum performance when checking against the lookups and stop words