pythonpython-3.xwhoosh

How to return the corresponding line that matches our search keyword in whoosh?


Lets say given file a.txt:

hello world
good morning world
good night world

Given the keyword that I want to search is morning, I want to use whoosh python library to return the line that matches the keyword morning in the text file a.txt. So, it will return good morning world. How can I achieve this?

Update: Here is my schema:

schema = Schema(title=TEXT(stored=True),
              path=ID(stored=True),
              content=TEXT(stored=True))

then I add a writer add_document to content field


Solution

  • Index the text file per line and store the line number as a NUMERIC field and the entire line as an ID field (storage is cheap, right!).

    Something like the following (untested):

    schema = Schema(
        title=TEXT(stored=True),
        path=ID(stored=True),
        content=TEXT(stored=True),
        line_number=NUMERIC(int, 32, stored=True, signed=False),
        line_text=ID(stored=True),
    )
    
    
    ix = index.open_dir("index")
    writer = ix.writer()
    
    with open('a.txt') as f:
        for line_number, line in enumerate(f):
            writer.add_document(
                title='This is a title',
                path='a.txt',
                content=line,
                line_number=line_number,
                line_text=line,
            )
    

    Clearly you could extend this to index multiple text files:

    files_to_index = [
        {'title': 'Title A', 'path': 'a.txt'},
        {'title': 'Title B', 'path': 'b.txt'},
        {'title': 'Title C', 'path': 'c.txt'},
    ]
    
    ix = index.open_dir("index")
    writer = ix.writer()
    
    
    for file_to_index in files_to_index:
    
        with open(file_to_index['path']) as f:
            for line_number, line in enumerate(f):
                writer.add_document(
                    title=file_to_index['title'],
                    path=file_to_index['path'],
                    content=line,
                    line_number=line_number,
                    line_text=line,
                )