pythonmatcherwhoosh

Using whoosh as matcher without an index


Is it possible to use whoosh as a matcher without building an index?

My situation is that I have subscriptions pre-defined with strings, and documents coming through in a stream. I check each document matches the subscriptions and send them if so. I don't need to store the documents, or recall them later. Once they've been sent to the subscriptions, they can be discarded.

Currently just using simple matching, but as consumers ask for searches based on fields, and/or logic, etc, I'm wondering if it's possible to use a whoosh matcher and allow whoosh query syntax for this.

I could build an index for each document, query it, and then throw it away, but that seems very wasteful, is it possible to directly construct a Matcher? I couldn't find any docs or questions online indicating a way to do this and my attempts haven't worked. Alternatively, is this just the wrong library for this task, and is there something better suited?


Solution

  • The short answer is no.

    Search indices and matchers work quite differently. For example, if searching for the phrase "hello world", a matcher would simply check the document text contains the substring "hello world". A search index cannot do this, it would have to check every document, and that be very slow.

    As documents are added, every word in them is added to the index for that word. So the index for "hello" will say that document 1 matches at position 0, and the index for "world" will say that document 1 matches at position 6. And a search for "hello world" will find all document IDs in the "hello" index, then all in the "world" index, and see if any have a position for "world" which is 6 digits after the position for "hello".

    So it's a completely orthogonal way of doing things in whoosh vs a matcher.

    It is possible to do this with whoosh, using a new index for each document, like so:

    def matches_subscription(doc: Document, q: Query) -> bool:
        with RamStorage() as store:
            ix = store.create_index(schema)
            writer = ix.writer()
            writer.add_document(
                title=doc.title,
                description=doc.description,
                keywords=doc.keywords
            )
            writer.commit()
            with ix.searcher() as searcher:
                results = searcher.search(q)
                return bool(results)
    

    This takes about 800 milliseconds per check, which is quite slow.

    A better solution is to build a parser with pyparsing, anbd then create your own nested query classes which can do the matching, better fitting your specific search queries. It's quite extendable too that way. That can bring it down to ~40 microseconds, so, 20,000 times faster.