fast and effective search algorithm for large json file in python

I have a large JSON file with a dictionary-like structure, containing values and keys. I want to search through the values efficiently and effectively. I'm looking for a search algorithm that is fast and returns the most relevant results.

Currently, I'm using fuzzy search, but it only matches from the beginning of the string. For instance, searching for dimix doesn't yield a match if the actual string is ada_dimix. I need an algorithm that can find the closest substring matches as well. For example, searching for mix should return ada_dimix.

Could you suggest a suitable search algorithm?

Here is sample data to try:

{
    "tf-checking-feeders.md": "tf-checking-feeders.md",
    "rule-setting-your-local-data-directory.md": "rule-setting-your-local-data-directory.md",
    "formula-using-conditional-logic.md": "formula-using-conditional-logic.md",
    "allocations-feeding-qty-required-kgs.md": "allocations-feeding-qty-required-kgs.md",
    "rules-using-db-functions-move-data-between-cubes.md": "rules-using-db-functions-move-data-between-cubes.md",
    "feeders-skipcheck.md": "feeders-skipcheck.md",
    "rule-components-calculation-statement.md": "rule-components-calculation-statement.md",
    "window-inserting-function.md": "window-inserting-function.md",
    "market-writing-exchange-rate-rule-statement.md": "market-writing-exchange-rate-rule-statement.md",
    "series-hard-coded-feeders.md": "series-hard-coded-feeders.md",
    "allocations-calculating-quantities-fish-required-by-fishcake-type.md": "allocations-calculating-quantities-fish-required-by-fishcake-type.md",
    "rule-purchase-cost-calculation.md": "rule-purchase-cost-calculation.md",
    "options-setting-areas-appearance.md": "options-setting-areas-appearance.md",
    "solutions-third-approach-using-dimix-comparisons.md": "solutions-third-approach-using-dimix-comparisons.md",
    "process-feeding-first-statement.md": "process-feeding-first-statement.md",
    "flows-implementing-depletion-model-using-rules.md": "flows-implementing-depletion-model-using-rules.md",
    "rf-miscellaneous-rules-functions.md": "rf-miscellaneous-rules-functions.md",
    "formula-external-cube-references.md": "formula-external-cube-references.md",
    "costs-calculating-daily-fish-in-inventory-cube.md": "costs-calculating-daily-fish-in-inventory-cube.md",
    "rf-consolidation-calculation-rules-functions.md": "rf-consolidation-calculation-rules-functions.md",
    "replace-replacing-text.md": "replace-replacing-text.md",
    "market-rule.md": "market-rule.md",
    "eirf-elementindex.md": "functions/eirf-elementindex.md",
    "mrf-round.md": "functions/mrf-round.md",
    "dtrf-today.md": "functions/dtrf-today.md",
    "eirf-elementtype.md": "functions/eirf-elementtype.md",
    "eirf-elparn.md": "functions/eirf-elparn.md",
    "trf-char.md": "functions/trf-char.md",
    "dtrf-time.md": "functions/dtrf-time.md",
    "mrf-cos.md": "functions/mrf-cos.md",
    "trfdf-dimix.md": "functions/trfdf-dimix.md",
    "ada-dimix.md": "functions/ada-dimix.md",
    "trf-char-dimix.md": "functions/trf-char-dimix.md"
}

please keep in mind that the JSON data is large and will get even larger.

Solution

If you want to search many times, you can search in O(1) after you index your JSON file, for example with whoosh library:

import os
import json
from whoosh.index import create_in
from whoosh.fields import Schema, TEXT, NGRAM
from whoosh.qparser import QueryParser

json_data = your_JSON_data
# Create schema for indexing, here I use NGRAM to allow search with substrings of size 2-10
schema = Schema(filename=TEXT(stored=True), content=NGRAM(minsize=2, maxsize=10, stored=True))

# Create index directory
if not os.path.exists("indexdir"):
    os.mkdir("indexdir")

# Create index
ix = create_in("indexdir", schema)
writer = ix.writer()

# Add JSON data to index
for key, value in json_data.items():
    writer.add_document(filename=key, content=value)
writer.commit()

# Searching the indexed data
def search_index(query_str):
    with ix.searcher() as searcher:
        query = QueryParser("content", ix.schema).parse(query_str)
        results = searcher.search(query, limit=None)
        for result in results:
            print(result["filename"])

search_index("dimix")
# Should return:
# ada-dimix.md
# trfdf-dimix.md
# trf-char-dimix.md
# solutions-third-approach-using-dimix-comparisons.md

You can read more on whoosh indexing here.