pythonjsonsearch

fast and effective search algorithm for large json file in python


I have a large JSON file with a dictionary-like structure, containing values and keys. I want to search through the values efficiently and effectively. I'm looking for a search algorithm that is fast and returns the most relevant results.

Currently, I'm using fuzzy search, but it only matches from the beginning of the string. For instance, searching for dimix doesn't yield a match if the actual string is ada_dimix. I need an algorithm that can find the closest substring matches as well. For example, searching for mix should return ada_dimix.

Could you suggest a suitable search algorithm?

Here is sample data to try:

{
    "tf-checking-feeders.md": "tf-checking-feeders.md",
    "rule-setting-your-local-data-directory.md": "rule-setting-your-local-data-directory.md",
    "formula-using-conditional-logic.md": "formula-using-conditional-logic.md",
    "allocations-feeding-qty-required-kgs.md": "allocations-feeding-qty-required-kgs.md",
    "rules-using-db-functions-move-data-between-cubes.md": "rules-using-db-functions-move-data-between-cubes.md",
    "feeders-skipcheck.md": "feeders-skipcheck.md",
    "rule-components-calculation-statement.md": "rule-components-calculation-statement.md",
    "window-inserting-function.md": "window-inserting-function.md",
    "market-writing-exchange-rate-rule-statement.md": "market-writing-exchange-rate-rule-statement.md",
    "series-hard-coded-feeders.md": "series-hard-coded-feeders.md",
    "allocations-calculating-quantities-fish-required-by-fishcake-type.md": "allocations-calculating-quantities-fish-required-by-fishcake-type.md",
    "rule-purchase-cost-calculation.md": "rule-purchase-cost-calculation.md",
    "options-setting-areas-appearance.md": "options-setting-areas-appearance.md",
    "solutions-third-approach-using-dimix-comparisons.md": "solutions-third-approach-using-dimix-comparisons.md",
    "process-feeding-first-statement.md": "process-feeding-first-statement.md",
    "flows-implementing-depletion-model-using-rules.md": "flows-implementing-depletion-model-using-rules.md",
    "rf-miscellaneous-rules-functions.md": "rf-miscellaneous-rules-functions.md",
    "formula-external-cube-references.md": "formula-external-cube-references.md",
    "costs-calculating-daily-fish-in-inventory-cube.md": "costs-calculating-daily-fish-in-inventory-cube.md",
    "rf-consolidation-calculation-rules-functions.md": "rf-consolidation-calculation-rules-functions.md",
    "replace-replacing-text.md": "replace-replacing-text.md",
    "market-rule.md": "market-rule.md",
    "eirf-elementindex.md": "functions/eirf-elementindex.md",
    "mrf-round.md": "functions/mrf-round.md",
    "dtrf-today.md": "functions/dtrf-today.md",
    "eirf-elementtype.md": "functions/eirf-elementtype.md",
    "eirf-elparn.md": "functions/eirf-elparn.md",
    "trf-char.md": "functions/trf-char.md",
    "dtrf-time.md": "functions/dtrf-time.md",
    "mrf-cos.md": "functions/mrf-cos.md",
    "trfdf-dimix.md": "functions/trfdf-dimix.md",
    "ada-dimix.md": "functions/ada-dimix.md",
    "trf-char-dimix.md": "functions/trf-char-dimix.md"
}

please keep in mind that the JSON data is large and will get even larger.


Solution

  • If you want to search many times, you can search in O(1) after you index your JSON file, for example with whoosh library:

    import os
    import json
    from whoosh.index import create_in
    from whoosh.fields import Schema, TEXT, NGRAM
    from whoosh.qparser import QueryParser
    
    json_data = your_JSON_data
    # Create schema for indexing, here I use NGRAM to allow search with substrings of size 2-10
    schema = Schema(filename=TEXT(stored=True), content=NGRAM(minsize=2, maxsize=10, stored=True))
    
    # Create index directory
    if not os.path.exists("indexdir"):
        os.mkdir("indexdir")
    
    # Create index
    ix = create_in("indexdir", schema)
    writer = ix.writer()
    
    # Add JSON data to index
    for key, value in json_data.items():
        writer.add_document(filename=key, content=value)
    writer.commit()
    
    # Searching the indexed data
    def search_index(query_str):
        with ix.searcher() as searcher:
            query = QueryParser("content", ix.schema).parse(query_str)
            results = searcher.search(query, limit=None)
            for result in results:
                print(result["filename"])
    
    search_index("dimix")
    # Should return:
    # ada-dimix.md
    # trfdf-dimix.md
    # trf-char-dimix.md
    # solutions-third-approach-using-dimix-comparisons.md
    

    You can read more on whoosh indexing here.