I have a large JSON file with a dictionary-like structure, containing values and keys. I want to search through the values efficiently and effectively. I'm looking for a search algorithm that is fast and returns the most relevant results.
Currently, I'm using fuzzy search, but it only matches from the beginning of the string. For instance, searching for dimix
doesn't yield a match if the actual string is ada_dimix
. I need an algorithm that can find the closest substring matches as well. For example, searching for mix
should return ada_dimix
.
Could you suggest a suitable search algorithm?
Here is sample data to try:
{
"tf-checking-feeders.md": "tf-checking-feeders.md",
"rule-setting-your-local-data-directory.md": "rule-setting-your-local-data-directory.md",
"formula-using-conditional-logic.md": "formula-using-conditional-logic.md",
"allocations-feeding-qty-required-kgs.md": "allocations-feeding-qty-required-kgs.md",
"rules-using-db-functions-move-data-between-cubes.md": "rules-using-db-functions-move-data-between-cubes.md",
"feeders-skipcheck.md": "feeders-skipcheck.md",
"rule-components-calculation-statement.md": "rule-components-calculation-statement.md",
"window-inserting-function.md": "window-inserting-function.md",
"market-writing-exchange-rate-rule-statement.md": "market-writing-exchange-rate-rule-statement.md",
"series-hard-coded-feeders.md": "series-hard-coded-feeders.md",
"allocations-calculating-quantities-fish-required-by-fishcake-type.md": "allocations-calculating-quantities-fish-required-by-fishcake-type.md",
"rule-purchase-cost-calculation.md": "rule-purchase-cost-calculation.md",
"options-setting-areas-appearance.md": "options-setting-areas-appearance.md",
"solutions-third-approach-using-dimix-comparisons.md": "solutions-third-approach-using-dimix-comparisons.md",
"process-feeding-first-statement.md": "process-feeding-first-statement.md",
"flows-implementing-depletion-model-using-rules.md": "flows-implementing-depletion-model-using-rules.md",
"rf-miscellaneous-rules-functions.md": "rf-miscellaneous-rules-functions.md",
"formula-external-cube-references.md": "formula-external-cube-references.md",
"costs-calculating-daily-fish-in-inventory-cube.md": "costs-calculating-daily-fish-in-inventory-cube.md",
"rf-consolidation-calculation-rules-functions.md": "rf-consolidation-calculation-rules-functions.md",
"replace-replacing-text.md": "replace-replacing-text.md",
"market-rule.md": "market-rule.md",
"eirf-elementindex.md": "functions/eirf-elementindex.md",
"mrf-round.md": "functions/mrf-round.md",
"dtrf-today.md": "functions/dtrf-today.md",
"eirf-elementtype.md": "functions/eirf-elementtype.md",
"eirf-elparn.md": "functions/eirf-elparn.md",
"trf-char.md": "functions/trf-char.md",
"dtrf-time.md": "functions/dtrf-time.md",
"mrf-cos.md": "functions/mrf-cos.md",
"trfdf-dimix.md": "functions/trfdf-dimix.md",
"ada-dimix.md": "functions/ada-dimix.md",
"trf-char-dimix.md": "functions/trf-char-dimix.md"
}
please keep in mind that the JSON data is large and will get even larger.
If you want to search many times, you can search in O(1) after you index your JSON file, for example with whoosh library:
import os
import json
from whoosh.index import create_in
from whoosh.fields import Schema, TEXT, NGRAM
from whoosh.qparser import QueryParser
json_data = your_JSON_data
# Create schema for indexing, here I use NGRAM to allow search with substrings of size 2-10
schema = Schema(filename=TEXT(stored=True), content=NGRAM(minsize=2, maxsize=10, stored=True))
# Create index directory
if not os.path.exists("indexdir"):
os.mkdir("indexdir")
# Create index
ix = create_in("indexdir", schema)
writer = ix.writer()
# Add JSON data to index
for key, value in json_data.items():
writer.add_document(filename=key, content=value)
writer.commit()
# Searching the indexed data
def search_index(query_str):
with ix.searcher() as searcher:
query = QueryParser("content", ix.schema).parse(query_str)
results = searcher.search(query, limit=None)
for result in results:
print(result["filename"])
search_index("dimix")
# Should return:
# ada-dimix.md
# trfdf-dimix.md
# trf-char-dimix.md
# solutions-third-approach-using-dimix-comparisons.md
You can read more on whoosh indexing here.