pythondictionaryindexingluceneinverted-index

Create inverted index from a dictionary with document ids as keys and a list of terms as values for each document


I have created the following dictionary from the Cranfield Collection:

{
    'd1'   : ['experiment', 'studi', ..., 'configur', 'experi', '.'], 
    'd2'   : ['studi', 'high-spe', ..., 'steadi', 'flow', '.'],
    ..., 
    'd1400': ['report', 'extens', ..., 'graphic', 'form', '.']
}

Each key, value pair represents a single document as the key and the value as a list of tokenized, stemmed words with stopwords removed. I need to create an inverted index from this dictionary with the following format:

{
    'experiment': {'d1': [1, [0]], ..., 'd30': [2, [12, 40]], ..., 'd123': [3, [11, 45, 67]], ...}, 

    'studi': {'d1': [1, [1]], 'd2': [2, [0, 36]], ..., 'd207': [3, [19, 44, 59]], ...}

    ...
}

Here the key becomes the term while the value is a dictionary that contains the document that term shows up in, the number of times, and the indices of the document where the term is found. I am not sure how to approach this conversion so I am just looking for some starter pointers as to how to think about this problem. Thank you.


Solution

  • I hope I've understood your question well:

    dct = {
        "d1": ["experiment", "studi", "configur", "experi", "."],
        "d2": ["studi", "high-spe", "steadi", "flow", "flow", "."],
        "d1400": ["report", "extens", "graphic", "form", "."],
    }
    
    out = {}
    for k, v in dct.items():
        for idx, word in enumerate(v):
            out.setdefault(word, {}).setdefault(k, []).append(idx)
    
    for v in out.values():
        for l in v.values():
            l[:] = [len(l), list(l)]
    
    print(out)
    

    Prints:

    {
        "experiment": {"d1": [1, [0]]},
        "studi": {"d1": [1, [1]], "d2": [1, [0]]},
        "configur": {"d1": [1, [2]]},
        "experi": {"d1": [1, [3]]},
        ".": {"d1": [1, [4]], "d2": [1, [5]], "d1400": [1, [4]]},
        "high-spe": {"d2": [1, [1]]},
        "steadi": {"d2": [1, [2]]},
        "flow": {"d2": [2, [3, 4]]},
        "report": {"d1400": [1, [0]]},
        "extens": {"d1400": [1, [1]]},
        "graphic": {"d1400": [1, [2]]},
        "form": {"d1400": [1, [3]]},
    }