I have created the following dictionary from the Cranfield Collection:
{
'd1' : ['experiment', 'studi', ..., 'configur', 'experi', '.'],
'd2' : ['studi', 'high-spe', ..., 'steadi', 'flow', '.'],
...,
'd1400': ['report', 'extens', ..., 'graphic', 'form', '.']
}
Each key, value pair represents a single document as the key and the value as a list of tokenized, stemmed words with stopwords removed. I need to create an inverted index from this dictionary with the following format:
{
'experiment': {'d1': [1, [0]], ..., 'd30': [2, [12, 40]], ..., 'd123': [3, [11, 45, 67]], ...},
'studi': {'d1': [1, [1]], 'd2': [2, [0, 36]], ..., 'd207': [3, [19, 44, 59]], ...}
...
}
Here the key becomes the term while the value is a dictionary that contains the document that term shows up in, the number of times, and the indices of the document where the term is found. I am not sure how to approach this conversion so I am just looking for some starter pointers as to how to think about this problem. Thank you.
I hope I've understood your question well:
dct = {
"d1": ["experiment", "studi", "configur", "experi", "."],
"d2": ["studi", "high-spe", "steadi", "flow", "flow", "."],
"d1400": ["report", "extens", "graphic", "form", "."],
}
out = {}
for k, v in dct.items():
for idx, word in enumerate(v):
out.setdefault(word, {}).setdefault(k, []).append(idx)
for v in out.values():
for l in v.values():
l[:] = [len(l), list(l)]
print(out)
Prints:
{
"experiment": {"d1": [1, [0]]},
"studi": {"d1": [1, [1]], "d2": [1, [0]]},
"configur": {"d1": [1, [2]]},
"experi": {"d1": [1, [3]]},
".": {"d1": [1, [4]], "d2": [1, [5]], "d1400": [1, [4]]},
"high-spe": {"d2": [1, [1]]},
"steadi": {"d2": [1, [2]]},
"flow": {"d2": [2, [3, 4]]},
"report": {"d1400": [1, [0]]},
"extens": {"d1400": [1, [1]]},
"graphic": {"d1400": [1, [2]]},
"form": {"d1400": [1, [3]]},
}