I have a issue related to llamindex query for which I am unable to find resolution. Basically, I am trying to build a classifier using llamindex. I have around 700 docs (not huge - each is just a couple of paragraphs). I divide it into train test and build the index based on train. Issue is my query takes around 1 min per doc and with 100+ docs in test set, I takes more than 2 hrs. Is there a way around it ? Below is a code snippet of how I am evaluating.
vector_query_engine = my_index.as_query_engine(similarity_top_k=3,
text_qa_template=text_qa_template)
df['PredictedOutcome'] = df['doc_text'].apply(lambda x: vector_query_engine.query(x))
Here you can find an approach that works way faster. This is only a starter code you may try different indexes, retrievers and queries to even work faster.
Started Steps & Code
--> Total time spent for 1000 files is 9 minutes.
filename_fn = lambda filename: {"file_name": filename}
documents = SimpleDirectoryReader(
"train",
file_metadata=filename_fn
).load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
file_contents = []
def read_files_in_folder(folder_path):
for filename in os.listdir(folder_path):
file_path = os.path.join(folder_path, filename)
if os.path.isfile(file_path):
with open(file_path, 'r') as file:
file_contents.append(file.read())
return file_contents
folder_path = 'test'
file_contents = read_files_in_folder(folder_path)
start_time = time.time()
content_categories = []
for content in file_contents:
prompt = f'''
Take the {content} and tell the category. Possible categories are: [business,entertainment, politics,sport, tech]
'''
response = query_engine.query(
prompt
)
content_categories.append(response)
end_time = time.time()
elapsed_time = end_time - start_time
print(f"Elapsed time: {elapsed_time} seconds")
Elapsed time: 599.567540884018 seconds
Further Development
If you have any more questions, don't hesitate to reach out. I hope this answer was useful for you 🍀