large-language-model azure-openai llama-index

Slow query performance on llamaindex

I have a issue related to llamindex query for which I am unable to find resolution. Basically, I am trying to build a classifier using llamindex. I have around 700 docs (not huge - each is just a couple of paragraphs). I divide it into train test and build the index based on train. Issue is my query takes around 1 min per doc and with 100+ docs in test set, I takes more than 2 hrs. Is there a way around it ? Below is a code snippet of how I am evaluating.

vector_query_engine = my_index.as_query_engine(similarity_top_k=3,
                                                       text_qa_template=text_qa_template)

df['PredictedOutcome'] = df['doc_text'].apply(lambda x: vector_query_engine.query(x))

Solution

Here you can find an approach that works way faster. This is only a starter code you may try different indexes, retrievers and queries to even work faster.

Started Steps & Code

I found this dataset from Kaggle that I think is similar to yours. This dataset contains approximately 500 text chunks in separate files from 5 different categories.
I renamed all the files to <category>_<filename>.txt. Then divided the files into two as 'train' and 'test', with 200 files from each category. As a result, I had 1000 separate text files for train and test data.
While loading the document with SimpleDirectoryReader I gave file metadata as parameter. Remember from previous step we changed document names and added category to the filename. That will result while embedding and indexing each category will have closer vectors.
After loading the documents I used VectorStoreIndex and used that index as query engine.
Then, I query for each file in test folder.

--> Total time spent for 1000 files is 9 minutes.

filename_fn = lambda filename: {"file_name": filename}

documents = SimpleDirectoryReader(
    "train",
    file_metadata=filename_fn
).load_data()

index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

file_contents = []
def read_files_in_folder(folder_path):
    for filename in os.listdir(folder_path):
        file_path = os.path.join(folder_path, filename)
        if os.path.isfile(file_path):
            with open(file_path, 'r') as file:
                file_contents.append(file.read()) 
    return file_contents

folder_path = 'test' 
file_contents = read_files_in_folder(folder_path)

start_time = time.time()

content_categories = []
for content in file_contents:
    prompt = f'''
    Take the {content} and tell the category. Possible categories are: [business,entertainment, politics,sport, tech]
    '''
    response = query_engine.query(
     prompt
    )
    content_categories.append(response)

end_time = time.time()
elapsed_time = end_time - start_time
print(f"Elapsed time: {elapsed_time} seconds")

Elapsed time: 599.567540884018 seconds

Further Development

There is an unresolved issue with document classification using Llama-Index that you may want to check.
You may try different approaches while loading the documents Customizing Documents
For your problem Keyword Table Index can also be useful.
You may want to develop custom prompts and structured outputs. For that I suggest you to research LangChain

If you have any more questions, don't hesitate to reach out. I hope this answer was useful for you 🍀