pythonazureazure-aiazure-ai-search

How Can I display Reference/Citation using Streamlit Application?


I have created an Azure AI Search Service, created index using Azure Blob Storage and Deployed a web application and made a chat Playground using AzureOpenAI.

Similarly, I have made a streamlit Application using VS Code. The application is like I will upload document, ask a query and will get an answer based on uploaded document using azure ai search index and azureopenai. But, One thing that I want is below answer, I want Reference/Citation to be displayed.

It should be the page/source information from where answer is extracted.

The fields in my Index are:

content: Full text content of the document. metadata_storage_path: Path where the document is stored. metadata_author: Author of the document. metadata_title: Title of the document. metadata_creation_date: Creation date of the document. language: Language of the document. split_text: Segmented parts of the document text. keywords: Keywords extracted from the document. summary: Summary of the document content. section_titles: Titles of sections within the document. metadata_file_type: File type of the document (e.g., PDF, DOCX). merged_content: Combined content from different parts of the document. text: Main text of the document. layoutText: Text layout information of the document.

My Code is here:

import os
import streamlit as st
from openai import AzureOpenAI
from azure.identity import AzureCliCredential
from azure.core.credentials import AccessToken

# Environment variables
endpoint = os.getenv("ENDPOINT_URL", "https://****************.azure.com/")
deployment = os.getenv("DEPLOYMENT_NAME", "openai-gpt-35-1106")
search_endpoint = os.getenv("SEARCH_ENDPOINT", "https://****************windows.net")
search_key = os.getenv("SEARCH_KEY", ********************************)
search_index = os.getenv("SEARCH_INDEX_NAME", "azureblob-index")

# Setup Azure OpenAI client
credential = AzureCliCredential()

def get_bearer_token() -> str:
    token = credential.get_token("https://****************windows.net")
    return token.token

client = AzureOpenAI(
    azure_endpoint=endpoint,
    azure_ad_token_provider=get_bearer_token,
    api_version="2024-05-01-preview"
)

# Streamlit UI
st.title("Document Uploader and Query Tool")

# File upload
uploaded_file = st.file_uploader("Upload a document", type=["pdf", "docx", "pptx", "xlsx", "txt"])

if uploaded_file is not None:
    file_content = uploaded_file.read()
    st.write("Document uploaded successfully!")

# Send query to Azure AI Search and OpenAI
query = st.text_input("Enter your query:")

if st.button("Get Answer"):
    if query:
        try:
            completion = client.chat.completions.create(
                model=deployment,
                messages=[
                    {
                        "role": "user",
                        "content": query
                    }
                ],
                max_tokens=800,
                temperature=0,
                top_p=1,
                frequency_penalty=0,
                presence_penalty=0,
                stop=None,
                stream=False,
                extra_body={
                    "data_sources": [{
                        "type": "azure_search",
                        "parameters": {
                            "endpoint": search_endpoint,
                            "index_name": search_index,
                            "semantic_configuration": "docs_test",
                            "query_type": "semantic",
                            "fields_mapping": {
                                "content_fields_separator": "\n",
                                "content_fields": ["content", "merged_content"]
                            },
                            "in_scope": True,
                            "role_information": "You are an AI assistant that helps people find information. The information should be small and crisp. It should be accurate.",
                            "authentication": {
                                "type": "api_key",
                                "key": search_key
                            }
                        }
                    }]
                }
            )

            response = completion.to_dict()
            answer = response["choices"][0]["message"]["content"]

            references = []
            if "references" in response["choices"][0]["message"]:
                references = response["choices"][0]["message"]["references"]

            st.write("Response from OpenAI:")
            st.write(answer)

            if references:
                st.write("References:")
                for i, ref in enumerate(references):
                    st.write(f"{i + 1}. {ref['title']} ({ref['url']})")

        except Exception as e:
            st.error(f"Error: {e}")
    else:
        st.warning("Please enter a query.")

The answer that I am getting is like below:

The primary purpose of Tesla's existence, as stated in the 2019 Impact Report, is to accelerate the world's transition to sustainable energy [doc1]. This mission is aimed at minimizing the environmental impact of products and their components, particularly in the product-use phase, by providing information on both the manufacturing and consumer-use aspects of Tesla products [doc1].

the [doc1] is placeholder of source information. But I want it to be like:

Reference: the source information/page from where answer is extracted.

Can you help.

Thanks in Advance!!!!!


Solution

  • You can use below code to extract title name and urlfrom references.

    Actually, the [doc1] itself the reference which is in content of the message object.

    doc1 in the sense 1st document in citations dictionary.

    So, below code helps you extract it.

    First, find out the unique references.

    import re
    
    pattern = r'\[(.*?)\]'
    text = simple_res.choices[0].message.content
    matches = re.findall(pattern, text)
    documents = list(set([match for match in matches if match.startswith('doc')]))
    
    print(documents)
    

    Output:

    ['doc1']
    

    Next, create a dictionary of citation. The result citation will be mapped increasing order like doc1 is first citation and doc2 is second citation and so on.

    references = {}
    for i,j in enumerate(simple_res.choices[0].message.context['citations']):
        references[f"doc{i+1}"] =j
    

    Now fetch the title and url.

    if references:
        print("References:")
        for i, ref in enumerate(documents):
            print(f"{i + 1}. {references[ref]['title']} ({references[ref]['url']})")
           
    

    Output:

    References:
    1. 78782543_7_23_24.html (https://xxxxx.blob.core.windows.net/data/pdf/78782543_7_23_24.html)