pythonazureazure-cognitive-searchazure-openai

How to Chat with my Data in Python using Microsoft Azure OpenAI and Azure Cogintive Search


I have written code that extracts text from a PDF document and converts it into vectors using the text-embeddings-ada-002 model from Azure OpenAI. These vectors are then stored in a Microsoft Azure Cognitive Search Index and can be queried. However, I now want to use Azure OpenAI to interact with this data and retrieve a generated result. My code until now works fine, but i dont know how to implement the interaction through Azure OpenAI with my custom data in Azure Cognitive Search in Python.

This is my code:

OPENAI_API_BASE = "https://xxxxx.openai.azure.com"
OPENAI_API_KEY = "xxxxxx"
OPENAI_API_VERSION = "2023-05-15"

openai.api_type = "azure"
openai.api_key = OPENAI_API_KEY
openai.api_base = OPENAI_API_BASE
openai.api_version = OPENAI_API_VERSION

AZURE_COGNITIVE_SEARCH_SERVICE_ENDPOINT = "https://xxxxxx.search.windows.net"
AZURE_COGNITIVE_SEARCH_API_KEY = "xxxxxxx"
AZURE_COGNITIVE_SEARCH_INDEX_NAME = "test"
AZURE_COGNITIVE_SEARCH_CREDENTIAL = AzureKeyCredential(AZURE_COGNITIVE_SEARCH_API_KEY)

llm = AzureChatOpenAI(deployment_name="gpt35", openai_api_key=OPENAI_API_KEY, openai_api_base=OPENAI_API_BASE, openai_api_version=OPENAI_API_VERSION)
embeddings = OpenAIEmbeddings(deployment_id="ada002", chunk_size=1, openai_api_key=OPENAI_API_KEY, openai_api_base=OPENAI_API_BASE, openai_api_version=OPENAI_API_VERSION)

acs = AzureSearch(azure_search_endpoint=AZURE_COGNITIVE_SEARCH_SERVICE_ENDPOINT,
                  azure_search_key = AZURE_COGNITIVE_SEARCH_API_KEY,
                  index_name = AZURE_COGNITIVE_SEARCH_INDEX_NAME,
                  embedding_function = embeddings.embed_query)


def generate_embeddings(s):
  # wichtig! engine muss der name sein meiner bereitstellung sein!
  response = openai.Embedding.create(
      input=s,
      engine="ada002"
  )

  embeddings = response['data'][0]['embedding']

  return embeddings

def generate_tokens(s, f):
  text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
  splits = text_splitter.split_text(s)
  i = 0

  documents = []
  for split in splits:
    metadata = {}
    metadata["index"] = i
    metadata["file_source"] = f
    i = i+1

    new_doc = Document(page_content=split, metadata=metadata)
    documents.append(new_doc)
    #documents = text_splitter.create_documents(splits)

  return documents

drive.mount('/content/drive')
folder = "/content/drive/docs/pdf/"

page_content = ''
doc_content = ''

for filename in os.listdir(folder):
    file_path = os.path.join(folder, filename)
    if os.path.isfile(file_path):
        print(f"Processing file: {file_path}")

        doc = fitz.open(file_path)
        for page in doc: # iterate the document pages
          page_content += page.get_text() # get plain text encoded as UTF-8

        doc_content += page_content

        d = generate_tokens(doc_content, file_path)
        print(d)

        acs.add_documents(documents=d)
    
        print("Done.")


query = "What are the advantages of an open-source ai model?"
search_client = SearchClient(AZURE_COGNITIVE_SEARCH_SERVICE_ENDPOINT, AZURE_COGNITIVE_SEARCH_INDEX_NAME, credential=AZURE_COGNITIVE_SEARCH_CREDENTIAL)

results = search_client.search(
    search_text=None,
    vector_queries= [vector_query],
    select=["content_vector", "metadata"],
)

print(results)

for result in results:
  print(result)

The fields in Azure Cognitive search are content_vector for the vectors and content for the plain text content. I have looks at a lot of GitHub Code, also published by Microsoft and know that it is implemented, but have obviously some problems understanding how this piece in particular is implemented.

So i am looking for some help/hint how to extend this code to interact with the content in Azure Cognitive via Azure Open AI Chat.


Solution

  • What your code has done till now is done a similarity search in Azure Cognitive Search and found the relevant data related to your question.

    Next step would be to pass the query and this relevant data to an LLM to create an answer to the query from the relevant data. The way you would do it is create a prompt and populate it with this information and send it to an LLM to answer the query.

    Here's some code to do the same:

    # "content" field contains the text content of your data. make sure that it is retrieved.
    results = search_client.search(
        search_text=None,
        vector_queries= [vector_query],
        select=["content", "content_vector", "metadata"],
    )
    
    context = ""
    for result in results:
      context += result.content + "\n\n"
    
    
    # setup prompt template
    template = """
    Use the following pieces of context to answer the question at the end. Question is enclosed in <question></question>.
    Do keep the following things in mind when answering the question:
    - If you don't know the answer, just say that you don't know, don't try to make up an answer.
    - Keep the answer as concise as possible.
    - Use only the context to answer the question. Context is enclosed in <context></context>
    - If the answer is not found in context, simply output "I'm sorry but I do not know the answer to your question.".
    
    
    <context>{context}</context>
    <question>{question}</question>
    """
    prompt_template = PromptTemplate.from_template(template)
    
    # initialize LLM
    llm = AzureChatOpenAI(deployment_name="gpt35", openai_api_key=OPENAI_API_KEY, openai_api_base=OPENAI_API_BASE, openai_api_version=OPENAI_API_VERSION, temperature=0)
    prompt = prompt_template.format(context=context, question= query)
    message = HumanMessage(content=prompt)
    result = llm([message])
    print(result.content)
    

    This is a classic Retrieval Augmented Generation (RAG) technique. I created a simple application using this to query Azure Documentation using natural language. The code above is based on the code I wrote for that application. You can read more about the application and see the source code here: https://github.com/gmantri/azure-docs-copilot.