pythonartificial-intelligenceopenai-apiopenai-assistants-api

How do I extract data from a document using the OpenAI API?


I want to extract key terms from rental agreements.

To do this, I want to send the PDF of the contract to an AI service that must return some key terms in JSON format.

What are some of the different libraries and companies that can do this? So far, I've explored the OpenAI API, but it isn't as straightforward as I would have imagined.

When using the ChatGPT interface, it works very well, so I thought using the API should be equally simple.

It seems like I need to read the PDF text first and then send the text to OpenAI API.

Any other ideas to achieve this will be appreciated.


Solution

  • Note: The code below works with the OpenAI Assistants API v1. In April 2024, the OpenAI Assistants API v2 was released. See the migration guide.


    What you want to use is the Assistants API.

    As of today, there are 3 tools available:

    You need to use the Knowledge Retrieval tool. As stated in the official OpenAI documentation:

    Retrieval augments the Assistant with knowledge from outside its model, such as proprietary product information or documents provided by your users. Once a file is uploaded and passed to the Assistant, OpenAI will automatically chunk your documents, index and store the embeddings, and implement vector search to retrieve relevant content to answer user queries.

    I've built a customer support chatbot in the past. Take this as an example. In your case, you want the assistant to use your PDF file (I used the knowledge.txt file). Take a look at my GitHub and YouTube.

    customer_support_chatbot.py

    import os
    from openai import OpenAI
    client = OpenAI()
    OpenAI.api_key = os.getenv('OPENAI_API_KEY')
    
    # Step 1: Upload a File with an "assistants" purpose
    my_file = client.files.create(
      file=open("knowledge.txt", "rb"),
      purpose='assistants'
    )
    print(f"This is the file object: {my_file} \n")
    
    # Step 2: Create an Assistant
    my_assistant = client.beta.assistants.create(
        model="gpt-3.5-turbo-1106",
        instructions="You are a customer support chatbot. Use your knowledge base to best respond to customer queries.",
        name="Customer Support Chatbot",
        tools=[{"type": "retrieval"}]
    )
    print(f"This is the assistant object: {my_assistant} \n")
    
    # Step 3: Create a Thread
    my_thread = client.beta.threads.create()
    print(f"This is the thread object: {my_thread} \n")
    
    # Step 4: Add a Message to a Thread
    my_thread_message = client.beta.threads.messages.create(
      thread_id=my_thread.id,
      role="user",
      content="What can I buy in your online store?",
      file_ids=[my_file.id]
    )
    print(f"This is the message object: {my_thread_message} \n")
    
    # Step 5: Run the Assistant
    my_run = client.beta.threads.runs.create(
      thread_id=my_thread.id,
      assistant_id=my_assistant.id,
      instructions="Please address the user as Rok Benko."
    )
    print(f"This is the run object: {my_run} \n")
    
    # Step 6: Periodically retrieve the Run to check on its status to see if it has moved to completed
    while my_run.status in ["queued", "in_progress"]:
        keep_retrieving_run = client.beta.threads.runs.retrieve(
            thread_id=my_thread.id,
            run_id=my_run.id
        )
        print(f"Run status: {keep_retrieving_run.status}")
    
        if keep_retrieving_run.status == "completed":
            print("\n")
    
            # Step 7: Retrieve the Messages added by the Assistant to the Thread
            all_messages = client.beta.threads.messages.list(
                thread_id=my_thread.id
            )
    
            print("------------------------------------------------------------ \n")
    
            print(f"User: {my_thread_message.content[0].text.value}")
            print(f"Assistant: {all_messages.data[0].content[0].text.value}")
    
            break
        elif keep_retrieving_run.status == "queued" or keep_retrieving_run.status == "in_progress":
            pass
        else:
            print(f"Run status: {keep_retrieving_run.status}")
            break