openai-apichatgpt-apiazure-openaiazure-ai

Does prompt_token usage affect my billing when using Azure OpenAI models with your own data?


I have set Azure OpenAI on my data, Chat with Azure OpenAI models using your own data. My goal was to reduce token usage in each request.

However, I have noticed additional prompt_token usage, even when I am sending user content without any prompt. For example, if I only send the text hello there, it results in a total of 2628 tokens, whereas it should only be 24. If a longer text (7 words) is provided without any prompt, it results in a total of approximately 3.4k tokens.

Example:

[{'role': 'system', 'content': ''}, {'role': 'user', 'content': 'hello there'}]
total_tokens: {'completion_tokens': 24, 'prompt_tokens': 2604, 'total_tokens': 2628}

----------------------------------------------------

[{'role': 'system', 'content': ''},
{'role': 'user', 'content': 'I worked overtime what should I do?'}]
total_tokens: {'completion_tokens': 52, 'prompt_tokens': 3334, 'total_tokens': 3386}

As you can see, under total token I am seeing prompt_token usage close the ~3.5k. Where does prompt_token usage come from since I do not provide any prompt or system message? Isn't the whole purpose of using Azure OpenAI models with your own data to reduce token usage? For each request, additional 3.5k token is very expensive. Will it affect my billing, where would prompt_token be considered as part of input tokens?


Pricing/details/cognitive-services/openai-service/, states that Input (Per 1,000 tokens) is $0.0025, than as I understand for 4,000 tokens, the cost should be $0.01.


Solution

  • There is a bit of magic going on when they "use your data"; it is essentially Retrieval Augmented Generation or RAG. Basically there's a few more steps than simply your prompt text.

    They explain it fairly well here:

    In total, there are two calls made to the model:

    For processing the intent: The token estimate for the intent prompt includes those for the user question, conversation history, and the instructions sent to the model for intent generation.

    For generating the response: The token estimate for the generation prompt includes those for the user question, conversation history, the retrieved list of document chunks, role information, and the instructions sent to it for generation.