I have a NodeJS program that connects to OpenAI's assistant API to create messages. I have followed this documentation from OpenAI to create the steps below:
openai.beta.threads.messages.create(threadId, {
role: "user",
content: createMessage(),
});
await openai.beta.threads.runs.create(threadId, {
assistant_id: assistantId,
instructions:
"Please address the user as Mahesh. The user is an administrator.",
});
await openai.beta.threads.runs.retrieve(threadId, runId);
const messages = await openai.beta.threads.messages.list(threadId, {
limit: 1,
});
This code takes around 250,000 tokens to complete. The image shows today's token usage for three requests.
There could be multiple reasons why your cost of running an assistant is very high.
If you take a look at the official OpenAI documentation, you'll see that they use the gpt-4-1106-preview
model. They state:
We recommend using OpenAI’s latest models with the Assistants API for best results and maximum compatibility with tools.
But older models might be good enough. It depends on what your assistant is used for. You can lower the cost of running the assistant just by changing the model. Of course, if you see that the performance of the assistant is considerably worse, then you need to use the latest models. Just take a look at the table below to see what a difference a model decision can make:
MODEL | INPUT | OUTPUT |
---|---|---|
gpt-4-1106-preview | $0.01 / 1K tokens | $0.03 / 1K tokens |
gpt-3.5-turbo-1106 | $0.001 / 1K tokens | $0.002 / 1K tokens |
As stated in the official OpenAI documentation:
Assistants can access persistent threads. Threads simplify AI application development by storing message history and truncating it when the conversation gets too long for the model’s context length. You create a thread once, and simply append messages to it as your users reply.
/ ... /
Threads and messages represent a conversation session between an assistant and a user. There is no limit to the number of messages you can store in a thread. Once the size of the messages exceeds the context window of the model, the thread will attempt to include as many messages as possible that fit in the context window and drop the oldest messages.
The tread is storing the message history! The gpt-4-1106-preview
has a context window of 128,000
tokens. So, if you chat with your assistant using the same thread long enough, you will fill up the thread up to the context window of your chosen model.
If you choose the gpt-4-1106-preview
this means that after some time chatting with your assistant using the same thread, a single question you ask your assistant means that you used 128,000
tokens. Your recent question might contain 1,000
tokens, but you also need to keep in mind that hundreds of messages that were either asked by you or answered by the assistant in the past were also sent to the Assistants API.
In your case, you can see that today you spent 760,564
context tokens. You have probably been using the same thread for quite some time.
You said that you check the run status to see if it has been moved to completed
every 5 seconds. Try to increase this number, let's say 10 seconds, to make fewer API calls. You pay for every API call you make.