pythontokenizeopenai-api

Calculating total tokens for API request to ChatGPT including functions


Hello Stack Overflow community,

I've been working on integrating ChatGPT's API into my project, and I'm having some trouble calculating the total number of tokens for my API requests. Specifically, I'm passing both messages and functions in my API calls.

I've managed to figure out how to calculate the token count for the messages, but I'm unsure about how to account for the tokens used by the functions.

Could someone please guide me on how to properly calculate the total token count, including both messages and functions, for a request to ChatGPT's API?

Any help or insights would be greatly appreciated!

Thank you in advance.

I have been working on brute forcing a solution by formatting the data in the call in different ways. I have been using the tokenizer, and Tiktokenizer to test my formats.


Solution

  • I am going to walk you through calculating the tokens for gpt-3.5 and gpt-4. You can apply a similar method to other models you just need to find the right settings.

    We are going to calculate the tokens used by the messages and the functions separately then adding them together at the end to get the total.

    Messages

    Start by getting the tokenizer using tiktoken. We will use this to tokenize all the custom text in the messages and functions. Also add constants for the extra tokens the API will add to the request.

    enc = tiktoken.encoding_for_model(model)
    

    Make a variable to hold the total tokens for the messages and set it to 0.

    msg_token_count = 0
    

    Loop through the messages, and for each message add 3 to msg_token_count. Then loop through each element in the message and encode the value, adding the length of the encoded object to msg_token_count. If the dictionary has the "name" key set add an additional token to msg_token_count.

    for message in messages:
        msg_token_count += 3  # Add tokens for each message
        for key, value in message.items():
            msg_token_count += len(enc.encode(value))  # Add tokens in set message
            if key == "name":
                msgTokenCount += 1 # Add token if name is set
    

    Finally we need to add 3 to msg_token_count, for the ending tokens.

    msgTokenCount += 3 # Add tokens to account for ending
    

    Functions

    Now we are going to calculate the number of tokens the functions will take.

    Start by making a variable to hold the total tokens used by functions and set it to 0.

    func_token_count = 0
    

    Next we are going to loop through the functions and add tokens to func_token_count. Loop through the functions and add 7 to func_token_count for each function. Then add the length of the encoded name and description.

    For each function, if it has properties, add 3 to func_token_count. Then for each key in the properties add another 3 and the length of the encoded property, making sure to subtract 3 if it has an "enum" key, and adding 3 for each item in the enum section.

    Finally, add 12 to func_token_count to account for the tokens at the end of all the functions.

    for function in functions:
        func_token_count += 7  # Add tokens for start of each function
        f_name = function["name"]
        f_desc = function["description"]
        if f_desc.endswith("."):
            f_desc = f_desc[:-1]
        line = f_name + ":" + f_desc
        func_token_count += len(enc.encode(line))  # Add tokens for set name and description
        if len(function["parameters"]["properties"]) > 0:
            func_token_count += 3  # Add tokens for start of each property
            for key in list(function["parameters"]["properties"].keys()):
                func_token_count += 3  # Add tokens for each set property
                p_name = key
                p_type = function["parameters"]["properties"][key]["type"]
                p_desc = function["parameters"]["properties"][key]["description"]
                if "enum" in function["parameters"]["properties"][key].keys():
                    func_token_count += 3  # Add tokens if property has enum list
                    for item in function["parameters"]["properties"][key]["enum"]:
                        func_token_count += 3
                        func_token_count += len(enc.encode(item))
                if p_desc.endswith("."):
                    p_desc = p_desc[:-1]
                line = f"{p_name}:{p_type}:{p_desc}"
                func_token_count += len(enc.encode(line))
    func_token_count += 12
    

    Here is the full code. Please note that instead of hard coding the additional token counts I used a constant to hold the value.

    def get_token_count(model, messages, functions):
        # Initialize message settings to 0
        msg_init = 0
        msg_name = 0
        msg_end = 0
        
        # Initialize function settings to 0
        func_init = 0
        prop_init = 0
        prop_key = 0
        enum_init = 0
        enum_item = 0
        func_end = 0
        
        if model in [
            "gpt-3.5-turbo-0613",
            "gpt-4-0613"
        ]:
            # Set message settings for above models
            msg_init = 3
            msg_name = 1
            msg_end = 3
            
            # Set function settings for the above models
            func_init = 7
            prop_init = 3
            prop_key = 3
            enum_init = -3
            enum_item = 3
            func_end = 12
        
        enc = tiktoken.encoding_for_model(model)
        
        msg_token_count = 0
        for message in messages:
            msg_token_count += msg_init  # Add tokens for each message
            for key, value in message.items():
                msg_token_count += len(enc.encode(value))  # Add tokens in set message
                if key == "name":
                    msg_token_count += msg_name  # Add tokens if name is set
        msg_token_count += msg_end  # Add tokens to account for ending
        
        func_token_count = 0
        if len(functions) > 0:
            for function in functions:
                func_token_count += func_init  # Add tokens for start of each function
                f_name = function["name"]
                f_desc = function["description"]
                if f_desc.endswith("."):
                    f_desc = f_desc[:-1]
                line = f_name + ":" + f_desc
                func_token_count += len(enc.encode(line))  # Add tokens for set name and description
                if len(function["parameters"]["properties"]) > 0:
                    func_token_count += prop_init  # Add tokens for start of each property
                    for key in list(function["parameters"]["properties"].keys()):
                        func_token_count += prop_key  # Add tokens for each set property
                        p_name = key
                        p_type = function["parameters"]["properties"][key]["type"]
                        p_desc = function["parameters"]["properties"][key]["description"]
                        if "enum" in function["parameters"]["properties"][key].keys():
                            func_token_count += enum_init  # Add tokens if property has enum list
                            for item in function["parameters"]["properties"][key]["enum"]:
                                func_token_count += enum_item
                                func_token_count += len(enc.encode(item))
                        if p_desc.endswith("."):
                            p_desc = p_desc[:-1]
                        line = f"{p_name}:{p_type}:{p_desc}"
                        func_token_count += len(enc.encode(line))
            func_token_count += func_end
        
        return msg_token_count + func_token_count
    

    Please let me know if something is not clear, or if you have a suggestion to make my post better.