[SOLVED] Calculating total tokens for API request to ChatGPT including functions

I am going to walk you through calculating the tokens for gpt-3.5 and gpt-4. You can apply a similar method to other models you just need to find the right settings.

We are going to calculate the tokens used by the messages and the functions separately then adding them together at the end to get the total.

Messages

Start by getting the tokenizer using tiktoken. We will use this to tokenize all the custom text in the messages and functions. Also add constants for the extra tokens the API will add to the request.

enc = tiktoken.encoding_for_model(model)

Make a variable to hold the total tokens for the messages and set it to 0.

msg_token_count = 0

Loop through the messages, and for each message add 3 to msg_token_count. Then loop through each element in the message and encode the value, adding the length of the encoded object to msg_token_count. If the dictionary has the "name" key set add an additional token to msg_token_count.

for message in messages:
    msg_token_count += 3  # Add tokens for each message
    for key, value in message.items():
        msg_token_count += len(enc.encode(value))  # Add tokens in set message
        if key == "name":
            msgTokenCount += 1 # Add token if name is set

Finally we need to add 3 to msg_token_count, for the ending tokens.

msgTokenCount += 3 # Add tokens to account for ending

Functions

Now we are going to calculate the number of tokens the functions will take.

Start by making a variable to hold the total tokens used by functions and set it to 0.

func_token_count = 0

Next we are going to loop through the functions and add tokens to func_token_count. Loop through the functions and add 7 to func_token_count for each function. Then add the length of the encoded name and description.

For each function, if it has properties, add 3 to func_token_count. Then for each key in the properties add another 3 and the length of the encoded property, making sure to subtract 3 if it has an "enum" key, and adding 3 for each item in the enum section.

Finally, add 12 to func_token_count to account for the tokens at the end of all the functions.

for function in functions:
    func_token_count += 7  # Add tokens for start of each function
    f_name = function["name"]
    f_desc = function["description"]
    if f_desc.endswith("."):
        f_desc = f_desc[:-1]
    line = f_name + ":" + f_desc
    func_token_count += len(enc.encode(line))  # Add tokens for set name and description
    if len(function["parameters"]["properties"]) > 0:
        func_token_count += 3  # Add tokens for start of each property
        for key in list(function["parameters"]["properties"].keys()):
            func_token_count += 3  # Add tokens for each set property
            p_name = key
            p_type = function["parameters"]["properties"][key]["type"]
            p_desc = function["parameters"]["properties"][key]["description"]
            if "enum" in function["parameters"]["properties"][key].keys():
                func_token_count += 3  # Add tokens if property has enum list
                for item in function["parameters"]["properties"][key]["enum"]:
                    func_token_count += 3
                    func_token_count += len(enc.encode(item))
            if p_desc.endswith("."):
                p_desc = p_desc[:-1]
            line = f"{p_name}:{p_type}:{p_desc}"
            func_token_count += len(enc.encode(line))
func_token_count += 12

Here is the full code. Please note that instead of hard coding the additional token counts I used a constant to hold the value.

def get_token_count(model, messages, functions):
    # Initialize message settings to 0
    msg_init = 0
    msg_name = 0
    msg_end = 0
    
    # Initialize function settings to 0
    func_init = 0
    prop_init = 0
    prop_key = 0
    enum_init = 0
    enum_item = 0
    func_end = 0
    
    if model in [
        "gpt-3.5-turbo-0613",
        "gpt-4-0613"
    ]:
        # Set message settings for above models
        msg_init = 3
        msg_name = 1
        msg_end = 3
        
        # Set function settings for the above models
        func_init = 7
        prop_init = 3
        prop_key = 3
        enum_init = -3
        enum_item = 3
        func_end = 12
    
    enc = tiktoken.encoding_for_model(model)
    
    msg_token_count = 0
    for message in messages:
        msg_token_count += msg_init  # Add tokens for each message
        for key, value in message.items():
            msg_token_count += len(enc.encode(value))  # Add tokens in set message
            if key == "name":
                msg_token_count += msg_name  # Add tokens if name is set
    msg_token_count += msg_end  # Add tokens to account for ending
    
    func_token_count = 0
    if len(functions) > 0:
        for function in functions:
            func_token_count += func_init  # Add tokens for start of each function
            f_name = function["name"]
            f_desc = function["description"]
            if f_desc.endswith("."):
                f_desc = f_desc[:-1]
            line = f_name + ":" + f_desc
            func_token_count += len(enc.encode(line))  # Add tokens for set name and description
            if len(function["parameters"]["properties"]) > 0:
                func_token_count += prop_init  # Add tokens for start of each property
                for key in list(function["parameters"]["properties"].keys()):
                    func_token_count += prop_key  # Add tokens for each set property
                    p_name = key
                    p_type = function["parameters"]["properties"][key]["type"]
                    p_desc = function["parameters"]["properties"][key]["description"]
                    if "enum" in function["parameters"]["properties"][key].keys():
                        func_token_count += enum_init  # Add tokens if property has enum list
                        for item in function["parameters"]["properties"][key]["enum"]:
                            func_token_count += enum_item
                            func_token_count += len(enc.encode(item))
                    if p_desc.endswith("."):
                        p_desc = p_desc[:-1]
                    line = f"{p_name}:{p_type}:{p_desc}"
                    func_token_count += len(enc.encode(line))
        func_token_count += func_end
    
    return msg_token_count + func_token_count

Please let me know if something is not clear, or if you have a suggestion to make my post better.