python huggingface-transformers tokenize embedding llama-index

How can I match the token count used by BGE-M3 embedding model before embedding?

For my particular project, it would be very helpful to know how many tokens the BGE-M3 embedding model would break a string down into before I embed the text. I could embed the string and count the tokens with using the following code

Settings.callback_manager = callback_manager
embedding_vector = Settings.embed_model.get_text_embedding(text)
embedding_tokens = token_counter.total_embedding_token_count

but unfortunately, embedding large amounts of text is a relatively computationally heavy problem, so I would prefer not to use this method. After doing some digging, I realized that I could use

tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-m3")

to tokenize text, but oddly enough any text I embed using this method gives different results than using the previously mentioned method. I think the crux of my issue might be that the BGE-M3 embedding model pre-processes text prior to embedding. I have tried to google exactly what this pre-processing step looks like, but have been unable to find it so far. Below is some code that will allow us to re-create the issue I am talking about. Please note that this script assumes you have the model saved in ./embeddings. If you don't, it will attempt to automatically download the model, which is > 2GB.

from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings
from llama_index.core.callbacks import CallbackManager, TokenCountingHandler
from transformers import AutoTokenizer
import os

# String to test on
text = "Random words. This is a test! A very exciting test, indeed."
# just set chunk_size to 512
chunk_size = 512


# Load in the embedding model. If it does not exist, go ahead and download it
def create_embedding_model(_chunk_size=None):
    print('loading embeddings...')
    if os.path.exists('./embeddings/models--BAAI--bge-m3'):
        _cache_path = f"./embeddings/models--BAAI--bge-m3/snapshots/{os.listdir('./embeddings/models--BAAI--bge-m3/snapshots')[0]}"
        _embed_model = HuggingFaceEmbedding(model_name=_cache_path)
    else:
        os.makedirs("./embeddings", exist_ok=True)
        _emb_model_name = "BAAI/bge-m3"
        _embed_model = HuggingFaceEmbedding(model_name=_emb_model_name, max_length=_chunk_size,
                                            cache_folder='./embeddings')
    print('embeddings loaded')
    return _embed_model


# Grab embedding model
embed_model = create_embedding_model(_chunk_size=chunk_size)

# Create a token counting handler
token_counter = TokenCountingHandler()
callback_manager = CallbackManager([token_counter])

Settings.embed_model = embed_model
Settings.callback_manager = callback_manager

# Grab the embedding vector from the model
embedding_vector = Settings.embed_model.get_text_embedding(text)

# Grab the count of tokens from the embedding vector
embedding_tokens = token_counter.total_embedding_token_count

# Just the tokenizer
model_name = "BAAI/bge-m3"
tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenized_text = tokenizer(text)
token_count = len(tokenized_text['input_ids'])

print(f"Original text: {text}")
print(f"The embedding model broke the text into: {embedding_tokens} tokens")
print(f"The tokenizer broke the text into {token_count} tokens")

When I run the above, I get the following output

Original text: Random words. This is a test! A very exciting test, indeed. The embedding model broke the text into: 15 tokens The tokenizer broke the text into 18 tokens

How can I replicate the token count that the BGE-M3 embedding model uses, without running the embedding itself? Is there a way to pre-process the text in the same way the embedding model does, so that the tokenizer gives the same token count?

Solution

What you face here is a good example for the python priciple of why explicit is better than implicit.

The TokenCountingHandler initializes a TokenCounter object with the tokenizer you provide as parameter or the default tokenizer (code reference). The default tokenizer of llama_index is a tiktoken (i.e. openai) tokenizer and not the one your model uses:

from llama_index.core import Settings

print(Settings.tokenizer)

Output:

functools.partial(<bound method Encoding.encode of <Encoding 'cl100k_base'>>, allowed_special='all')

In order to get the correct number of tokens, which is 18, you need to initialize the TokenCounter object with the tokenizer your model is using. Since the llama_index implementation TokenCounter requires that the tokenizer returns the list of input_ids instead of huggingfaces "standard" of a Batchencoding object, you need to wrap it the tokenizer first (otherwise it will return always 2 for the keys input_ids and attention_mask).

from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings
from llama_index.core.callbacks import CallbackManager, TokenCountingHandler
from transformers import AutoTokenizer
from transformers import XLMRobertaTokenizerFast
import os

text = "Random words. This is a test! A very exciting test, indeed."
chunk_size = 512

model_id = "BAAI/bge-m3"


# AutoTokenizer is just a factory
# BAAI/bge-m3 uses an XLMRobertaTokenizer
class MyToeknCounterTokenizerForLlamaIndex(XLMRobertaTokenizerFast):
    def __call__(self, *args, **kwargs):
        return super().__call__(*args, **kwargs).input_ids

llama_index_tokenizer = MyToeknCounterTokenizerForLlamaIndex.from_pretrained(model_id)

embed_model = HuggingFaceEmbedding(model_name=model_id, max_length=chunk_size)

# Create a token counting handler
# You could of course also make it the default via the settings object
token_counter = TokenCountingHandler(tokenizer=llama_index_tokenizer)
callback_manager = CallbackManager([token_counter])

Settings.embed_model = embed_model
Settings.callback_manager = callback_manager

# Grab the embedding vector from the model
embedding_vector = Settings.embed_model.get_text_embedding(text)

# Grab the count of tokens from the embedding vector
embedding_tokens = token_counter.total_embedding_token_count

# Just the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
token_count = len(tokenizer(text).input_ids)

print(f"Original text: {text}")
print(f"The embedding model broke the text into: {embedding_tokens} tokens")
print(f"The tokenizer broke the text into {token_count} tokens")

Output:

Original text: Random words. This is a test! A very exciting test, indeed.
The embedding model broke the text into: 18 tokens
The tokenizer broke the text into 18 tokens