For my particular project, it would be very helpful to know how many tokens the BGE-M3 embedding model would break a string down into before I embed the text. I could embed the string and count the tokens with using the following code
Settings.callback_manager = callback_manager
embedding_vector = Settings.embed_model.get_text_embedding(text)
embedding_tokens = token_counter.total_embedding_token_count
but unfortunately, embedding large amounts of text is a relatively computationally heavy problem, so I would prefer not to use this method. After doing some digging, I realized that I could use
tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-m3")
to tokenize text, but oddly enough any text I embed using this method gives different results than using the previously mentioned method. I think the crux of my issue might be that the BGE-M3 embedding model pre-processes text prior to embedding. I have tried to google exactly what this pre-processing step looks like, but have been unable to find it so far. Below is some code that will allow us to re-create the issue I am talking about. Please note that this script assumes you have the model saved in ./embeddings
. If you don't, it will attempt to automatically download the model, which is > 2GB.
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings
from llama_index.core.callbacks import CallbackManager, TokenCountingHandler
from transformers import AutoTokenizer
import os
# String to test on
text = "Random words. This is a test! A very exciting test, indeed."
# just set chunk_size to 512
chunk_size = 512
# Load in the embedding model. If it does not exist, go ahead and download it
def create_embedding_model(_chunk_size=None):
print('loading embeddings...')
if os.path.exists('./embeddings/models--BAAI--bge-m3'):
_cache_path = f"./embeddings/models--BAAI--bge-m3/snapshots/{os.listdir('./embeddings/models--BAAI--bge-m3/snapshots')[0]}"
_embed_model = HuggingFaceEmbedding(model_name=_cache_path)
else:
os.makedirs("./embeddings", exist_ok=True)
_emb_model_name = "BAAI/bge-m3"
_embed_model = HuggingFaceEmbedding(model_name=_emb_model_name, max_length=_chunk_size,
cache_folder='./embeddings')
print('embeddings loaded')
return _embed_model
# Grab embedding model
embed_model = create_embedding_model(_chunk_size=chunk_size)
# Create a token counting handler
token_counter = TokenCountingHandler()
callback_manager = CallbackManager([token_counter])
Settings.embed_model = embed_model
Settings.callback_manager = callback_manager
# Grab the embedding vector from the model
embedding_vector = Settings.embed_model.get_text_embedding(text)
# Grab the count of tokens from the embedding vector
embedding_tokens = token_counter.total_embedding_token_count
# Just the tokenizer
model_name = "BAAI/bge-m3"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenized_text = tokenizer(text)
token_count = len(tokenized_text['input_ids'])
print(f"Original text: {text}")
print(f"The embedding model broke the text into: {embedding_tokens} tokens")
print(f"The tokenizer broke the text into {token_count} tokens")
When I run the above, I get the following output
Original text: Random words. This is a test! A very exciting test, indeed. The embedding model broke the text into: 15 tokens The tokenizer broke the text into 18 tokens
How can I replicate the token count that the BGE-M3 embedding model uses, without running the embedding itself? Is there a way to pre-process the text in the same way the embedding model does, so that the tokenizer gives the same token count?
What you face here is a good example for the python priciple of why explicit is better than implicit.
The TokenCountingHandler initializes a TokenCounter object with the tokenizer you provide as parameter or the default tokenizer (code reference). The default tokenizer of llama_index is a tiktoken (i.e. openai) tokenizer and not the one your model uses:
from llama_index.core import Settings
print(Settings.tokenizer)
Output:
functools.partial(<bound method Encoding.encode of <Encoding 'cl100k_base'>>, allowed_special='all')
In order to get the correct number of tokens, which is 18, you need to initialize the TokenCounter object with the tokenizer your model is using. Since the llama_index implementation TokenCounter requires that the tokenizer returns the list of input_ids instead of huggingfaces "standard" of a Batchencoding object, you need to wrap it the tokenizer first (otherwise it will return always 2 for the keys input_ids
and attention_mask
).
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings
from llama_index.core.callbacks import CallbackManager, TokenCountingHandler
from transformers import AutoTokenizer
from transformers import XLMRobertaTokenizerFast
import os
text = "Random words. This is a test! A very exciting test, indeed."
chunk_size = 512
model_id = "BAAI/bge-m3"
# AutoTokenizer is just a factory
# BAAI/bge-m3 uses an XLMRobertaTokenizer
class MyToeknCounterTokenizerForLlamaIndex(XLMRobertaTokenizerFast):
def __call__(self, *args, **kwargs):
return super().__call__(*args, **kwargs).input_ids
llama_index_tokenizer = MyToeknCounterTokenizerForLlamaIndex.from_pretrained(model_id)
embed_model = HuggingFaceEmbedding(model_name=model_id, max_length=chunk_size)
# Create a token counting handler
# You could of course also make it the default via the settings object
token_counter = TokenCountingHandler(tokenizer=llama_index_tokenizer)
callback_manager = CallbackManager([token_counter])
Settings.embed_model = embed_model
Settings.callback_manager = callback_manager
# Grab the embedding vector from the model
embedding_vector = Settings.embed_model.get_text_embedding(text)
# Grab the count of tokens from the embedding vector
embedding_tokens = token_counter.total_embedding_token_count
# Just the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
token_count = len(tokenizer(text).input_ids)
print(f"Original text: {text}")
print(f"The embedding model broke the text into: {embedding_tokens} tokens")
print(f"The tokenizer broke the text into {token_count} tokens")
Output:
Original text: Random words. This is a test! A very exciting test, indeed.
The embedding model broke the text into: 18 tokens
The tokenizer broke the text into 18 tokens