chunkssentence-transformers

SentenceTransformer (SBERT): encode_multi_process(): difference between batch_size and chunk_size


Assuming I have a few thousands sentences to encode on 4 CPU cores.

I believe I understand what batch_size means. A batch_size of 32 would mean that groups of 32 sentences would be sent together to be encoded (normal batch processing meaning for deep learning).

If I run 4 processes (4 CPU cores), batches of 32 sentences would be sent to each core to be encoded.

I don't see what "chunk_size" is for... or what it means in this context. Thanks for any help, clarification, and your time...


Solution

  • You're passing a list of sentences to the transformer to encode. When running in parallel there are multiple transformers of number performing encoding.

    In summary, the chunk size has to do with how many sentences each transformer receives at a time to process, the batch size is internal to the transformer on how many sentences are processed together.