Hello so I was checking sentiment of a text using transformers pretrained model ,but doing so gave me error
RuntimeError: The size of tensor a (1954) must match the size of tensor b (512) at non-singleton dimension 1
I went through few post which suggested that setting max_length
as 512 will sort the error.
It did resolve the error, but I want to know how it affects the quality of output. Does it truncate my text? For example, if the length of my text is 1195 will it process till 512, something like text[:512]?
Yes. It means the sentiment will be based on the first 512 tokens, and any tokens after that will not influence the result.
Note that this is tokens, not characters. If text
was your raw string, and if we assume that on average each token is 2.5 characters, then truncating at 512 tokens would be the same as text[:1280]
.
(The characters per token can vary a lot based on the model, the tokenizer, the language, the domain, but mainly how unusual the string is compared to the text used to train the tokenizer.)
By the way, according to https://huggingface.co/docs/transformers/pad_truncation if you don't specify truncation
then no truncation is applied; and if you do, but don't specify max_length
then it will default to the maximum supported by the model. So setting max_length
and not changing anything else shouldn't have fixed it. (I've not tested anything, or read the code, that is just based on my understanding of the documentation.)