I am trying to follow this to translate english sentences to japanese.
Using this line:
import torch
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM
quantized_model_dir = "webbigdata/ALMA-7B-Ja-GPTQ-Ja-En"
model_basename = "gptq_model-4bit-128g"
tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir)
model = AutoGPTQForCausalLM.from_quantized(
quantized_model_dir,
model_basename=model_basename,
use_safetensors=True,
device="cuda:0")
Using this:
prompt1="Translate this from Japanese to English:\nJapanese: 量子化するとモデルの性能はどのくらい劣化してしまうのでしょうか?\nEnglish:"
input_ids = tokenizer(prompt1, return_tensors="pt", padding=True, max_length=200, truncation=True).input_ids.cuda()
input_ids.shape
i tokenize the input and print the input shape:
torch.Size([1, 52])
which is a sequence of 52 tokens.
If i input a different this may vary.
From the mode config i see max_length=512
(which i guess is the input size).
Shouldn't the result from the tokenizer be of size 512 always do match the model's input size? or does this happen when the input is given to the model?
The arguments of the tokenizer imply that the padding is done there.
Depends on what you want to do with the padded tokens, most probably if you're going to just run inference or feed it to the Trainer object, then you wont need special arguments to get the batch size shape to be a fixed length. The Trainer
object or model forward()
function usually takes care of that.
P/S: It looks like you're using the ALMA machine translation model, I'm guessing you're trying to tune/use the model, so the tokenizer's output doesn't need to emit the pad tokens.
But if you would like to get the tokenizer to output the shape that's padded with the pad tokens, try this:
prompt1="Translate this from Japanese to English:\nJapanese: 量子化するとモデルの性能はどのくらい劣化してしまうのでしょうか?\nEnglish:"
input_ids = tokenizer(
prompt1, return_tensors="pt",
padding="max_length", max_length=200,
truncation=True).input_ids
input_ids.shape
[out]:
torch.Size([1, 200])