I am trying to save the GPT2 tokenizer as follows:
from transformers import GPT2Tokenizer, GPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = GPT2Tokenizer.eos_token
dataset_file = "x.csv"
df = pd.read_csv(dataset_file, sep=",")
input_ids = tokenizer.batch_encode_plus(list(df["x"]), max_length=1024,padding='max_length',truncation=True)["input_ids"]
# saving the tokenizer
tokenizer.save_pretrained("tokenfile")
I am getting the following error: TypeError: Object of type property is not JSON serializable
More details:
TypeError Traceback (most recent call last)
Cell In[x], line 3
1 # Save the fine-tuned model
----> 3 tokenizer.save_pretrained("tokenfile")
File /3tb/share/anaconda3/envs/ak_env/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:2130, in PreTrainedTokenizerBase.save_pretrained(self, save_directory, legacy_format, filename_prefix, push_to_hub, **kwargs)
2128 write_dict = convert_added_tokens(self.special_tokens_map_extended, add_type_field=False)
2129 with open(special_tokens_map_file, "w", encoding="utf-8") as f:
-> 2130 out_str = json.dumps(write_dict, indent=2, sort_keys=True, ensure_ascii=False) + "\n"
2131 f.write(out_str)
2132 logger.info(f"Special tokens file saved in {special_tokens_map_file}")
File /3tb/share/anaconda3/envs/ak_env/lib/python3.10/json/__init__.py:238, in dumps(obj, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, default, sort_keys, **kw)
232 if cls is None:
233 cls = JSONEncoder
234 return cls(
235 skipkeys=skipkeys, ensure_ascii=ensure_ascii,
236 check_circular=check_circular, allow_nan=allow_nan, indent=indent,
237 separators=separators, default=default, sort_keys=sort_keys,
--> 238 **kw).encode(obj)
File /3tb/share/anaconda3/envs/ak_env/lib/python3.10/json/encoder.py:201, in JSONEncoder.encode(self, o)
199 chunks = self.iterencode(o, _one_shot=True)
...
178 """
--> 179 raise TypeError(f'Object of type {o.__class__.__name__} '
180 f'is not JSON serializable')
TypeError: Object of type property is not JSON serializable
How can I solve this issue?
The Problem is on the line:
tokenizer.pad_token = GPT2Tokenizer.eos_token
Here the initializer is wrong, that's why this error occurred.
A simple solution is to modify this line to:
tokenizer.pad_token = tokenizer.eos_token
For the reference purpose, your final code will look like this:
from transformers import GPT2Tokenizer, GPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
dataset_file = "x.csv"
df = pd.read_csv(dataset_file, sep=",")
input_ids = tokenizer.batch_encode_plus(list(df["x"]), max_length=1024,padding='max_length',truncation=True)["input_ids"]
# saving the tokenizer
tokenizer.save_pretrained("tokenfile")