I am trying to get document embeddings using pre-trained models in the HuggingFace Transformer library. The input is a document, the output is an embedding for this document using a pre-trained model. But I got an error as below and don't know how to fix it.
Code:
from transformers import pipeline, AutoTokenizer, AutoModel
from transformers import RobertaTokenizer, RobertaModel
import fitz
from openpyxl import load_workbook
import os
from tqdm import tqdm
PRETRAIN_MODEL = 'distilbert-base-cased'
DIR = "dataset"
# Load and process the text
all_files = os.listdir(DIR)
pdf_texts = {}
for filename in all_files:
if filename.lower().endswith('.pdf'):
pdf_path = os.path.join(DIR, filename)
with fitz.open(pdf_path) as doc:
text_content = ""
for page in doc:
text_content += page.get_text()
text = text_content.split("PUBLIC CONSULTATION")[0]
project_code = os.path.splitext(filename)[0]
pdf_texts[project_code] = text
# Generate embeddings for the documents
tokenizer = AutoTokenizer.from_pretrained(PRETRAIN_MODEL)
model = AutoModel.from_pretrained(PRETRAIN_MODEL)
pipe = pipeline('feature-extraction', model=model, tokenizer=tokenizer)
embeddings = {}
for project_code, text in tqdm(pdf_texts.items(), desc="Generating embeddings", unit="doc"):
embedding = pipe(text, return_tensors="pt")
embeddings[project_code] = embedding[0][0].numpy()
Error:
The error happens to the line embedding = pipe(text, return_tensors="pt")
. The output is as follows:
Generating embeddings: 0%| | 0/58 [00:00<?, ?doc/s]Token indices sequence length is longer than the specified maximum sequence length for this model (3619 > 512). Running this sequence through the model will result in indexing errors
Generating embeddings: 0%| | 0/58 [00:00<?, ?doc/s]
RuntimeError: The size of tensor a (3619) must match the size of tensor b (512) at non-singleton dimension 1
The input documents: https://drive.google.com/file/d/17yFOR0dQ8UMbefFed5QPZUXqU0vzifUw/view?usp=sharing
Thank you!
The length of the text
variable is 3619, while the pipeline accepts a maximum length of 512. You can fix this by splitting your text into chunks of 512, or you can use a model that accepts larger sequences.