huggingface-transformerslarge-language-modelword-embeddinghuggingfacepre-trained-model

Tensor size error when generating embeddings for documents using HuggingFace pre-trained models


I am trying to get document embeddings using pre-trained models in the HuggingFace Transformer library. The input is a document, the output is an embedding for this document using a pre-trained model. But I got an error as below and don't know how to fix it.

Code:

from transformers import pipeline, AutoTokenizer, AutoModel
from transformers import RobertaTokenizer, RobertaModel
import fitz
from openpyxl import load_workbook
import os
from tqdm import tqdm

PRETRAIN_MODEL = 'distilbert-base-cased'
DIR = "dataset"

# Load and process the text
all_files = os.listdir(DIR)
pdf_texts = {}
for filename in all_files:
    if filename.lower().endswith('.pdf'):
        pdf_path = os.path.join(DIR, filename)
        with fitz.open(pdf_path) as doc:
            text_content = ""
            for page in doc:
                text_content += page.get_text()
            text = text_content.split("PUBLIC CONSULTATION")[0]
            project_code = os.path.splitext(filename)[0]
            pdf_texts[project_code] = text 

# Generate embeddings for the documents
tokenizer = AutoTokenizer.from_pretrained(PRETRAIN_MODEL)
model = AutoModel.from_pretrained(PRETRAIN_MODEL)
pipe = pipeline('feature-extraction', model=model, tokenizer=tokenizer)

embeddings = {}
for project_code, text in tqdm(pdf_texts.items(), desc="Generating embeddings", unit="doc"):
    embedding = pipe(text, return_tensors="pt")
    embeddings[project_code] = embedding[0][0].numpy()

Error:

The error happens to the line embedding = pipe(text, return_tensors="pt"). The output is as follows:

Generating embeddings:   0%|          | 0/58 [00:00<?, ?doc/s]Token indices sequence length is longer than the specified maximum sequence length for this model (3619 > 512). Running this sequence through the model will result in indexing errors
Generating embeddings:   0%|          | 0/58 [00:00<?, ?doc/s]
RuntimeError: The size of tensor a (3619) must match the size of tensor b (512) at non-singleton dimension 1

The input documents: https://drive.google.com/file/d/17yFOR0dQ8UMbefFed5QPZUXqU0vzifUw/view?usp=sharing

Thank you!


Solution

  • The length of the text variable is 3619, while the pipeline accepts a maximum length of 512. You can fix this by splitting your text into chunks of 512, or you can use a model that accepts larger sequences.