pythonmachine-learningnlpbert-language-modelknn

RuntimeError when trying to extract text features from a BERT model then using KNN for classification


I'm trying to use camembert model to just to extract text features. After that, I'm trying to use a KNN classifier to classify the feature vectors as inputs.

This is the code I wrote

import torch
from transformers import AutoTokenizer, CamembertModel
from sklearn.neighbors import KNeighborsClassifier

tokenizer = AutoTokenizer.from_pretrained("camembert-base")
model = CamembertModel.from_pretrained("camembert-base")

data = df.to_dict(orient='split')
data = dict(zip(data['index'], data['data']))

# Collect all the input texts into a list of strings
input_texts = [str(text) for text in data.values()]

# Tokenize all the input texts together
inputs = tokenizer(input_texts, return_tensors="pt", padding=True, truncation=True)

# Get the model outputs for all the input texts
with torch.no_grad():
    outputs = model(**inputs)

# Extract the last hidden states and convert them to a numpy array
last_hidden_states = outputs.last_hidden_state
input_features = last_hidden_states[:, 0, :].numpy()

# Extract the labels from the data dictionary
input_labels = list(data.keys())

neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(input_features, input_labels)

However, I get this error

RuntimeError: [enforce fail at ..\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 19209424896 bytes.

The data I'm using in the dictionary has this form:

{
    'index': [row_index_1, row_index_2, ...],
    'columns': [column_name_1, column_name_2, ...],
    'data': [
        [cell_value_row_1_col_1, cell_value_row_1_col_2, ...],
        [cell_value_row_2_col_1, cell_value_row_2_col_2, ...],
        ...
    ]
}

Solution

  • It seems that you are feeding ALL your data to the model at once and you don't have enough memory to do that. Instead of doing that, you can invoke the model sentence by sentence or with small sentence batches, so that you keep the needed memory within the available system resources.