huggingfacehuggingface-tokenizersfine-tuning

Finetuning a huggingface LLM on two Books using LoRa


I have been trying to get into finetuning LLMs on my own hardware (Ryzen 3960x and RTX 3090 64 GB Ram) as efficiently as possible and running into some problems while doing so. As a test, I wanted to train GPT-2 on DavidCopperfield by Charles Dickens to test the result one could expect, so I tokenized the books using pdfReader and autoTokenize from my model. This seemed to work. Then, I wanted to finetune the model on this tokenized dataset, but I ran into some issues with CUDA installation. Every time I run my code, I get this error:

    bin C:\Users\salom\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\bitsandbytes\libbitsandbytes_cpu.so
    False
    C:\Users\salom\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\bitsandbytes\cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
      warn("The installed version of bitsandbytes was compiled without GPU support. "
    'NoneType' object has no attribute 'cadam32bit_grad_fp32'
    CUDA SETUP: Required library version not found: libbitsandbytes_cpu.so. Maybe you need to compile it from source?
    CUDA SETUP: Defaulting to libbitsandbytes_cpu.so...
    
    ================================================ERROR=====================================
    CUDA SETUP: CUDA detection failed! Possible reasons:
    1. CUDA driver not installed
    2. CUDA not installed
    3. You have multiple conflicting CUDA libraries
    4. Required library not pre-compiled for this bitsandbytes release!
    CUDA SETUP: If you compiled from source, try again with `make CUDA_VERSION=DETECTED_CUDA_VERSION` for example, `make CUDA_VERSION=113`.
    CUDA SETUP: The CUDA version for the compile might depend on your conda install. Inspect CUDA version via `conda list | grep cuda`.
    ================================================================================
    
    CUDA SETUP: Problem: The main issue seems to be that the main CUDA library was not detected.
    CUDA SETUP: Solution 1): Your paths are probably not up-to-date. You can update them via: sudo ldconfig.
    CUDA SETUP: Solution 2): If you do not have sudo rights, you can do the following:
    CUDA SETUP: Solution 2a): Find the cuda library via: find / -name libcuda.so 2>/dev/null
    CUDA SETUP: Solution 2b): Once the library is found add it to the LD_LIBRARY_PATH: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:FOUND_PATH_FROM_2a
    CUDA SETUP: Solution 2c): For a permanent solution add the export from 2b into your .bashrc file, located at ~/.bashrc
    CUDA SETUP: Setup Failed!

This is my code:


import PyPDF2

# Function to extract text from a PDF file
def extract_text_from_pdf(file_path):
    with open(file_path, 'rb') as file:
        pdf_reader = PyPDF2.PdfReader(file)
        text = ""
        for page in pdf_reader.pages:
            text += page.extract_text()
        return text

# Load the PDF file and extract text
pdf_file_path = "DavidCopperfield.pdf"
book_text = extract_text_from_pdf(pdf_file_path)

import re

# Function to filter and clean the text
def filter_text(text):
    # Remove chapter titles and page numbers
    text = re.sub(r'CHAPTER \d+', '', text)
    text = re.sub(r'\d+', '', text)

    # Remove unwanted characters and extra whitespaces
    text = re.sub(r'[^\w\s\'.-]', '', text)
    text = re.sub(r'\s+', ' ', text)

    # Remove lines with all uppercase letters (potential noise)
    text = '\n'.join(line for line in text.split('\n') if not line.isupper())

    return text

# Apply text filtering to the book text
filtered_text = filter_text(book_text)

# Partition the filtered text into training texts with a maximum size
max_text_size = 150
train_texts = []
current_text = ""
for paragraph in filtered_text.split("\n\n"):
    if len(current_text) + len(paragraph) < max_text_size:
        current_text += paragraph + "\n\n"
    else:
        train_texts.append(current_text)
        current_text = paragraph + "\n\n"
if current_text:
    train_texts.append(current_text)


from transformers import GPT2LMHeadModel, GPT2Tokenizer, GPT2Config
from transformers import AdamW
from torch.utils.data import Dataset, DataLoader
import torch
# Define your dataset class
class TextDataset(Dataset):
    def __init__(self, texts, tokenizer, max_length):
        self.texts = [text for text in texts if len(text) >= max_length]  # Filter out texts shorter than max_length
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        encoded_input = self.tokenizer.encode_plus(text, max_length=self.max_length, padding='max_length', truncation=True, return_tensors='pt')
        input_ids = encoded_input['input_ids'].squeeze()
        attention_mask = encoded_input['attention_mask'].squeeze()
        return input_ids, attention_mask

# Load pre-trained LM and tokenizer
lm_model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.add_special_tokens({'pad_token': '[PAD]'})  # Add padding token

# Prepare your training data
train_dataset = TextDataset(train_texts, tokenizer, max_length=128)
train_dataloader = DataLoader(train_dataset, batch_size=8, shuffle=True)

# Configure LM training
lm_model.train()
# Replace the optimizer initialization line
optimizer = torch.optim.AdamW(lm_model.parameters(), lr=1e-5)
num_epochs = 10

# Training loop
for epoch in range(num_epochs):
    for batch in train_dataloader:
        input_ids, attention_mask = batch
        outputs = lm_model(input_ids=input_ids, attention_mask=attention_mask, labels=input_ids)
        loss = outputs.loss

        # Backpropagation and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Print loss or other metrics for monitoring

# Save the fine-tuned LM
lm_model.save_pretrained('fine_tuned_lm')
tokenizer.save_pretrained('fine_tuned_lm')


Solution

  • If anyone else is running into the issue, the fix for me was to remove all versions I had installed and use pytorch installation which aso installs CUDA. Then, I installed bitsandbytes again and ran it. The results from he finetune were mediocre though, as I had expected.