pythongpuhuggingface-transformershuggingface-datasets

Efficiently using Hugging Face transformers pipelines on GPU with large datasets


I'm relatively new to Python and facing some performance issues while using Hugging Face Transformers for sentiment analysis on a relatively large dataset. I've created a DataFrame with 6000 rows of text data in Spanish, and I'm applying a sentiment analysis pipeline to each row of text. Here's a simplified version of my code:

import pandas as pd
import torch
from tqdm import tqdm
from transformers import pipeline


data = {
    'TD': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'text': [
        # ... (your text data here)
    ]
}

df_model = pd.DataFrame(data)

device = 0 if torch.cuda.is_available() else -1
py_sentimiento = pipeline("sentiment-analysis", model="finiteautomata/beto-sentiment-analysis", tokenizer="finiteautomata/beto-sentiment-analysis", device=device, truncation=True)

tqdm.pandas()
df_model['py_sentimiento'] = df_model['text'].progress_apply(py_sentimiento)
df_model['py_sentimiento'] = df_model['py_sentimiento'].apply(lambda x: x[0]['label'])

However, I've encountered a warning message that suggests I should use a dataset for more efficient processing. The warning message is as follows:

"You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset."

I have a two questions:

What does this warning mean, and why should I use a dataset for efficiency?

How can I modify my code to batch my data and use parallel computing to make better use of my GPU resources, what code or function or library should be used with hugging face transformers?

I'm eager to learn and optimize my code.


Solution

  • I think you can ignore this message. I found it being reported on different websites this year, but if I get it correctly, this Github issue on the Huggingface transformers (https://github.com/huggingface/transformers/issues/22387) shows that the warning can be safely ignored. In addition, batching or using datasets might not remove the warning or automatically use the resources in the best way. You can do call_count = 0 in here (https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/base.py#L1100) to ignore the warning, as explained by Martin Weyssow above.

    How can I modify my code to batch my data and use parallel computing to make better use of my GPU resources:

    You can add batching like this:

    py_sentimiento = pipeline("sentiment-analysis", model="finiteautomata/beto-sentiment-analysis", tokenizer="finiteautomata/beto-sentiment-analysis", batch_size=8, device=device, truncation=True)
    

    and most importantly, you can experiment with the batch size that will result to the highest GPU usage possible on your device and particular task.

    Huggingface provides here some rules to help users figure out how to batch: https://huggingface.co/docs/transformers/main_classes/pipelines#pipeline-batching. Making the best resource/GPU usage possible might take some experimentation and it depends on the use case you work on every time.

    What does this warning mean, and why should I use a dataset for efficiency?

    This means the GPU utilization is not optimal, because the data is not grouped together and it is thus not processed efficiently. Using a dataset from the Huggingface library datasets will utilize your resources more efficiently. However, it is not so easy to tell what exactly is going on, especially considering that we don’t know exactly how the data looks like, what the device is and how the model deals with the data internally. The warning might go away by using the datasets library, but that does not necessarily mean that the resources are optimally used.

    What code or function or library should be used with hugging face transformers?

    Here is a code example with pipelines and the datasets library: https://huggingface.co/docs/transformers/v4.27.1/pipeline_tutorial#using-pipelines-on-a-dataset. It mentions that using iterables will fill your GPU as fast as possible and batching might also help with computational time improvements.

    In your case it seems you are doing a relatively small POC (doing inference for under 10,000 documents with a medium size model), so I don’t think you need to use pipelines. I assume the sentiment analysis model is a classifier and you want to keep using Pandas as shown in the post, so here is how you can combine both. This is usually fast enough for my experiments and prints no warnings about the resources.

    from transformers import AutoTokenizer, AutoModelForSequenceClassification
    import torch as t
    import pandas as pd
            
    model = AutoModelForSequenceClassification.from_pretrained("finiteautomata/beto-sentiment-analysis")
    tokenizer = AutoTokenizer.from_pretrained("finiteautomata/beto-sentiment-analysis")
                
    def classify_dataframe_row(
        example: pd.Series,
    ):
        output = model(**tokenizer(example["text"], return_tensors="pt"))
        prediction = t.argmax(output[0]).detach().numpy()
        return prediction
    
    dataset = pd.read_csv("file")
    dataset = dataset.assign(
        prediction=dataset.progress_apply(classify_dataframe_row, axis=1)
    )
    

    As soon as your inference starts, either with this snippet or with the datasets library code, you can run nvidia-smi in a terminal and check what the GPU usage is and play around with the parameters to optimize it. Beware that running the code on your local machine with a GPU vs running it on a larger machine, e.g., a Linux server with perhaps a more powerful GPU might lead to different performance and might need different tuning. If you wish to run the code for larger document collections, you can split the data in order to avoid GPU memory errors locally, or in order to speed up the inference with concurrent runs in a server.