pythonnlphuggingface-transformerssentence-transformerslarge-language-model

How to use cross-encoder with Huggingface transformers pipeline?


There're a set of models on huggingface hubs that comes from the sentence_transformers library, e.g. https://huggingface.co/cross-encoder/mmarco-mMiniLMv2-L12-H384-v1

The suggested usage examples are:

# Using sentence_transformers

from sentence_transformers import CrossEncoder

model_name = 'cross-encoder/mmarco-mMiniLMv2-L12-H384-v1'
model = CrossEncoder(model_name)
scores = model.predict([
  ['How many people live in Berlin?', 'How many people live in Berlin?'], 
  ['Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.', 'New York City is famous for the Metropolitan Museum of Art.']
])
scores

[out]:

array([ 0.36782095, -4.2674575 ], dtype=float32)

Or

# From transformers.

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import pipeline
import torch

# cross-encoder/ms-marco-MiniLM-L-12-v2
model = AutoModelForSequenceClassification.from_pretrained('cross-encoder/mmarco-mMiniLMv2-L12-H384-v1')
tokenizer = AutoTokenizer.from_pretrained('cross-encoder/mmarco-mMiniLMv2-L12-H384-v1')

features = tokenizer(['How many people live in Berlin?', 'How many people live in Berlin?'], 
                     ['Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.', 'New York City is famous for the Metropolitan Museum of Art.'],  
                     padding=True, truncation=True, return_tensors="pt")

model.eval()
with torch.no_grad():
    scores = model(**features).logits
    print(scores)

[out]:

tensor([[10.7615],
        [-8.1277]])

If a user wants to use the transformers.pipeline on these cross-encoder model, it throws an error:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import pipeline
import torch

# cross-encoder/ms-marco-MiniLM-L-12-v2
model = AutoModelForSequenceClassification.from_pretrained('cross-encoder/mmarco-mMiniLMv2-L12-H384-v1')
tokenizer = AutoTokenizer.from_pretrained('cross-encoder/mmarco-mMiniLMv2-L12-H384-v1')

pipe = pipeline(model=model, tokenizer=tokenizer)

It throws an error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/tmp/ipykernel_108/785368641.py in <module>
----> 1 pipe = pipeline(model=model, tokenizer=tokenizer)

/opt/conda/lib/python3.7/site-packages/transformers/pipelines/__init__.py in pipeline(task, model, config, tokenizer, feature_extractor, image_processor, framework, revision, use_fast, use_auth_token, device, device_map, torch_dtype, trust_remote_code, model_kwargs, pipeline_class, **kwargs)
    711         if not isinstance(model, str):
    712             raise RuntimeError(
--> 713                 "Inferring the task automatically requires to check the hub with a model_id defined as a `str`."
    714                 f"{model} is not a valid model_id."
    715             )

RuntimeError: Inferring the task automatically requires to check the hub with a model_id defined as a `str`.

Q: How to use cross-encoder with Huggingface transformers pipeline?

Q: If a model_id is needed, is it possible to add the model_id as an args or kwargs in pipeline?

There's a similar question Error: Inferring the task automatically requires to check the hub with a model_id defined as a `str`. AraBERT model but I'm not sure it's the same issue, since the other question is on 'aubmindlab/bert-base-arabertv02' but not the cross-encoder class of models from sentence_transformers.


Solution

  • After much trial and error and code digging from

    And now, here goes...

    TL;DR

    from transformers import AutoTokenizer, AutoModelForSequenceClassification
    from transformers import pipeline
    import torch
    
    # cross-encoder/ms-marco-MiniLM-L-12-v2
    model = AutoModelForSequenceClassification.from_pretrained('cross-encoder/mmarco-mMiniLMv2-L12-H384-v1')
    tokenizer = AutoTokenizer.from_pretrained('cross-encoder/mmarco-mMiniLMv2-L12-H384-v1')
    
    pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)
    
    
    pipe([{"text": 'How many people live in Berlin?', "text_pair": 'Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.'},
          {"text": 'How many people live in Berlin?', "text_pair": 'New York City is famous for the Metropolitan Museum of Art.'},
    

    [out]:

    [{'label': 'LABEL_0', 'score': 0.99997878074646},
     {'label': 'LABEL_0', 'score': 0.0002951461647171527},
     {'label': 'LABEL_0', 'score': 0.027012893930077553}]
    

    But the output isn't the same as using sentence_transformers!

    Yes it isn't, because softmax was applied to the outputs.

    There's a classification function that is applied post model inference, at https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/text_classification.py#L27

    class ClassificationFunction(ExplicitEnum):
        SIGMOID = "sigmoid"
        SOFTMAX = "softmax"
        NONE = "none"
    
    

    And particularly at https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/text_classification.py#L184

        def postprocess(self, model_outputs, function_to_apply=None, top_k=1, _legacy=True):
            # `_legacy` is used to determine if we're running the naked pipeline and in backward
            # compatibility mode, or if running the pipeline with `pipeline(..., top_k=1)` we're running
            # the more natural result containing the list.
            # Default value before `set_parameters`
            if function_to_apply is None:
                if self.model.config.problem_type == "multi_label_classification" or self.model.config.num_labels == 1:
                    function_to_apply = ClassificationFunction.SIGMOID
                elif self.model.config.problem_type == "single_label_classification" or self.model.config.num_labels > 1:
                    function_to_apply = ClassificationFunction.SOFTMAX
                elif hasattr(self.model.config, "function_to_apply") and function_to_apply is None:
                    function_to_apply = self.model.config.function_to_apply
                else:
                    function_to_apply = ClassificationFunction.NONE
    

    TL;DR (this time for real)

    To replicate the results from rolling out your own tokenize + forward function, you'll have to explicitly set the classification function and override the post-process function, i.e.

    from transformers import AutoTokenizer, AutoModelForSequenceClassification
    from transformers import pipeline
    from transformers.pipelines.text_classification import ClassificationFunction
    
    
    model = AutoModelForSequenceClassification.from_pretrained('cross-encoder/mmarco-mMiniLMv2-L12-H384-v1')
    tokenizer = AutoTokenizer.from_pretrained('cross-encoder/mmarco-mMiniLMv2-L12-H384-v1')
    
    pipe = pipeline("text-classification", model=model, tokenizer=tokenizer, function_to_apply=ClassificationFunction.NONE)
    
    
    pipe([{"text": 'How many people live in Berlin?', "text_pair": 'Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.'},
          {"text": 'How many people live in Berlin?', "text_pair": 'New York City is famous for the Metropolitan Museum of Art.'},
          {"text": 'Hello how are you?', "text_pair": "I'm fine, thank you"},
          
         ])
    

    [out]:

    [{'label': 'LABEL_0', 'score': 10.761542320251465},
     {'label': 'LABEL_0', 'score': -8.127744674682617},
     {'label': 'LABEL_0', 'score': -3.5840566158294678}]