pythonmachine-learninghuggingface-transformerslarge-language-modelnlp-question-answering

How to tune LLM to give full length and detailed answers


I am building an application in which you can select an open source model from a list of models and ask it general questions. I am using searxng to search the web for context. While all of this is working great and I am able to get the results, what I am not able to get are the detailed or full length answers. For example, if I ask who is the 2007 f1 world champion, the answer I get is Raikkonen.

I want my answer to be structure properly. My ideal answer would be, "The 2007 F1 world champion is Kimi Raikkonen".

I am using the hugging face transformer pipeline in this manner:

model_name = question.model
question_answerer = pipeline(
"question-answering",
model=AutoModelForQuestionAnswering.from_pretrained(model_name),
tokenizer=AutoTokenizer.from_pretrained(model_name),
device=0  # Use GPU if available
)

response = question_answerer(question=question.question, context=summarized_content, batch_size=16)

return response

Currently, I am using deepset/roberta-base-squad2 model

I have tried sending a large amount of context to the model in hopes of getting a detailed answer. I have also tried a bunch of different models but got similar results


Solution

  • You are using deepset/roberta-base-squad2 which obviously is trained on squad_v2 dataset. If you take a look at the this dataset you can see that all the answers are short therefore it doesn't matter how long your context is the answer is always going to be the short one. In this case using other models like T5 family, BART or anything that is trained on this dataset would yield a short answer. You should be looking for models that were trained on datasets with longer answers. Here are a few to consider:

    1. Eli5: https://yjernite.github.io/lfqa.html#generation
    2. CoQA: https://stanfordnlp.github.io/coqa/
    3. QuAC: https://quac.ai/

    You can search either of the above datasets on HuggingFace and will find models that were trained on these datasets.