pythonnlphuggingface-transformerslanguage-modelnlp-question-answering

How to structure data for question-answering task to fine-tune a model with Huggingface run_qa.py example?


import sagemaker
import boto3
from sagemaker.huggingface import HuggingFace

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']
        
hyperparameters = {
    'model_name_or_path':'t5-base',
    'output_dir':'/opt/ml/model'
    # add your remaining hyperparameters
    # more info here https://github.com/huggingface/transformers/tree/v4.26.0/examples/pytorch/question-answering
}

# git configuration to download our fine-tuning script
git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch': 'v4.26.0'}

# creates Hugging Face estimator
huggingface_estimator = HuggingFace(
    entry_point='run_qa.py',
    source_dir='./examples/pytorch/question-answering',
    instance_type='ml.p3.2xlarge',
    instance_count=1,
    role=role,
    git_config=git_config,
    transformers_version='4.26.0',
    pytorch_version='1.13.1',
    py_version='py39',
    hyperparameters = hyperparameters
)

# starting the train job
huggingface_estimator.fit()

Given the above script (launch_training.py) which can be found here: https://huggingface.co/t5-base, how should my data be structured for a generative question-answering task?

For context: I am training T5 on some synthetic company text data so that I can then prompt it with questions such as "How can CompanyX improve sales?" or "How can CompanyX reduce the turnover rate?"

I have tried formatting my data as question-answer pairs, e.g. {"question": "How can CompanyX improve the performance of their marketing campaigns?", "answer": "The recent marketing campaign of CompanyX attracted a 20% increase in new customers. It suggests that if CompanyX focuses on customer-centric strategies and amplifies their digital marketing efforts, they might achieve even better results."} but this gives ValueError: Need either a dataset name or a training/validation file

I am passing an S3 URI to huggingface_estimator.fit(), namely huggingface_estimator.fit({"train_data_uri": "s3://fine-tuning/q-a_pairs.json"})


Solution

  • The code snippet you're using with Sagemaker and the Huggingface example comes from https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering

    The example uses the dataset formatted as how it is in the squad dataset, https://huggingface.co/datasets/squad

    Each example should look like this:

    {
        "answers": {
            "answer_start": [1],
            "text": ["This is a test text"]
        },
        "context": "This is a test context.",
        "id": "1",
        "question": "Is this a test?",
        "title": "train test"
    }
    

    The actual data file from squad would come from https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json and looks something like:

    {
        "context": "Following the disbandment of Destiny's Child in June 2005, she released her second solo album, B'Day (2006), which contained hits \"D\u00e9j\u00e0 Vu\", \"Irreplaceable\", and \"Beautiful Liar\". Beyonc\u00e9 also ventured into acting, with a Golden Globe-nominated performance in Dreamgirls (2006), and starring roles in The Pink Panther (2006) and Obsessed (2009). Her marriage to rapper Jay Z and portrayal of Etta James in Cadillac Records (2008) influenced her third album, I Am... Sasha Fierce (2008), which saw the birth of her alter-ego Sasha Fierce and earned a record-setting six Grammy Awards in 2010, including Song of the Year for \"Single Ladies (Put a Ring on It)\". Beyonc\u00e9 took a hiatus from music in 2010 and took over management of her career; her fourth album 4 (2011) was subsequently mellower in tone, exploring 1970s funk, 1980s pop, and 1990s soul. Her critically acclaimed fifth studio album, Beyonc\u00e9 (2013), was distinguished from previous releases by its experimental production and exploration of darker themes.",
        "qas": [{
            "answers": [{
                "answer_start": 207,
                "text": "acting"
            }],
            "question": "After her second solo album, what other entertainment venture did Beyonce explore?",
            "id": "56be86cf3aeaaa14008c9076"
        }, {
            "answers": [{
                "answer_start": 369,
                "text": "Jay Z"
            }],
            "question": "Which artist did Beyonce marry?",
            "id": "56be86cf3aeaaa14008c9078"
        }, {
            "answers": [{
                "answer_start": 565,
                "text": "six"
            }],
            "question": "To set the record for Grammys, how many did Beyonce win?",
            "id": "56be86cf3aeaaa14008c9079"
        }, {
            "answers": [{
                "answer_start": 260,
                "text": "Dreamgirls"
            }],
            "question": "For what movie did Beyonce receive  her first Golden Globe nomination?",
            "id": "56bf6e823aeaaa14008c9627"
        }, {
            "answers": [{
                "answer_start": 586,
                "text": "2010"
            }],
            "question": "When did Beyonce take a hiatus in her career and take control of her management?",
            "id": "56bf6e823aeaaa14008c9629"
        }, {
            "answers": [{
                "answer_start": 180,
                "text": "Beyonc\u00e9"
            }],
            "question": "Which album was darker in tone from her previous work?",
            "id": "56bf6e823aeaaa14008c962a"
        }, {
            "answers": [{
                "answer_start": 406,
                "text": "Cadillac Records"
            }],
            "question": "After what movie portraying Etta James, did Beyonce create Sasha Fierce?",
            "id": "56bf6e823aeaaa14008c962b"
        }, {
            "answers": [{
                "answer_start": 48,
                "text": "June 2005"
            }],
            "question": "When did Destiny's Child end their group act?",
            "id": "56d43da72ccc5a1400d830bd"
        }, {
            "answers": [{
                "answer_start": 95,
                "text": "B'Day"
            }],
            "question": "What was the name of Beyonc\u00e9's second solo album?",
            "id": "56d43da72ccc5a1400d830be"
        }, {
            "answers": [{
                "answer_start": 260,
                "text": "Dreamgirls"
            }],
            "question": "What was Beyonc\u00e9's first acting job, in 2006?",
            "id": "56d43da72ccc5a1400d830bf"
        }, {
            "answers": [{
                "answer_start": 369,
                "text": "Jay Z"
            }],
            "question": "Who is Beyonc\u00e9 married to?",
            "id": "56d43da72ccc5a1400d830c0"
        }, {
            "answers": [{
                "answer_start": 466,
                "text": "Sasha Fierce"
            }],
            "question": "What is the name of Beyonc\u00e9's alter-ego?",
            "id": "56d43da72ccc5a1400d830c1"
        }]
    }
    

    Breaking it down a little, if you have a data in JSON format that looks like this:

    
    import json
    
    from datasets import load_dataset
    
    
    two_qas = {
        "data": [{
            "title": "Destinys_Child",
            "paragraphs": [{
                    "context": "Following the disbandment of Destiny's Child in June 2005, she released her second solo album, B'Day (2006), which contained hits \"D\u00e9j\u00e0 Vu\", \"Irreplaceable\", and \"Beautiful Liar\". Beyonc\u00e9 also ventured into acting, with a Golden Globe-nominated performance in Dreamgirls (2006), and starring roles in The Pink Panther (2006) and Obsessed (2009). Her marriage to rapper Jay Z and portrayal of Etta James in Cadillac Records (2008) influenced her third album, I Am... Sasha Fierce (2008), which saw the birth of her alter-ego Sasha Fierce and earned a record-setting six Grammy Awards in 2010, including Song of the Year for \"Single Ladies (Put a Ring on It)\". Beyonc\u00e9 took a hiatus from music in 2010 and took over management of her career; her fourth album 4 (2011) was subsequently mellower in tone, exploring 1970s funk, 1980s pop, and 1990s soul. Her critically acclaimed fifth studio album, Beyonc\u00e9 (2013), was distinguished from previous releases by its experimental production and exploration of darker themes.",
                    "qas": [{
                        "answers": [{
                            "answer_start": 207,
                            "text": "acting"
                        }],
                        "question": "After her second solo album, what other entertainment venture did Beyonce explore?",
                        "id": "56be86cf3aeaaa14008c9076"
                    }, {
                        "answers": [{
                            "answer_start": 369,
                            "text": "Jay Z"
                        }],
                        "question": "Which artist did Beyonce marry?",
                        "id": "56be86cf3aeaaa14008c9078"
                    }, {
                        "answers": [{
                            "answer_start": 466,
                            "text": "Sasha Fierce"
                        }],
                        "question": "What is the name of Beyonc\u00e9's alter-ego?",
                        "id": "56d43da72ccc5a1400d830c1"
                    }]
                },
    
                {
                    "context": "A self-described \"modern-day feminist\", Beyonc\u00e9 creates songs that are often characterized by themes of love, relationships, and monogamy, as well as female sexuality and empowerment. On stage, her dynamic, highly choreographed performances have led to critics hailing her as one of the best entertainers in contemporary popular music. Throughout a career spanning 19 years, she has sold over 118 million records as a solo artist, and a further 60 million with Destiny's Child, making her one of the best-selling music artists of all time. She has won 20 Grammy Awards and is the most nominated woman in the award's history. The Recording Industry Association of America recognized her as the Top Certified Artist in America during the 2000s decade. In 2009, Billboard named her the Top Radio Songs Artist of the Decade, the Top Female Artist of the 2000s and their Artist of the Millennium in 2011. Time listed her among the 100 most influential people in the world in 2013 and 2014. Forbes magazine also listed her as the most powerful female musician of 2015.",
                    "qas": [{
                        "answers": [{
                            "answer_start": 104,
                            "text": "love, relationships, and monogamy"
                        }],
                        "question": "In her music, what are some recurring elements in them?",
                        "id": "56be88473aeaaa14008c9080"
                    }, {
                        "answers": [{
                            "answer_start": 935,
                            "text": "influential"
                        }],
                        "question": "Time magazine named her one of the most 100 what people of the century?",
                        "id": "56be88473aeaaa14008c9083"
                    }, {
                        "answers": [{
                            "answer_start": 985,
                            "text": "Forbes"
                        }],
                        "question": "Which magazine declared her the most dominant woman musician?",
                        "id": "56be88473aeaaa14008c9084"
                    }, {
                        "answers": [{
                            "answer_start": 736,
                            "text": "2000s"
                        }],
                        "question": "In which decade did the Recording Industry Association of America recognize Beyonce as the The Top Certified Artist?",
                        "id": "56bf725c3aeaaa14008c9643"
                    }]
                }
            ]
    
        }]
    }
    
    
    with open('my_qas_dataset.json', 'w') as fout:
        json.dump(two_qas, fout)
        
        
    ds = load_dataset("json", 
            data_files={
                'train': 
                'my_qas_dataset.json'
            }, 
            field="data"
        )
    
    

    Then to train a model, the easiest way out is to push to huggingface hub, https://huggingface.co/docs/datasets/upload_dataset#upload-with-python

    After that you can use load_dataset when you change the script on https://github.com/huggingface/transformers/blob/main/examples/pytorch/question-answering/run_qa.py#LL286C1-L293C10 and save a local script on your machine, e.g. on ./scripts/run_qa.py

    Finally, instead of using the the git_config, you can do this:

    hyperparameters = {
        'model_name_or_path':'t5-base',
        'output_dir':'/opt/ml/model'
        # add your remaining hyperparameters
        # more info here https://github.com/huggingface/transformers/tree/v4.26.0/examples/pytorch/question-answering
    }
    
    # creates Hugging Face estimator
    huggingface_estimator = HuggingFace(
        entry_point='run_qa.py',
        source_dir='./scripts',
        instance_type='ml.p3.2xlarge',
        instance_count=1,
        role=role,
        transformers_version='4.26.0',
        pytorch_version='1.13.1',
        py_version='py39',
        hyperparameters = hyperparameters
    )