jsonnlpdatasethuggingface-transformersfine-tuning

How to create a dataset for a model like Falcon-7b/40b?


I am having the data as docx files and I want to use them to fine-tune the Falcon model. From what I see the data used to train the model was in json format. How can I convert my data in a format to be useful for the model?

Currently I am trying to convert my data in the json format, but it's a tedious work to do by hand.


Solution

  • To convert your data from DOCX format to JSON format, you can use python-docx library to extract text from a DOCX file and convert it to JSON:

    import json
    from docx import Document
    
     # Load the DOCX file
    doc = Document('input.docx')
    
    # Extract text content
    text = [p.text for p in doc.paragraphs]
    
    # Create JSON objects
    json_data = []
    for paragraph in text:
        json_data.append({"text": paragraph})
    
    # Save as JSON
    with open('output.json', 'w') as json_file:
        json.dump(json_data, json_file)