I am having the data as docx files and I want to use them to fine-tune the Falcon model. From what I see the data used to train the model was in json format. How can I convert my data in a format to be useful for the model?
Currently I am trying to convert my data in the json format, but it's a tedious work to do by hand.
To convert your data from DOCX format to JSON format, you can use python-docx library to extract text from a DOCX file and convert it to JSON:
import json
from docx import Document
# Load the DOCX file
doc = Document('input.docx')
# Extract text content
text = [p.text for p in doc.paragraphs]
# Create JSON objects
json_data = []
for paragraph in text:
json_data.append({"text": paragraph})
# Save as JSON
with open('output.json', 'w') as json_file:
json.dump(json_data, json_file)