[SOLVED] How to train FLAN-T5 to summarization task with a custom dataset of legal documents in pt-br?

How to train FLAN-T5 to summarization task with a custom dataset of legal documents in pt-br?

So, I would like to create a small proof-of-concept using (already extracted in txt files) +- 4.000 legal text divided in:

2.000 initial petitions / complaints *.txt files
2.000 summaries of each initial petition (txt files too)

PS.: all text files are in brazilian portuguese (pt-br)

So how can I use these txt files to train a new transformer able to generate new summaries (using flan-t5) ?

Solution

I wrote a post and published a Colab about how this, if you want all of the details and code. (Post), (Colab Notebook)

The basic steps that I would recommend are:

Install the adapter-transformers library. (Docs)
Use Flan-T5's tokenizer to convert each example from Unicode to the tokens used by Flan-T5. (Docs)
Fine-tune a set of changes to the weights using LoRA. (Docs)
Merge the low-rank changes back into the original weights.

Another way of doing it would be to fine-tune all of the model weights without using adapter methods, but that takes longer and uses more VRAM, without improving performance noticeably.

Note: Flan-T5 was mostly trained on English text, which means it won't perform as well on other languages.