pythonnlptransformer-modelsummarization

How to train FLAN-T5 to summarization task with a custom dataset of legal documents in pt-br?


So, I would like to create a small proof-of-concept using (already extracted in txt files) +- 4.000 legal text divided in:

  1. 2.000 initial petitions / complaints *.txt files
  2. 2.000 summaries of each initial petition (txt files too)

PS.: all text files are in brazilian portuguese (pt-br)

So how can I use these txt files to train a new transformer able to generate new summaries (using flan-t5) ?


Solution

  • I wrote a post and published a Colab about how this, if you want all of the details and code. (Post), (Colab Notebook)

    The basic steps that I would recommend are:

    1. Install the adapter-transformers library. (Docs)
    2. Use Flan-T5's tokenizer to convert each example from Unicode to the tokens used by Flan-T5. (Docs)
    3. Fine-tune a set of changes to the weights using LoRA. (Docs)
    4. Merge the low-rank changes back into the original weights.

    Another way of doing it would be to fine-tune all of the model weights without using adapter methods, but that takes longer and uses more VRAM, without improving performance noticeably.

    Note: Flan-T5 was mostly trained on English text, which means it won't perform as well on other languages.