pythonspacysummarypytextrank

What is the optimal value of limit_phrases for the summary method in pyTextRank


I am summarizing documents using TextRank pipeline in SpaCy. I need to summarize both long and short documents. Can you suggest a good approach to choose the right parameter of limit_phrases?

this is the approach I am currently using, but I am sure it can be improved:

import spacy
import pytextrank

nlp = spacy.load(spacy_model)
nlp.add_pipe("textrank", last=True)

# Process the input text
doc = nlp(text)

doc_sentences = len(list(doc.sents))
print(f'Number of document sentences = {doc_sentences}')
limit_sentences = int(doc_sentences * percentage)
limit_phrases = int(limit_sentences * 2)

top_sentences = doc._.textrank.summary(limit_phrases=limit_phrases, limit_sentences=limit_sentences, preserve_order=True)

Solution

  • The optimal values for limit_phrases will depend strongly on your content. Do you have any kind of benchmark against which you could run test, essentially doing a grid search to find a reasonable setting for this parameter?

    FWIW, I'm one of the authors of pytextrank, and this is really good question. There's no analytic way to determining how to set this parameter, as far as our team knows.