I am summarizing documents using TextRank pipeline in SpaCy. I need to summarize both long and short documents. Can you suggest a good approach to choose the right parameter of limit_phrases?
this is the approach I am currently using, but I am sure it can be improved:
import spacy
import pytextrank
nlp = spacy.load(spacy_model)
nlp.add_pipe("textrank", last=True)
# Process the input text
doc = nlp(text)
doc_sentences = len(list(doc.sents))
print(f'Number of document sentences = {doc_sentences}')
limit_sentences = int(doc_sentences * percentage)
limit_phrases = int(limit_sentences * 2)
top_sentences = doc._.textrank.summary(limit_phrases=limit_phrases, limit_sentences=limit_sentences, preserve_order=True)
The optimal values for limit_phrases
will depend strongly on your content. Do you have any kind of benchmark against which you could run test, essentially doing a grid search to find a reasonable setting for this parameter?
FWIW, I'm one of the authors of pytextrank
, and this is really good question. There's no analytic way to determining how to set this parameter, as far as our team knows.