If this is not the best place to ask this question, please lead me to the most accurate one.
I am planning to use one of the Huggingface summarization models (https://huggingface.co/models?pipeline_tag=summarization) to summarize my lecture video transcriptions.
So far I have tested facebook/bart-large-cnn
and sshleifer/distilbart-cnn-12-6
, but they only support a maximum of 1,024 tokens as input.
So, here are my questions:
Are there any summarization models that support longer inputs such as 10,000 word articles?
What are the optimal output lengths for given input lengths? Let's say for a 1,000 word input, what is the optimal (minimum) output length (the min. length of the summarized text)?
Which model would likely work on programming related articles?
Are there any summarization models that support longer inputs such as 10,000 word articles?
Yes, the Longformer Encoder-Decoder (LED) [1] model published by Beltagy et al. is able to process up to 16k tokens. Various LED models are available here on HuggingFace. There is also PEGASUS-X [2] published recently by Phang et al. which is also able to process up to 16k tokens. Models are also available here on HuggingFace.
Alternatively, you can look at either:
max_input_length
(e.g. 1024), summarise each, and then concatenate together. Care will have to be taken as to how the documents are chunked as to avoid chunking mid-way through particular topics, or having a relatively short final chunk that may produce an unusable summary.What are the optimal output lengths for given input lengths? Let's say for a 1,000 word input, what is the optimal (minimum) output length (i.e. the min. length of the summarized text)?
This is a very difficult question to answer as it hard to empirically evaluate the quality of a summarisation. I would suggest running a few tests yourself with varied output length limits (e.g. 20, 50, 100, 200) and find what subjectively works best. Each model and document genre will be different. Anecdotally, I would say 50 words will a good minimum, with 100-150 offering better results.
Which model would likely to work on programming related articles?
I can imagine three possible cases for what constitutes a programming related article.
For case (1), I'm not aware of any implementations on HuggingFace that focus on this problem. However, it is an active research topic (see [3], [4], [5]).
For case (2), you can use the models you've been using already, and if feasible, fine tune on your own specific dataset of programming related articles.
For case (3), simply look at combining implementations from both (1) and (2) based on whether the input is categorised as either formal (code) or informal (natural) language.
[1] Beltagy, I., Peters, M.E. and Cohan, A., 2020. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150.
[2] Phang, J., Zhao, Y. and Liu, P.J., 2022. Investigating Efficiently Extending Transformers for Long Input Summarization. arXiv preprint arXiv:2208.04347.
[3] Ahmad, W.U., Chakraborty, S., Ray, B. and Chang, K.W., 2020. A transformer-based approach for source code summarization. arXiv preprint arXiv:2005.00653.
[4] Wei, B., Li, G., Xia, X., Fu, Z. and Jin, Z., 2019. Code generation as a dual task of code summarization. Advances in neural information processing systems, 32.
[5] Wan, Y., Zhao, Z., Yang, M., Xu, G., Ying, H., Wu, J. and Yu, P.S., 2018, September. Improving automatic source code summarization via deep reinforcement learning. In Proceedings of the 33rd ACM/IEEE international conference on automated software engineering (pp. 397-407).