Question 1
Are there any summarization models that support longer inputs such as
10,000 word articles?
Yes, the Longformer Encoder-Decoder (LED) [1] model published by Beltagy et al. is able to process up to 16k tokens. Various LED models are available here on HuggingFace. There is also PEGASUS-X [2] published recently by Phang et al. which is also able to process up to 16k tokens. Models are also available here on HuggingFace.
Alternatively, you can look at either:
- Extractive followed by abstractive summarisation, or
- Splitting a large document into chunks of
max_input_length
(e.g. 1024), summarise each, and then concatenate together. Care will have to be taken as to how the documents are chunked as to avoid chunking mid-way through particular topics, or having a relatively short final chunk that may produce an unusable summary.
Question 2
What are the optimal output lengths for given input lengths? Let's say
for a 1,000 word input, what is the optimal (minimum) output
length (i.e. the min. length of the summarized text)?
This is a very difficult question to answer as it hard to empirically evaluate the quality of a summarisation. I would suggest running a few tests yourself with varied output length limits (e.g. 20, 50, 100, 200) and find what subjectively works best. Each model and document genre will be different. Anecdotally, I would say 50 words will a good minimum, with 100-150 offering better results.
Question 3
Which model would likely to work on programming related articles?
I can imagine three possible cases for what constitutes a programming related article.
- Source code summarisation (which involves producing a natural (informal) language summary of code (formal language)).
- Traditional abstractive summarisation (i.e. natural language summary of natural language but for articles talking about programming yet have no code).
- Combination of both 1 and 2.
For case (1), I'm not aware of any implementations on HuggingFace that focus on this problem. However, it is an active research topic (see [3], [4], [5]).
For case (2), you can use the models you've been using already, and if feasible, fine tune on your own specific dataset of programming related articles.
For case (3), simply look at combining implementations from both (1) and (2) based on whether the input is categorised as either formal (code) or informal (natural) language.
References
[1] Beltagy, I., Peters, M.E. and Cohan, A., 2020. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150.
[2] Phang, J., Zhao, Y. and Liu, P.J., 2022. Investigating Efficiently Extending Transformers for Long Input Summarization. arXiv preprint arXiv:2208.04347.
[3] Ahmad, W.U., Chakraborty, S., Ray, B. and Chang, K.W., 2020. A transformer-based approach for source code summarization. arXiv preprint arXiv:2005.00653.
[4] Wei, B., Li, G., Xia, X., Fu, Z. and Jin, Z., 2019. Code generation as a dual task of code summarization. Advances in neural information processing systems, 32.
[5] Wan, Y., Zhao, Z., Yang, M., Xu, G., Ying, H., Wu, J. and Yu, P.S., 2018, September. Improving automatic source code summarization via deep reinforcement learning. In Proceedings of the 33rd ACM/IEEE international conference on automated software engineering (pp. 397-407).