Which HuggingFace summarization models support more than 1024 tokens? Which model is more suitable for programming related articles?
Asked Answered
G

1

9

If this is not the best place to ask this question, please lead me to the most accurate one.

I am planning to use one of the Huggingface summarization models (https://huggingface.co/models?pipeline_tag=summarization) to summarize my lecture video transcriptions.

So far I have tested facebook/bart-large-cnn and sshleifer/distilbart-cnn-12-6, but they only support a maximum of 1,024 tokens as input.

So, here are my questions:

  1. Are there any summarization models that support longer inputs such as 10,000 word articles?

  2. What are the optimal output lengths for given input lengths? Let's say for a 1,000 word input, what is the optimal (minimum) output length (the min. length of the summarized text)?

  3. Which model would likely work on programming related articles?

Geneva answered 27/10, 2022 at 21:45 Comment(0)
T
9

Question 1

Are there any summarization models that support longer inputs such as 10,000 word articles?

Yes, the Longformer Encoder-Decoder (LED) [1] model published by Beltagy et al. is able to process up to 16k tokens. Various LED models are available here on HuggingFace. There is also PEGASUS-X [2] published recently by Phang et al. which is also able to process up to 16k tokens. Models are also available here on HuggingFace.

Alternatively, you can look at either:

  1. Extractive followed by abstractive summarisation, or
  2. Splitting a large document into chunks of max_input_length (e.g. 1024), summarise each, and then concatenate together. Care will have to be taken as to how the documents are chunked as to avoid chunking mid-way through particular topics, or having a relatively short final chunk that may produce an unusable summary.

Question 2

What are the optimal output lengths for given input lengths? Let's say for a 1,000 word input, what is the optimal (minimum) output length (i.e. the min. length of the summarized text)?

This is a very difficult question to answer as it hard to empirically evaluate the quality of a summarisation. I would suggest running a few tests yourself with varied output length limits (e.g. 20, 50, 100, 200) and find what subjectively works best. Each model and document genre will be different. Anecdotally, I would say 50 words will a good minimum, with 100-150 offering better results.

Question 3

Which model would likely to work on programming related articles?

I can imagine three possible cases for what constitutes a programming related article.

  1. Source code summarisation (which involves producing a natural (informal) language summary of code (formal language)).
  2. Traditional abstractive summarisation (i.e. natural language summary of natural language but for articles talking about programming yet have no code).
  3. Combination of both 1 and 2.

For case (1), I'm not aware of any implementations on HuggingFace that focus on this problem. However, it is an active research topic (see [3], [4], [5]).

For case (2), you can use the models you've been using already, and if feasible, fine tune on your own specific dataset of programming related articles.

For case (3), simply look at combining implementations from both (1) and (2) based on whether the input is categorised as either formal (code) or informal (natural) language.

References

[1] Beltagy, I., Peters, M.E. and Cohan, A., 2020. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150.

[2] Phang, J., Zhao, Y. and Liu, P.J., 2022. Investigating Efficiently Extending Transformers for Long Input Summarization. arXiv preprint arXiv:2208.04347.

[3] Ahmad, W.U., Chakraborty, S., Ray, B. and Chang, K.W., 2020. A transformer-based approach for source code summarization. arXiv preprint arXiv:2005.00653.

[4] Wei, B., Li, G., Xia, X., Fu, Z. and Jin, Z., 2019. Code generation as a dual task of code summarization. Advances in neural information processing systems, 32.

[5] Wan, Y., Zhao, Z., Yang, M., Xu, G., Ying, H., Wu, J. and Yu, P.S., 2018, September. Improving automatic source code summarization via deep reinforcement learning. In Proceedings of the 33rd ACM/IEEE international conference on automated software engineering (pp. 397-407).

Thoma answered 28/10, 2022 at 0:48 Comment(6)
ty so much for answers. by programming i didnt exactly mean source code. it is like i am teaching programming like a c# course. do you know why they dont auto support sliding window splitting with bigger text? i found a solution but it splits with exact number of text length. i will check answersMeatiness
@MonsterMMORPG You're welcome. Probably because some sort of "sliding window" that chunks and produces intermediate summaries which contribute to an overall summary is not as straightforward as it sounds (see this paper by Gidiotis & Tsoumakas); so the development effort may not be warranted. It seems like there is a concerted research effort currently to produce efficient models that scale to large documents so we'll get there eventually.Thoma
@MonsterMMORPG In the mean time, also check out this HuggingFace thread on the topic of long document summarisation, you might find some more useful answers there.Thoma
Ty so much for answers. I want to ask that, where I can see hyper parameters of models? Also most of the models have no usage example. I made this one work but I have no idea what hyper parameters it have and what they do : i.imgur.com/gUYw1HC.png - pszemraj/led-large-book-summaryMeatiness
Also can I somehow display how progress of the summarization? currently I have no idea it will take minutes, hours or days :(Meatiness
Amazing, thank you for such a detailed answer Kyle!Wishful

© 2022 - 2024 — McMap. All rights reserved.