python nlp huggingface-transformers summarization large-language-model

Why did the bart-large-cnn summarization model giving funny output with different length settings?

I have a piece of text of 4226 characters (316 words + special characters)

I am trying different combinations of min_length and max_length to get summary

print(summarizer(INPUT, max_length = 1000, min_length=500, do_sample=False))

With the code:

The code is

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

INPUT = """We see ChatGPT as an engine that will eventually power human interactions with computer systems in a familiar, natural, and intuitive way. As ChatGPT stated, large language models can be put to work as a communication engine in a variety of applications across a number of vertical markets. Glaringly absent in its answer is the use of ChatGPT in search engines. Microsoft, which is an investor in OpenAI, is integrating ChatGPT into its Bing search engine. The use of a large language model enables more complex and more natural searches and extract deeper meaning and better context from source material. This is ultimately expected to deliver more robust and useful results. Is AI coming for your job? Every wave of new and disruptive technology has incited fears of mass job losses due to automation, and we are already seeing those fears expressed relative to AI generally and ChatGPT specifically. The year 1896, when Henry Ford rolled out his first automobile, was probably not a good year for buggy whip makers. When IBM introduced its first mainframe, the System/360, in 1964, office workers feared replacement by mechanical brains that never made mistakes, never called in sick, and never took vacations. There are certainly historical cases of job displacement due to new technology adoption, and ChatGPT may unseat some office workers or customer service reps. However, we think AI tools broadly will end up as part of the solution in an economy that has more job openings than available workers. However, economic history shows that technology of any sort (i.e., manufacturing technology, communications technology, information technology) ultimately makes productive workers more productive and is net additive to employment and economic growth. How big is the opportunity? The broad AI hardware and services market was nearly USD 36bn in 2020, based on IDC and Bloomberg Intelligence data. We expect the market to grow by 20% CAGR to reach USD 90bn by 2025. Given the relatively early monetization stage of conversational AI, we estimate that the segment accounted for 10% of the broader AI’s addressable market in 2020, predominantly from enterprise and consumer subscriptions. That said, user adoption is rapidly rising. ChatGPT reached its first 1 million user milestone in a week, surpassing Instagram to become the quickest application to do so. Similarly, we see strong interest from enterprises to integrate conservational AI into their existing ecosystem. As a result, we believe conversational AI’s share in the broader AI’s addressable market can climb to 20% by 2025 (USD 18–20bn). Our estimate may prove to be conservative; they could be even higher if conversational AI improvements (in terms of computing power, machine learning, and deep learning capabilities), availability of talent, enterprise adoption, spending from governments, and incentives are stronger than expected. How to invest in AI? We see artificial intelligence as a horizontal technology that will have important use cases across a number of applications and industries. From a broader perspective, AI, along with big data and cybersecurity, forms what we call the ABCs of technology. We believe these three major foundational technologies are at inflection points and should see faster adoption over the next few years as enterprises and governments increase their focus and investments in these areas. Conservational AI is currently in its early stages of monetization and costs remain high as it is expensive to run. Instead of investing directly in such platforms, interested investors in the short term can consider semiconductor companies, and cloud-service providers that provides the infrastructure needed for generative AI to take off. In the medium to long term, companies can integrate generative AI to improve margins across industries and sectors, such as within healthcare and traditional manufacturing. Outside of public equities, investors can also consider opportunities in private equity (PE). We believe the tech sector is currently undergoing a new innovation cycle after 12–18 months of muted activity, which provides interesting and new opportunities that PE can capture through early-stage investments."""

print(summarizer(INPUT, max_length = 1000, min_length=500, do_sample=False))

Questions I have are:

Q1: What does the following warning message mean? `Your max_length is set to 1000, ...`

Your max_length is set to 1000, but you input_length is only 856. You might consider decreasing max_length manually, e.g. summarizer(‘…’, max_length=428)

Q2: After above message this it publishes a summary of total 2211 characters. How did it get that?

Q3: Of the above 2211 characters, first 933 characters are valid content from text but then it publishes text like

For confidential support call the Samaritans on 08457 90 90 90 or visit a local Samaritans branch, see www.samaritans.org for details. For support …

Q4: How does min_length and max_length actually work (it does not seems to follow the restrictions given to it)?

Q5: What is the max input that I can actually give to this summarizer?

Solution

Q2: After above message this it publishes a summary of total 2211 characters. How did it get that?

A: The length that the model sees is not the no. of characters, so Q2 is out-of-scope question. It's more appropriate to determine if the output of the model is shorter than the input no. of subwords tokens.

How we humanly decide no. of words is kinda different from how the model sees no. of tokens, i.e.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")

text = """We see ChatGPT as an engine that will eventually power human interactions with computer systems in a familiar, natural, and intuitive way. As ChatGPT stated, large language models can be put to work as a communication engine in a variety of applications across a number of vertical markets. Glaringly absent in its answer is the use of ChatGPT in search engines. Microsoft, which is an investor in OpenAI, is integrating ChatGPT into its Bing search engine. The use of a large language model enables more complex and more natural searches and extract deeper meaning and better context from source material. This is ultimately expected to deliver more robust and useful results. Is AI coming for your job? Every wave of new and disruptive technology has incited fears of mass job losses due to automation, and we are already seeing those fears expressed relative to AI generally and ChatGPT specifically. The year 1896, when Henry Ford rolled out his first automobile, was probably not a good year for buggy whip makers. When IBM introduced its first mainframe, the System/360, in 1964, office workers feared replacement by mechanical brains that never made mistakes, never called in sick, and never took vacations. There are certainly historical cases of job displacement due to new technology adoption, and ChatGPT may unseat some office workers or customer service reps. However, we think AI tools broadly will end up as part of the solution in an economy that has more job openings than available workers. However, economic history shows that technology of any sort (i.e., manufacturing technology, communications technology, information technology) ultimately makes productive workers more productive and is net additive to employment and economic growth. How big is the opportunity? The broad AI hardware and services market was nearly USD 36bn in 2020, based on IDC and Bloomberg Intelligence data. We expect the market to grow by 20% CAGR to reach USD 90bn by 2025. Given the relatively early monetization stage of conversational AI, we estimate that the segment accounted for 10% of the broader AI’s addressable market in 2020, predominantly from enterprise and consumer subscriptions. That said, user adoption is rapidly rising. ChatGPT reached its first 1 million user milestone in a week, surpassing Instagram to become the quickest application to do so. Similarly, we see strong interest from enterprises to integrate conservational AI into their existing ecosystem. As a result, we believe conversational AI’s share in the broader AI’s addressable market can climb to 20% by 2025 (USD 18–20bn). Our estimate may prove to be conservative; they could be even higher if conversational AI improvements (in terms of computing power, machine learning, and deep learning capabilities), availability of talent, enterprise adoption, spending from governments, and incentives are stronger than expected. How to invest in AI? We see artificial intelligence as a horizontal technology that will have important use cases across a number of applications and industries. From a broader perspective, AI, along with big data and cybersecurity, forms what we call the ABCs of technology. We believe these three major foundational technologies are at inflection points and should see faster adoption over the next few years as enterprises and governments increase their focus and investments in these areas. Conservational AI is currently in its early stages of monetization and costs remain high as it is expensive to run. Instead of investing directly in such platforms, interested investors in the short term can consider semiconductor companies, and cloud-service providers that provides the infrastructure needed for generative AI to take off. In the medium to long term, companies can integrate generative AI to improve margins across industries and sectors, such as within healthcare and traditional manufacturing. Outside of public equities, investors can also consider opportunities in private equity (PE). We believe the tech sector is currently undergoing a new innovation cycle after 12–18 months of muted activity, which provides interesting and new opportunities that PE can capture through early-stage investments."""

tokenized_text = tokenizer(text)

print(len(tokenized_text['input_ids']))

[out]:

We see that the input text you have in the example has 800 input subwords tokens, not 300 words.

Q1: What does the following mean? `Your max_length is set to 1000 ...`

The warning message is as such:

Your max_length is set to 1000, but you input_length is only 856. You might consider decreasing max_length manually, e.g. summarizer(‘…’, max_length=428)

Lets first try to put the input into the model and see the no. of tokens it outputs (without pipeline)

[code]:


from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")


text = """We see ChatGPT as an engine that will eventually power human interactions with computer systems in a familiar, natural, and intuitive way. As ChatGPT stated, large language models can be put to work as a communication engine in a variety of applications across a number of vertical markets. Glaringly absent in its answer is the use of ChatGPT in search engines. Microsoft, which is an investor in OpenAI, is integrating ChatGPT into its Bing search engine. The use of a large language model enables more complex and more natural searches and extract deeper meaning and better context from source material. This is ultimately expected to deliver more robust and useful results. Is AI coming for your job? Every wave of new and disruptive technology has incited fears of mass job losses due to automation, and we are already seeing those fears expressed relative to AI generally and ChatGPT specifically. The year 1896, when Henry Ford rolled out his first automobile, was probably not a good year for buggy whip makers. When IBM introduced its first mainframe, the System/360, in 1964, office workers feared replacement by mechanical brains that never made mistakes, never called in sick, and never took vacations. There are certainly historical cases of job displacement due to new technology adoption, and ChatGPT may unseat some office workers or customer service reps. However, we think AI tools broadly will end up as part of the solution in an economy that has more job openings than available workers. However, economic history shows that technology of any sort (i.e., manufacturing technology, communications technology, information technology) ultimately makes productive workers more productive and is net additive to employment and economic growth. How big is the opportunity? The broad AI hardware and services market was nearly USD 36bn in 2020, based on IDC and Bloomberg Intelligence data. We expect the market to grow by 20% CAGR to reach USD 90bn by 2025. Given the relatively early monetization stage of conversational AI, we estimate that the segment accounted for 10% of the broader AI’s addressable market in 2020, predominantly from enterprise and consumer subscriptions. That said, user adoption is rapidly rising. ChatGPT reached its first 1 million user milestone in a week, surpassing Instagram to become the quickest application to do so. Similarly, we see strong interest from enterprises to integrate conservational AI into their existing ecosystem. As a result, we believe conversational AI’s share in the broader AI’s addressable market can climb to 20% by 2025 (USD 18–20bn). Our estimate may prove to be conservative; they could be even higher if conversational AI improvements (in terms of computing power, machine learning, and deep learning capabilities), availability of talent, enterprise adoption, spending from governments, and incentives are stronger than expected. How to invest in AI? We see artificial intelligence as a horizontal technology that will have important use cases across a number of applications and industries. From a broader perspective, AI, along with big data and cybersecurity, forms what we call the ABCs of technology. We believe these three major foundational technologies are at inflection points and should see faster adoption over the next few years as enterprises and governments increase their focus and investments in these areas. Conservational AI is currently in its early stages of monetization and costs remain high as it is expensive to run. Instead of investing directly in such platforms, interested investors in the short term can consider semiconductor companies, and cloud-service providers that provides the infrastructure needed for generative AI to take off. In the medium to long term, companies can integrate generative AI to improve margins across industries and sectors, such as within healthcare and traditional manufacturing. Outside of public equities, investors can also consider opportunities in private equity (PE). We believe the tech sector is currently undergoing a new innovation cycle after 12–18 months of muted activity, which provides interesting and new opportunities that PE can capture through early-stage investments."""

tokenized_text = tokenizer(text, return_tensors="pt")

outputs = model.generate(tokenized_text['input_ids'])

tokenizer.decode(outputs[0], skip_special_tokens=True)

[stderr]:

/usr/local/lib/python3.9/dist-packages/transformers/generation/utils.py:1288: 

UserWarning: Using `max_length`'s default (142) to control the generation length. This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.

[stdout]:

ChatGPT is an engine that will eventually power human interactions with computer systems in a familiar, natural, and intuitive way. Microsoft, which is an investor in OpenAI, is integrating ChatGPT into its Bing search engine. The broad AI hardware and services market was nearly USD 36bn in 2020, based on IDC and Bloomberg Intelligence data.

Checking the output no. of tokens:

print(outputs.shape)

print(len(tokenizer.decode(outputs[0], skip_special_tokens=True)))

[out]:

torch.Size([1, 73])
343

Thus, the model summarizes the 800 subwords tokens input to an output of 73 subwords made up of 343 characters

Not sure how you got an output of 2k+ chars though, so lets try with pipeline.

[code]:

from transformers import pipeline

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

text = """We see ChatGPT as an engine that will eventually power human interactions with computer systems in a familiar, natural, and intuitive way. As ChatGPT stated, large language models can be put to work as a communication engine in a variety of applications across a number of vertical markets. Glaringly absent in its answer is the use of ChatGPT in search engines. Microsoft, which is an investor in OpenAI, is integrating ChatGPT into its Bing search engine. The use of a large language model enables more complex and more natural searches and extract deeper meaning and better context from source material. This is ultimately expected to deliver more robust and useful results. Is AI coming for your job? Every wave of new and disruptive technology has incited fears of mass job losses due to automation, and we are already seeing those fears expressed relative to AI generally and ChatGPT specifically. The year 1896, when Henry Ford rolled out his first automobile, was probably not a good year for buggy whip makers. When IBM introduced its first mainframe, the System/360, in 1964, office workers feared replacement by mechanical brains that never made mistakes, never called in sick, and never took vacations. There are certainly historical cases of job displacement due to new technology adoption, and ChatGPT may unseat some office workers or customer service reps. However, we think AI tools broadly will end up as part of the solution in an economy that has more job openings than available workers. However, economic history shows that technology of any sort (i.e., manufacturing technology, communications technology, information technology) ultimately makes productive workers more productive and is net additive to employment and economic growth. How big is the opportunity? The broad AI hardware and services market was nearly USD 36bn in 2020, based on IDC and Bloomberg Intelligence data. We expect the market to grow by 20% CAGR to reach USD 90bn by 2025. Given the relatively early monetization stage of conversational AI, we estimate that the segment accounted for 10% of the broader AI’s addressable market in 2020, predominantly from enterprise and consumer subscriptions. That said, user adoption is rapidly rising. ChatGPT reached its first 1 million user milestone in a week, surpassing Instagram to become the quickest application to do so. Similarly, we see strong interest from enterprises to integrate conservational AI into their existing ecosystem. As a result, we believe conversational AI’s share in the broader AI’s addressable market can climb to 20% by 2025 (USD 18–20bn). Our estimate may prove to be conservative; they could be even higher if conversational AI improvements (in terms of computing power, machine learning, and deep learning capabilities), availability of talent, enterprise adoption, spending from governments, and incentives are stronger than expected. How to invest in AI? We see artificial intelligence as a horizontal technology that will have important use cases across a number of applications and industries. From a broader perspective, AI, along with big data and cybersecurity, forms what we call the ABCs of technology. We believe these three major foundational technologies are at inflection points and should see faster adoption over the next few years as enterprises and governments increase their focus and investments in these areas. Conservational AI is currently in its early stages of monetization and costs remain high as it is expensive to run. Instead of investing directly in such platforms, interested investors in the short term can consider semiconductor companies, and cloud-service providers that provides the infrastructure needed for generative AI to take off. In the medium to long term, companies can integrate generative AI to improve margins across industries and sectors, such as within healthcare and traditional manufacturing. Outside of public equities, investors can also consider opportunities in private equity (PE). We believe the tech sector is currently undergoing a new innovation cycle after 12–18 months of muted activity, which provides interesting and new opportunities that PE can capture through early-stage investments."""

output = summarizer(text)

print(output)

[out]:

[{'summary_text': 'ChatGPT is an engine that will eventually power human interactions with computer systems in a familiar, natural, and intuitive way. Microsoft, which is an investor in OpenAI, is integrating ChatGPT into its Bing search engine. The broad AI hardware and services market was nearly USD 36bn in 2020, based on IDC and Bloomberg Intelligence data.'}]

Checking the size of the output:

print(output[0]['summary_text'])

[out]:

This is consistent with how we use the model without pipeline, 343 characters summary.

Q: Does that mean that I don't have to set the `max_new_tokens`?

Yeah, kind-of, you don't have to do anything since the summary is already shorter than the input text.

Q: What does setting the `max_new_tokens` do?

We know that the default output summary gives us 73 tokens. Lets try and see what happens if we set it down to 30 tokens!


from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")


text = """We see ChatGPT as an engine that will eventually power human interactions with computer systems in a familiar, natural, and intuitive way. As ChatGPT stated, large language models can be put to work as a communication engine in a variety of applications across a number of vertical markets. Glaringly absent in its answer is the use of ChatGPT in search engines. Microsoft, which is an investor in OpenAI, is integrating ChatGPT into its Bing search engine. The use of a large language model enables more complex and more natural searches and extract deeper meaning and better context from source material. This is ultimately expected to deliver more robust and useful results. Is AI coming for your job? Every wave of new and disruptive technology has incited fears of mass job losses due to automation, and we are already seeing those fears expressed relative to AI generally and ChatGPT specifically. The year 1896, when Henry Ford rolled out his first automobile, was probably not a good year for buggy whip makers. When IBM introduced its first mainframe, the System/360, in 1964, office workers feared replacement by mechanical brains that never made mistakes, never called in sick, and never took vacations. There are certainly historical cases of job displacement due to new technology adoption, and ChatGPT may unseat some office workers or customer service reps. However, we think AI tools broadly will end up as part of the solution in an economy that has more job openings than available workers. However, economic history shows that technology of any sort (i.e., manufacturing technology, communications technology, information technology) ultimately makes productive workers more productive and is net additive to employment and economic growth. How big is the opportunity? The broad AI hardware and services market was nearly USD 36bn in 2020, based on IDC and Bloomberg Intelligence data. We expect the market to grow by 20% CAGR to reach USD 90bn by 2025. Given the relatively early monetization stage of conversational AI, we estimate that the segment accounted for 10% of the broader AI’s addressable market in 2020, predominantly from enterprise and consumer subscriptions. That said, user adoption is rapidly rising. ChatGPT reached its first 1 million user milestone in a week, surpassing Instagram to become the quickest application to do so. Similarly, we see strong interest from enterprises to integrate conservational AI into their existing ecosystem. As a result, we believe conversational AI’s share in the broader AI’s addressable market can climb to 20% by 2025 (USD 18–20bn). Our estimate may prove to be conservative; they could be even higher if conversational AI improvements (in terms of computing power, machine learning, and deep learning capabilities), availability of talent, enterprise adoption, spending from governments, and incentives are stronger than expected. How to invest in AI? We see artificial intelligence as a horizontal technology that will have important use cases across a number of applications and industries. From a broader perspective, AI, along with big data and cybersecurity, forms what we call the ABCs of technology. We believe these three major foundational technologies are at inflection points and should see faster adoption over the next few years as enterprises and governments increase their focus and investments in these areas. Conservational AI is currently in its early stages of monetization and costs remain high as it is expensive to run. Instead of investing directly in such platforms, interested investors in the short term can consider semiconductor companies, and cloud-service providers that provides the infrastructure needed for generative AI to take off. In the medium to long term, companies can integrate generative AI to improve margins across industries and sectors, such as within healthcare and traditional manufacturing. Outside of public equities, investors can also consider opportunities in private equity (PE). We believe the tech sector is currently undergoing a new innovation cycle after 12–18 months of muted activity, which provides interesting and new opportunities that PE can capture through early-stage investments."""

tokenized_text = tokenizer(text, return_tensors="pt")

outputs = model.generate(tokenized_text['input_ids'], max_new_tokens=30)

[stderr]:

ValueError                                Traceback (most recent call last)
<ipython-input-26-665cd5fbe802> in <module>
      3 tokenized_text = tokenizer(text, return_tensors="pt")
      4 
----> 5 model.generate(tokenized_text['input_ids'], max_new_tokens=30)

1 frames
/usr/local/lib/python3.9/dist-packages/transformers/generation/utils.py in generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, **kwargs)
   1304 
   1305         if generation_config.min_length is not None and generation_config.min_length > generation_config.max_length:
-> 1306             raise ValueError(
   1307                 f"Unfeasible length constraints: the minimum length ({generation_config.min_length}) is larger than"
   1308                 f" the maximum length ({generation_config.max_length})"

ValueError: Unfeasible length constraints: the minimum length (56) is larger than the maximum length (31)

Ah ha, there's some minimum length that the model wants to output as the summary!

So lets just try to set it to 60

tokenized_text = tokenizer(text, return_tensors="pt")

outputs = model.generate(tokenized_text['input_ids'], max_new_tokens=60)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

[out]:

ChatGPT is an engine that will eventually power human interactions with computer systems in a familiar, natural, and intuitive way. Microsoft, which is an investor in OpenAI, is integrating ChatGPT into its Bing search engine. The broad AI hardware and services market was nearly USD 36bn

We see that now the summarized output is shorter than the 73 default output and fits into the 60 max_new_tokens limit we set.

And if we check the print(len(outputs[0])), we get 61 subwords tokens, the additional one off the max_new_tokens is to account for the end of sentence symbol. If you print the outputs, you'll see that the first token id is 2 which is represented by the </s> token.

When you specify the skip_special_tokens=True it will delete the </s> token, as well as the start of sentence tokens <s>.

Q4: How does min_length and max_length actually work (it does not seems to follow the restrictions given to it)?

Given the above examples, the min_length is actually hard to determine since the model has to decide the minimum subwords tokens it needs to get a good summary output. Remember the Unfeasible length constraints: the minimum length (56) ... warning?

Q5: What is the max input that I can actually give to this summarizer?

The sensible max_length or more appropriately max_new_tokens is most probably going to lower than your input length and if there's some sort of UI limitations or compute/latency limitations, it's best to keep it low and close to whatever is needed.

I.e., to set the max_new_tokens, just make sure it's lower than the input text no. of tokens and sensible enough for your application. If you want to know a ballpark no. try the model without setting the limit and see if the summary output is how you expect the model to behave, then adjust appropriately.

Like seasoning while cooking, "Add/Reduce max_new_tokens as desired"

Q3: Of the above 2211 characters, first 933 characters are valid content from text but then it publishes text like ...

When setting the min_length to some arbitrarily large number, way larger than the default output of the model, i.e. 73 subwords,

print(summarizer(text, max_length=900, min_length=300, do_sample=False))

print(summarizer(text, max_length=900, min_length=500, do_sample=False))

Then it will warn you,

[sterr]:

Your max_length is set to 900, but you input_length is only 800. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=400)

It will start hallucinating things beyond the first 300-ish subwords tokens. Possibly, the model thinks that beyond 300-ish subwords, nothing else from the input text is important.

And output looks something like:

[{'summary_text': 'ChatGPT is an engine that will eventually power human interactions with computer systems in a familiar, natural, and intuitive way. Microsoft, which is an investor in OpenAI, is integrating ChatGPT into its Bing search engine. ... They recommend semiconductor companies, cloud-service providers that provides the infrastructure needed for generative AI to take off, and private equity firms that provide the infrastructure for cloud-based services. They also suggest investors can consider opportunities in private equity (PE) to invest in AI platforms in the short-term and in the medium to long-term.'}]

[{'summary_text': "ChatGPT is an engine that will eventually power human interactions with computer systems in a familiar, natural, and intuitive way. Microsoft, which is an investor in OpenAI, is integrating ChatGPT into its Bing search engine. ... They say AI tools broadly will end up as part of the solution in an economy that has more job openings than available workers. The technology of any sort (i.e., manufacturing technology, communications technology, information technology) ultimately makes productive workers more productive and is net additive to employment and economic growth, they say. The authors believe the tech sector is currently undergoing a new innovation cycle after 12–18 months of muted activity, which provides interesting and new opportunities that PE can capture through early-stage investments. They recommend semiconductor companies, cloud-service providers that provides the infrastructure needed for generative AI to take off, and private equity firms that provide the infrastructure for cloud-based services. They also suggest investors can consider opportunities in private equity (PE) to invest in AI platforms in the short-term and in the medium to long-term, such as within healthcare and traditional manufacturing. The author's firm is based in New York and they have worked with Microsoft, Google, Facebook, and others on AI projects in the past. The firm has also worked with Google, Microsoft, Facebook and others to develop AI products and services in the U.S. and abroad. For confidential support, call the National Suicide Prevention Lifeline at 1-800-273-8255 or visit http://www.suicidepreventionlifeline.org/. For confidential. support on suicide matters call the Samaritans on 08457 90 90 90 or visit a local Samaritans branch or click here for details. In the UK, contact Samaritans at 08457 909090 or visit\xa0the Samaritans’\xa0online helpline at http:// www.samaritans.org\xa0or\xa0click\xa0here for details on how to get involved in the UK’s national suicide prevention Lifeline (in the UK or the UK). For confidential help in the United States, call\xa0the National suicide Prevention Line at\xa0800\xa0273\xa08255."}]

Q: Why did the model start hallucinating beyond 300 subwords?

Good question and also an active research area, see https://aclanthology.org/2022.naacl-main.387/ and there are many more in that area.

[Opinion]: Personally, hunch says, it's most probably because most of the data that the model learnt from where the text is 800-ish subwords, the summary it trained are between the length of 80-300 subwords. And training data points where there are 300-500 subwords in the summary, it always contains the SOS helpline. So the model starts to overfit whenever it reaches that min_length that is >300.

To prove the hunch pudding, try another random text of 800-ish subwords, and then set the min_length again to 500, it will most probably hallucinate the SOS sentence again beyond 300-ish subwords.

Why did the bart-large-cnn summarization model giving funny output with different length settings?

Q1: What does the following warning message mean? Your max_length is set to 1000, ...

Q2: After above message this it publishes a summary of total 2211 characters. How did it get that?

Q3: Of the above 2211 characters, first 933 characters are valid content from text but then it publishes text like

Q4: How does min_length and max_length actually work (it does not seems to follow the restrictions given to it)?

Q2: After above message this it publishes a summary of total 2211 characters. How did it get that?

How we humanly decide no. of words is kinda different from how the model sees no. of tokens, i.e.

We see that the input text you have in the example has 800 input subwords tokens, not 300 words.

Q1: What does the following mean? Your max_length is set to 1000 ...

Lets first try to put the input into the model and see the no. of tokens it outputs (without pipeline)

Thus, the model summarizes the 800 subwords tokens input to an output of 73 subwords made up of 343 characters

Q: Does that mean that I don't have to set the max_new_tokens?

Q: What does setting the max_new_tokens do?

Ah ha, there's some minimum length that the model wants to output as the summary!

We see that now the summarized output is shorter than the 73 default output and fits into the 60 max_new_tokens limit we set.

Q4: How does min_length and max_length actually work (it does not seems to follow the restrictions given to it)?

Q5: What is the max input that I can actually give to this summarizer?

Q3: Of the above 2211 characters, first 933 characters are valid content from text but then it publishes text like ...

Q: Why did the model start hallucinating beyond 300 subwords?

Q1: What does the following warning message mean? `Your max_length is set to 1000, ...`

Q1: What does the following mean? `Your max_length is set to 1000 ...`

Q: Does that mean that I don't have to set the `max_new_tokens`?

Q: What does setting the `max_new_tokens` do?