I have a piece of text of 4226 characters (316 words + special characters)
I am trying different combinations of min_length and max_length to get summary
print(summarizer(INPUT, max_length = 1000, min_length=500, do_sample=False))
With the code:
The code is
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
INPUT = """We see ChatGPT as an engine that will eventually power human interactions with computer systems in a familiar, natural, and intuitive way. As ChatGPT stated, large language models can be put to work as a communication engine in a variety of applications across a number of vertical markets. Glaringly absent in its answer is the use of ChatGPT in search engines. Microsoft, which is an investor in OpenAI, is integrating ChatGPT into its Bing search engine. The use of a large language model enables more complex and more natural searches and extract deeper meaning and better context from source material. This is ultimately expected to deliver more robust and useful results. Is AI coming for your job? Every wave of new and disruptive technology has incited fears of mass job losses due to automation, and we are already seeing those fears expressed relative to AI generally and ChatGPT specifically. The year 1896, when Henry Ford rolled out his first automobile, was probably not a good year for buggy whip makers. When IBM introduced its first mainframe, the System/360, in 1964, office workers feared replacement by mechanical brains that never made mistakes, never called in sick, and never took vacations. There are certainly historical cases of job displacement due to new technology adoption, and ChatGPT may unseat some office workers or customer service reps. However, we think AI tools broadly will end up as part of the solution in an economy that has more job openings than available workers. However, economic history shows that technology of any sort (i.e., manufacturing technology, communications technology, information technology) ultimately makes productive workers more productive and is net additive to employment and economic growth. How big is the opportunity? The broad AI hardware and services market was nearly USD 36bn in 2020, based on IDC and Bloomberg Intelligence data. We expect the market to grow by 20% CAGR to reach USD 90bn by 2025. Given the relatively early monetization stage of conversational AI, we estimate that the segment accounted for 10% of the broader AI’s addressable market in 2020, predominantly from enterprise and consumer subscriptions. That said, user adoption is rapidly rising. ChatGPT reached its first 1 million user milestone in a week, surpassing Instagram to become the quickest application to do so. Similarly, we see strong interest from enterprises to integrate conservational AI into their existing ecosystem. As a result, we believe conversational AI’s share in the broader AI’s addressable market can climb to 20% by 2025 (USD 18–20bn). Our estimate may prove to be conservative; they could be even higher if conversational AI improvements (in terms of computing power, machine learning, and deep learning capabilities), availability of talent, enterprise adoption, spending from governments, and incentives are stronger than expected. How to invest in AI? We see artificial intelligence as a horizontal technology that will have important use cases across a number of applications and industries. From a broader perspective, AI, along with big data and cybersecurity, forms what we call the ABCs of technology. We believe these three major foundational technologies are at inflection points and should see faster adoption over the next few years as enterprises and governments increase their focus and investments in these areas. Conservational AI is currently in its early stages of monetization and costs remain high as it is expensive to run. Instead of investing directly in such platforms, interested investors in the short term can consider semiconductor companies, and cloud-service providers that provides the infrastructure needed for generative AI to take off. In the medium to long term, companies can integrate generative AI to improve margins across industries and sectors, such as within healthcare and traditional manufacturing. Outside of public equities, investors can also consider opportunities in private equity (PE). We believe the tech sector is currently undergoing a new innovation cycle after 12–18 months of muted activity, which provides interesting and new opportunities that PE can capture through early-stage investments."""
print(summarizer(INPUT, max_length = 1000, min_length=500, do_sample=False))
Questions I have are:
Your max_length is set to 1000, ...
Your max_length is set to 1000, but you input_length is only 856. You might consider decreasing max_length manually, e.g. summarizer(‘…’, max_length=428)
For confidential support call the Samaritans on 08457 90 90 90 or visit a local Samaritans branch, see www.samaritans.org for details. For support …
Q5: What is the max input that I can actually give to this summarizer?
A: The length that the model sees is not the no. of characters, so Q2 is out-of-scope question. It's more appropriate to determine if the output of the model is shorter than the input no. of subwords tokens.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
text = """We see ChatGPT as an engine that will eventually power human interactions with computer systems in a familiar, natural, and intuitive way. As ChatGPT stated, large language models can be put to work as a communication engine in a variety of applications across a number of vertical markets. Glaringly absent in its answer is the use of ChatGPT in search engines. Microsoft, which is an investor in OpenAI, is integrating ChatGPT into its Bing search engine. The use of a large language model enables more complex and more natural searches and extract deeper meaning and better context from source material. This is ultimately expected to deliver more robust and useful results. Is AI coming for your job? Every wave of new and disruptive technology has incited fears of mass job losses due to automation, and we are already seeing those fears expressed relative to AI generally and ChatGPT specifically. The year 1896, when Henry Ford rolled out his first automobile, was probably not a good year for buggy whip makers. When IBM introduced its first mainframe, the System/360, in 1964, office workers feared replacement by mechanical brains that never made mistakes, never called in sick, and never took vacations. There are certainly historical cases of job displacement due to new technology adoption, and ChatGPT may unseat some office workers or customer service reps. However, we think AI tools broadly will end up as part of the solution in an economy that has more job openings than available workers. However, economic history shows that technology of any sort (i.e., manufacturing technology, communications technology, information technology) ultimately makes productive workers more productive and is net additive to employment and economic growth. How big is the opportunity? The broad AI hardware and services market was nearly USD 36bn in 2020, based on IDC and Bloomberg Intelligence data. We expect the market to grow by 20% CAGR to reach USD 90bn by 2025. Given the relatively early monetization stage of conversational AI, we estimate that the segment accounted for 10% of the broader AI’s addressable market in 2020, predominantly from enterprise and consumer subscriptions. That said, user adoption is rapidly rising. ChatGPT reached its first 1 million user milestone in a week, surpassing Instagram to become the quickest application to do so. Similarly, we see strong interest from enterprises to integrate conservational AI into their existing ecosystem. As a result, we believe conversational AI’s share in the broader AI’s addressable market can climb to 20% by 2025 (USD 18–20bn). Our estimate may prove to be conservative; they could be even higher if conversational AI improvements (in terms of computing power, machine learning, and deep learning capabilities), availability of talent, enterprise adoption, spending from governments, and incentives are stronger than expected. How to invest in AI? We see artificial intelligence as a horizontal technology that will have important use cases across a number of applications and industries. From a broader perspective, AI, along with big data and cybersecurity, forms what we call the ABCs of technology. We believe these three major foundational technologies are at inflection points and should see faster adoption over the next few years as enterprises and governments increase their focus and investments in these areas. Conservational AI is currently in its early stages of monetization and costs remain high as it is expensive to run. Instead of investing directly in such platforms, interested investors in the short term can consider semiconductor companies, and cloud-service providers that provides the infrastructure needed for generative AI to take off. In the medium to long term, companies can integrate generative AI to improve margins across industries and sectors, such as within healthcare and traditional manufacturing. Outside of public equities, investors can also consider opportunities in private equity (PE). We believe the tech sector is currently undergoing a new innovation cycle after 12–18 months of muted activity, which provides interesting and new opportunities that PE can capture through early-stage investments."""
tokenized_text = tokenizer(text)
print(len(tokenized_text['input_ids']))
[out]:
800
Your max_length is set to 1000 ...
The warning message is as such:
Your max_length is set to 1000, but you input_length is only 856. You might consider decreasing max_length manually, e.g. summarizer(‘…’, max_length=428)
[code]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")
text = """We see ChatGPT as an engine that will eventually power human interactions with computer systems in a familiar, natural, and intuitive way. As ChatGPT stated, large language models can be put to work as a communication engine in a variety of applications across a number of vertical markets. Glaringly absent in its answer is the use of ChatGPT in search engines. Microsoft, which is an investor in OpenAI, is integrating ChatGPT into its Bing search engine. The use of a large language model enables more complex and more natural searches and extract deeper meaning and better context from source material. This is ultimately expected to deliver more robust and useful results. Is AI coming for your job? Every wave of new and disruptive technology has incited fears of mass job losses due to automation, and we are already seeing those fears expressed relative to AI generally and ChatGPT specifically. The year 1896, when Henry Ford rolled out his first automobile, was probably not a good year for buggy whip makers. When IBM introduced its first mainframe, the System/360, in 1964, office workers feared replacement by mechanical brains that never made mistakes, never called in sick, and never took vacations. There are certainly historical cases of job displacement due to new technology adoption, and ChatGPT may unseat some office workers or customer service reps. However, we think AI tools broadly will end up as part of the solution in an economy that has more job openings than available workers. However, economic history shows that technology of any sort (i.e., manufacturing technology, communications technology, information technology) ultimately makes productive workers more productive and is net additive to employment and economic growth. How big is the opportunity? The broad AI hardware and services market was nearly USD 36bn in 2020, based on IDC and Bloomberg Intelligence data. We expect the market to grow by 20% CAGR to reach USD 90bn by 2025. Given the relatively early monetization stage of conversational AI, we estimate that the segment accounted for 10% of the broader AI’s addressable market in 2020, predominantly from enterprise and consumer subscriptions. That said, user adoption is rapidly rising. ChatGPT reached its first 1 million user milestone in a week, surpassing Instagram to become the quickest application to do so. Similarly, we see strong interest from enterprises to integrate conservational AI into their existing ecosystem. As a result, we believe conversational AI’s share in the broader AI’s addressable market can climb to 20% by 2025 (USD 18–20bn). Our estimate may prove to be conservative; they could be even higher if conversational AI improvements (in terms of computing power, machine learning, and deep learning capabilities), availability of talent, enterprise adoption, spending from governments, and incentives are stronger than expected. How to invest in AI? We see artificial intelligence as a horizontal technology that will have important use cases across a number of applications and industries. From a broader perspective, AI, along with big data and cybersecurity, forms what we call the ABCs of technology. We believe these three major foundational technologies are at inflection points and should see faster adoption over the next few years as enterprises and governments increase their focus and investments in these areas. Conservational AI is currently in its early stages of monetization and costs remain high as it is expensive to run. Instead of investing directly in such platforms, interested investors in the short term can consider semiconductor companies, and cloud-service providers that provides the infrastructure needed for generative AI to take off. In the medium to long term, companies can integrate generative AI to improve margins across industries and sectors, such as within healthcare and traditional manufacturing. Outside of public equities, investors can also consider opportunities in private equity (PE). We believe the tech sector is currently undergoing a new innovation cycle after 12–18 months of muted activity, which provides interesting and new opportunities that PE can capture through early-stage investments."""
tokenized_text = tokenizer(text, return_tensors="pt")
outputs = model.generate(tokenized_text['input_ids'])
tokenizer.decode(outputs[0], skip_special_tokens=True)
[stderr]:
/usr/local/lib/python3.9/dist-packages/transformers/generation/utils.py:1288:
UserWarning: Using `max_length`'s default (142) to control the generation length. This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
[stdout]:
ChatGPT is an engine that will eventually power human interactions with computer systems in a familiar, natural, and intuitive way. Microsoft, which is an investor in OpenAI, is integrating ChatGPT into its Bing search engine. The broad AI hardware and services market was nearly USD 36bn in 2020, based on IDC and Bloomberg Intelligence data.
Checking the output no. of tokens:
print(outputs.shape)
print(len(tokenizer.decode(outputs[0], skip_special_tokens=True)))
[out]:
torch.Size([1, 73])
343
Not sure how you got an output of 2k+ chars though, so lets try with pipeline.
[code]:
from transformers import pipeline
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
text = """We see ChatGPT as an engine that will eventually power human interactions with computer systems in a familiar, natural, and intuitive way. As ChatGPT stated, large language models can be put to work as a communication engine in a variety of applications across a number of vertical markets. Glaringly absent in its answer is the use of ChatGPT in search engines. Microsoft, which is an investor in OpenAI, is integrating ChatGPT into its Bing search engine. The use of a large language model enables more complex and more natural searches and extract deeper meaning and better context from source material. This is ultimately expected to deliver more robust and useful results. Is AI coming for your job? Every wave of new and disruptive technology has incited fears of mass job losses due to automation, and we are already seeing those fears expressed relative to AI generally and ChatGPT specifically. The year 1896, when Henry Ford rolled out his first automobile, was probably not a good year for buggy whip makers. When IBM introduced its first mainframe, the System/360, in 1964, office workers feared replacement by mechanical brains that never made mistakes, never called in sick, and never took vacations. There are certainly historical cases of job displacement due to new technology adoption, and ChatGPT may unseat some office workers or customer service reps. However, we think AI tools broadly will end up as part of the solution in an economy that has more job openings than available workers. However, economic history shows that technology of any sort (i.e., manufacturing technology, communications technology, information technology) ultimately makes productive workers more productive and is net additive to employment and economic growth. How big is the opportunity? The broad AI hardware and services market was nearly USD 36bn in 2020, based on IDC and Bloomberg Intelligence data. We expect the market to grow by 20% CAGR to reach USD 90bn by 2025. Given the relatively early monetization stage of conversational AI, we estimate that the segment accounted for 10% of the broader AI’s addressable market in 2020, predominantly from enterprise and consumer subscriptions. That said, user adoption is rapidly rising. ChatGPT reached its first 1 million user milestone in a week, surpassing Instagram to become the quickest application to do so. Similarly, we see strong interest from enterprises to integrate conservational AI into their existing ecosystem. As a result, we believe conversational AI’s share in the broader AI’s addressable market can climb to 20% by 2025 (USD 18–20bn). Our estimate may prove to be conservative; they could be even higher if conversational AI improvements (in terms of computing power, machine learning, and deep learning capabilities), availability of talent, enterprise adoption, spending from governments, and incentives are stronger than expected. How to invest in AI? We see artificial intelligence as a horizontal technology that will have important use cases across a number of applications and industries. From a broader perspective, AI, along with big data and cybersecurity, forms what we call the ABCs of technology. We believe these three major foundational technologies are at inflection points and should see faster adoption over the next few years as enterprises and governments increase their focus and investments in these areas. Conservational AI is currently in its early stages of monetization and costs remain high as it is expensive to run. Instead of investing directly in such platforms, interested investors in the short term can consider semiconductor companies, and cloud-service providers that provides the infrastructure needed for generative AI to take off. In the medium to long term, companies can integrate generative AI to improve margins across industries and sectors, such as within healthcare and traditional manufacturing. Outside of public equities, investors can also consider opportunities in private equity (PE). We believe the tech sector is currently undergoing a new innovation cycle after 12–18 months of muted activity, which provides interesting and new opportunities that PE can capture through early-stage investments."""
output = summarizer(text)
print(output)
[out]:
[{'summary_text': 'ChatGPT is an engine that will eventually power human interactions with computer systems in a familiar, natural, and intuitive way. Microsoft, which is an investor in OpenAI, is integrating ChatGPT into its Bing search engine. The broad AI hardware and services market was nearly USD 36bn in 2020, based on IDC and Bloomberg Intelligence data.'}]
Checking the size of the output:
print(output[0]['summary_text'])
[out]:
343
This is consistent with how we use the model without pipeline, 343 characters summary.
max_new_tokens
?Yeah, kind-of, you don't have to do anything since the summary is already shorter than the input text.
max_new_tokens
do?We know that the default output summary gives us 73 tokens. Lets try and see what happens if we set it down to 30 tokens!
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")
text = """We see ChatGPT as an engine that will eventually power human interactions with computer systems in a familiar, natural, and intuitive way. As ChatGPT stated, large language models can be put to work as a communication engine in a variety of applications across a number of vertical markets. Glaringly absent in its answer is the use of ChatGPT in search engines. Microsoft, which is an investor in OpenAI, is integrating ChatGPT into its Bing search engine. The use of a large language model enables more complex and more natural searches and extract deeper meaning and better context from source material. This is ultimately expected to deliver more robust and useful results. Is AI coming for your job? Every wave of new and disruptive technology has incited fears of mass job losses due to automation, and we are already seeing those fears expressed relative to AI generally and ChatGPT specifically. The year 1896, when Henry Ford rolled out his first automobile, was probably not a good year for buggy whip makers. When IBM introduced its first mainframe, the System/360, in 1964, office workers feared replacement by mechanical brains that never made mistakes, never called in sick, and never took vacations. There are certainly historical cases of job displacement due to new technology adoption, and ChatGPT may unseat some office workers or customer service reps. However, we think AI tools broadly will end up as part of the solution in an economy that has more job openings than available workers. However, economic history shows that technology of any sort (i.e., manufacturing technology, communications technology, information technology) ultimately makes productive workers more productive and is net additive to employment and economic growth. How big is the opportunity? The broad AI hardware and services market was nearly USD 36bn in 2020, based on IDC and Bloomberg Intelligence data. We expect the market to grow by 20% CAGR to reach USD 90bn by 2025. Given the relatively early monetization stage of conversational AI, we estimate that the segment accounted for 10% of the broader AI’s addressable market in 2020, predominantly from enterprise and consumer subscriptions. That said, user adoption is rapidly rising. ChatGPT reached its first 1 million user milestone in a week, surpassing Instagram to become the quickest application to do so. Similarly, we see strong interest from enterprises to integrate conservational AI into their existing ecosystem. As a result, we believe conversational AI’s share in the broader AI’s addressable market can climb to 20% by 2025 (USD 18–20bn). Our estimate may prove to be conservative; they could be even higher if conversational AI improvements (in terms of computing power, machine learning, and deep learning capabilities), availability of talent, enterprise adoption, spending from governments, and incentives are stronger than expected. How to invest in AI? We see artificial intelligence as a horizontal technology that will have important use cases across a number of applications and industries. From a broader perspective, AI, along with big data and cybersecurity, forms what we call the ABCs of technology. We believe these three major foundational technologies are at inflection points and should see faster adoption over the next few years as enterprises and governments increase their focus and investments in these areas. Conservational AI is currently in its early stages of monetization and costs remain high as it is expensive to run. Instead of investing directly in such platforms, interested investors in the short term can consider semiconductor companies, and cloud-service providers that provides the infrastructure needed for generative AI to take off. In the medium to long term, companies can integrate generative AI to improve margins across industries and sectors, such as within healthcare and traditional manufacturing. Outside of public equities, investors can also consider opportunities in private equity (PE). We believe the tech sector is currently undergoing a new innovation cycle after 12–18 months of muted activity, which provides interesting and new opportunities that PE can capture through early-stage investments."""
tokenized_text = tokenizer(text, return_tensors="pt")
outputs = model.generate(tokenized_text['input_ids'], max_new_tokens=30)
[stderr]:
ValueError Traceback (most recent call last)
<ipython-input-26-665cd5fbe802> in <module>
3 tokenized_text = tokenizer(text, return_tensors="pt")
4
----> 5 model.generate(tokenized_text['input_ids'], max_new_tokens=30)
1 frames
/usr/local/lib/python3.9/dist-packages/transformers/generation/utils.py in generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, **kwargs)
1304
1305 if generation_config.min_length is not None and generation_config.min_length > generation_config.max_length:
-> 1306 raise ValueError(
1307 f"Unfeasible length constraints: the minimum length ({generation_config.min_length}) is larger than"
1308 f" the maximum length ({generation_config.max_length})"
ValueError: Unfeasible length constraints: the minimum length (56) is larger than the maximum length (31)
So lets just try to set it to 60
tokenized_text = tokenizer(text, return_tensors="pt")
outputs = model.generate(tokenized_text['input_ids'], max_new_tokens=60)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
[out]:
ChatGPT is an engine that will eventually power human interactions with computer systems in a familiar, natural, and intuitive way. Microsoft, which is an investor in OpenAI, is integrating ChatGPT into its Bing search engine. The broad AI hardware and services market was nearly USD 36bn
And if we check the print(len(outputs[0]))
, we get 61 subwords tokens, the additional one off the max_new_tokens is to account for the end of sentence symbol. If you print the outputs
, you'll see that the first token id is 2 which is represented by the </s>
token.
When you specify the skip_special_tokens=True
it will delete the </s>
token, as well as the start of sentence tokens <s>
.
Given the above examples, the min_length
is actually hard to determine since the model has to decide the minimum subwords tokens it needs to get a good summary output. Remember the Unfeasible length constraints: the minimum length (56) ...
warning?
The sensible max_length
or more appropriately max_new_tokens
is most probably going to lower than your input length and if there's some sort of UI limitations or compute/latency limitations, it's best to keep it low and close to whatever is needed.
I.e., to set the max_new_tokens
, just make sure it's lower than the input text no. of tokens and sensible enough for your application. If you want to know a ballpark no. try the model without setting the limit and see if the summary output is how you expect the model to behave, then adjust appropriately.
Like seasoning while cooking, "Add/Reduce max_new_tokens
as desired"
When setting the min_length to some arbitrarily large number, way larger than the default output of the model, i.e. 73 subwords,
print(summarizer(text, max_length=900, min_length=300, do_sample=False))
print(summarizer(text, max_length=900, min_length=500, do_sample=False))
Then it will warn you,
[sterr]:
Your max_length is set to 900, but you input_length is only 800. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=400)
It will start hallucinating things beyond the first 300-ish subwords tokens. Possibly, the model thinks that beyond 300-ish subwords, nothing else from the input text is important.
And output looks something like:
[{'summary_text': 'ChatGPT is an engine that will eventually power human interactions with computer systems in a familiar, natural, and intuitive way. Microsoft, which is an investor in OpenAI, is integrating ChatGPT into its Bing search engine. ... They recommend semiconductor companies, cloud-service providers that provides the infrastructure needed for generative AI to take off, and private equity firms that provide the infrastructure for cloud-based services. They also suggest investors can consider opportunities in private equity (PE) to invest in AI platforms in the short-term and in the medium to long-term.'}]
[{'summary_text': "ChatGPT is an engine that will eventually power human interactions with computer systems in a familiar, natural, and intuitive way. Microsoft, which is an investor in OpenAI, is integrating ChatGPT into its Bing search engine. ... They say AI tools broadly will end up as part of the solution in an economy that has more job openings than available workers. The technology of any sort (i.e., manufacturing technology, communications technology, information technology) ultimately makes productive workers more productive and is net additive to employment and economic growth, they say. The authors believe the tech sector is currently undergoing a new innovation cycle after 12–18 months of muted activity, which provides interesting and new opportunities that PE can capture through early-stage investments. They recommend semiconductor companies, cloud-service providers that provides the infrastructure needed for generative AI to take off, and private equity firms that provide the infrastructure for cloud-based services. They also suggest investors can consider opportunities in private equity (PE) to invest in AI platforms in the short-term and in the medium to long-term, such as within healthcare and traditional manufacturing. The author's firm is based in New York and they have worked with Microsoft, Google, Facebook, and others on AI projects in the past. The firm has also worked with Google, Microsoft, Facebook and others to develop AI products and services in the U.S. and abroad. For confidential support, call the National Suicide Prevention Lifeline at 1-800-273-8255 or visit http://www.suicidepreventionlifeline.org/. For confidential. support on suicide matters call the Samaritans on 08457 90 90 90 or visit a local Samaritans branch or click here for details. In the UK, contact Samaritans at 08457 909090 or visit\xa0the Samaritans’\xa0online helpline at http:// www.samaritans.org\xa0or\xa0click\xa0here for details on how to get involved in the UK’s national suicide prevention Lifeline (in the UK or the UK). For confidential help in the United States, call\xa0the National suicide Prevention Line at\xa0800\xa0273\xa08255."}]
Good question and also an active research area, see https://aclanthology.org/2022.naacl-main.387/ and there are many more in that area.
[Opinion]: Personally, hunch says, it's most probably because most of the data that the model learnt from where the text is 800-ish subwords, the summary it trained are between the length of 80-300 subwords. And training data points where there are 300-500 subwords in the summary, it always contains the SOS helpline. So the model starts to overfit whenever it reaches that min_length
that is >300.
To prove the hunch pudding, try another random text of 800-ish subwords, and then set the min_length again to 500, it will most probably hallucinate the SOS sentence again beyond 300-ish subwords.