pythonnlpspacytokenizestringtokenizer

How to handle with large dataset in spacy


I use the following code to clean my dataset and print all tokens (words).

with open(".data.csv", "r", encoding="utf-8") as file:
    text = file.read()
text = re.sub(r"[^a-zA-Z0-9ß\.,!\?-]", " ", text)
text = text.lower()
nlp = spacy.load("de_core_news_sm")
doc = nlp(text)
for token in doc:
     print(token.text)

When I execute this code with a small string it works fine. But when I use a 50 megabyte csv I get the message

Text of length 62235045 exceeds maximum of 1000000. The parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the `nlp.max_length` limit. The limit is in number of characters, so you can check whether your inputs are too long by checking `len(text)`.

When I increase the limit to this size my computer gets hard problems.. How can I fix this? It can't be anything special to want to tokenize this amount of data.


Solution

  • de_core_web_sm isn't just tokenizing. It is running a number of pipeline components including a parser and NER, where you are more likely to run out of RAM on long texts. This is why spacy includes this default limit.

    If you only want to tokenize, use spacy.blank("de") and then you can probably increase nlp.max_length to a fairly large limit without running out of RAM. (You'll still eventually run out of RAM if the text gets extremely long, but this takes much much longer than with the parser or NER.)

    If you want to run the full de_core_news_sm pipeline, then you'd need to break your text up into smaller units. Meaningful units like paragraphs or sections can make sense. The linguistic analysis from the provided pipelines mostly depends on local context within a few neighboring sentences, so having longer texts isn't helpful. Use nlp.pipe to process batches of text more efficiently, see: https://spacy.io/usage/processing-pipelines#processing

    If you have CSV input, then it might make sense to use individual text fields as the units?