I'm trying to use the langchain text splitters library fun to "chunk" or divide A massive str file that has Sci-Fi Books I want to split it into n_chunks with a n_lenght of overlaping
This is my code:
from langchain_text_splitters import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
chunk_size=30,
chunk_overlap=5
)
text_raw = """
Water is life's matter and matrix, mother, and medium. There is no life without water.
Save water, secure the future.
Conservation is the key to a sustainable water supply.
Every drop saved today is a resource for tomorrow.
Let's work together to keep our rivers flowing and our oceans blue.
"""
chunks=text_splitter.split_text(text_raw)
print(chunks)
print(f'\n\n {len(chunks)}')
But this is my output:
["Water is life's matter and matrix, mother, and medium. There is no life without water.\nSave water, secure the future.\nConservation is the key to a sustainable water supply.\nEvery drop saved today is a resource for tomorrow.\nLet's work together to keep our rivers flowing and our oceans blue."]
1
My intention is to split at every 30 characters and overlap the last/leading 5
for instance if this is one chunk:
'This is one Chunk after text splitting ABC'
Then I want my following chunk to be something like :
'splitting ABC This is my Second Chunk ---''
Notice how the beginning of the next chunk overlaps the last characters of the previous chunks?
That's what I'm looking for but it is obvious that that is not how that function works. I am very new to langchanin. I have checked the official documentation but haven't found an example or tutorial like the one I'm looking for.
I am also wanting to write a function to save locally the chunks from LangChain. Or do we have to stick to base Python to do that?
I think the subclass CharacterTextSplitter does not quite solve this problem, because it splits long string based on manually specified seperators such as "\n". It seems a better thing to use is the TokenTextSplitter, which splits a long string based on tokens. Here is what I tried based on your example:
from langchain_text_splitters import TokenTextSplitter
text_splitter = TokenTextSplitter.from_tiktoken_encoder(
chunk_size=20,
chunk_overlap=4
)
text_raw = """Water is life's matter and matrix, mother, and medium. There is no life without water.
Save water, secure the future.
Conservation is the key to a sustainable water supply.
Every drop saved today is a resource for tomorrow.
Let's work together to keep our rivers flowing and our oceans blue.
"""
chunks = text_splitter.split_text(text_raw)
# chunks is a List[str]
for line in chunks:
print(line.replace("\n", " "))
This is the output I got:
Water is life's matter and matrix, mother, and medium. There is no life without water.
life without water. Save water, secure the future. Conservation is the key to a
the key to a sustainable water supply. Every drop saved today is a resource for tomorrow.
for tomorrow. Let's work together to keep our rivers flowing and our oceans blue.
I noticed that for TokenTextSplitter, chunk_size and chunk_overlap refer to number of words, unlike in some other subclasses where they refer to number of characters. So I guess this would be a better choice. You do have to replace any mechanical seperators like "\n" into something natrual though.