My default assumption was that the chunk_size
parameter would set a ceiling on the size of the chunks/splits that come out of the split_text
method, but that's clearly not right:
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
chunk_size = 6
chunk_overlap = 2
c_splitter = CharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
text = 'abcdefghijklmnopqrstuvwxyz'
c_splitter.split_text(text)
prints: ['abcdefghijklmnopqrstuvwxyz']
, i.e. one single chunk that is much larger than chunk_size=6
.
So I understand that it didn't split the text into chunks because it never encountered the separator. But so then the question is what is the chunk_size
even doing?
I checked the documentation page for langchain.text_splitter.CharacterTextSplitter
here but did not see an answer to this question. And I asked the "mendable" chat-with-langchain-docs search functionality, but got the answer "The chunk_size parameter of the CharacterTextSplitter determines the maximum number of characters in each chunk of text."...which is not true, as the code sample above shows.