What does langchain CharacterTextSplitter's chunk_size param even do?

Asked 7/7, 2023 at 3:50 Answered 22/10, 2023 at 22:36

Solved python machine-learning text nlp langchain

My default assumption was that the chunk_size parameter would set a ceiling on the size of the chunks/splits that come out of the split_text method, but that's clearly not right:

from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

chunk_size = 6
chunk_overlap = 2

c_splitter = CharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)

text = 'abcdefghijklmnopqrstuvwxyz'

c_splitter.split_text(text)

prints: ['abcdefghijklmnopqrstuvwxyz'], i.e. one single chunk that is much larger than chunk_size=6.

So I understand that it didn't split the text into chunks because it never encountered the separator. But so then the question is what is the chunk_size even doing?

I checked the documentation page for langchain.text_splitter.CharacterTextSplitter here but did not see an answer to this question. And I asked the "mendable" chat-with-langchain-docs search functionality, but got the answer "The chunk_size parameter of the CharacterTextSplitter determines the maximum number of characters in each chunk of text."...which is not true, as the code sample above shows.

Hydrometer answered 7/7, 2023 at 3:50 Comment(1)

Thank you for this question. I feel like crazy people! Now, can someone explain how to split on character count in a clean and simple manner? – Aye 7/3 at 23:3

CharacterTextSplitter will only split on separator (which is '\n\n' by default). chunk_size is the maximum chunk size that will be split if splitting is possible. If a string starts with n characters, has a separator, and has m more characters before the next separator then the first chunk size will be n if chunk_size < n + m + len(separator).

Your example string has no matching separators so there's nothing to split on.

Basically, it attempts to make chunks that are <= chunk_size, but will still produce chunks > chunk_size if the minimum size chunks that can be created are > chunk_size.

Santoro answered 23/7, 2023 at 7:45 Comment(0)

CharacterTextSpliiter behaves differently from what you expected.

text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=6,
)

It first looks for the first 6 characters and then splits the next chunk from the closest separator, not from the 7th character.

As stated in the docs default separator is "\n".

This is the simplest method. This splits based on characters (by default "\n\n") and measure chunk length by number of characters.

you can test the behaviour with a sample code. first create a test.txt file with this

1.Respect for Others: Treat others with kindness.
2.Honesty and Integrity: Be truthful and act with integrity in your interactions with others.
3.Fairness and Justice: Treat people equitably.
4.Respect for Property: Respect public and private property.
5.Good Citizenship: Contribute positively to your community by obeying laws, voting, volunteering, and supporting communal well-being.

then write this code:

from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter

# it will first find first 20 character then it will make the next chunk at the closest separator
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=20,
    chunk_overlap=0
)

loader = TextLoader("test.txt")
docs = loader.load_and_split(
    text_splitter=text_splitter
)

for doc in docs:
    print(doc.page_content)
    print("\n")

this is how it look like:

Pammy answered 22/10, 2023 at 22:36 Comment(0)

Similar to CharacterTextSplitter, RecursiveCharacterTextSplitter module explains with more sense to me.

Recursively split by character

This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", " ", ""]. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.

Reference > https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter

Gothart answered 8/9, 2023 at 7:40 Comment(0)

Recommended topics

Hot tags