RAG - Chunking
Source: Dev.to
What is chunking
Chunking is the process of breaking data into smaller pieces called chunks. It happens before the data is fed into an embedding model, which converts each chunk into a vector (point) and stores the vectors in a vector database.
Why chunking matters in RAG
Data can contain different types of context while still relating to the same topic.
For example, a paragraph about the Redis database may contain multiple contexts. An embedding model such as nomic-embed-text would convert the entire paragraph into a single vector and store it in the database.
Proper chunking helps retrieve only the most relevant information and avoids unrelated content. If a chunk mixes information about both Python and Java, a query about Python might also retrieve Java‑related information because both topics exist in the same chunk. Effective chunking prevents such irrelevant retrieval.
Even an entire document can be stored as a single chunk, but the purpose of chunking is to split the data into smaller, meaningful sections so that only relevant data is retrieved for the user query while avoiding irrelevant information.
Chunking methods
Fixed chunking
- Fixed chunking assigns a fixed character or token limit to every chunk.
- There is no single best chunking strategy for all datasets; choosing the right chunk size usually requires a trial‑and‑error approach.
Overlapping chunking
- In some cases, related information may be stored far apart in vector space due to the embedding model’s understanding, causing the LLM to miss relevant information during retrieval.
- Overlapping chunking includes a portion of the previous chunk’s ending content in each new chunk, helping the embedding model place related chunks closer together in the vector database.
- The purpose is to improve retrieval by making semantically related chunks easier to find.
- A possible downside is that irrelevant information may also be retrieved because of the overlap.
Example
Paragraph 1 is related to Topic A. If overlapping is applied, a query about Topic B may also retrieve some information from Topic A because part of Paragraph 1 overlaps with Paragraph 2.
Semantic chunking
- When two paragraphs discuss the same topic but are not strongly related, they may still be stored nearby in the vector database, making overlapping unnecessary.
- Semantic chunking groups content based on meaning rather than fixed size.
- Each sentence is compared with the previous chunk using a similarity threshold. If the similarity score is below the threshold, the sentence starts a new chunk.
- Libraries such as NLTK can be used to implement semantic chunking, and the threshold value is configurable based on the use case.
Embedded chunking
- Embedding‑based chunking uses embedding models instead of libraries like NLTK.
- It calculates cosine similarity between sentences and groups semantically similar sentences into chunks.
Choosing the right chunking method
Choosing a chunking method always involves trade‑offs; there is no single strategy that works for all datasets. The best method depends on:
- Dataset type – different applications may require different chunking strategies to achieve optimal RAG performance.