semchunk
The world’s most popular semantic chunking algorithm
Overview
Context-preserving chunks, by design.
We built semchunk, the most popular semantic chunking algorithm, to chunk texts in such a way that the chunks created are unlikely to cut off right in the middle of important sentences and paragraphs.
Capabilities
What makes semchunk different
Fast, structure-aware chunking — with an optional AI mode for the hardest documents.
>4M
monthly open source downloads
3s
to chunk the entire Gutenberg Corpus
Free
and open source on GitHub
-
Recursive splitting
semchunk works by recursively splitting texts at increasingly structurally granular splitter sequences until all chunks are less than or equal to a given chunk size.
-
Structure-aware merging
By prioritizing whitespace ahead of sentence terminators and clause separators, semchunk often ends up producing chunks that keep whole paragraphs and sentences intact.
-
AI chunking mode
semchunk’s AI chunking mode uses Kanon 2 Enricher to derive structural spans, delivering 6% better RAG correctness than its non-AI mode and 15% over chonkie.
Under the hood
How semchunk works
A closer look at the heuristics, the AI chunking mode, and where semchunk is used in production.
How chunking works
semchunk uses common typographical patterns in predominantly Latin-script documents to preserve syntactic and semantic divisions within chunks.
Using a sophisticated set of heuristics, semchunk breaks long documents into smaller chunks while preserving as much context as possible. This allows high-information chunks to be passed to AI models for processing, including models that either do not support long context windows or perform worse when overloaded with too much context.
semchunk prioritizes whitespace first, then sentence-ending punctuation such as periods and question marks, and then punctuation commonly used to separate dependent clauses. As a result it often produces chunks that keep whole paragraphs and sentences intact, while remaining relatively inexpensive to run.
You can customize how chunking is performed by setting chunk size and chunk overlap ratio parameters in our API — or use semchunk directly in your own Python stack.
AI chunking with Kanon 2 Enricher
A first-of-a-kind AI chunking mode powered by our enrichment and hierarchical segmentation model.
Heuristic chunking can struggle with higher-level document structures, such as sections within a statute, chapters within a book, or even simple headings. This is where AI chunking and Kanon 2 Enricher come in.
semchunk’s AI chunking mode uses Kanon 2 Enricher to identify spans that correspond to structural elements in a document. Those spans are then merged and decomposed as needed, producing chunks that stay within a specified chunk size.
Used in production
semchunk has become an integral component of retrieval-augmented generation pipelines around the world.
semchunk is the world’s most popular semantic chunking algorithm, with over a million monthly downloads, and is used in Microsoft’s Intelligence Toolkit and Docling by IBM.
Over time, Isaacus has continued to improve the accuracy and efficiency of semchunk. A standard consumer PC can now chunk the entire Gutenberg Corpus — 18 books and 3 billion tokens — in around three seconds.
Trusted by