semchunk

The world’s most popular semantic chunking algorithm

Overview

Context-preserving chunks, by design.

We built semchunk, the most popular semantic chunking algorithm, to chunk texts in such a way that the chunks created are unlikely to cut off right in the middle of important sentences and paragraphs.

>4M

monthly open source downloads
3s

to chunk the entire Gutenberg Corpus
Free

and open source on GitHub

Recursive splitting

semchunk works by recursively splitting texts at increasingly structurally granular splitter sequences until all chunks are less than or equal to a given chunk size.
Structure-aware merging

By prioritizing whitespace ahead of sentence terminators and clause separators, semchunk often ends up producing chunks that keep whole paragraphs and sentences intact.
AI chunking mode

semchunk’s AI chunking mode uses Kanon 2 Enricher to derive structural spans, delivering 6% better RAG correctness than its non-AI mode and 15% over chonkie.

How chunking works

semchunk uses common typographical patterns in predominantly Latin-script documents to preserve syntactic and semantic divisions within chunks.

Using a sophisticated set of heuristics, semchunk breaks long documents into smaller chunks while preserving as much context as possible. This allows high-information chunks to be passed to AI models for processing, including models that either do not support long context windows or perform worse when overloaded with too much context.
semchunk prioritizes whitespace first, then sentence-ending punctuation such as periods and question marks, and then punctuation commonly used to separate dependent clauses. As a result it often produces chunks that keep whole paragraphs and sentences intact, while remaining relatively inexpensive to run.
You can customize how chunking is performed by setting chunk size and chunk overlap ratio parameters in our API — or use semchunk directly in your own Python stack.

Chunks created by the Isaacus API will often correspond to separate clauses and sections in a document.

Read the chunking docs →
View semchunk on GitHub →
AI chunking with Kanon 2 Enricher

A first-of-a-kind AI chunking mode powered by our enrichment and hierarchical segmentation model.

Heuristic chunking can struggle with higher-level document structures, such as sections within a statute, chapters within a book, or even simple headings. This is where AI chunking and Kanon 2 Enricher come in.
semchunk’s AI chunking mode uses Kanon 2 Enricher to identify spans that correspond to structural elements in a document. Those spans are then merged and decomposed as needed, producing chunks that stay within a specified chunk size.

How semchunk compares to other algorithms

On Legal RAG QA, semchunk’s AI chunking mode delivers a 6% increase in RAG correctness over its non-AI chunking mode, 8% over LangChain’s recursive chunking algorithm, 12% over naïve fixed-size chunking, and 15% over chonkie’s recursive and embedding-powered chunking modes.
- Enrichment
15% better RAG correctness than its closest competitor, per our AI chunking announcement.

Read Introducing AI chunking to semchunk →
Used in production

semchunk has become an integral component of retrieval-augmented generation pipelines around the world.

semchunk is the world’s most popular semantic chunking algorithm, with over a million monthly downloads, and is used in Microsoft’s Intelligence Toolkit and Docling by IBM.
Over time, Isaacus has continued to improve the accuracy and efficiency of semchunk. A standard consumer PC can now chunk the entire Gutenberg Corpus — 18 books and 3 billion tokens — in around three seconds.

Read semchunk hits one million monthly downloads →

Trusted by

semchunk

Context-preserving chunks, by design.

What makes semchunk different

Recursive splitting

Structure-aware merging

AI chunking mode

How semchunk works

How chunking works

AI chunking with Kanon 2 Enricher

Used in production