What is document chunking?
[+]
Splitting large documents into smaller pieces for AI processing. LLMs have context limits, so a 500-page PDF needs to be split into chunks before embedding or retrieval. The quality of chunking affects retrieval accuracy - chunks that cut off mid-sentence or split code blocks create problems for RAG systems.
Why not just run LangChain locally?
[+]
You can. RecursiveCharacterTextSplitter splits on character count and doesn't detect document structure. Common issues:
✗ Cuts code blocks mid-function
✗ Splits abbreviations like "Dr. Smith"
✗ Breaks tables into fragments
✗ Scatters list items across chunks
Our Structure tier ($0.002/page) detects these structures and avoids splitting them. Whether that's worth paying for depends on your use case.
What structures does CHUNKER preserve?
[+]
Code Blocks: Fenced code (``` and ~~~) stays intact with language detection
Tables: Markdown and HTML tables kept as complete units
Lists: Bullet points, numbered lists, and nested items stay together
Headers: H1-H6, ALL CAPS, and numbered sections (Chapter 1, Section 2.3) are detected and tracked
Each chunk's metadata tells you exactly what structures it contains and which section of the document it belongs to.
What's the difference between tiers?
[+]
Free ($0): Basic recursive splitting. 100 pages/day limit.
Structure ($0.002/page): Detects and preserves code blocks, tables, lists. Handles abbreviations. Recommended starting point.
Semantic ($0.004/page): Uses OpenAI text-embedding-3-small to find topic boundaries.
AI ($0.01/page): Claude Haiku analyzes text to identify semantic boundaries.
AI Pro ($0.025/page): Claude Sonnet for boundaries + generates per-chunk summaries and extracts entities.
What is quality scoring?
[+]
Every chunk gets an A-D grade based on five factors:
Length: Optimal 400-1500 chars (not too short, not too long)
Completeness: Starts with capital letter, ends with punctuation
Coherence: Doesn't start mid-sentence ("and", "but", "however")
Structure: Code blocks are properly closed, not cut off
Density: Reasonable words-per-sentence ratio
Use quality scores to filter chunks before embedding, or flag low-quality chunks for manual review.
What file formats are supported?
[+]
PDF - Text-based PDFs (not scanned images)
DOCX - Microsoft Word documents (including tables, headers, footers)
HTML - Web pages (tags stripped, content extracted)
TXT - Plain text files (UTF-8, Windows-1252, Latin-1)
Pages calculated at ~2500 characters per page.
How does payment work?
[+]
We accept USDC on Solana - fast, cheap, no subscriptions.
1. Web UI: Connect Phantom wallet, select tier, approve USDC transfer. Done.
2. API: Call /estimate to get price, send USDC transfer on Solana, include TX signature in X-PAYMENT header.
Pay only for what you use. No monthly fees. No credit card required.
Can AI agents use this API programmatically?
[+]
Yes. The API uses x402 payment protocol, which supports programmatic payments:
1. Call POST /estimate to get exact USDC cost
2. Execute Solana USDC transfer (requires wallet with signing capability)
3. Call chunking endpoint with TX signature in X-PAYMENT header
No API keys required. Payment verification happens on-chain. Free tier works without any payment for testing.
Is my data stored or logged?
[+]
No. Documents are processed in memory and immediately discarded. We don't store your files, chunks, or content. Only basic request metadata (timestamp, file size, wallet address) is logged for rate limiting and payment verification. Your documents never touch disk.