Chapter Detection Process
This page contains technical and proprietary information about ChapterWise's chapter detection system and is only available to administrators.
🔒 Administrative Access Required
This documentation provides deep technical insights into our proprietary chapter detection algorithms. Access is restricted to authorized administrators only.
System Overview
The chapter detection system uses a sophisticated multi-stage approach called "Cucumber Cutting" to automatically identify and extract chapters from manuscripts with high accuracy and reliability.
File Structure and Data Flow
The chapter detection process generates several key files that store different stages of the processing pipeline. All files use the manuscript UUID as a prefix for organization.
📁 File Organization
Each manuscript processing session creates a unique directory structure with UUID-based naming for complete traceability and data integrity.
Generated Files
[uuid]_manuscript.json- Original manuscript content and metadata[uuid]_doctree.json- Structured document representation with indexed blocks[uuid]_converted.html- HTML version of the manuscript for processing[uuid]_boundary_raw.json- Raw LLM responses with completion IDs for debugging-
[uuid]_boundaries_merged.json- Processed and merged chapter boundaries -
[uuid]_results.json- Final chapter detection results and statistics metadata.json- Overall project metadata and processing information
Data Flow
- Input: Original manuscript (PDF, Word, etc.) →
[uuid]_manuscript.json - Processing: Document conversion →
[uuid]_doctree.json+[uuid]_converted.html - AI Detection: LLM boundary analysis →
[uuid]_boundary_raw.json - Merging: Duplicate removal and validation →
[uuid]_boundaries_merged.json - Cucumber Cutting: Document slicing using merged boundaries
- Output: Complete results →
[uuid]_results.json+metadata.json
How It Works (Summary)
- Document Processing: Converts manuscripts into structured DocTree format with indexed blocks
- AI Analysis: Uses GPT models to identify chapter boundaries in overlapping document chunks
- Validation: Rigorous validation ensures all detected boundaries are accurate and within valid ranges
- Merging: Combines duplicate detections from overlapping chunks into single boundaries
- Ordering: Orders boundaries by document position to maintain proper chapter sequence
- Cutting: Slices the document at exact boundary points to create sequential chapters
- Classification: Automatically detects chapter types (prologue, chapter, epilogue, etc.)
Key Benefits
- 100% Document Coverage: Every word is included in exactly one chapter with no gaps or overlaps
- Duplicate Prevention: Advanced algorithms prevent the same content appearing in multiple chapters
- Consistent Processing: Prompt-budgeted chunking ensures predictable processing times across all document sizes
- High Accuracy: Multi-layer validation and AI reasoning achieve 90-98% accuracy on well-formatted documents
🎯 Performance Guarantee
Our system maintains consistent processing times regardless of document complexity, with intelligent load balancing and resource optimization.
Core Methodology: "Cucumber Cutting" Approach
Think of the manuscript as a cucumber that needs to be sliced. The chapter detection process:
- Identifies cut points (chapter boundaries) throughout the document
- Orders these cut points sequentially from start to finish
- Makes clean cuts at these points to create chapters
- Ensures no overlapping or duplicate sections
🔪 Precision Engineering
Our "Cucumber Cutting" methodology ensures surgical precision in chapter boundary detection, with zero tolerance for content loss or duplication.
This approach guarantees that: - Chapters appear in the correct document order - No content is duplicated across chapters - No content is lost between chapters - Each chapter represents a distinct section of the manuscript
Detailed Process Breakdown
Step 1: Document Preprocessing and Chunking
Function: create_doctree_chunks() and _create_prompt_budgeted_chunks()
The system begins by converting the manuscript into a structured DocTree format where each text block has a unique index, position, styling information, and content. This creates a numbered sequence of blocks that can be precisely referenced.
🏗️ Foundation Layer
This critical preprocessing stage establishes the architectural foundation for all subsequent operations, ensuring data integrity and processing reliability.
Intelligent Chunking Process: The document is then split into overlapping chunks using an advanced prompt-budgeted system. Unlike traditional text-based chunking, this approach:
- Measures actual prompt size: The system pre-builds formatted prompt units using
_build_prompt_unit()and calculates the exact character count including formatting overhead - Maintains consistent chunk sizes: Each chunk generates prompts of similar length (around 30-35k characters) to ensure predictable processing times
- Eliminates the "first chunk problem": Traditional chunking created oversized first chunks that took much longer to process
- Uses smart overlap: Chunks overlap by a specific number of prompt characters to ensure no chapter boundaries are missed between chunks
Why This Matters: The original problem was that the first chunk was consistently 2-3 times larger in prompt size than subsequent chunks because: - Front matter contains many short blocks (headings, TOC entries) that add formatting overhead - Each block requires metadata formatting regardless of text length - Traditional character counting ignored this formatting cost
The Solution:
- _calculate_prompt_overhead() computes constant overhead for instructions and context
- _build_prompt_unit() creates formatted units with exact character counts
- _validate_chunk_consistency() ensures all chunks meet size requirements
Quality Monitoring:
The system includes comprehensive validation through _validate_chunk_consistency() that:
- Monitors chunk size variance and alerts if chunks vary too much
- Validates that the first chunk is properly normalized
- Detects empty chunks or processing issues
- Reports performance metrics for optimization
Step 2: AI-Powered Boundary Detection
Function: process_chapter_detection() and build_doctree_boundary_detection_system_prompt()
Each chunk is analyzed simultaneously by AI to identify potential chapter boundaries. The system uses specialized prompts designed specifically for chapter detection.
🧠 AI Intelligence Layer
Our proprietary prompt engineering ensures optimal AI performance with context-aware boundary detection and intelligent reasoning capabilities.
How AI Analysis Works:
- Role-based instructions: The build_doctree_boundary_detection_system_prompt() function creates prompts that instruct the AI to act as a conservative chapter boundary detection specialist
- Structured input: Each chunk is presented with clearly marked DocTree block indexes through build_doctree_boundary_detection_user_prompt()
- Enhanced reasoning: Uses GPT models (configurable via CHAPTER_DETECTION_MODEL environment variable) for logical analysis
What the AI Returns: For each potential chapter boundary found, the AI provides: - DocTree Block Indexes: Up to 4 specific block indexes marking where the chapter begins - Chapter Title: Both the raw detected title and a cleaned/corrected version - Confidence Score: How certain the AI is about this boundary (0.0 to 1.0) - Detection Reasoning: Specific explanation for why this was identified as a chapter start - Title Corrections: Automatic fixes for common issues like spacing problems or incomplete titles
Parallel Processing:
The process_chapter_detection() function uses ThreadPoolExecutor to process multiple chunks simultaneously, with configurable concurrency limits to stay within API rate limits.
Step 3: Index Validation and Quality Control
Function: _validate_boundary_indexes() and _sanitize_boundary_doctree_indexes()
Every AI-detected boundary undergoes rigorous validation to ensure accuracy and prevent errors from corrupted AI responses.
🛡️ Quality Assurance
Multi-layer validation protocols eliminate AI hallucinations and ensure data integrity throughout the processing pipeline.
Validation Process:
The _validate_boundary_indexes() function performs these critical checks:
- Range verification: Ensures all indexes are valid integers within the document range (0 ≤ index < total_blocks)
- Existence verification: Confirms each index corresponds to an actual block in the document structure
- Chunk boundary respect: Validates that indexes belong to the chunk being analyzed
- Quantity limits: Restricts each boundary to a maximum of 4 DocTree indexes
- Quality filtering: Removes boundaries that end up with no valid indexes after validation
Why This is Critical: AI models can sometimes "hallucinate" invalid indexes or return corrupted data. This validation step: - Eliminates AI hallucinations and invalid responses - Prevents cross-contamination between chunks - Ensures all indexes can be safely used for document slicing - Maintains data integrity throughout the pipeline
Performance Benefits: Using simple integer indexes instead of complex string IDs provides significant performance improvements: - No expensive string-to-index lookup operations required - Simple integer range checks are extremely fast - Direct array indexing for document slicing
Step 4: Raw Response Storage and Extraction
Function: process_chapter_detection() - Raw response handling
All LLM responses are immediately saved to preserve complete debugging information, then boundaries are extracted directly from the raw responses.
Raw Response Storage Process:
- Immediate saving: Raw responses saved to _boundary_raw.json before any processing
- Complete data: Includes completion IDs, timestamps, model used, usage statistics
- JSON formatting: Raw content parsed as structured JSON objects for easy inspection
- Error preservation: Failed responses also saved with error details
Direct Boundary Extraction:
After saving raw responses, the system:
- Extracts boundaries: Directly from raw_content.boundaries_detected in each response
- Tags with source: Each boundary tagged with its source chunk for tracking
- Preserves all data: No intermediate processing that could lose boundaries
- Comprehensive logging: Detailed logging shows exactly which chunks contain boundaries
Why This Approach Works: - No data loss: Boundaries can't be lost in complex intermediate processing - Full traceability: Raw responses provide complete audit trail - Debugging capability: Can inspect exact LLM responses when issues occur - Reliability: Simple, direct extraction minimizes failure points
Step 5: Boundary Merging and Deduplication
Function: _ensure_unique_doctree_indexes_across_boundaries()
The system merges boundaries that represent the same chapter detected across multiple overlapping chunks.
Smart Merging Process: - Overlap detection: Identifies boundaries sharing 2+ DocTree indexes (same chapter detected multiple times) - Intelligent merging: Combines all indexes from overlapping boundaries into a single boundary - Best metadata preservation: Uses the boundary with highest confidence for final metadata - Complete index coverage: Merged boundary contains all indexes from all detections
Merging Logic:
Chapter 7 in chunk A: [3344, 3345, 3346, 3347] (confidence: 0.92)
Chapter 7 in chunk B: [3346, 3347, 3348, 3349] (confidence: 0.96)
Result: One Chapter 7: [3344, 3345, 3346, 3347, 3348, 3349] (confidence: 0.96)
Quality Assurance: - 2+ index requirement: Prevents false merges from single coincidental index matches - Confidence-based selection: Always keeps the best metadata from the highest-confidence detection - Complete coverage: Merged boundaries have comprehensive index coverage for accurate cutting
Result: One clean boundary per actual chapter, with complete index coverage and best available metadata.
Step 6: Final Boundary Storage
Function: process_chapter_detection() - Final file creation
The merged boundaries are saved to _boundaries_merged.json with complete processing statistics and metadata.
Merged Boundaries File Structure:
{
"type": "merged_boundaries",
"manuscript_id": "uuid",
"created_at": "timestamp",
"total_raw_boundaries": 15,
"total_merged_boundaries": 10,
"merged_boundaries": [...],
"processing_stats": {
"successful_chunks": 50,
"total_chunks": 50,
"boundaries_found": 15,
"boundaries_after_merge": 10
}
}
Quality Assurance: - Complete statistics: Shows how many raw boundaries were found vs final merged count - Processing metrics: Success rates and chunk processing information - Audit trail: Full processing history for debugging and quality monitoring
Result: Clean, merged boundaries ready for cucumber cutting with complete processing transparency.
Step 7: Import Orchestrator Processing
Function: _merge_boundary_results_and_reconstruct() in import_orchestrator.py
The import orchestrator loads the merged boundaries and applies final validation before cucumber cutting.
Streamlined Loading Process:
- Load merged boundaries: Reads _boundaries_merged.json with fallback to legacy format
- Skip redundant processing: Boundaries are already merged and validated
- Direct usage: Uses merged boundaries directly without re-processing
- Sort for cutting: Orders boundaries by doctree_index for proper cucumber cutting sequence
Quality Assurance: The import orchestrator applies final validation: - Confidence filtering: Removes boundaries below 0.6 confidence threshold - TOC filtering: Eliminates table of contents entries that shouldn't be chapters - Proximity filtering: Prevents micro-chapters with minimum 50-block gaps - Sequential ordering: Ensures proper document order for cucumber cutting
Fallback Handling:
- Primary: Uses _boundaries_merged.json when available
- Fallback: Can process legacy _boundary_results.json format if needed
- Error handling: Graceful failure with detailed error messages
Step 8: Cucumber Cutting Implementation
Function: slice_doctree_at_boundaries() in import_orchestrator.py
The actual "cucumber cutting" where the document is sliced at exact boundary points to create sequential chapters.
Document Slicing Process: - Extract cut points: Gets exact block indexes from merged boundaries - Sort by position: Ensures proper sequential order for cutting - Create chapters: Slices document between cut points (prologue + chapters) - Content extraction: Builds chapter content from DocTree blocks in each slice
Chapter Creation Logic:
Document: 5000 blocks, Cut points: [100, 500, 1200, 2000]
Prologue: blocks[0:100] // First 100 blocks
Chapter 1: blocks[100:500] // Blocks 100-500
Chapter 2: blocks[500:1200] // Blocks 500-1200
Chapter 3: blocks[1200:2000] // Blocks 1200-2000
Chapter 4: blocks[2000:5000] // Blocks 2000 to end
Quality Validation: - Complete coverage: Every block included in exactly one chapter - No gaps: Continuous slicing ensures no content loss - No overlaps: Clean boundaries prevent duplicate content - Word count filtering: Ensures meaningful content in each chapter
Result: Sequential chapters with complete document coverage and no content duplication.
The "Cucumber Cutting" Final Phase
Note: The boundary detection phase (Steps 1-8 above) only identifies WHERE to cut. The actual cutting happens later in the import orchestrator.
Step 9: Cucumber Cutting Implementation
Function: slice_doctree_at_boundaries()
This is the true "cucumber cutting" phase where the document is sliced at the exact boundary points identified during detection.
How Document Slicing Works:
The slice_doctree_at_boundaries() function performs the actual cutting:
- Extract cut points: Gets the exact block indexes from all validated boundaries
- Sort cut points: Ensures proper sequential order (should already be sorted from validation)
- Create chapters by slicing: Cuts the document between cut points to create sequential chapters
Chapter Creation Logic: - Prologue: From beginning (block 0) to first cut point - Regular chapters: From each cut point to the next cut point (or document end) - Content extraction: Builds chapter content by joining text from all blocks in the slice - Quality filtering: Only includes slices with meaningful content (minimum word count)
Complete Document Coverage: Every single block from 0 to the total number of blocks is included in exactly one chapter, with no gaps or overlaps.
Step 10: Final Validation and Chapter Type Detection
Function: detect_chapter_type() and validation within slice_doctree_at_boundaries()
Each chapter is classified and validated to ensure proper organization and complete document coverage.
Chapter Type Classification Process:
The detect_chapter_type() function analyzes title and content patterns to classify each chapter:
- Explicit keyword detection: Looks for specific keywords in the title
- Position-based logic: Considers the chapter's position in the document
- Content analysis: Examines the actual chapter content when needed
- Default classification: Falls back to "chapter" for standard content
Chapter Types Detected:
- "toc": Table of contents sections
- "prologue": Prologue or introduction sections
- "epilogue": Epilogue or conclusion sections
- "part": Part divisions in multi-part books
- "appendix": Appendix sections
- "acknowledgments": Acknowledgments sections
- "notes": Notes or reference sections
- "chapter": Standard narrative chapters (default)
Quality Assurance Checks: The final validation process includes: - Verify document coverage percentage to ensure no content is lost - Log chapter word count distribution for analysis - Validate sequential numbering is correct - Confirm no overlapping content between chapters - Generate comprehensive metadata for each chapter
Key Design Principles
DocTree Index Validation
Functions: _validate_boundary_indexes(), _sanitize_boundary_doctree_indexes()
- Always verify that AI-detected indexes are within valid range (0 ≤ index < total_blocks)
- Never trust AI-generated indexes without validation
- Remove invalid indexes immediately to prevent downstream errors
- Performance advantage: Simple integer range checks vs complex string validation
Merge Before Order
Function: _merge_overlapping_boundaries()
- Merge overlapping boundaries first before ordering
- Use index intersection as the primary merge criteria
- Preserve highest confidence metadata when merging
- Direct integer operations eliminate lookup overhead
Sequential Processing
Function: _order_boundaries_by_document_position()
- Order by document position not by detection order
- Think like cutting a cucumber - cut points must be sequential
- Maintain document flow from beginning to end
- Instant ordering using direct integer comparison
TOC vs Content Distinction
Function: _apply_proximity_and_toc_filtering()
- TOC appears early in most documents (first 15%)
- TOC contains listings of chapter names, not chapter content
- Actual chapters have substantial content following the heading
Quality Over Quantity
Functions: Various filtering and validation functions
- Better to have fewer, accurate chapters than many incorrect ones
- Apply confidence thresholds to filter low-quality detections
- Use proximity filtering to prevent micro-chapters
System Validation and Testing
Validation Checks
The system performs comprehensive validation to ensure quality:
- No duplicate content across chapters
- Sequential chapter ordering matches document flow
- All content preserved (no gaps or missing sections)
- Proper chapter type classification (prologue, chapter, epilogue, etc.)
- TOC filtering effectiveness (no TOC entries as chapters)
Debug Information and Monitoring
The system provides extensive logging and monitoring:
- Log boundary merge operations for transparency
- Track DocTree index validation results and performance
- Monitor proximity filtering decisions
- Report final cut point positions
- Performance metrics for optimization verification
- Comprehensive error handling and retry logic
Implementation Files
The chapter detection system is implemented across several key files:
agent_worker/tasks/chapter_detection.py: Core boundary processing logic and all validation functionsapp/services/import_orchestrator.py: High-level orchestration and coordination with other systems- Prompt Engineering: Specialized system and user prompts optimized for boundary detection
Performance Optimization: Integer Index System
Major Performance Enhancement
The system underwent a major optimization by switching from complex string-based DocTree IDs to simple integer indexes.
Previous Approach: Used complex string-based DocTree IDs like "wiUaeu6TLhE" requiring expensive lookup operations.
New Approach: Uses simple integer indexes like 45, 46, 47, 48 representing direct block positions.
⚡ Performance Breakthrough
This architectural optimization delivers dramatic performance improvements while maintaining 100% accuracy and reliability.
Performance Benefits
Eliminated Expensive Operations
- No more ID-to-index mapping: Previously required expensive loops through all blocks
- No more string validation: Complex alphanumeric string format checks removed
- No more lookup tables: Block ID dictionaries eliminated
Direct Integer Operations
- Simple range validation: Direct integer comparison instead of string format checks
- Instant ordering: Direct integer comparison vs lookup operations
- Immediate positioning: Indexes ARE positions - no conversion needed
Simplified Code Paths
- Streamlined validation: Integer range checks vs complex string validation
- Optimized merging: Direct integer set operations vs ID matching
- Direct cucumber cutting: Use indexes immediately for array slicing
Expected Performance Gains
- CPU Usage: Significant reduction in processing time for large documents
- Memory: Eliminates ID-to-index mapping dictionaries
- Reliability: Fewer potential failure points
- Maintainability: Much simpler code to debug and maintain
This optimization maintains 100% accuracy while dramatically improving processing speed, especially for large manuscripts with many chapters.
Configuration and Monitoring
Environment Variables
The system can be configured through several environment variables:
- PROMPT_BUDGETING_ENABLED: Enables/disables prompt-budgeted chunking (default: true)
- PROMPT_MAX_INPUT_CHARS: Maximum prompt characters per chunk (default: 30000)
- PROMPT_BLOCK_TEXT_CAP: Text preview length per block (default: 200)
- PROMPT_OVERLAP_CHARS: Overlap in prompt characters (default: 4000)
- PROMPT_SAFETY_MARGIN: Safety buffer percentage (default: 0.15)
- CHAPTER_DETECTION_MODEL: AI model to use (default: gpt-5-mini)
Advanced Configuration Examples
Standard Configuration:
PROMPT_BUDGETING_ENABLED=true # Enable prompt-budgeted chunking (default)
PROMPT_MAX_INPUT_CHARS=56000 # Maximum prompt characters per chunk (~15k total tokens)
PROMPT_BLOCK_TEXT_CAP=80 # Text preview length per block (optimized)
PROMPT_OVERLAP_CHARS=4000 # Overlap in prompt characters
PROMPT_SAFETY_MARGIN=0.2 # 20% safety buffer
Optimization Examples:
# For very large documents (more content per chunk, use with caution)
PROMPT_MAX_INPUT_CHARS=70000
# For faster processing (smaller, quicker chunks)
PROMPT_MAX_INPUT_CHARS=40000
# Use legacy text-based chunking instead
PROMPT_BUDGETING_ENABLED=false
Performance Monitoring
The system logs detailed metrics during processing:
- Chunking mode: Indicates whether prompt-budgeted or legacy chunking is used
- Chunk statistics: Variance ratios and size distributions for consistency monitoring
- Validation results: Consistency checks and quality metrics
- Processing times: Per-chunk and overall processing performance
- Error rates: Success/failure statistics for reliability monitoring
Monitoring Chunking Performance: The system logs detailed metrics during processing: - Chunking mode: Look for "🎯 Using prompt-budgeted chunking" in logs - Chunk statistics: Variance ratios and size distributions - Validation results: Consistency checks and quality metrics
Expected Results
After proper implementation, the chapter detection system delivers:
- Chapters in correct order matching the original document structure
- No duplicate chapters from overlapping chunk detection
- Proper chapter types (prologue, chapter, epilogue, etc.) automatically classified
- No TOC entries mistaken for actual chapters
- Complete content coverage with no missing sections or gaps
- Clean chapter boundaries at natural break points in the narrative
- High accuracy rates (90-98% for well-formatted documents)
- Consistent processing times regardless of document size or complexity
🏆 Enterprise-Grade Reliability
Our system delivers production-ready results with enterprise-level reliability, scalability, and performance optimization.