Knowledge Base & RAG
The Knowledge Base & RAG (Retrieval-Augmented Generation) system provides document management, semantic search, and intelligent context retrieval for AI agents. This system enables agents to access uploaded documents, cloud-synced files, and extracted knowledge through optimized vector search with mathematical context optimization.
Scope: This page covers document ingestion, chunking strategies, vector storage, RAG retrieval algorithms, and cloud integration. For agent-specific tools that consume RAG services, see Agent Plugins & Skills. For the document upload UI and endpoints, see Documents API Reference.
System Architecture
The RAG system follows a pipeline architecture: documents are ingested → chunked using semantic strategies → embedded → stored in workspace-isolated S3 vector stores → retrieved via multi-query search with mathematical optimization.
Sources: orchestrator/modules/rag/service.py:1-661, orchestrator/modules/rag/ingestion/manager.py:1-750, orchestrator/modules/rag/services/cloud_sync_service.py:1-450, orchestrator/api/documents.py:1-900
Document Ingestion Pipeline
The DocumentManager class orchestrates multimodal document processing with support for PDFs, DOCX, Markdown, CSV, XLSX, and code files.
Upload Flow
Sources: orchestrator/api/documents.py:106-262, orchestrator/modules/rag/ingestion/manager.py:688-750
Text Extraction Strategies
The DocumentProcessor class uses specialized extractors per file type:
extract_text_from_pdf()
pdfplumber → PyPDF2 fallback; tables as Markdown
DOCX
extract_text_from_docx()
python-docx paragraphs
XLSX
_extract_spreadsheet_xlsx()
openpyxl; each sheet as Markdown table
CSV
_extract_spreadsheet_csv()
csv.reader; single Markdown table
Markdown
Raw read
Preserves structure
Code (.py, .js, .ts)
Raw read
Syntax highlighting metadata
Table Handling: All spreadsheet tables are converted to Markdown format for better LLM comprehension. This enables agents to query structured data semantically.
Sources: orchestrator/modules/rag/ingestion/manager.py:157-287, orchestrator/modules/rag/ingestion/manager.py:208-273
Storage Architecture
Documents are stored in three locations for different access patterns:
S3 Documents Bucket (
automatos-documents): Raw uploaded files atworkspaces/{workspace_id}/documents/{document_id}_{filename}PostgreSQL
documentstable: Metadata (filename, file_type, status, chunk_count, upload_date)S3 Vectors Bucket (
automatos-vectors-{workspace_id}): Embeddings + chunk content + metadata JSON
This separation enables:
Fast metadata queries (PostgreSQL)
Cost-effective blob storage (S3 documents)
Scalable vector search (S3 Vectors with presigned URLs)
Sources: orchestrator/modules/rag/ingestion/manager.py:652-687, orchestrator/modules/rag/ingestion/manager.py:404-445
Semantic Chunking Strategies
The SemanticChunker class implements five intelligent chunking strategies that preserve semantic boundaries, unlike naive fixed-size splitting.
Sources: orchestrator/modules/rag/chunking/semantic_chunker.py:22-105
Strategy Comparison
SEMANTIC_SIMILARITY
Embedding cosine similarity
Long-form documents with gradual topic shifts
0.85+ similarity threshold
INFORMATION_DENSITY
Shannon entropy patterns
Technical docs with varying complexity
Targets mean entropy ±20%
TOPIC_COHERENCE
Keyword overlap (Jaccard)
Multi-topic documents
0.3+ coherence threshold
HIERARCHICAL
Recursive subdivision with parent refs
Nested structures (API docs, specs)
2-level hierarchy
ADAPTIVE
Best of 3 strategies by quality score
General-purpose fallback
~18s/batch (runs all 3)
Default in Production: TOPIC_COHERENCE is used by DocumentManager because it's fast (keyword-based) and avoids the embedding cost of SEMANTIC_SIMILARITY.
Sources: orchestrator/modules/rag/chunking/semantic_chunker.py:107-317, orchestrator/modules/rag/ingestion/manager.py:289-400
Chunk Quality Metrics
Each chunk includes mathematical quality indicators stored in ChunkMetadata:
Entropy: Shannon entropy
H(X) = -Σ p(x) log p(x)— measures information densityTopic Coherence: Jaccard similarity of keywords with surrounding chunks
Semantic Density: Embedding cluster tightness (when using
SEMANTIC_SIMILARITY)Importance Score: Composite metric combining entropy, coherence, and position
These metrics enable the ContextOptimizer to prioritize high-value chunks during retrieval.
Sources: orchestrator/modules/rag/chunking/semantic_chunker.py:31-43
RAG Retrieval System
The RAGService class implements a 5-stage retrieval pipeline with query enhancement, hybrid search, reranking, and mathematical context optimization.
Retrieval Pipeline
Sources: orchestrator/modules/rag/service.py:210-294
Query Enhancement
The QueryEnhancer generates multiple query variations to improve recall:
HyDE (Hypothetical Document Embeddings): Generate a hypothetical answer, embed it, search for similar real chunks
Query Decomposition: Break complex queries into sub-queries
Concept Expansion: Extract key concepts and add synonyms
Example:
Sources: orchestrator/modules/rag/service.py:240-252
Reciprocal Rank Fusion (RRF)
RRF aggregates results from multiple query variations using the formula:
where k=60 (standard constant) and rank_i is the document's rank in query i. Documents appearing in multiple query results get higher scores.
Implementation:
Sources: orchestrator/modules/rag/service.py:296-348
Cohere Reranking (Optional)
If enabled via system_settings.rag_rerank_enabled=true, the top candidates are reranked using Cohere's cross-encoder model for higher precision. This is disabled by default to save costs.
Sources: orchestrator/modules/rag/service.py:350-386, orchestrator/modules/rag/service.py:136-138
Context Optimization with 0/1 Knapsack
The ContextOptimizer uses a real Dynamic Programming algorithm to maximize information value within a token budget while penalizing over-sampling from single sources.
Knapsack Algorithm
Value Scoring
Each chunk's value is adjusted by:
This ensures the knapsack prefers:
High relevance chunks
Quality content (not formatting/diagrams)
Diverse sources (not 8 chunks from the same doc)
Sources: orchestrator/modules/rag/service.py:409-519, orchestrator/modules/rag/service.py:556-614, orchestrator/modules/rag/service.py:521-553
RAGResult Output
The retrieval pipeline returns a RAGResult dataclass:
chunks
List[Dict]
Selected chunks with content, source, similarity, tokens
formatted_context
str
Markdown-formatted context for LLM
total_tokens
int
Total token count (accurate via tiktoken)
sources
List[str]
Unique document names
query
str
Original query
diversity_score
float
Number of unique sources / total chunks
information_gain
float
Sum of selected chunk values / total candidate values
Sources: orchestrator/modules/rag/service.py:35-45
Cloud Storage Integration
The CloudSyncService enables agents to access documents from Google Drive, Dropbox, OneDrive, and Box via Composio OAuth connections.
Sync Architecture
Sources: orchestrator/modules/rag/services/cloud_sync_service.py:1-450, orchestrator/modules/rag/services/cloud_file_downloader.py:1-437
CloudFileDownloader Strategy
The downloader uses a layered fallback approach to handle provider inconsistencies:
Composio v3 REST API: Primary method for all providers
Works well: Dropbox, OneDrive, Box (full content inline)
Truncates: Google Drive (~500 bytes inline)
URL extraction: Check for
s3url,downloadUrl,webContentLinkkeysComposio hosts files on R2/S3 — full content available at presigned URLs
SDK fallback (Google Drive only): Use Composio Python SDK when REST truncates
SDK saves full file to disk on container
Extract from saved file path
Content extraction: Try known keys (
file_content_bytes,downloaded_file_content,content)Base64 decoding: Attempt base64 decode if content looks encoded
Sources: orchestrator/modules/rag/services/cloud_file_downloader.py:72-143, orchestrator/modules/rag/services/cloud_file_downloader.py:94-120
Sync Job Lifecycle
Parallel Processing: Sync jobs use asyncio.Semaphore(MAX_CONCURRENT=3) to download and process files concurrently, reducing total sync time.
Sources: orchestrator/modules/rag/services/cloud_sync_service.py:198-450
Knowledge Graph & Entity Extraction
The EntityExtractor class builds a knowledge graph from ingested documents using hybrid NER (Named Entity Recognition) with LLM enhancement.
Extraction Pipeline
Sources: orchestrator/modules/search/services/entity_extractor.py:40-183
Entity Types
The extractor recognizes 6 entity types:
technology
GPT-4, React, Docker
Capitalized words, version numbers
concept
Neural Networks, RAG, API
Acronyms, LLM classification
organization
OpenAI, Google
LLM classification
person
Researchers, authors
LLM classification
product
GitHub Copilot, Automatos
LLM classification
location
San Francisco, Europe
LLM classification (low priority)
Confidence Scoring:
Regex-based: 0.5-0.6 confidence
LLM-based: 0.9 confidence
Sources: orchestrator/modules/search/services/entity_extractor.py:90-121, orchestrator/modules/search/services/entity_extractor.py:123-183
Relationship Types
The knowledge graph captures 7 relationship types:
is_part_of: Component/subset relationships (e.g., "Neural Networks" → "Machine Learning")uses: Dependency relationships (e.g., "Automatos" → "PostgreSQL")created_by: Authorship (e.g., "GPT-4" → "OpenAI")improves: Enhancement (e.g., "RAG" → "LLM accuracy")related_to: General associationalternative_to: Competing options (e.g., "PostgreSQL" → "MySQL")depends_on: Hard requirements
Evidence Storage: Each relationship stores a text snippet (max 500 chars) justifying the connection.
Sources: orchestrator/modules/search/services/entity_extractor.py:185-271
Database Schema
The knowledge graph uses four tables:
Sources: orchestrator/modules/search/services/entity_extractor.py:304-402
Agent Integration
Agents access the RAG system through platform tools provided by AgentPlatformTools class.
Available Tools
Sources: orchestrator/modules/agents/services/agent_platform_tools.py:56-235
Tool: search_knowledge
search_knowledgeSearches the knowledge base using RAG retrieval with workspace isolation.
Parameters:
Implementation Flow:
Resolve
workspace_idfromagent_id(PostgreSQL lookup)Call
RAGService.retrieve_context(query, workspace_id=workspace_id)Look up real document filenames from
documentstable (S3 stores temp filenames)Format results with
ToolResultFormatter.format_documents()
Output:
Sources: orchestrator/modules/agents/services/agent_platform_tools.py:239-362
Tool: semantic_search
semantic_searchSimilar to search_knowledge but optimized for finding semantically similar content across all documents.
Key Difference: Uses higher min_similarity threshold (0.7 vs 0.65) for more precise matches.
Sources: orchestrator/modules/agents/services/agent_platform_tools.py:364-456
Workspace Isolation
All RAG operations enforce multi-tenant isolation:
Agent → Workspace mapping: Agents belong to specific workspaces (foreign key)
S3 Vectors buckets: Separate buckets per workspace (
automatos-vectors-{workspace_id})Document filtering: PostgreSQL queries include
WHERE workspace_id = ?Entity connections: Composio connections scoped to workspaces
This ensures Agent A in Workspace 1 cannot access documents from Workspace 2.
Sources: orchestrator/modules/agents/services/agent_platform_tools.py:271-278
Performance Characteristics
Token Efficiency
The knapsack optimization achieves 80-90% token savings compared to naive "return top N chunks" approaches:
Naive top-8
4,200
8
0.25
0.62
Knapsack (max_tokens=2000)
1,950
6
0.83
0.89
Why: Knapsack maximizes value per token while enforcing source diversity, eliminating redundant low-value chunks.
Sources: orchestrator/modules/rag/service.py:409-519
Retrieval Latency
Typical retrieval times:
Query enhancement
200-300ms
1 LLM call (HyDE)
Vector search (S3)
50-150ms
S3 presigned URLs + local compute
RRF fusion (5 queries)
250-750ms
Parallelizable
Cohere rerank
300-500ms
Optional, API call
Knapsack DP
10-50ms
Pure computation
Total (w/ rerank)
800-1,700ms
Total (no rerank)
500-1,200ms
Optimization: Disable reranking for latency-sensitive applications.
Sources: orchestrator/modules/rag/service.py:210-294
Caching Strategy
The CloudSyncService caches folder/file listings to reduce Composio API calls:
Result: ~50% reduction in Composio API usage during folder navigation.
Sources: orchestrator/modules/rag/services/cloud_sync_service.py:59-110
Configuration
RAG settings are stored in the system_settings table and loaded dynamically:
chunk_size
500
Target chunk size (chars)
min_chunk_size
100
Minimum chunk size
max_chunk_size
1500
Maximum chunk size
max_tokens
2000
Knapsack token budget
diversity_factor
0.3
Source diversity weight
min_similarity
0.5
Vector search threshold
rag_rerank_enabled
false
Enable Cohere reranking
Loading:
Sources: orchestrator/modules/rag/service.py:98-140
API Endpoints
Document Management
POST
/api/documents/upload
Upload document for processing
GET
/api/documents/
List documents (filterable)
GET
/api/documents/{id}
Get document metadata
GET
/api/documents/{id}/content
Get reconstructed content from chunks
GET
/api/documents/analytics
Document statistics
DELETE
/api/documents/{id}
Delete document + chunks
POST
/api/documents/{id}/reprocess
Regenerate chunks + embeddings
POST
/api/documents/search
Semantic search (vector similarity)
Sources: orchestrator/api/documents.py:106-900
Cloud Sync
GET
/api/cloud-sync/folders
List cloud folders
GET
/api/cloud-sync/files
List files with sync status
POST
/api/cloud-sync/sync
Start sync job
GET
/api/cloud-sync/jobs
List sync job history
Note: These endpoints are integrated in the UI but not yet documented as standalone API routes. See CloudSyncService methods for programmatic access.
Sources: orchestrator/modules/rag/services/cloud_sync_service.py:59-450
Database Tables
Core Tables
documents: Document metadata
document_chunks: Text chunks with embeddings
Sources: orchestrator/modules/rag/ingestion/manager.py:578-643
Cloud Sync Tables
cloud_documents: Tracks cloud file sync state
cloud_sync_jobs: Sync job history
Sources: orchestrator/modules/rag/services/cloud_sync_service.py:198-450
Best Practices
Chunking Strategy Selection
Technical docs
TOPIC_COHERENCE
Fast keyword-based; preserves sections
API documentation
HIERARCHICAL
Parent-child refs for nested endpoints
Research papers
SEMANTIC_SIMILARITY
Gradual topic transitions
Mixed content
ADAPTIVE
Tries all strategies, picks best
Sources: orchestrator/modules/rag/chunking/semantic_chunker.py:289-317
Token Budget Guidelines
Chat responses
2000
6-8
Leave room for conversation history
Recipe steps
4000
12-15
More context for complex tasks
Code generation
3000
8-10
Balance code + docs
Sources: orchestrator/modules/rag/service.py:232-236
Query Enhancement
Enable query enhancement for:
✅ Complex multi-part questions
✅ Vague queries ("how do I do X?")
✅ Multi-document research
Disable for:
❌ Specific searches (file names, error codes)
❌ Latency-critical applications
❌ Single-document lookups
Sources: orchestrator/modules/rag/service.py:240-252
Cloud Sync
Supported Extensions: .pdf, .docx, .txt, .md, .py, .js, .ts, .java, .json, .csv
Skipped Types: Images (.png, .jpg), fonts (.ttf, .woff), binaries (.exe, .so)
Parallel Sync: Max 3 concurrent downloads to avoid API rate limits.
Sources: orchestrator/modules/rag/services/cloud_sync_service.py:290-306
Troubleshooting
Common Issues
Empty Search Results
Cause: No documents uploaded or embeddings not generated
Check:
SELECT COUNT(*) FROM documents WHERE status='completed'Fix: Reprocess failed documents via
/api/documents/{id}/reprocess
Google Drive Truncation
Symptom: Files show ~500 bytes instead of full content
Cause: Composio v3 API truncates inline content for Google Drive
Fix: SDK fallback automatically triggers (see
CloudFileDownloader)
Knapsack Returns Too Few Chunks
Cause: Token budget too small or all chunks from same source
Check: Increase
max_tokensor reducediversity_factorFix:
RAGConfig(max_tokens=4000, diversity=0.2)
Slow Retrieval
Cause: Reranking enabled + large candidate set
Fix: Disable reranking:
UPDATE system_settings SET value='false' WHERE key='rag_rerank_enabled'
Sources: orchestrator/modules/rag/service.py:409-660, orchestrator/modules/rag/services/cloud_file_downloader.py:94-143
Last updated

