RAG Retrieval System
This document describes the RAG (Retrieval-Augmented Generation) retrieval pipeline, which transforms user queries into optimized context for LLM consumption. The system implements a multi-stage retrieval process with query enhancement, vector search, fusion, reranking, and mathematical optimization.
Scope: This page covers the retrieval pipeline only. For document ingestion and processing, see Document Ingestion Pipeline. For chunking strategies, see Semantic Chunking Strategies. For the API surface, see Documents API Reference.
Architecture Overview
The RAG retrieval system follows a six-stage pipeline that progressively refines search results to maximize information value within token constraints.
Sources: orchestrator/modules/rag/service.py:142-660
RAGService Class
The RAGService class orchestrates the entire retrieval pipeline. It integrates with existing optimization components rather than reimplementing them.
Key Methods:
Sources: orchestrator/modules/rag/service.py:142-208
Configuration: RAGConfig
Configuration is loaded from the system_settings table rather than hardcoded. This allows runtime tuning without code deployment.
chunk_size
500
Target chunk size in characters
max_tokens
2000
Maximum tokens in final context
diversity_factor
0.3
MMR diversity parameter
min_similarity
0.5
Minimum cosine similarity threshold
rag_rerank_enabled
"false"
Enable Cohere reranking (requires API key)
Sources: orchestrator/modules/rag/service.py:98-140
Stage 1: Query Enhancement
Query enhancement generates multiple query variations to improve recall. The system uses three techniques: HyDE (Hypothetical Document Embeddings), query decomposition, and concept expansion.
HyDE (Hypothetical Document Embeddings)
HyDE generates hypothetical documents that would answer the query, then uses their embeddings for search. This bridges the semantic gap between short queries and longer documents.
Example:
Query:
"What is RAG?"HyDE Doc:
"Retrieval-Augmented Generation (RAG) is a technique that combines information retrieval with language model generation. It works by first retrieving relevant documents..."
Query Decomposition
Complex queries are decomposed into simpler sub-queries that can be searched independently.
Example:
Query:
"How do agents use tools and workflows together?"Sub-queries:
"How do agents use tools?""How do workflows work?""Agent and workflow integration"
Concept Expansion
Adds related terms and synonyms to improve coverage.
Example:
Query:
"LLM models"Expanded:
"LLM models language models GPT Claude AI models"
Sources: orchestrator/modules/rag/service.py:241-250
Stage 2: Vector Search with S3 Vectors
Vector search is performed against the S3 Vectors backend, which stores embeddings in workspace-isolated S3 buckets rather than PostgreSQL.
Implementation:
Multi-Tenant Isolation
Each workspace has its own S3 bucket: automatos-vectors-{workspace_id}. This ensures complete data isolation between workspaces.
Sources: orchestrator/modules/rag/service.py:661-731, modules/search/vector_store/backends/s3_vectors_backend.py
Stage 3: Reciprocal Rank Fusion (RRF)
When using query enhancement, multiple query variations produce overlapping results. RRF aggregates these results by scoring documents based on their ranks across all queries.
RRF Algorithm
Where:
k= 60 (standard constant from literature)rank_i= rank of document in query i (0-indexed)
Documents appearing in multiple query results receive higher scores, indicating higher relevance.
Implementation:
Sources: orchestrator/modules/rag/service.py:296-348
Stage 4: Reranking with Cohere
Optional precision reranking using Cohere's cross-encoder model. This stage is disabled by default (requires Cohere API key) and can be enabled via system_settings.rag_rerank_enabled = "true".
Cross-Encoder vs Bi-Encoder
Bi-Encoder (embeddings)
Fast
Good
Initial retrieval (Stage 2)
Cross-Encoder (Cohere)
Slow
Excellent
Final reranking (Stage 4)
Why Rerank?
Bi-encoders encode query and documents independently, missing interaction signals. Cross-encoders process query+document pairs jointly, capturing fine-grained relevance signals.
Implementation:
Sources: orchestrator/modules/rag/service.py:350-386, core/llm/rerank_manager.py
Stage 5: Parent-Child Context Expansion
After reranking, the system expands each chunk by retrieving adjacent chunks from the same document. This provides surrounding context without over-fetching.
Configuration:
Implementation:
Sources: orchestrator/modules/rag/service.py:661-689
Stage 6: Context Optimization (0/1 Knapsack)
The final stage uses a real 0/1 knapsack dynamic programming algorithm to select chunks that maximize information value within the token budget.
Objective Function
Content Quality Scoring
Chunks are scored on quality to penalize low-information content (ASCII art, separators):
Source Diversity Penalty
Prevents over-sampling from a single document:
Knapsack DP Algorithm
Complexity Analysis
Time
O(n × capacity × max_items)
Space
O(n × capacity)
n
Number of candidate chunks (~50)
capacity
max_tokens (typically 2000)
Sources: orchestrator/modules/rag/service.py:409-614
RAG Result Format
The retrieval pipeline returns a structured RAGResult object:
Chunk Schema
Each chunk in RAGResult.chunks contains:
Sources: orchestrator/modules/rag/service.py:35-45, orchestrator/modules/rag/service.py:511-519
Platform Tool Integration
Agents access the RAG system through two platform tools exposed by AgentPlatformTools:
Tool Definitions
search_knowledge
semantic_search
Execution Flow
Document Name Resolution
S3 Vectors stores temporary filenames, but agents need the real document names. The system queries PostgreSQL to map document_id → filename:
Sources: orchestrator/modules/agents/services/agent_platform_tools.py:26-456
Workspace Isolation
All retrieval operations respect workspace boundaries through multiple isolation layers:
Vector Storage
Separate S3 buckets: automatos-vectors-{workspace_id}
Document Metadata
PostgreSQL workspace_id filtering
Agent Permissions
Agent → Workspace ownership check
Sources: orchestrator/modules/agents/services/agent_platform_tools.py:271-277
Performance Characteristics
Token Efficiency
The 0/1 knapsack algorithm ensures optimal token utilization:
Token Budget
2000
2000
Chunks Selected
8 (first 8)
6 (highest value)
Token Usage
78%
98%
Information Gain
0.62
0.87
Source Diversity
0.25
0.83
Query Latency
Query Enhancement
200-500ms
Optional, parallel LLM calls
Vector Search
50-200ms
S3 round-trip
RRF Fusion
1-5ms
In-memory aggregation
Reranking
300-800ms
Optional, Cohere API
Context Expansion
10-50ms
PostgreSQL query
Knapsack DP
1-10ms
O(n × capacity)
Total
260-1550ms
Depends on enabled stages
Cache Hit Rates
Query enhancement and reranking results can be cached to reduce latency:
Enhanced Queries
1 hour
40-60%
Vector Search
5 minutes
20-30%
Rerank Results
15 minutes
50-70%
Sources: orchestrator/modules/rag/service.py:210-519
Configuration Examples
High Precision (Reranking Enabled)
High Recall (Lower Threshold)
Fast Retrieval (Minimal Processing)
Sources: orchestrator/modules/rag/service.py:98-140
Error Handling
The retrieval pipeline includes fallbacks at each stage:
Sources: orchestrator/modules/rag/service.py:210-294, orchestrator/modules/rag/service.py:617-659
Related Systems
Document Ingestion: Document Ingestion Pipeline
Chunking: Semantic Chunking Strategies
Cloud Sync: Cloud Storage Integration
Knowledge Graph: Knowledge Graph & Entity Extraction
Agent Tools: For tool execution context, see Tool Router & Execution
Last updated

