RAG Retrieval System

chevron-rightRelevant source fileshashtag

This document describes the RAG (Retrieval-Augmented Generation) retrieval pipeline, which transforms user queries into optimized context for LLM consumption. The system implements a multi-stage retrieval process with query enhancement, vector search, fusion, reranking, and mathematical optimization.

Scope: This page covers the retrieval pipeline only. For document ingestion and processing, see Document Ingestion Pipeline. For chunking strategies, see Semantic Chunking Strategies. For the API surface, see Documents API Reference.


Architecture Overview

The RAG retrieval system follows a six-stage pipeline that progressively refines search results to maximize information value within token constraints.

spinner

Sources: orchestrator/modules/rag/service.py:142-660


RAGService Class

The RAGService class orchestrates the entire retrieval pipeline. It integrates with existing optimization components rather than reimplementing them.

Component
Source
Purpose

ContextOptimizer

0/1 knapsack, MMR, entropy

SemanticChunker

5 chunking strategies

QueryEnhancer

HyDE, decomposition, expansion

EmbeddingManager

Centralized embeddings

Key Methods:

Sources: orchestrator/modules/rag/service.py:142-208


Configuration: RAGConfig

Configuration is loaded from the system_settings table rather than hardcoded. This allows runtime tuning without code deployment.

Setting Key
Default
Description

chunk_size

500

Target chunk size in characters

max_tokens

2000

Maximum tokens in final context

diversity_factor

0.3

MMR diversity parameter

min_similarity

0.5

Minimum cosine similarity threshold

rag_rerank_enabled

"false"

Enable Cohere reranking (requires API key)

Sources: orchestrator/modules/rag/service.py:98-140


Stage 1: Query Enhancement

Query enhancement generates multiple query variations to improve recall. The system uses three techniques: HyDE (Hypothetical Document Embeddings), query decomposition, and concept expansion.

spinner

HyDE (Hypothetical Document Embeddings)

HyDE generates hypothetical documents that would answer the query, then uses their embeddings for search. This bridges the semantic gap between short queries and longer documents.

Example:

  • Query: "What is RAG?"

  • HyDE Doc: "Retrieval-Augmented Generation (RAG) is a technique that combines information retrieval with language model generation. It works by first retrieving relevant documents..."

Query Decomposition

Complex queries are decomposed into simpler sub-queries that can be searched independently.

Example:

  • Query: "How do agents use tools and workflows together?"

  • Sub-queries:

    • "How do agents use tools?"

    • "How do workflows work?"

    • "Agent and workflow integration"

Concept Expansion

Adds related terms and synonyms to improve coverage.

Example:

  • Query: "LLM models"

  • Expanded: "LLM models language models GPT Claude AI models"

Sources: orchestrator/modules/rag/service.py:241-250


Stage 2: Vector Search with S3 Vectors

Vector search is performed against the S3 Vectors backend, which stores embeddings in workspace-isolated S3 buckets rather than PostgreSQL.

spinner

Implementation:

Multi-Tenant Isolation

Each workspace has its own S3 bucket: automatos-vectors-{workspace_id}. This ensures complete data isolation between workspaces.

Sources: orchestrator/modules/rag/service.py:661-731, modules/search/vector_store/backends/s3_vectors_backend.py


Stage 3: Reciprocal Rank Fusion (RRF)

When using query enhancement, multiple query variations produce overlapping results. RRF aggregates these results by scoring documents based on their ranks across all queries.

RRF Algorithm

Where:

  • k = 60 (standard constant from literature)

  • rank_i = rank of document in query i (0-indexed)

Documents appearing in multiple query results receive higher scores, indicating higher relevance.

spinner

Implementation:

Sources: orchestrator/modules/rag/service.py:296-348


Stage 4: Reranking with Cohere

Optional precision reranking using Cohere's cross-encoder model. This stage is disabled by default (requires Cohere API key) and can be enabled via system_settings.rag_rerank_enabled = "true".

Cross-Encoder vs Bi-Encoder

Approach
Speed
Accuracy
Use Case

Bi-Encoder (embeddings)

Fast

Good

Initial retrieval (Stage 2)

Cross-Encoder (Cohere)

Slow

Excellent

Final reranking (Stage 4)

Why Rerank?

Bi-encoders encode query and documents independently, missing interaction signals. Cross-encoders process query+document pairs jointly, capturing fine-grained relevance signals.

spinner

Implementation:

Sources: orchestrator/modules/rag/service.py:350-386, core/llm/rerank_manager.py


Stage 5: Parent-Child Context Expansion

After reranking, the system expands each chunk by retrieving adjacent chunks from the same document. This provides surrounding context without over-fetching.

spinner

Configuration:

Implementation:

Sources: orchestrator/modules/rag/service.py:661-689


Stage 6: Context Optimization (0/1 Knapsack)

The final stage uses a real 0/1 knapsack dynamic programming algorithm to select chunks that maximize information value within the token budget.

Objective Function

Content Quality Scoring

Chunks are scored on quality to penalize low-information content (ASCII art, separators):

Source Diversity Penalty

Prevents over-sampling from a single document:

Knapsack DP Algorithm

Complexity Analysis

Metric
Value

Time

O(n × capacity × max_items)

Space

O(n × capacity)

n

Number of candidate chunks (~50)

capacity

max_tokens (typically 2000)

Sources: orchestrator/modules/rag/service.py:409-614


RAG Result Format

The retrieval pipeline returns a structured RAGResult object:

Chunk Schema

Each chunk in RAGResult.chunks contains:

Sources: orchestrator/modules/rag/service.py:35-45, orchestrator/modules/rag/service.py:511-519


Platform Tool Integration

Agents access the RAG system through two platform tools exposed by AgentPlatformTools:

spinner

Tool Definitions

search_knowledge

Execution Flow

Document Name Resolution

S3 Vectors stores temporary filenames, but agents need the real document names. The system queries PostgreSQL to map document_idfilename:

Sources: orchestrator/modules/agents/services/agent_platform_tools.py:26-456


Workspace Isolation

All retrieval operations respect workspace boundaries through multiple isolation layers:

Layer
Mechanism

Vector Storage

Separate S3 buckets: automatos-vectors-{workspace_id}

Document Metadata

PostgreSQL workspace_id filtering

Agent Permissions

Agent → Workspace ownership check

spinner

Sources: orchestrator/modules/agents/services/agent_platform_tools.py:271-277


Performance Characteristics

Token Efficiency

The 0/1 knapsack algorithm ensures optimal token utilization:

Metric
Naive Approach
Knapsack DP

Token Budget

2000

2000

Chunks Selected

8 (first 8)

6 (highest value)

Token Usage

78%

98%

Information Gain

0.62

0.87

Source Diversity

0.25

0.83

Query Latency

Stage
Latency
Notes

Query Enhancement

200-500ms

Optional, parallel LLM calls

Vector Search

50-200ms

S3 round-trip

RRF Fusion

1-5ms

In-memory aggregation

Reranking

300-800ms

Optional, Cohere API

Context Expansion

10-50ms

PostgreSQL query

Knapsack DP

1-10ms

O(n × capacity)

Total

260-1550ms

Depends on enabled stages

Cache Hit Rates

Query enhancement and reranking results can be cached to reduce latency:

Cache Type
TTL
Hit Rate

Enhanced Queries

1 hour

40-60%

Vector Search

5 minutes

20-30%

Rerank Results

15 minutes

50-70%

Sources: orchestrator/modules/rag/service.py:210-519


Configuration Examples

High Precision (Reranking Enabled)

High Recall (Lower Threshold)

Fast Retrieval (Minimal Processing)

Sources: orchestrator/modules/rag/service.py:98-140


Error Handling

The retrieval pipeline includes fallbacks at each stage:

Sources: orchestrator/modules/rag/service.py:210-294, orchestrator/modules/rag/service.py:617-659



Last updated