Document Management
Document Management provides the interface for uploading, processing, and managing documents that feed into the RAG system. It handles file uploads via REST API, validates content types, stores documents in S3, and tracks metadata in PostgreSQL. Documents can be uploaded directly or synced automatically from cloud storage providers like Google Drive and Dropbox.
Scope: This page covers document upload, storage, metadata management, and cloud sync orchestration. For details on how documents are processed and chunked, see Document Ingestion Pipeline. For chunking algorithms, see Semantic Chunking Strategies. For using documents in retrieval, see RAG Retrieval System.
Document Lifecycle
Documents move through a defined lifecycle from upload to completion. The DocumentManager and API endpoints coordinate state transitions.
Document States
Sources: orchestrator/api/documents.py:106-262, orchestrator/modules/rag/ingestion/manager.py:56-61
uploaded
File received, awaiting processing
status='uploaded'
processing
Extraction and chunking in progress
status='processing'
completed
Successfully processed and indexed
status='completed'
failed
Processing encountered an error
status='failed'
The Document model in PostgreSQL tracks this lifecycle with fields: id, filename, file_type, file_size, upload_date, processed_date, status, chunk_count, content_hash, and workspace_id.
Sources: orchestrator/modules/rag/ingestion/manager.py:585-601
Upload Methods
Direct Upload via API
The primary upload endpoint accepts multipart form data with file validation.
Sources: orchestrator/api/documents.py:106-262
Key Validations:
File Size: Maximum 50MB enforced at orchestrator/api/documents.py:121-127
MIME Type Detection: Uses
python-magicfor content-based detection (not just extension) at orchestrator/api/documents.py:130-148Deduplication: SHA-256 hash prevents duplicate uploads at orchestrator/api/documents.py:154-164
Supported File Types:
Sources: orchestrator/api/documents.py:89-104
Cloud Sync
Documents can be automatically synced from connected cloud storage providers. The CloudSyncService orchestrates folder scanning and batch processing.
Sources: orchestrator/modules/rag/services/cloud_sync_service.py:38-48, orchestrator/modules/rag/services/cloud_file_downloader.py:60-143
For details on cloud sync implementation, see Cloud Storage Integration.
Storage Architecture
Documents use a dual-storage model: metadata in PostgreSQL for queries, files in S3 for cost-effective bulk storage.
Storage Components
Sources: orchestrator/modules/rag/ingestion/manager.py:652-677
Table: documents
id
SERIAL
Primary key
workspace_id
TEXT
Multi-tenant isolation
filename
VARCHAR(255)
Original filename
file_type
VARCHAR(50)
Detected type (pdf, docx, etc.)
file_size
INTEGER
Bytes
upload_date
TIMESTAMP
Upload time
processed_date
TIMESTAMP
Processing completion time
status
VARCHAR(50)
uploaded/processing/completed/failed
chunk_count
INTEGER
Number of chunks generated
content_hash
VARCHAR(64)
SHA-256 for deduplication
file_path
TEXT
S3 key or local path (deprecated)
Sources: orchestrator/modules/rag/ingestion/manager.py:585-601
S3 Document Upload Flow
The DocumentManager._upload_to_s3() method handles workspace-isolated storage:
Sources: orchestrator/modules/rag/ingestion/manager.py:652-677
Bucket Configuration:
Document storage:
S3_DOCUMENTS_BUCKET(default:automatos-documents)Vector storage:
automatos-vectors-{workspace_id}(per-workspace buckets)Region: Configured via
AWS_REGION
Sources: orchestrator/modules/rag/ingestion/manager.py:420-434
Document API Endpoints
POST /api/documents/upload
Upload a document with validation and automatic processing.
Request:
Response:
Sources: orchestrator/api/documents.py:106-262
GET /api/documents/
List documents with filtering and pagination.
Query Parameters:
skip: Pagination offset (default: 0)limit: Max results (default: 100, max: 1000)status: Filter by status (uploaded/processing/completed/failed)file_type: Filter by type (pdf/docx/markdown/text)search: Search filename and description (case-insensitive)
Example:
Sources: orchestrator/api/documents.py:442-491
GET /api/documents/{document_id}
Retrieve document metadata.
Response:
Sources: orchestrator/api/documents.py:493-524
GET /api/documents/{document_id}/content
Retrieve reconstructed document content from chunks with optional highlighting.
Query Parameters:
highlight_chunk_ids: Comma-separated chunk IDs to wrap in<mark>tags
Response:
Sources: orchestrator/api/documents.py:699-768
POST /api/documents/search
Semantic search across all document chunks using vector similarity.
Request:
Process:
Generate query embedding via
EmbeddingManagerSearch S3 Vectors backend with
S3VectorsBackend.search()Look up document metadata from PostgreSQL
Group results by document, rank by similarity
Fetch preview chunks with context window
Sources: orchestrator/api/documents.py:776-895
DELETE /api/documents/{document_id}
Delete document and all associated chunks. Removes file from S3 and all database records.
Impact Analysis:
Returns:
Sources: orchestrator/api/documents.py:526-622
POST /api/documents/{document_id}/reprocess
Reprocess a failed or outdated document. Re-runs extraction and chunking with current settings.
Requirements:
Original file must exist (either in S3 or local path)
Status will temporarily change to
processing
Sources: orchestrator/api/documents.py:624-697
GET /api/documents/analytics
Aggregate statistics across workspace documents.
Response:
Sources: orchestrator/api/documents.py:362-440
Cloud Integration
Supported Providers
The CloudFileDownloader and CloudSyncService support multiple cloud storage providers via Composio actions.
Google Drive
GOOGLEDRIVE_LIST_FILES
GOOGLEDRIVE_DOWNLOAD_FILE
Uses SDK fallback for large files
Dropbox
DROPBOX_LIST_FILES_IN_FOLDER
DROPBOX_READ_FILE
Full content in API response
OneDrive
ONEDRIVE_LIST_FILES
ONEDRIVE_DOWNLOAD_FILE
Standard REST API
Box
BOX_LIST_FOLDER_ITEMS
BOX_DOWNLOAD_FILE
Standard REST API
Sources: orchestrator/modules/rag/services/cloud_file_downloader.py:29-35, orchestrator/modules/rag/services/cloud_sync_service.py:30-35
Download Strategy (PRD-42)
Cloud file downloads use a layered fallback approach to handle provider-specific quirks:
Why Two Layers?
Google Drive's v3 REST API truncates inline content to ~500 bytes. The SDK saves full files to disk on the container, which the downloader then reads.
Content Extraction Priority (orchestrator/modules/rag/services/cloud_file_downloader.py:264-303):
URL keys (
s3url,downloadUrl,webContentLink) — Composio hosts full file on R2/S3Content keys (
file_content_bytes,downloaded_file_content,content) — inline contentDeep search — any large string/bytes value in response
Sources: orchestrator/modules/rag/services/cloud_file_downloader.py:72-143
Sync Orchestration
The CloudSyncService.sync_folder() method coordinates batch syncing:
Workflow:
Query
cloud_sync_configsfor root folder pathCreate
CloudSyncJob(status='running')List all files recursively via
list_files()Filter by supported extensions (
.pdf,.docx,.md, etc.)Check existing
cloud_documentsrecords for modification timestampsDownload changed files in parallel (max 3 concurrent)
Process via
DocumentManager.upload_document()Update
cloud_documentswith sync status and chunk countMark
CloudSyncJobas completed with statistics
Sources: orchestrator/modules/rag/services/cloud_sync_service.py:198-402
Security & Validation
MIME Type Detection
The upload endpoint uses python-magic to detect actual file content, not just the extension:
This prevents attackers from uploading malicious files with fake extensions (e.g., malware.exe renamed to document.pdf).
Sources: orchestrator/api/documents.py:129-148
Path Safety
Document download endpoints validate paths to prevent directory traversal attacks:
Sources: orchestrator/api/documents.py:274-282
Workspace Isolation
All API endpoints enforce workspace context via the get_request_context_hybrid dependency, which extracts workspace_id from the X-Workspace-ID header or Clerk JWT.
Database queries filter by workspace_id:
S3 keys include workspace prefix:
Sources: orchestrator/api/documents.py:107-113, orchestrator/modules/rag/ingestion/manager.py:663
Document Processing Integration
Once uploaded, documents flow through the ingestion pipeline for text extraction and chunking. The DocumentManager._process_document() method coordinates this:
Key Classes:
DocumentProcessor: Detects file type, extracts text from PDF/DOCX/XLSX/CSVSemanticChunker: Splits text into chunks using information-theoretic metricsEmbeddingManager: Generates OpenAI embeddings for each chunkS3VectorsBackend: Stores embeddings with metadata for retrieval
For implementation details, see Document Ingestion Pipeline and Semantic Chunking Strategies.
Sources: orchestrator/modules/rag/ingestion/manager.py:688-769
Error Handling
Failed Processing
When processing fails, the document status is set to failed and the error is logged. Users can retry via the reprocess endpoint.
Common failure scenarios:
Corrupted files: PDF extraction fails
Unsupported content: Scanned PDFs without OCR
Empty documents: No text extracted
Timeout: Processing exceeds time limit
Sources: orchestrator/api/documents.py:245-248
Duplicate Detection
The upload endpoint computes SHA-256 hash and checks content_hash before processing:
Sources: orchestrator/api/documents.py:154-164
Configuration
Document management behavior is controlled by environment variables and system settings:
S3_DOCUMENTS_BUCKET
automatos-documents
S3 bucket for document storage
S3_VECTORS_ENABLED
false
Use S3 for vector storage vs PostgreSQL
AWS_REGION
us-east-1
S3 region
AWS_ACCESS_KEY_ID
(required)
S3 credentials
AWS_SECRET_ACCESS_KEY
(required)
S3 credentials
POSTGRES_*
Various
Database connection
Sources: orchestrator/modules/rag/ingestion/manager.py:405-445
Usage Examples
Upload and Process Document
Sources: orchestrator/modules/rag/ingestion/manager.py:688-769
Sync Cloud Documents
Sources: orchestrator/modules/rag/services/cloud_sync_service.py:59-193
Summary: Document Management provides the entry point for knowledge into the RAG system, handling uploads, cloud sync, storage, and metadata tracking. Documents flow through validation, storage in S3, and handoff to the ingestion pipeline for processing. The REST API enables programmatic document management with workspace isolation and comprehensive analytics.
Last updated

