Document Management

chevron-rightRelevant source fileshashtag

Document Management provides the interface for uploading, processing, and managing documents that feed into the RAG system. It handles file uploads via REST API, validates content types, stores documents in S3, and tracks metadata in PostgreSQL. Documents can be uploaded directly or synced automatically from cloud storage providers like Google Drive and Dropbox.

Scope: This page covers document upload, storage, metadata management, and cloud sync orchestration. For details on how documents are processed and chunked, see Document Ingestion Pipeline. For chunking algorithms, see Semantic Chunking Strategies. For using documents in retrieval, see RAG Retrieval System.


Document Lifecycle

Documents move through a defined lifecycle from upload to completion. The DocumentManager and API endpoints coordinate state transitions.

Document States

spinner

Sources: orchestrator/api/documents.py:106-262, orchestrator/modules/rag/ingestion/manager.py:56-61

State
Description
Database Field

uploaded

File received, awaiting processing

status='uploaded'

processing

Extraction and chunking in progress

status='processing'

completed

Successfully processed and indexed

status='completed'

failed

Processing encountered an error

status='failed'

The Document model in PostgreSQL tracks this lifecycle with fields: id, filename, file_type, file_size, upload_date, processed_date, status, chunk_count, content_hash, and workspace_id.

Sources: orchestrator/modules/rag/ingestion/manager.py:585-601


Upload Methods

Direct Upload via API

The primary upload endpoint accepts multipart form data with file validation.

spinner

Sources: orchestrator/api/documents.py:106-262

Key Validations:

Supported File Types:

Sources: orchestrator/api/documents.py:89-104

Cloud Sync

Documents can be automatically synced from connected cloud storage providers. The CloudSyncService orchestrates folder scanning and batch processing.

spinner

Sources: orchestrator/modules/rag/services/cloud_sync_service.py:38-48, orchestrator/modules/rag/services/cloud_file_downloader.py:60-143

For details on cloud sync implementation, see Cloud Storage Integration.


Storage Architecture

Documents use a dual-storage model: metadata in PostgreSQL for queries, files in S3 for cost-effective bulk storage.

Storage Components

spinner

Sources: orchestrator/modules/rag/ingestion/manager.py:652-677

Table: documents

Column
Type
Purpose

id

SERIAL

Primary key

workspace_id

TEXT

Multi-tenant isolation

filename

VARCHAR(255)

Original filename

file_type

VARCHAR(50)

Detected type (pdf, docx, etc.)

file_size

INTEGER

Bytes

upload_date

TIMESTAMP

Upload time

processed_date

TIMESTAMP

Processing completion time

status

VARCHAR(50)

uploaded/processing/completed/failed

chunk_count

INTEGER

Number of chunks generated

content_hash

VARCHAR(64)

SHA-256 for deduplication

file_path

TEXT

S3 key or local path (deprecated)

Sources: orchestrator/modules/rag/ingestion/manager.py:585-601

S3 Document Upload Flow

The DocumentManager._upload_to_s3() method handles workspace-isolated storage:

Sources: orchestrator/modules/rag/ingestion/manager.py:652-677

Bucket Configuration:

  • Document storage: S3_DOCUMENTS_BUCKET (default: automatos-documents)

  • Vector storage: automatos-vectors-{workspace_id} (per-workspace buckets)

  • Region: Configured via AWS_REGION

Sources: orchestrator/modules/rag/ingestion/manager.py:420-434


Document API Endpoints

POST /api/documents/upload

Upload a document with validation and automatic processing.

Request:

Response:

Sources: orchestrator/api/documents.py:106-262

GET /api/documents/

List documents with filtering and pagination.

Query Parameters:

  • skip: Pagination offset (default: 0)

  • limit: Max results (default: 100, max: 1000)

  • status: Filter by status (uploaded/processing/completed/failed)

  • file_type: Filter by type (pdf/docx/markdown/text)

  • search: Search filename and description (case-insensitive)

Example:

Sources: orchestrator/api/documents.py:442-491

GET /api/documents/{document_id}

Retrieve document metadata.

Response:

Sources: orchestrator/api/documents.py:493-524

GET /api/documents/{document_id}/content

Retrieve reconstructed document content from chunks with optional highlighting.

Query Parameters:

  • highlight_chunk_ids: Comma-separated chunk IDs to wrap in <mark> tags

Response:

Sources: orchestrator/api/documents.py:699-768

POST /api/documents/search

Semantic search across all document chunks using vector similarity.

Request:

Process:

  1. Generate query embedding via EmbeddingManager

  2. Search S3 Vectors backend with S3VectorsBackend.search()

  3. Look up document metadata from PostgreSQL

  4. Group results by document, rank by similarity

  5. Fetch preview chunks with context window

Sources: orchestrator/api/documents.py:776-895

DELETE /api/documents/{document_id}

Delete document and all associated chunks. Removes file from S3 and all database records.

Impact Analysis:

Returns:

Sources: orchestrator/api/documents.py:526-622

POST /api/documents/{document_id}/reprocess

Reprocess a failed or outdated document. Re-runs extraction and chunking with current settings.

Requirements:

  • Original file must exist (either in S3 or local path)

  • Status will temporarily change to processing

Sources: orchestrator/api/documents.py:624-697

GET /api/documents/analytics

Aggregate statistics across workspace documents.

Response:

Sources: orchestrator/api/documents.py:362-440


Cloud Integration

Supported Providers

The CloudFileDownloader and CloudSyncService support multiple cloud storage providers via Composio actions.

Provider
List Action
Download Action
Notes

Google Drive

GOOGLEDRIVE_LIST_FILES

GOOGLEDRIVE_DOWNLOAD_FILE

Uses SDK fallback for large files

Dropbox

DROPBOX_LIST_FILES_IN_FOLDER

DROPBOX_READ_FILE

Full content in API response

OneDrive

ONEDRIVE_LIST_FILES

ONEDRIVE_DOWNLOAD_FILE

Standard REST API

Box

BOX_LIST_FOLDER_ITEMS

BOX_DOWNLOAD_FILE

Standard REST API

Sources: orchestrator/modules/rag/services/cloud_file_downloader.py:29-35, orchestrator/modules/rag/services/cloud_sync_service.py:30-35

Download Strategy (PRD-42)

Cloud file downloads use a layered fallback approach to handle provider-specific quirks:

spinner

Why Two Layers?

Google Drive's v3 REST API truncates inline content to ~500 bytes. The SDK saves full files to disk on the container, which the downloader then reads.

Content Extraction Priority (orchestrator/modules/rag/services/cloud_file_downloader.py:264-303):

  1. URL keys (s3url, downloadUrl, webContentLink) — Composio hosts full file on R2/S3

  2. Content keys (file_content_bytes, downloaded_file_content, content) — inline content

  3. Deep search — any large string/bytes value in response

Sources: orchestrator/modules/rag/services/cloud_file_downloader.py:72-143

Sync Orchestration

The CloudSyncService.sync_folder() method coordinates batch syncing:

Workflow:

  1. Query cloud_sync_configs for root folder path

  2. Create CloudSyncJob (status='running')

  3. List all files recursively via list_files()

  4. Filter by supported extensions (.pdf, .docx, .md, etc.)

  5. Check existing cloud_documents records for modification timestamps

  6. Download changed files in parallel (max 3 concurrent)

  7. Process via DocumentManager.upload_document()

  8. Update cloud_documents with sync status and chunk count

  9. Mark CloudSyncJob as completed with statistics

Sources: orchestrator/modules/rag/services/cloud_sync_service.py:198-402


Security & Validation

MIME Type Detection

The upload endpoint uses python-magic to detect actual file content, not just the extension:

This prevents attackers from uploading malicious files with fake extensions (e.g., malware.exe renamed to document.pdf).

Sources: orchestrator/api/documents.py:129-148

Path Safety

Document download endpoints validate paths to prevent directory traversal attacks:

Sources: orchestrator/api/documents.py:274-282

Workspace Isolation

All API endpoints enforce workspace context via the get_request_context_hybrid dependency, which extracts workspace_id from the X-Workspace-ID header or Clerk JWT.

Database queries filter by workspace_id:

S3 keys include workspace prefix:

Sources: orchestrator/api/documents.py:107-113, orchestrator/modules/rag/ingestion/manager.py:663


Document Processing Integration

Once uploaded, documents flow through the ingestion pipeline for text extraction and chunking. The DocumentManager._process_document() method coordinates this:

spinner

Key Classes:

  • DocumentProcessor: Detects file type, extracts text from PDF/DOCX/XLSX/CSV

  • SemanticChunker: Splits text into chunks using information-theoretic metrics

  • EmbeddingManager: Generates OpenAI embeddings for each chunk

  • S3VectorsBackend: Stores embeddings with metadata for retrieval

For implementation details, see Document Ingestion Pipeline and Semantic Chunking Strategies.

Sources: orchestrator/modules/rag/ingestion/manager.py:688-769


Error Handling

Failed Processing

When processing fails, the document status is set to failed and the error is logged. Users can retry via the reprocess endpoint.

Common failure scenarios:

  • Corrupted files: PDF extraction fails

  • Unsupported content: Scanned PDFs without OCR

  • Empty documents: No text extracted

  • Timeout: Processing exceeds time limit

Sources: orchestrator/api/documents.py:245-248

Duplicate Detection

The upload endpoint computes SHA-256 hash and checks content_hash before processing:

Sources: orchestrator/api/documents.py:154-164


Configuration

Document management behavior is controlled by environment variables and system settings:

Variable
Default
Purpose

S3_DOCUMENTS_BUCKET

automatos-documents

S3 bucket for document storage

S3_VECTORS_ENABLED

false

Use S3 for vector storage vs PostgreSQL

AWS_REGION

us-east-1

S3 region

AWS_ACCESS_KEY_ID

(required)

S3 credentials

AWS_SECRET_ACCESS_KEY

(required)

S3 credentials

POSTGRES_*

Various

Database connection

Sources: orchestrator/modules/rag/ingestion/manager.py:405-445


Usage Examples

Upload and Process Document

Sources: orchestrator/modules/rag/ingestion/manager.py:688-769

Sync Cloud Documents

Sources: orchestrator/modules/rag/services/cloud_sync_service.py:59-193


Summary: Document Management provides the entry point for knowledge into the RAG system, handling uploads, cloud sync, storage, and metadata tracking. Documents flow through validation, storage in S3, and handoff to the ingestion pipeline for processing. The REST API enables programmatic document management with workspace isolation and comprehensive analytics.


Last updated