Documents API Reference
Purpose and Scope
This page documents the REST API endpoints for document management in Automatos AI. The Documents API provides capabilities for uploading, processing, searching, and managing documents across the platform's knowledge base. Documents are ingested through a multi-stage pipeline that extracts text, generates semantic chunks, creates embeddings, and stores vectors for RAG retrieval.
For information about the RAG retrieval system that queries these documents, see RAG Retrieval System. For details on semantic chunking strategies used during ingestion, see Semantic Chunking Strategies. For cloud storage integration (Google Drive, Dropbox, etc.), see Cloud Storage Integration.
Sources: orchestrator/api/documents.py:1-50, orchestrator/main.py:36-42
API Endpoint Overview
The Documents API is mounted at /api/documents and provides the following endpoint groups:
Upload
Upload and process documents
Required (Clerk JWT + Workspace)
Retrieval
Download and view documents
Required
Search
Semantic and keyword search
Required
Analytics
Document statistics and metrics
Required
Management
Delete, reprocess, list
Required
Cloud Sync
Sync from Google Drive, Dropbox, etc.
Required
All endpoints require workspace isolation via the X-Workspace-ID header, which is automatically injected by the get_request_context_hybrid dependency.
Sources: orchestrator/api/documents.py:29-30, orchestrator/core/auth/hybrid.py:1-50
Document Upload Flow
Sources: orchestrator/api/documents.py:106-261, orchestrator/modules/rag/ingestion/manager.py:402-600
Upload Endpoint
POST /api/documents/upload
Uploads a document, validates file type, checks for duplicates, and triggers asynchronous processing.
Parameters:
file
UploadFile
Form
Yes
Document file (PDF, DOCX, TXT, MD, CSV, XLSX, JSON)
description
string
Form
No
Human-readable description
tags
string
Form
No
Comma-separated tags (temporarily disabled due to SQLAlchemy bug)
Request Example:
Security Validations:
MIME Type Detection: Uses
python-magicto detect actual content type (not just extension)Extension Matching: Ensures file extension matches detected MIME type
Allowlist Enforcement: Only permits predefined MIME types
Size Limit: Maximum 50MB per file
Duplicate Detection: SHA256 hash check prevents redundant storage
The allowlist is defined in ALLOWED_MIME_TYPES dictionary, which maps MIME types to valid extensions:
application/pdf
.pdf
application/vnd.openxmlformats-officedocument.wordprocessingml.document
.docx
text/plain
.txt, .md, .csv
text/markdown
.md
text/csv
.csv
application/json
.json
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
.xlsx
Response:
Processing Flow:
Client uploads file → API validates MIME type
Hash calculated (SHA256) → Check for duplicates in DB
File saved to
/tmp/automotas_uploads/{uuid}.extDocument record created in
documentstable withstatus=processingBackground task triggers
DocumentManager._process_document()Text extracted → Semantic chunking → Embeddings generated
Chunks stored in
document_chunkstableVectors stored in S3 Vectors or pgvector
Document status updated to
processedwithchunk_count
Sources: orchestrator/api/documents.py:106-261, orchestrator/api/documents.py:88-104
Document Processing Pipeline
Text Extraction Methods:
PDF: Primary extraction with
pdfplumber, fallback toPyPDF2for difficult PDFsDOCX:
python-docxparagraph extractionXLSX:
openpyxl→ Markdown tables for LLM consumptionCSV: Standard
csvmodule → Markdown tablesTXT/MD/JSON: Direct UTF-8 read with
latin-1fallback
Chunking Strategy:
The system uses SemanticChunker from modules/rag/chunking/semantic_chunker.py:52-91 with TOPIC_COHERENCE strategy by default:
Target chunk size: 500 characters
Min/Max bounds: 100-1500 characters
Overlap ratio: 10%
Post-processing: Filters out separators, code fences, and chunks <50 chars
Metrics preserved: Entropy, topic coherence, semantic density, importance score
Sources: orchestrator/modules/rag/ingestion/manager.py:113-194, orchestrator/modules/rag/ingestion/manager.py:289-400
Document Retrieval Endpoints
GET /api/documents/
List documents with filtering and pagination.
Query Parameters:
skip
integer
0
Number of records to skip
limit
integer
100
Max records to return (1-1000)
status
string
null
Filter by status: uploaded, processing, processed, failed
file_type
string
null
Filter by type: pdf, docx, markdown, text, json
search
string
null
Search in filename or description (case-insensitive)
Response:
Sources: orchestrator/api/documents.py:442-491
GET /api/documents/{document_id}
Get detailed information about a single document.
Path Parameters:
document_id
integer
Document ID
Response:
Sources: orchestrator/api/documents.py:493-531
GET /api/documents/download
Download a document file from storage.
Query Parameters:
path
string
Yes
Full path to document (e.g., /var/automatos/documents/filename.md)
Security:
Path safety validation using
pathlib.resolve()Restricted to
/var/automatos/documents/directoryPrevents directory traversal attacks (
.., symlinks)
Response: Binary file with appropriate Content-Type header
Sources: orchestrator/api/documents.py:263-314
GET /api/documents/content
Get document content as text (for artifact viewer).
Query Parameters:
path
string
Yes
Full path to document
Response:
Sources: orchestrator/api/documents.py:316-358
Document Search Endpoints
GET /api/documents/search
Semantic search across all documents in workspace.
Query Parameters:
query
string
Required
Search query
limit
integer
5
Max results to return
min_similarity
float
0.5
Minimum cosine similarity threshold
Search Flow:
Response:
The search uses RAGService.retrieve() which implements:
Query Enhancement: HyDE (Hypothetical Document Embeddings) + query decomposition
Multi-query Retrieval: Searches with multiple query variations
RRF Fusion: Reciprocal Rank Fusion to merge results (k=60)
Optional Reranking: Cohere cross-encoder for precision
Context Optimization: 0/1 knapsack DP algorithm for token budget
Sources: orchestrator/api/documents.py:533-610, orchestrator/modules/rag/service.py:210-294
Analytics Endpoints
GET /api/documents/analytics
Get comprehensive document analytics for the workspace.
Response:
Metrics Calculated:
Total documents count by status
Total storage used (bytes and MB)
File type distribution
Chunk statistics (total, average per document)
Recent activity (last 24 hours)
Processing success rate (processed / (processed + failed) * 100)
Sources: orchestrator/api/documents.py:362-440
Document Management Endpoints
DELETE /api/documents/{document_id}
Delete a document and all associated chunks/embeddings.
Deletion Flow:
Response: 204 No Content on success
Sources: orchestrator/api/documents.py:612-651
POST /api/documents/{document_id}/reprocess
Reprocess a document (re-extract text, re-chunk, re-embed).
Use Cases:
Document failed processing initially
Chunking strategy changed
Embedding model upgraded
Document metadata needs updating
Process:
Fetch document record from DB
Verify file still exists at
file_pathDelete existing chunks and vectors
Trigger
DocumentManager._process_document()againUpdate status to
processing→processed
Response:
Sources: orchestrator/api/documents.py:653-710
Cloud Sync Integration
Cloud Document Sync Flow
The cloud sync system uses CloudFileDownloader which implements a three-tier download strategy:
Composio v3 REST API (primary)
Composio Python SDK (fallback for Google Drive truncation)
Direct URL download (when API returns S3 presigned URLs)
Supported Providers:
Google Drive
GOOGLEDRIVE_DOWNLOAD_FILE
REST → SDK fallback (truncation fix)
Dropbox
DROPBOX_READ_FILE
REST API (full content)
OneDrive
ONEDRIVE_DOWNLOAD_FILE
REST API
Box
BOX_DOWNLOAD_FILE
REST API
Cloud Sync Tables:
cloud_documents: Tracks synced files (external_file_id, sync_status, chunk_count)cloud_sync_jobs: Sync job history (connection_id, status, files_synced)cloud_sync_config: Per-connection sync settings (root_folder, schedule)
Sources: orchestrator/modules/rag/services/cloud_file_downloader.py:60-143, orchestrator/modules/rag/services/cloud_sync_service.py:38-48
Storage Architecture
Storage Decision Logic:
The DocumentManager constructor accepts use_s3_vectors parameter:
True: Store embeddings in S3 Vectors (separate bucket per workspace)False: Store embeddings in PostgreSQLpgvectorextension
The choice is controlled by S3_VECTORS_ENABLED environment variable and passed through the get_document_manager() factory function.
Workspace Isolation:
All document operations are scoped to workspace_id:
Database queries filter by
workspace_idcolumnS3 paths use prefix:
workspaces/{workspace_id}/Vector buckets:
automatos-vectors-{workspace_id}
Sources: orchestrator/api/documents.py:76-86, orchestrator/modules/rag/ingestion/manager.py:402-420
Configuration Reference
Environment Variables
S3_VECTORS_ENABLED
false
Use S3 for vector storage (vs pgvector)
S3_DOCUMENTS_BUCKET
automatos-ai
S3 bucket for document files
S3_VECTORS_BUCKET
None
S3 bucket for embeddings (if enabled)
S3_VECTORS_DIMENSION
2048
Embedding vector dimension
S3_VECTORS_METRIC
cosine
Distance metric for similarity
DOCUMENT_STORAGE_DIR
documents
Local filesystem directory
COMPOSIO_API_KEY
None
Composio API key for cloud sync
AWS_ACCESS_KEY_ID
None
AWS credentials for S3
AWS_SECRET_ACCESS_KEY
None
AWS credentials for S3
AWS_REGION
us-east-1
AWS region
Database Configuration
The documents API uses db_config dictionary constructed from environment variables:
This config is passed to DocumentManager for direct database access (bypasses SQLAlchemy for performance).
Sources: orchestrator/api/documents.py:38-74, orchestrator/config.py:245-260
Error Handling
Upload Errors
File too large (max 50MB)
400
File exceeds 50MB limit
File type not allowed
400
MIME type not in allowlist
File extension does not match detected content type
400
Extension mismatch (security)
Document already exists
200 (duplicate)
SHA256 hash collision
Internal server error
500
Processing failure
Processing Errors
When document processing fails:
Status set to
failedin databaseError logged with document ID
Reprocessing endpoint available for retry
Graceful Degradation:
If Redis unavailable → processing continues (no cache)
If S3 unavailable → falls back to local filesystem
If embedding fails → document stored without vectors (searchable by metadata)
Sources: orchestrator/api/documents.py:257-261, orchestrator/modules/rag/ingestion/manager.py:582-600
Integration with Other Systems
RAG Service Integration
Documents are consumed by RAGService for semantic search:
The RAG service:
Enhances query with HyDE + decomposition
Searches document chunks by vector similarity
Re-ranks results with Cohere
Optimizes context with 0/1 knapsack algorithm
Returns formatted chunks with source attribution
Agent Tool Integration
Agents can search documents via AgentPlatformTools:
This provides agents with access to workspace knowledge during execution.
Sources: orchestrator/modules/agents/services/agent_platform_tools.py:56-95, orchestrator/modules/rag/service.py:142-294
Database Schema
documents Table
id
INTEGER
PRIMARY KEY
Unique document ID
workspace_id
UUID
NOT NULL, FK
Workspace isolation
filename
VARCHAR(255)
NOT NULL
Display filename
original_filename
VARCHAR(255)
-
Original upload name
file_type
VARCHAR(50)
NOT NULL
Type: pdf, docx, etc.
file_size
BIGINT
NOT NULL
Size in bytes
file_path
TEXT
NOT NULL
Filesystem path
content_hash
VARCHAR(64)
NOT NULL
SHA256 hash
status
VARCHAR(50)
NOT NULL
Status: uploaded, processing, processed, failed
chunk_count
INTEGER
DEFAULT 0
Number of chunks created
description
TEXT
-
User description
upload_date
TIMESTAMP
DEFAULT NOW()
Upload timestamp
processed_date
TIMESTAMP
-
Processing completion
created_by
VARCHAR(100)
-
User identifier
document_chunks Table
id
INTEGER
PRIMARY KEY
Unique chunk ID
document_id
INTEGER
NOT NULL, FK
Parent document
chunk_index
INTEGER
NOT NULL
Position in document
content
TEXT
NOT NULL
Chunk text content
metadata
JSONB
-
Entropy, coherence, etc.
parent_content
TEXT
-
Parent section context
headers
JSONB
-
Header hierarchy (h1, h2, h3)
embedding
VECTOR(2048)
-
Vector embedding (if pgvector)
Indexes:
idx_documents_workspace_idonworkspace_ididx_documents_statusonstatusidx_documents_hashoncontent_hashidx_chunks_document_idondocument_id
Sources: orchestrator/core/models/core.py:1-100 (Document model), orchestrator/modules/rag/ingestion/manager.py:93-111 (DocumentChunk dataclass)
Last updated

