PRD-42: Cloud Document Sync with S3 Vectors
Version: 1.0 Status: 🟡 Planning Date: 2026-01-31 Author: System Architecture Team Prerequisites: PRD-36 (Composio Integration), PRD-11 (Document Management) Blocks: None
Executive Summary
Transform the Automatos AI Platform document management system from a local upload-based model to a hybrid cloud-native architecture where users connect their existing cloud storage providers (Google Drive, Dropbox, OneDrive, Box, SharePoint, S3) via Composio-managed OAuth. Instead of storing duplicate copies of user documents, the system will:
Sync document metadata from cloud storage providers
Download, chunk, and embed documents using existing semantic chunking pipeline
Store only vector embeddings in AWS S3 Vectors (not original files)
Enable RAG queries across all connected storage providers
Maintain workspace isolation and GDPR compliance
Key Benefits:
✅ 90-95% cost reduction - only pay for vector storage (~$6/100GB vs $730+/10TB for full storage)
✅ Zero document liability - users keep documents in their own cloud, we never store originals
✅ GDPR/compliance friendly - data stays in user's control
✅ OAuth handled by Composio - no custom authentication code needed
✅ Unified RAG - query across Google Drive, Dropbox, etc. from single interface
✅ Scales to 2 billion vectors per index with S3 Vectors
Target Users: Single users and small businesses (1-10 people) with 1-2 cloud storage providers
Problem Statement
Current System Limitations
The existing document management system (PRD-11) has several constraints that limit scalability and increase operational costs:
Storage Costs:
System stores full document copies locally (filesystem or S3)
Vector embeddings stored separately in PostgreSQL with pgvector
For 1000 users with 10GB documents each: ~$730-1230/month storage costs
Duplicate storage: original file + chunks + vectors
User Experience:
Users must manually upload documents to Automatos
No sync with existing document repositories (Google Drive, Dropbox, etc.)
Documents live in isolated system, not user's primary workspace
No automatic updates when cloud documents change
Compliance & Liability:
We store and are responsible for user documents
GDPR data retention requirements fall on us
Security breach exposes user documents
Users lose control over their own data
Current Architecture:
What Doesn't Work
Currently, the system:
❌ Does NOT sync from cloud storage providers
❌ Does NOT use Composio's 500+ app integrations for documents
❌ Does NOT leverage AWS S3 Vectors (new 2026 service)
❌ Does NOT support folder navigation in cloud storage
❌ Does NOT handle real-time document updates via webhooks
❌ Stores 60 test documents that can be deleted (fresh start opportunity)
Success Criteria
Phase 1: MVP (Pilot Launch)
Phase 2: Production (Auto-Sync)
Phase 3: Scale (Future)
Goals
Reduce Storage Costs by 90%+
Eliminate local document storage
Store only vector embeddings (100x smaller than originals)
Target: $56/month for 1000 users vs $730+ current
Enable Cloud-Native Document Access
Users connect existing Google Drive, Dropbox, OneDrive, Box, SharePoint, S3
OAuth managed entirely by Composio (no custom auth code)
Folder navigation and file selection UI
Maintain RAG Quality
Preserve existing semantic chunking quality (SemanticChunker with adaptive strategy)
Same or better retrieval accuracy with S3 Vectors vs pgvector
Support 10+ chunks per query with context optimization
GDPR Compliance & Data Sovereignty
User documents never leave their cloud storage
We store only vector embeddings (not PII/document content)
Users maintain full control and ownership
Workspace Isolation
Each workspace has independent S3 vector bucket
Vectors scoped by workspace_id
No cross-workspace data leakage
Functional Requirements
FR-1: Cloud Storage Connection Management
FR-1.1: Users can connect cloud storage providers via Composio OAuth
Supported providers: Google Drive, Dropbox, OneDrive, Box, SharePoint, Amazon S3
OAuth flow handled by Composio (hosted auth UI)
Connection status tracked in
composio_connectionstableUI shows: app logo, connection status (active/pending/error), connected date
FR-1.2: Users can disconnect cloud storage providers
Disconnect triggers confirmation modal: "Delete synced vectors or keep for RAG queries?"
If "Delete": Remove all
cloud_documentsand S3 vector entries for that connectionIf "Keep": Mark connection as disconnected but preserve vectors (read-only)
Update
composio_connections.status = 'disconnected'
FR-1.3: System validates connection health on page load
Check if Composio connection still active (token not expired)
Display warning badge if connection needs re-authorization
Auto-refresh expired tokens via Composio SDK
FR-2: Folder Navigation & Selection
FR-2.1: After connecting a storage provider, users see folder tree interface
Root folder displays top-level folders from cloud storage
Click folder to expand children (lazy loading)
Display folder metadata: name, path, file count, last modified
FR-2.2: Users select ONE root folder per app for syncing
Suggested: Create shared "Automatos" folder in cloud storage
UI shows: "Select a folder to sync. We recommend creating a dedicated 'Automatos' folder."
Once selected, sync all files and subfolders recursively underneath
Store selection in
cloud_sync_configtable:(connection_id, root_folder_path)
FR-2.3: Folder tree displays file type icons and sync status
Icons: PDF, Word, Excel, Text, Markdown, etc.
Sync badges: ✅ Synced, ⏳ Pending, ❌ Error, ⊘ Excluded
Filter by file type (checkboxes: PDFs, Docs, Sheets, Text)
FR-3: Manual Document Sync (Pilot)
FR-3.1: Users trigger sync via "Sync Now" button
Button appears next to each connected storage provider
Initiates background job:
sync_cloud_documents_task(connection_id, root_folder_path)Shows progress modal: "Syncing 47 files from Google Drive..."
Real-time progress updates via WebSocket or polling
FR-3.2: Sync process workflow
FR-3.3: Sync respects file size and type limits
Max file size: 50MB (configurable via system settings)
Supported types: PDF, DOCX, TXT, MD, JSON, XLSX, PPTX
Unsupported files logged with reason: "File type not supported"
FR-4: AWS S3 Vectors Integration
FR-4.1: Create workspace-scoped S3 vector buckets
Bucket naming:
automatos-vectors-{workspace_id}Index naming:
documents-indexDimension: 1024 (matches BAAI/bge-large-en-v1.5)
Distance metric: COSINE
FR-4.2: Vector insertion flow
FR-4.3: Vector querying flow
FR-5: Document Viewing & Download
FR-5.1: Users can view document content
Click document in file list → Opens preview modal
Download original file from cloud storage via Composio API
Display in browser using appropriate viewer:
PDF: Embedded PDF viewer
Text/Markdown: Syntax-highlighted text area
Images: Image preview
Office docs: Download prompt (no inline preview for MVP)
FR-5.2: Download flow
FR-5.3: Temporary caching for performance
Cache downloaded files for 1 hour in
/tmp/automatos_previews/{hash}Serve from cache if requested again within 1 hour
Background job cleans up cache files older than 1 hour
FR-6: Unified RAG Queries
FR-6.1: RAG endpoint queries all connected storage providers by default
Endpoint:
POST /api/cloud-documents/rag/queryParameters:
FR-6.2: Query process
FR-6.3: Result formatting includes source attribution
FR-7: Migration from Existing System
FR-7.1: Deprecate local document upload (phased approach)
Phase 1: Add cloud sync (keep upload)
Add "Connect Cloud Storage" tab to document management UI
Keep existing "Upload Files" tab functional
Both systems work in parallel during pilot
Phase 2: Migrate test documents
Delete 60 existing test documents from PostgreSQL + local storage
Clean up
documentsanddocument_chunkstablesDrop pgvector extension (optional, can keep for other features)
Phase 3: Remove upload UI (post-pilot)
Remove "Upload Files" tab
Redirect users to "Connect Cloud Storage"
Update documentation and help text
FR-7.2: Database migration script
FR-8: Error Handling & Resilience
FR-8.1: OAuth token expiration
Detect expired tokens on Composio API calls (401 errors)
Display banner: "Your Google Drive connection needs re-authorization"
Clicking banner triggers OAuth refresh flow
Auto-retry failed sync after re-auth
FR-8.2: File download failures
If Composio API fails (file deleted, permissions changed), log error
Mark document as
sync_status: 'error'in cloud_documentsDisplay error in UI with actionable message: "File no longer accessible in Google Drive"
Retry logic: 3 attempts with exponential backoff
FR-8.3: S3 Vectors quota limits
Monitor S3 Vectors bucket size and index capacity
Alert when approaching 2 billion vector limit per index
Graceful degradation: If S3 Vectors unavailable, fallback to "View Document" only (no RAG)
Technical Architecture
System Overview
Data Flow: Document Sync
Data Flow: RAG Query
Database Schema
New Tables
Modified Tables
Indexes for Performance
Migration Script
API Endpoints
Cloud Storage Connections
GET /api/cloud-documents/connections
List all connected cloud storage providers for workspace.
Response:
POST /api/cloud-documents/connect/{app_name}
Initiate OAuth connection to cloud storage provider.
Parameters:
app_name: GOOGLEDRIVE | DROPBOX | ONEDRIVE | BOX | SHAREPOINT | AMAZONS3
Request:
Response:
DELETE /api/cloud-documents/connections/{connection_id}
Disconnect cloud storage provider (with confirmation).
Query Params:
delete_vectors: boolean (if true, delete all vectors; if false, mark disconnected but keep vectors)
Response:
Folder Navigation & Selection
GET /api/cloud-documents/connections/{connection_id}/folders
List folders in cloud storage (lazy loading for tree navigation).
Query Params:
path: string (optional, default: root) - e.g., "/Automatos/Projects"
Response:
GET /api/cloud-documents/connections/{connection_id}/files
List files in a specific folder.
Query Params:
path: string (optional, default: root)file_types: string[] (optional) - e.g., ["pdf", "docx", "txt"]recursive: boolean (optional, default: false)
Response:
POST /api/cloud-documents/connections/{connection_id}/select-folder
Set root folder for syncing.
Request:
Response:
Document Sync
POST /api/cloud-documents/connections/{connection_id}/sync
Manually trigger sync for a connection.
Request:
Response:
GET /api/cloud-documents/sync-jobs/{job_id}
Get sync job status.
Response:
GET /api/cloud-documents/connections/{connection_id}/sync-status
Get current sync status summary.
Response:
Document Operations
GET /api/cloud-documents/{document_id}
Get cloud document metadata.
Response:
GET /api/cloud-documents/{document_id}/download
Download original file from cloud storage.
Response:
Headers:
Content-Type: {mime_type},Content-Disposition: inline; filename="{file_name}"Body: File content (binary)
GET /api/cloud-documents/{document_id}/preview
Get document preview (first 5 chunks).
Response:
DELETE /api/cloud-documents/{document_id}
Delete cloud document (removes vectors, keeps file in cloud storage).
Response:
RAG Queries
POST /api/cloud-documents/rag/query
Query documents using RAG across all connected storage providers.
Request:
Response:
POST /api/cloud-documents/search
Semantic search across cloud documents (simpler than RAG).
Request:
Response:
Backend Implementation
File Structure
Key Services
CloudDocumentSyncService
S3VectorsClient
Frontend Implementation
Component Structure
Key Components
CloudStorageConnections.tsx
FolderNavigator.tsx
Implementation Roadmap
Phase 1: MVP - Manual Sync (Weeks 1-3)
Week 1: Backend Foundation
Test S3 Vectors integration with sample data
Create CloudDocument, CloudSyncConfig SQLAlchemy models
Week 2: Sync Service
Implement CloudDocumentSyncService
list_folders() - Composio folder listing
list_files() - Composio file listing with sync status
sync_folder() - Main sync orchestration
_download_and_process_file() - Download, chunk, embed, store
Create API endpoints in /api/cloud_documents.py
GET /connections
POST /connect/{app_name}
DELETE /connections/{connection_id}
GET /connections/{id}/folders
GET /connections/{id}/files
POST /connections/{id}/select-folder
POST /connections/{id}/sync
GET /sync-jobs/{job_id}
Test end-to-end sync with Google Drive (10 sample PDFs)
Verify vectors in S3 Vectors bucket via AWS Console
Week 3: Frontend + RAG
Create CloudStorageConnections.tsx component
Display supported apps grid
Connect/disconnect buttons
OAuth popup flow
Create FolderNavigator.tsx component
Tree view with lazy loading
Folder selection
Create SyncButton.tsx + SyncProgressModal.tsx
Manual "Sync Now" trigger
Real-time progress display
Implement query_rag() in CloudDocumentSyncService
Reuse EmbeddingManager for query embedding
Call S3 Vectors query API
Reuse ContextOptimizer for chunk selection
Create POST /api/cloud-documents/rag/query endpoint
Test RAG queries across Google Drive + Dropbox
Add cloud storage tab to DocumentManagement.tsx
Phase 2: Auto-Sync via Webhooks (Weeks 4-5)
Week 4: Webhook Integration
Review Composio webhook documentation
Implement webhook signature verification (HMAC-SHA256)
Create POST /api/cloud-documents/webhook endpoint
Handle trigger types:
GOOGLEDRIVE_FILE_CREATED
GOOGLEDRIVE_FILE_UPDATED
GOOGLEDRIVE_FILE_DELETED
DROPBOX_FILE_ADDED, DROPBOX_FILE_MODIFIED, DROPBOX_FILE_DELETED
ONEDRIVE_FILE_CREATED, ONEDRIVE_FILE_UPDATED
Subscribe to triggers when user connects app
Unsubscribe when user disconnects app
Test webhook flow with ngrok (local dev)
Deploy webhook endpoint to production
Week 5: Background Sync Jobs
Set up Celery or RQ for background tasks
Create cloud_sync_task.py with @task decorator
Implement periodic sync (every 30 mins) as fallback
Cron job or Celery beat schedule
Query cloud_sync_config for enabled connections
Trigger sync_folder() for each
Add sync status dashboard to frontend
Last sync timestamp
Next scheduled sync
Sync history (last 10 jobs)
Error notification system
Email alerts for failed syncs
In-app notifications
Test auto-sync: upload file to Google Drive, verify appears in Automatos within 30 seconds
Phase 3: Polish & Scale (Week 6)
Week 6: Production Readiness
Add document preview modal
Download file from cloud via Composio
Display PDF/text/image preview
Cache for 1 hour
Implement document download endpoint
Add file type filters to folder navigator
Implement exclude patterns (e.g., skip .tmp files)
Performance optimization:
Batch embedding generation (20 chunks at a time)
Parallel file processing (max 5 concurrent)
Error handling improvements:
Retry logic with exponential backoff
Token expiration handling
File size validation
Monitoring & logging:
CloudWatch metrics for S3 Vectors usage
Sync job success/failure rates
Average sync duration
Documentation:
User guide: "How to connect Google Drive"
Admin guide: S3 Vectors bucket management
API documentation (Swagger/OpenAPI)
Load testing:
100 concurrent sync jobs
1000 documents per workspace
RAG query latency under load
Migration Strategy
Step 1: Backup Existing Data (Day 1)
Step 2: Run Database Migration (Day 1)
Step 3: Delete Test Documents (Day 1)
Step 4: Deploy Backend Changes (Week 1)
Step 5: Deploy Frontend Changes (Week 3)
Step 6: User Migration Plan (Post-Launch)
For Existing Users (if any beyond test):
Send email: "We've upgraded document management - connect your cloud storage"
Keep old upload flow available for 30 days (parallel mode)
Provide migration tool: "Upload your existing documents to Google Drive, then sync"
After 30 days, remove upload UI, redirect to cloud storage
For New Users:
Onboarding flow shows cloud storage connection first
Guided setup: "Create an 'Automatos' folder in Google Drive"
Sample data: Pre-populate with demo documents for testing
Success Metrics
Quantitative Metrics
Storage Cost
$730/month (1000 users, 10GB each)
$56/month (90%+ reduction)
AWS billing dashboard
Documents Synced
0 (manual upload only)
500+ per workspace
Database count
RAG Query Latency
<200ms (pgvector)
<500ms (S3 Vectors cold), <100ms (warm)
Application metrics
Sync Success Rate
N/A
>95%
cloud_sync_jobs.status = 'completed' / total
User Adoption
N/A (no cloud sync)
70% of active users connect at least 1 app
composio_connections count
OAuth Success Rate
N/A
>99%
composio_connections.status = 'active' / total
Vector Storage Size
~10GB per 1000 docs
~100MB per 1000 docs (100x smaller)
S3 Vectors bucket size
Chunk Quality (Similarity)
0.75 avg similarity
0.75+ avg similarity (maintain quality)
RAG query results
Qualitative Metrics
Users can connect Google Drive in <60 seconds
Users understand that "we only store vectors, not files" (onboarding messaging)
Sync errors are clear and actionable (e.g., "File deleted from Google Drive")
RAG queries return relevant results from cloud documents
Support tickets related to document storage decrease by 50%
Users report satisfaction with "always in sync" experience (webhooks)
Dependencies
Internal Dependencies
PRD-36: Composio Integration (COMPLETE) - OAuth and app connections
PRD-11: Document Management (COMPLETE) - Reuse DocumentProcessor, SemanticChunker
Existing RAGService and ContextOptimizer
External Dependencies
AWS S3 Vectors - Service availability (GA as of Jan 2026)
Composio Platform - API stability for 500+ apps
Cloud Provider APIs - Google Drive, Dropbox, OneDrive uptime
Technical Dependencies
boto3 >= 1.34.0 (S3 Vectors support)
composio-python >= 0.5.0
PostgreSQL >= 14 (for new tables)
React >= 18, TypeScript >= 5
Testing Strategy
Unit Tests
Integration Tests
Frontend Tests
Timeline
Phase 1: MVP (Manual Sync)
3 weeks
120 hours
Week 1
Week 3
- Backend foundation
1 week
40 hours
Week 1
Week 1
- Sync service + API
1 week
40 hours
Week 2
Week 2
- Frontend + RAG
1 week
40 hours
Week 3
Week 3
Phase 2: Auto-Sync (Webhooks)
2 weeks
80 hours
Week 4
Week 5
- Webhook integration
1 week
40 hours
Week 4
Week 4
- Background jobs
1 week
40 hours
Week 5
Week 5
Phase 3: Polish & Scale
1 week
40 hours
Week 6
Week 6
Total
6 weeks
240 hours
Milestones:
✅ Week 1: Database migration complete, S3 Vectors integration working
✅ Week 2: Manual sync working end-to-end for Google Drive
✅ Week 3: Frontend complete, RAG queries working across cloud storage
✅ Week 4: Webhooks live, real-time sync operational
✅ Week 5: Background jobs running, periodic sync as fallback
✅ Week 6: Production-ready, monitoring dashboard live
Appendix
Environment Variables
Composio Actions Reference
Google Drive:
GOOGLEDRIVE_LIST_FILES- List files in a folderGOOGLEDRIVE_DOWNLOAD_FILE- Download file contentGOOGLEDRIVE_GET_FILE_METADATA- Get file metadataGOOGLEDRIVE_SEARCH_FILES- Search files by query
Dropbox:
DROPBOX_LIST_FOLDER- List files in a folderDROPBOX_DOWNLOAD_FILE- Download file contentDROPBOX_GET_FILE_METADATA- Get file metadata
OneDrive:
ONEDRIVE_LIST_FILES- List filesONEDRIVE_DOWNLOAD_FILE- Download fileONEDRIVE_GET_FILE_METADATA- Get metadata
Webhooks/Triggers:
GOOGLEDRIVE_FILE_CREATEDGOOGLEDRIVE_FILE_UPDATEDGOOGLEDRIVE_FILE_DELETEDDROPBOX_FILE_ADDEDDROPBOX_FILE_MODIFIEDDROPBOX_FILE_DELETEDONEDRIVE_FILE_CREATEDONEDRIVE_FILE_UPDATED
AWS S3 Vectors API Reference
Cost Calculator
Assumptions:
1000 active workspaces
Average 10GB documents per workspace (100 documents @ 100MB each)
Embedding dimension: 1024 (BAAI/bge-large-en-v1.5)
Average chunks per document: 20
Vector size: 1024 floats × 4 bytes = 4KB per vector
Storage:
Total documents: 1000 workspaces × 100 docs = 100,000 docs
Total chunks: 100,000 docs × 20 chunks = 2,000,000 chunks
Vector storage: 2M chunks × 4KB = 8GB
S3 Vectors cost: 8GB × $0.06/GB = $0.48/month
Metadata storage (PostgreSQL): ~100MB = $5/month
Operations:
Ingestion (one-time): 8GB × $0.20/GB = $1.60
Queries: 1M queries/month × $0.0025 = $2.50/month
Total: ~$8/month for 1000 workspaces (vs $730+ with local storage!)
Glossary
S3 Vectors: AWS service for storing and querying vector embeddings (GA Jan 2026)
Composio: Integration platform managing OAuth for 500+ apps (Google Drive, Dropbox, etc.)
RAG: Retrieval-Augmented Generation - using document chunks to augment LLM prompts
Semantic Chunking: Intelligent text splitting based on semantic meaning (vs fixed-size)
Vector Embedding: Numerical representation of text (1024-dimensional array)
Cosine Similarity: Distance metric for comparing vector embeddings (0-1 scale)
Entity: Composio's workspace-scoped authentication container
External File ID: Cloud provider's unique identifier for a file (Google Drive file ID, Dropbox path, etc.)
Sync Status: pending → syncing → synced → error
Knapsack Algorithm: Dynamic programming optimization for selecting chunks within token budget
MMR: Maximal Marginal Relevance - diversity scoring algorithm
Context Optimizer: System for selecting optimal chunks for LLM context window
END OF PRD
Last updated

