PRD-42: Cloud Document Sync with S3 Vectors

Version: 1.0 Status: 🟡 Planning Date: 2026-01-31 Author: System Architecture Team Prerequisites: PRD-36 (Composio Integration), PRD-11 (Document Management) Blocks: None


Executive Summary

Transform the Automatos AI Platform document management system from a local upload-based model to a hybrid cloud-native architecture where users connect their existing cloud storage providers (Google Drive, Dropbox, OneDrive, Box, SharePoint, S3) via Composio-managed OAuth. Instead of storing duplicate copies of user documents, the system will:

  1. Sync document metadata from cloud storage providers

  2. Download, chunk, and embed documents using existing semantic chunking pipeline

  3. Store only vector embeddings in AWS S3 Vectors (not original files)

  4. Enable RAG queries across all connected storage providers

  5. Maintain workspace isolation and GDPR compliance

Key Benefits:

  • 90-95% cost reduction - only pay for vector storage (~$6/100GB vs $730+/10TB for full storage)

  • Zero document liability - users keep documents in their own cloud, we never store originals

  • GDPR/compliance friendly - data stays in user's control

  • OAuth handled by Composio - no custom authentication code needed

  • Unified RAG - query across Google Drive, Dropbox, etc. from single interface

  • Scales to 2 billion vectors per index with S3 Vectors

Target Users: Single users and small businesses (1-10 people) with 1-2 cloud storage providers


Problem Statement

Current System Limitations

The existing document management system (PRD-11) has several constraints that limit scalability and increase operational costs:

Storage Costs:

  • System stores full document copies locally (filesystem or S3)

  • Vector embeddings stored separately in PostgreSQL with pgvector

  • For 1000 users with 10GB documents each: ~$730-1230/month storage costs

  • Duplicate storage: original file + chunks + vectors

User Experience:

  • Users must manually upload documents to Automatos

  • No sync with existing document repositories (Google Drive, Dropbox, etc.)

  • Documents live in isolated system, not user's primary workspace

  • No automatic updates when cloud documents change

Compliance & Liability:

  • We store and are responsible for user documents

  • GDPR data retention requirements fall on us

  • Security breach exposes user documents

  • Users lose control over their own data

Current Architecture:

What Doesn't Work

Currently, the system:

  • ❌ Does NOT sync from cloud storage providers

  • ❌ Does NOT use Composio's 500+ app integrations for documents

  • ❌ Does NOT leverage AWS S3 Vectors (new 2026 service)

  • ❌ Does NOT support folder navigation in cloud storage

  • ❌ Does NOT handle real-time document updates via webhooks

  • ❌ Stores 60 test documents that can be deleted (fresh start opportunity)


Success Criteria

Phase 1: MVP (Pilot Launch)

Phase 2: Production (Auto-Sync)

Phase 3: Scale (Future)


Goals

  1. Reduce Storage Costs by 90%+

    • Eliminate local document storage

    • Store only vector embeddings (100x smaller than originals)

    • Target: $56/month for 1000 users vs $730+ current

  2. Enable Cloud-Native Document Access

    • Users connect existing Google Drive, Dropbox, OneDrive, Box, SharePoint, S3

    • OAuth managed entirely by Composio (no custom auth code)

    • Folder navigation and file selection UI

  3. Maintain RAG Quality

    • Preserve existing semantic chunking quality (SemanticChunker with adaptive strategy)

    • Same or better retrieval accuracy with S3 Vectors vs pgvector

    • Support 10+ chunks per query with context optimization

  4. GDPR Compliance & Data Sovereignty

    • User documents never leave their cloud storage

    • We store only vector embeddings (not PII/document content)

    • Users maintain full control and ownership

  5. Workspace Isolation

    • Each workspace has independent S3 vector bucket

    • Vectors scoped by workspace_id

    • No cross-workspace data leakage


Functional Requirements

FR-1: Cloud Storage Connection Management

FR-1.1: Users can connect cloud storage providers via Composio OAuth

  • Supported providers: Google Drive, Dropbox, OneDrive, Box, SharePoint, Amazon S3

  • OAuth flow handled by Composio (hosted auth UI)

  • Connection status tracked in composio_connections table

  • UI shows: app logo, connection status (active/pending/error), connected date

FR-1.2: Users can disconnect cloud storage providers

  • Disconnect triggers confirmation modal: "Delete synced vectors or keep for RAG queries?"

  • If "Delete": Remove all cloud_documents and S3 vector entries for that connection

  • If "Keep": Mark connection as disconnected but preserve vectors (read-only)

  • Update composio_connections.status = 'disconnected'

FR-1.3: System validates connection health on page load

  • Check if Composio connection still active (token not expired)

  • Display warning badge if connection needs re-authorization

  • Auto-refresh expired tokens via Composio SDK

FR-2: Folder Navigation & Selection

FR-2.1: After connecting a storage provider, users see folder tree interface

  • Root folder displays top-level folders from cloud storage

  • Click folder to expand children (lazy loading)

  • Display folder metadata: name, path, file count, last modified

FR-2.2: Users select ONE root folder per app for syncing

  • Suggested: Create shared "Automatos" folder in cloud storage

  • UI shows: "Select a folder to sync. We recommend creating a dedicated 'Automatos' folder."

  • Once selected, sync all files and subfolders recursively underneath

  • Store selection in cloud_sync_config table: (connection_id, root_folder_path)

FR-2.3: Folder tree displays file type icons and sync status

  • Icons: PDF, Word, Excel, Text, Markdown, etc.

  • Sync badges: ✅ Synced, ⏳ Pending, ❌ Error, ⊘ Excluded

  • Filter by file type (checkboxes: PDFs, Docs, Sheets, Text)

FR-3: Manual Document Sync (Pilot)

FR-3.1: Users trigger sync via "Sync Now" button

  • Button appears next to each connected storage provider

  • Initiates background job: sync_cloud_documents_task(connection_id, root_folder_path)

  • Shows progress modal: "Syncing 47 files from Google Drive..."

  • Real-time progress updates via WebSocket or polling

FR-3.2: Sync process workflow

FR-3.3: Sync respects file size and type limits

  • Max file size: 50MB (configurable via system settings)

  • Supported types: PDF, DOCX, TXT, MD, JSON, XLSX, PPTX

  • Unsupported files logged with reason: "File type not supported"

FR-4: AWS S3 Vectors Integration

FR-4.1: Create workspace-scoped S3 vector buckets

  • Bucket naming: automatos-vectors-{workspace_id}

  • Index naming: documents-index

  • Dimension: 1024 (matches BAAI/bge-large-en-v1.5)

  • Distance metric: COSINE

FR-4.2: Vector insertion flow

FR-4.3: Vector querying flow

FR-5: Document Viewing & Download

FR-5.1: Users can view document content

  • Click document in file list → Opens preview modal

  • Download original file from cloud storage via Composio API

  • Display in browser using appropriate viewer:

    • PDF: Embedded PDF viewer

    • Text/Markdown: Syntax-highlighted text area

    • Images: Image preview

    • Office docs: Download prompt (no inline preview for MVP)

FR-5.2: Download flow

FR-5.3: Temporary caching for performance

  • Cache downloaded files for 1 hour in /tmp/automatos_previews/{hash}

  • Serve from cache if requested again within 1 hour

  • Background job cleans up cache files older than 1 hour

FR-6: Unified RAG Queries

FR-6.1: RAG endpoint queries all connected storage providers by default

  • Endpoint: POST /api/cloud-documents/rag/query

  • Parameters:

FR-6.2: Query process

FR-6.3: Result formatting includes source attribution

FR-7: Migration from Existing System

FR-7.1: Deprecate local document upload (phased approach)

Phase 1: Add cloud sync (keep upload)

  • Add "Connect Cloud Storage" tab to document management UI

  • Keep existing "Upload Files" tab functional

  • Both systems work in parallel during pilot

Phase 2: Migrate test documents

  • Delete 60 existing test documents from PostgreSQL + local storage

  • Clean up documents and document_chunks tables

  • Drop pgvector extension (optional, can keep for other features)

Phase 3: Remove upload UI (post-pilot)

  • Remove "Upload Files" tab

  • Redirect users to "Connect Cloud Storage"

  • Update documentation and help text

FR-7.2: Database migration script

FR-8: Error Handling & Resilience

FR-8.1: OAuth token expiration

  • Detect expired tokens on Composio API calls (401 errors)

  • Display banner: "Your Google Drive connection needs re-authorization"

  • Clicking banner triggers OAuth refresh flow

  • Auto-retry failed sync after re-auth

FR-8.2: File download failures

  • If Composio API fails (file deleted, permissions changed), log error

  • Mark document as sync_status: 'error' in cloud_documents

  • Display error in UI with actionable message: "File no longer accessible in Google Drive"

  • Retry logic: 3 attempts with exponential backoff

FR-8.3: S3 Vectors quota limits

  • Monitor S3 Vectors bucket size and index capacity

  • Alert when approaching 2 billion vector limit per index

  • Graceful degradation: If S3 Vectors unavailable, fallback to "View Document" only (no RAG)


Technical Architecture

System Overview

Data Flow: Document Sync

Data Flow: RAG Query


Database Schema

New Tables

Modified Tables

Indexes for Performance

Migration Script


API Endpoints

Cloud Storage Connections

GET /api/cloud-documents/connections

List all connected cloud storage providers for workspace.

Response:

POST /api/cloud-documents/connect/{app_name}

Initiate OAuth connection to cloud storage provider.

Parameters:

  • app_name: GOOGLEDRIVE | DROPBOX | ONEDRIVE | BOX | SHAREPOINT | AMAZONS3

Request:

Response:

DELETE /api/cloud-documents/connections/{connection_id}

Disconnect cloud storage provider (with confirmation).

Query Params:

  • delete_vectors: boolean (if true, delete all vectors; if false, mark disconnected but keep vectors)

Response:

Folder Navigation & Selection

GET /api/cloud-documents/connections/{connection_id}/folders

List folders in cloud storage (lazy loading for tree navigation).

Query Params:

  • path: string (optional, default: root) - e.g., "/Automatos/Projects"

Response:

GET /api/cloud-documents/connections/{connection_id}/files

List files in a specific folder.

Query Params:

  • path: string (optional, default: root)

  • file_types: string[] (optional) - e.g., ["pdf", "docx", "txt"]

  • recursive: boolean (optional, default: false)

Response:

POST /api/cloud-documents/connections/{connection_id}/select-folder

Set root folder for syncing.

Request:

Response:

Document Sync

POST /api/cloud-documents/connections/{connection_id}/sync

Manually trigger sync for a connection.

Request:

Response:

GET /api/cloud-documents/sync-jobs/{job_id}

Get sync job status.

Response:

GET /api/cloud-documents/connections/{connection_id}/sync-status

Get current sync status summary.

Response:

Document Operations

GET /api/cloud-documents/{document_id}

Get cloud document metadata.

Response:

GET /api/cloud-documents/{document_id}/download

Download original file from cloud storage.

Response:

  • Headers: Content-Type: {mime_type}, Content-Disposition: inline; filename="{file_name}"

  • Body: File content (binary)

GET /api/cloud-documents/{document_id}/preview

Get document preview (first 5 chunks).

Response:

DELETE /api/cloud-documents/{document_id}

Delete cloud document (removes vectors, keeps file in cloud storage).

Response:

RAG Queries

POST /api/cloud-documents/rag/query

Query documents using RAG across all connected storage providers.

Request:

Response:

POST /api/cloud-documents/search

Semantic search across cloud documents (simpler than RAG).

Request:

Response:


Backend Implementation

File Structure

Key Services

CloudDocumentSyncService

S3VectorsClient


Frontend Implementation

Component Structure

Key Components

CloudStorageConnections.tsx

FolderNavigator.tsx


Implementation Roadmap

Phase 1: MVP - Manual Sync (Weeks 1-3)

Week 1: Backend Foundation

Week 2: Sync Service

Week 3: Frontend + RAG

Phase 2: Auto-Sync via Webhooks (Weeks 4-5)

Week 4: Webhook Integration

Week 5: Background Sync Jobs

Phase 3: Polish & Scale (Week 6)

Week 6: Production Readiness


Migration Strategy

Step 1: Backup Existing Data (Day 1)

Step 2: Run Database Migration (Day 1)

Step 3: Delete Test Documents (Day 1)

Step 4: Deploy Backend Changes (Week 1)

Step 5: Deploy Frontend Changes (Week 3)

Step 6: User Migration Plan (Post-Launch)

For Existing Users (if any beyond test):

  1. Send email: "We've upgraded document management - connect your cloud storage"

  2. Keep old upload flow available for 30 days (parallel mode)

  3. Provide migration tool: "Upload your existing documents to Google Drive, then sync"

  4. After 30 days, remove upload UI, redirect to cloud storage

For New Users:

  • Onboarding flow shows cloud storage connection first

  • Guided setup: "Create an 'Automatos' folder in Google Drive"

  • Sample data: Pre-populate with demo documents for testing


Success Metrics

Quantitative Metrics

Metric
Current (Upload-Based)
Target (Cloud Sync)
Measurement Method

Storage Cost

$730/month (1000 users, 10GB each)

$56/month (90%+ reduction)

AWS billing dashboard

Documents Synced

0 (manual upload only)

500+ per workspace

Database count

RAG Query Latency

<200ms (pgvector)

<500ms (S3 Vectors cold), <100ms (warm)

Application metrics

Sync Success Rate

N/A

>95%

cloud_sync_jobs.status = 'completed' / total

User Adoption

N/A (no cloud sync)

70% of active users connect at least 1 app

composio_connections count

OAuth Success Rate

N/A

>99%

composio_connections.status = 'active' / total

Vector Storage Size

~10GB per 1000 docs

~100MB per 1000 docs (100x smaller)

S3 Vectors bucket size

Chunk Quality (Similarity)

0.75 avg similarity

0.75+ avg similarity (maintain quality)

RAG query results

Qualitative Metrics


Dependencies

Internal Dependencies

  • PRD-36: Composio Integration (COMPLETE) - OAuth and app connections

  • PRD-11: Document Management (COMPLETE) - Reuse DocumentProcessor, SemanticChunker

  • Existing RAGService and ContextOptimizer

External Dependencies

  • AWS S3 Vectors - Service availability (GA as of Jan 2026)

  • Composio Platform - API stability for 500+ apps

  • Cloud Provider APIs - Google Drive, Dropbox, OneDrive uptime

Technical Dependencies

  • boto3 >= 1.34.0 (S3 Vectors support)

  • composio-python >= 0.5.0

  • PostgreSQL >= 14 (for new tables)

  • React >= 18, TypeScript >= 5


Testing Strategy

Unit Tests

Integration Tests

Frontend Tests


Timeline

Phase
Duration
Effort
Start Date
End Date

Phase 1: MVP (Manual Sync)

3 weeks

120 hours

Week 1

Week 3

- Backend foundation

1 week

40 hours

Week 1

Week 1

- Sync service + API

1 week

40 hours

Week 2

Week 2

- Frontend + RAG

1 week

40 hours

Week 3

Week 3

Phase 2: Auto-Sync (Webhooks)

2 weeks

80 hours

Week 4

Week 5

- Webhook integration

1 week

40 hours

Week 4

Week 4

- Background jobs

1 week

40 hours

Week 5

Week 5

Phase 3: Polish & Scale

1 week

40 hours

Week 6

Week 6

Total

6 weeks

240 hours

Milestones:

  • ✅ Week 1: Database migration complete, S3 Vectors integration working

  • ✅ Week 2: Manual sync working end-to-end for Google Drive

  • ✅ Week 3: Frontend complete, RAG queries working across cloud storage

  • ✅ Week 4: Webhooks live, real-time sync operational

  • ✅ Week 5: Background jobs running, periodic sync as fallback

  • ✅ Week 6: Production-ready, monitoring dashboard live


Appendix

Environment Variables

Composio Actions Reference

Google Drive:

  • GOOGLEDRIVE_LIST_FILES - List files in a folder

  • GOOGLEDRIVE_DOWNLOAD_FILE - Download file content

  • GOOGLEDRIVE_GET_FILE_METADATA - Get file metadata

  • GOOGLEDRIVE_SEARCH_FILES - Search files by query

Dropbox:

  • DROPBOX_LIST_FOLDER - List files in a folder

  • DROPBOX_DOWNLOAD_FILE - Download file content

  • DROPBOX_GET_FILE_METADATA - Get file metadata

OneDrive:

  • ONEDRIVE_LIST_FILES - List files

  • ONEDRIVE_DOWNLOAD_FILE - Download file

  • ONEDRIVE_GET_FILE_METADATA - Get metadata

Webhooks/Triggers:

  • GOOGLEDRIVE_FILE_CREATED

  • GOOGLEDRIVE_FILE_UPDATED

  • GOOGLEDRIVE_FILE_DELETED

  • DROPBOX_FILE_ADDED

  • DROPBOX_FILE_MODIFIED

  • DROPBOX_FILE_DELETED

  • ONEDRIVE_FILE_CREATED

  • ONEDRIVE_FILE_UPDATED

AWS S3 Vectors API Reference

Cost Calculator

Assumptions:

  • 1000 active workspaces

  • Average 10GB documents per workspace (100 documents @ 100MB each)

  • Embedding dimension: 1024 (BAAI/bge-large-en-v1.5)

  • Average chunks per document: 20

  • Vector size: 1024 floats × 4 bytes = 4KB per vector

Storage:

  • Total documents: 1000 workspaces × 100 docs = 100,000 docs

  • Total chunks: 100,000 docs × 20 chunks = 2,000,000 chunks

  • Vector storage: 2M chunks × 4KB = 8GB

  • S3 Vectors cost: 8GB × $0.06/GB = $0.48/month

  • Metadata storage (PostgreSQL): ~100MB = $5/month

Operations:

  • Ingestion (one-time): 8GB × $0.20/GB = $1.60

  • Queries: 1M queries/month × $0.0025 = $2.50/month

Total: ~$8/month for 1000 workspaces (vs $730+ with local storage!)


Glossary

  • S3 Vectors: AWS service for storing and querying vector embeddings (GA Jan 2026)

  • Composio: Integration platform managing OAuth for 500+ apps (Google Drive, Dropbox, etc.)

  • RAG: Retrieval-Augmented Generation - using document chunks to augment LLM prompts

  • Semantic Chunking: Intelligent text splitting based on semantic meaning (vs fixed-size)

  • Vector Embedding: Numerical representation of text (1024-dimensional array)

  • Cosine Similarity: Distance metric for comparing vector embeddings (0-1 scale)

  • Entity: Composio's workspace-scoped authentication container

  • External File ID: Cloud provider's unique identifier for a file (Google Drive file ID, Dropbox path, etc.)

  • Sync Status: pending → syncing → synced → error

  • Knapsack Algorithm: Dynamic programming optimization for selecting chunks within token budget

  • MMR: Maximal Marginal Relevance - diversity scoring algorithm

  • Context Optimizer: System for selecting optimal chunks for LLM context window


END OF PRD

Last updated