PRD 61: NL2SQL v2 — Top-10 Competitive Upgrade

Status: Draft Priority: High Effort: 35-45 hours Dependencies: PRD-21 (Database Knowledge), PRD-30 (Modular Architecture) Created: 2026-02-18 Research Base: Deep analysis of 9 leading open-source NL2SQL projects + Automatos codebase audit

Executive Summary

Deep research across the top NL2SQL/Text-to-SQL open-source projects (PandasAI 23.2K stars, Vanna 22.7K, DB-GPT 16K, WrenAI, Dataherald 3.6K, SQLCoder 4K, XiYan-SQL, DBHub, SQLBot 5.5K) reveals that Automatos already has a surprisingly strong NL2SQL implementation — more complete than most open-source alternatives. However, critical gaps remain in error correction, training data management, benchmark-quality generation, and MCP exposure that prevent it from being competitive with the best.

This PRD closes those gaps across 8 phases, taking Automatos from "good internal tool" to "competitive with Vanna/Dataherald" level.

Part 1: How the Top Projects Work

Tier 1: Full Platforms (Closest Competitors)

1. Vanna (22.7K stars) — RAG-Based NL2SQL

Core Innovation: Uses vector stores to store DDL schemas, documentation, and verified question/SQL pairs. Retrieves similar examples at generation time for few-shot prompting.
Pipeline: generate_sql() → run_sql() → generate_plotly_code() → get_plotly_figure()
Training: vn.train(question, sql) / vn.add_ddl() / vn.add_documentation() — builds a RAG corpus of schema + examples
Auto-train: If query fails, automatically adds correction to training data
Follow-ups: generate_followup_questions() suggests next questions
License: MIT
Key Lesson: RAG-based few-shot example retrieval dramatically improves SQL accuracy

2. PandasAI (23.2K stars) — Code Generation + SQL

Core Innovation: Generates Python code (not just SQL) for data analysis. Agent-based architecture routes between SQL queries and pandas operations.
Connectors: MySQL, PostgreSQL, Snowflake, Databricks (production license required for Snowflake/Databricks)
License: MIT + EE directory (enterprise features require commercial license)
Key Lesson: Smart routing between SQL and code-based analysis; already partially adopted in Automatos via PandasAI integration

3. DB-GPT (16K stars) — Full AI Data Platform

Core Innovation: AWEL (Agentic Workflow Expression Language) DAG-based orchestration. Five-stage Text2SQL pipeline: Query Understanding → Schema Linking → SQL Generation → Execution → Visualization
Databases: MySQL, PostgreSQL, Oracle, MSSQL, ClickHouse, DuckDB, Hive, Spark (8 databases)
Benchmark: 82.5% Spider accuracy
Architecture: Modular packages (dbgpt-core, dbgpt-ext, dbgpt-serve, dbgpt-client)
Key Lesson: Modular package architecture; schema linking as explicit pipeline stage; fine-tuning framework for Text2SQL

4. WrenAI — Semantic Layer Focus

Core Innovation: Modeling Definition Language (MDL) for semantic layer. Strong emphasis on business metric definitions over raw SQL.
API: SQL Generation + Chart Generation endpoints (commercial plan required)
Key Lesson: Semantic layer as first-class citizen improves accuracy for business queries

Tier 2: Specialized Tools

5. Dataherald (3.6K stars) — Enterprise NL2SQL API

Core Innovation: 4 SQL generator implementations (Langchain Agent, Langchain Chain, LlamaIndex, Dataherald Agent). Agent with 7 specialized tools. "Golden SQL" training system.
Agent Tools: QuerySQLDatabase, GetCurrentTime, TablesSQLDatabase, SchemaSQLDatabase, InfoRelevantColumns, ColumnEntityChecker, GetFewShotExamples
Databases: PostgreSQL, BigQuery, Databricks, Snowflake
Training: Golden SQL pairs stored in vector DB + used for fine-tuning (min 20-30 samples/table)
Key Lesson: Golden SQL concept; agent with specialized database tools; self-correction

6. SQLCoder (4K stars) — Fine-Tuned Model

Core Innovation: Purpose-built fine-tuned model for Text-to-SQL (based on StarCoder/Code Llama)
Key Lesson: Fine-tuned models can beat GPT-4 on specific domains at lower cost

7. XiYan-SQL — MCP-Native NL2SQL

Core Innovation: State-of-the-art benchmark scores delivered via MCP server. Remote mode (API) or local mode.
Databases: MySQL, PostgreSQL
Key Lesson: MCP as delivery mechanism for NL2SQL; 2-22% better than raw database MCP servers

8. DBHub — MCP Database Gateway

Core Innovation: Zero-dependency MCP server for multi-database access. Custom parameterized SQL tools.
Databases: PostgreSQL, MySQL, SQL Server, MariaDB, SQLite
Key Lesson: MCP + custom tools pattern; read-only mode with row limits

9. SQLBot (5.5K stars) — Conversational SQL

Core Innovation: Multi-turn conversational interface for SQL
Key Lesson: Conversation state management for iterative query refinement

Benchmark Reality Check

Method

Spider EX

Notes

MCS-SQL + GPT-4

89.6%

With Spider training data

CHASE-SQL + Gemini 1.5

87.6%

Without training data

DAIL-SQL + GPT-4

86.6%

With training data

DIN-SQL + GPT-4

85.3%

With training data

DB-GPT

82.5%

Reported

Spider 2.0 (best)

23.8%

Real-world enterprise complexity

Real enterprise datasets

~24%

"NL2SQL is a solved problem... Not!"

Critical insight: Academic benchmarks (85-91%) dramatically overstate real-world accuracy (~24% on enterprise schemas). Schema linking brittleness causes 10-20% accuracy drops on paraphrased queries. Models are only effective for ~20% of realistic user queries.

Part 2: Automatos Current State (Honest Assessment)

What Already Works (Strengths)

Automatos has a substantial, production-ready NL2SQL system across 16+ files in orchestrator/modules/nl2sql/:

Feature

Status

Quality

Files

NL→SQL Generation

Working

Production

query/nl2sql_service.py (343 lines)

SQL Validation (3-tier)

Working

Excellent

query/validator.py (254 lines)

Schema Introspection

Working

Good

schema/introspection.py

Schema Provider + Cache

Working

Good

schema/provider.py (374 lines)

Smart Agent (multi-turn)

Working

Good

intelligence/agent.py (403 lines)

Query Clarification

Working

Good

intelligence/clarifier.py (261 lines)

Query Rephrasing

Working

Basic

intelligence/rephraser.py

Result Explanation

Working

Basic

intelligence/explainer.py

Visualization Suggestion

Working

Good

intelligence/visualizer.py (310 lines)

PandasAI Integration

Working

Good

tools/services/pandas_ai_service.py

Semantic Layer (metrics/dims)

Working

Good

core/models/database_knowledge.py

Database Tool Auto-creation

Working

Good

tools/services/database_tool_integration.py

Full REST API

Working

Complete

api/database_knowledge.py (612 lines)

Multi-tenant Isolation

Working

Excellent

Workspace-based throughout

Audit Trail

Working

Complete

database_query_audit table

Frontend UI

Working

Extensive

DatabaseQueryExplorer.tsx (NL input, SQL, results, viz), SemanticLayerBuilder, QueryTemplatesGrid, DataWidget (chatbot results with table/pagination/CSV export), DataVisualizationModal, AddDatabaseModal, plus 15 analytics components

Credential Management

Working

Secure

Encrypted, runtime-resolved

Query Templates (20+)

Working

Good

Per-dialect templates

Analytics Dashboard

Working

Basic

api/database_analytics.py

Total: ~3,500+ lines of working NL2SQL code

What's Missing (Gaps vs Top Projects)

Gap

Impact

Who Has It

Priority

1. RAG-Based Few-Shot Examples

Accuracy drops 15-25% without examples

Vanna, Dataherald

Critical

2. SQL Error Self-Correction

First-try failures become dead ends

Vanna (auto-train), DB-GPT, Dataherald

Critical

3. Golden SQL Training System

No way to improve over time

Vanna, Dataherald

High

4. Schema Linking Stage

Sending full schema wastes tokens, reduces accuracy

DB-GPT, CHASE-SQL, DAIL-SQL

High

5. Multi-Database Connectors

Only PostgreSQL + MySQL

DB-GPT (8), Dataherald (4)

Medium

6. MCP Server Exposure

NL2SQL not available via MCP

XiYan-SQL, DBHub

Medium

7. Query Confidence Scoring

Users can't judge result reliability

Dataherald

Medium

8. SQL Generation Benchmarking

No way to measure/track accuracy

All serious projects

Medium

5 Bugs Found

Bug

Severity

File

Fix

No table relevance scoring — sends ALL tables to LLM

High

query/nl2sql_service.py:_get_relevant_tables()

Uses keyword matching only; needs embedding-based similarity

Clarifier patterns too aggressive — triggers on simple queries

Medium

intelligence/clarifier.py

Time-range pattern matches "last" in "last_name"; grouping matches "by" in "baby"

Validator allows subqueries with mutations

Medium

query/validator.py

Nested INSERT INTO ... SELECT bypasses single-statement check

Schema cache never invalidates on DDL changes

Medium

schema/provider.py

_schema_cache uses TTL but no hook for ALTER TABLE / new columns

Visualizer chart config missing axis labels

Low

intelligence/visualizer.py

_build_chart_config() returns x/y fields but no labels/titles for charts

Part 3: Build vs. Adopt Analysis

The Question

"Is it worth dropping the in-house version and adopting a really good open-source project?"

Verdict: BUILD ON EXISTING (Enhance, Don't Replace)

Why NOT to adopt Vanna/Dataherald/DB-GPT:

Concern

Detail

Multi-tenant isolation

None of the 9 projects support workspace-based multi-tenancy. You'd have to retrofit it — harder than building from scratch.

Credential management

Automatos uses encrypted, runtime-resolved credentials via CredentialStore. Open-source projects use connection strings in config files.

Agent tool integration

PRD-17 auto-creates 3 database tools per source. No open-source project has this.

Audit trail

Your database_query_audit table with user/agent/session tracking is more complete than any open-source project.

Architecture fit

FastAPI + SQLAlchemy + workspace isolation. Vanna is a standalone library, DB-GPT is a full platform, Dataherald is Docker-only. None fit cleanly into your modular architecture.

License risk

PandasAI EE requires commercial license. WrenAI API is paid. DB-GPT is AGPL-friendly but massive dependency.

Migration cost

Estimated 80-120 hours to rip out existing + integrate + retrofit multi-tenancy + restore feature parity.

What TO adopt (techniques, not codebases):

Technique

Source

Effort

Impact

RAG-based few-shot retrieval

Vanna

+15-25% accuracy

Golden SQL training pairs

Dataherald

Continuous improvement

SQL error self-correction loop

DB-GPT, Vanna

Reduces failure rate 30-50%

Explicit schema linking stage

CHASE-SQL, DB-GPT

Better token efficiency + accuracy

MCP tool exposure

XiYan-SQL, DBHub

New integration channel

Bottom line: Your existing implementation is worth ~3,500 lines of working, multi-tenant, production-ready code. Adopting would cost more than enhancing. Instead, adopt the techniques (RAG for SQL, Golden SQL, self-correction) that the top projects use.

Same Verdict for RAG (from PRD-60)

The RAG system has a similar story — substantial existing implementation (semantic chunker, hybrid search, ingestion pipeline). Adopt techniques (parent-child chunks, RRF, knowledge graph) not codebases.

Existing Frontend Reality

The NL2SQL frontend is already extensive — unlike RAG (where the DocumentWidget broke), the database UI is largely intact:

Component

File

What It Does

DatabaseQueryExplorer.tsx