Knowledge Graph & Entity Extraction
Purpose and Scope
The Knowledge Graph & Entity Extraction system extracts structured entities and relationships from unstructured documents, enabling semantic search, query understanding, and intelligent retrieval beyond simple keyword matching. This page covers:
Document Entity Extraction: Extracting entities (technologies, concepts, people, organizations) from uploaded documents using regex and LLM methods
Relationship Mapping: Identifying relationships between entities (uses, created_by, is_part_of, etc.)
Knowledge Graph Storage: Database schema for storing entities, mentions, and relationships
CodeGraph Integration: Code-specific knowledge graph for structural analysis (see Workspace Execution for code execution context)
For document ingestion and chunking, see Document Ingestion Pipeline. For semantic retrieval using extracted entities, see RAG Retrieval System.
System Architecture
The knowledge graph system operates as a post-processing layer on top of document ingestion, enriching chunks with structured metadata for improved query understanding.
Sources: orchestrator/modules/search/services/entity_extractor.py:1-403
Entity Extraction Pipeline
EntityExtractor Class
The EntityExtractor class orchestrates a two-phase extraction process combining fast regex patterns with accurate LLM-based extraction.
Core Methods:
extract_entities()
Main entry point, orchestrates regex + LLM
~2-5s per document
_extract_with_regex()
Fast pattern matching for common entities
<100ms
_extract_with_llm()
Accurate extraction using GPT-4o-mini
~2-4s
extract_relationships()
Identify entity-entity relationships
~2-3s
_deduplicate_entities()
Merge duplicates by canonical name
<10ms
Sources: orchestrator/modules/search/services/entity_extractor.py:40-88
Two-Phase Extraction Strategy
Phase 1: Regex-Based Extraction
Fast pattern matching for common entity types:
Technology names: Capitalized words with optional versions (e.g., "React 18", "PostgreSQL")
Acronyms: 2-5 capital letters (e.g., "RAG", "API", "LLM")
Filtering: Excludes common words ("The", "This", "When")
Regex patterns achieve 0.5-0.6 confidence scores due to higher false positive rates.
Sources: orchestrator/modules/search/services/entity_extractor.py:90-121
Phase 2: LLM-Based Extraction
For documents under 8,000 characters, uses GPT-4o-mini with a structured prompt:
LLM extraction achieves 0.9 confidence with lower false positive rates and better entity type classification.
Sources: orchestrator/modules/search/services/entity_extractor.py:123-183
Entity Data Model
Entity Types:
technology
Software, frameworks, tools
"React", "PostgreSQL", "Docker"
concept
Algorithms, methodologies
"Machine Learning", "RAG", "Semantic Search"
organization
Companies, institutions
"OpenAI", "Google", "MIT"
person
Individuals
"John Doe", "Researcher Name"
location
Physical locations
"San Francisco", "USA"
event
Conferences, launches
"NeurIPS 2024", "Product Launch"
Sources: orchestrator/modules/search/services/entity_extractor.py:18-38
Relationship Extraction
The extract_relationships() method identifies semantic connections between entities using LLM analysis.
Relationship Types
Relationship Type Definitions:
is_part_of
Component or subset relationship
"Neural Networks" → "Machine Learning"
uses
Dependency or usage
"React" → "JavaScript"
created_by
Authorship or development
"GPT-4" → "OpenAI"
improves
Enhancement or optimization
"GPT-4" → "GPT-3.5"
depends_on
Required dependency
"Frontend" → "API"
alternative_to
Competing or substitute
"PostgreSQL" → "MySQL"
related_to
General semantic connection
"Docker" → "Kubernetes"
Sources: orchestrator/modules/search/services/entity_extractor.py:185-270
Relationship Extraction Process
Algorithm:
Entity Filtering: Limit to top 20 entities (avoid token limits)
LLM Prompt: Send entity list + text to GPT-4o-mini
JSON Parsing: Extract structured relationships from response
Evidence Capture: Store supporting text snippets (max 500 chars)
Strength Assignment: Default 0.8 strength for LLM-identified relationships
Sources: orchestrator/modules/search/services/entity_extractor.py:185-270
Knowledge Graph Storage Schema
The knowledge graph uses three PostgreSQL tables with workspace isolation.
Table Descriptions
kb_entities: Core entity records with deduplication by canonical name
Unique constraint:
canonical_name(normalized lowercase)Mention count: Incremented on each new mention (PageRank-like frequency signal)
Embedding: Optional vector for semantic entity search
Workspace isolation:
workspace_idfor multi-tenancy
knowledge_entity_mentions: Links entities to document chunks
Position tracking:
position_in_sourcefor entity locationContext window: Surrounding text for disambiguation
Extraction method: "llm" or "regex" for quality tracking
Composite unique:
(knowledge_item_id, entity_id, position_in_source)
entity_relationships: Entity-to-entity connections
Strength: 0.0-1.0 confidence score
Evidence: Text snippet supporting the relationship
Upsert logic: On conflict, keep stronger relationship and merge evidence
Sources: orchestrator/modules/search/services/entity_extractor.py:304-402
Database Operations
Helper Functions
Upsert Logic:
Query by
canonical_name(case-insensitive)If exists: increment
mention_count, return existingidIf new: insert entity, return new
id
Deduplication Strategy: All variations map to a single canonical form (e.g., "react", "React", "REACT" → "react").
Links entity to document chunk with:
Surrounding context (150 char window)
Confidence score from extraction method
Position for ordered mentions
Extraction method for provenance
Conflict handling: ON CONFLICT DO NOTHING prevents duplicate mention records.
Creates/updates relationships:
Resolves entity names to IDs via canonical name lookup
On conflict: keeps stronger relationship, merges evidence text
Workspace-scoped for multi-tenancy
Sources: orchestrator/modules/search/services/entity_extractor.py:304-402
CodeGraph: Code-Specific Knowledge Graph
While document entity extraction focuses on concepts and technologies, CodeGraph builds a structural knowledge graph of code symbols using static analysis and PageRank ranking.
CodeGraph vs Document Entities
Source
Uploaded documents (PDF, MD, DOCX)
Indexed codebases (GitHub repos)
Entities
Concepts, technologies, people, orgs
Functions, classes, methods, interfaces
Relationships
Semantic (uses, created_by, related_to)
Structural (calls, imports, inherits)
Ranking
Mention count
PageRank on call graph
Storage
kb_entities, entity_relationships
codegraph_symbols, codegraph_relationships
Tools
search_knowledge, semantic_search
search_codebase, get_call_graph
Sources: orchestrator/modules/codegraph/ranking/pagerank_ranker.py:1-117
CodeGraph Architecture
Sources: orchestrator/modules/codegraph/ranking/pagerank_ranker.py:1-117, orchestrator/modules/agents/services/agent_platform_tools.py:98-206
PageRankRanker Class
The PageRankRanker implements Aider-inspired structural importance ranking, achieving 4-6% context usage vs 54-70% with naive approaches.
Algorithm:
Build DiGraph: Create NetworkX directed graph from call/import relationships
Run PageRank: Standard algorithm with damping factor α=0.85
Sort by Rank: Order symbols by importance score (descending)
Fit Budget: Iteratively add symbols until token budget exhausted
Return Top-K: Most structurally important symbols within budget
Token Estimation: ~4 chars per token heuristic for signatures + docstrings.
Sources: orchestrator/modules/codegraph/ranking/pagerank_ranker.py:20-116
Integration with RAG System
Entity extraction enhances retrieval through three mechanisms:
1. Query Enhancement
When users query the knowledge base, entity lookup expands queries:
2. Semantic Filtering
Document chunks tagged with entities enable precise filtering:
3. Relationship-Based Context
When retrieving a chunk about "React", automatically pull related entities:
Sources: orchestrator/modules/rag/service.py:210-294
Agent Platform Tools
Agents access knowledge graphs via the AgentPlatformTools service.
Document Knowledge Graph Tools
Uses RAG service with entity-enhanced retrieval.
Sources: orchestrator/modules/agents/services/agent_platform_tools.py:60-76
CodeGraph Tools
search_codebase: Fuzzy/semantic symbol search with PageRank ranking
get_call_graph: Traverse dependencies in both directions
analyze_architecture: High-level codebase overview with PageRank
Returns module structure, key classes, dependency patterns, and top-referenced symbols.
Sources: orchestrator/modules/agents/services/agent_platform_tools.py:98-206
Usage Examples
Extracting Entities from a Document
Extracting Relationships
Storing in Database
Querying the Knowledge Graph
Performance Characteristics
Extraction Performance
Regex extraction
<100ms
Regex engine
LLM entity extraction
2-4s
OpenAI API latency
LLM relationship extraction
2-3s
OpenAI API latency
Database upsert (batch)
50-200ms
PostgreSQL writes
Total per document
4-8s
LLM calls
Scalability Considerations
LLM Rate Limits: OpenAI API has rate limits; batch processing recommended
Text Truncation: LLM methods only process first 6,000-8,000 chars
Deduplication: Canonical name matching prevents entity explosion
Workspace Isolation: All queries filtered by
workspace_idfor multi-tenancy
Sources: orchestrator/modules/search/services/entity_extractor.py:47-88
Future Enhancements
Entity Embeddings: Store vector representations for semantic entity search
Graph Algorithms: Implement centrality measures (betweenness, eigenvector) for entity importance
Query Expansion: Auto-expand user queries with related entities
Entity Disambiguation: Handle homonyms (e.g., "React" as library vs. "react" as verb)
Cross-Workspace Entities: Shared marketplace entities for common technologies
Temporal Tracking: Track entity evolution across document versions
Sources:
Last updated

