Knowledge Graph & Entity Extraction

chevron-rightRelevant source fileshashtag

Purpose and Scope

The Knowledge Graph & Entity Extraction system extracts structured entities and relationships from unstructured documents, enabling semantic search, query understanding, and intelligent retrieval beyond simple keyword matching. This page covers:

  • Document Entity Extraction: Extracting entities (technologies, concepts, people, organizations) from uploaded documents using regex and LLM methods

  • Relationship Mapping: Identifying relationships between entities (uses, created_by, is_part_of, etc.)

  • Knowledge Graph Storage: Database schema for storing entities, mentions, and relationships

  • CodeGraph Integration: Code-specific knowledge graph for structural analysis (see Workspace Execution for code execution context)

For document ingestion and chunking, see Document Ingestion Pipeline. For semantic retrieval using extracted entities, see RAG Retrieval System.


System Architecture

The knowledge graph system operates as a post-processing layer on top of document ingestion, enriching chunks with structured metadata for improved query understanding.

spinner

Sources: orchestrator/modules/search/services/entity_extractor.py:1-403


Entity Extraction Pipeline

EntityExtractor Class

The EntityExtractor class orchestrates a two-phase extraction process combining fast regex patterns with accurate LLM-based extraction.

Core Methods:

Method
Purpose
Performance

extract_entities()

Main entry point, orchestrates regex + LLM

~2-5s per document

_extract_with_regex()

Fast pattern matching for common entities

<100ms

_extract_with_llm()

Accurate extraction using GPT-4o-mini

~2-4s

extract_relationships()

Identify entity-entity relationships

~2-3s

_deduplicate_entities()

Merge duplicates by canonical name

<10ms

Sources: orchestrator/modules/search/services/entity_extractor.py:40-88


Two-Phase Extraction Strategy

spinner

Phase 1: Regex-Based Extraction

Fast pattern matching for common entity types:

  • Technology names: Capitalized words with optional versions (e.g., "React 18", "PostgreSQL")

  • Acronyms: 2-5 capital letters (e.g., "RAG", "API", "LLM")

  • Filtering: Excludes common words ("The", "This", "When")

Regex patterns achieve 0.5-0.6 confidence scores due to higher false positive rates.

Sources: orchestrator/modules/search/services/entity_extractor.py:90-121

Phase 2: LLM-Based Extraction

For documents under 8,000 characters, uses GPT-4o-mini with a structured prompt:

LLM extraction achieves 0.9 confidence with lower false positive rates and better entity type classification.

Sources: orchestrator/modules/search/services/entity_extractor.py:123-183


Entity Data Model

spinner

Entity Types:

Type
Description
Examples

technology

Software, frameworks, tools

"React", "PostgreSQL", "Docker"

concept

Algorithms, methodologies

"Machine Learning", "RAG", "Semantic Search"

organization

Companies, institutions

"OpenAI", "Google", "MIT"

person

Individuals

"John Doe", "Researcher Name"

location

Physical locations

"San Francisco", "USA"

event

Conferences, launches

"NeurIPS 2024", "Product Launch"

Sources: orchestrator/modules/search/services/entity_extractor.py:18-38


Relationship Extraction

The extract_relationships() method identifies semantic connections between entities using LLM analysis.

Relationship Types

spinner

Relationship Type Definitions:

Type
Semantic Meaning
Example

is_part_of

Component or subset relationship

"Neural Networks" → "Machine Learning"

uses

Dependency or usage

"React" → "JavaScript"

created_by

Authorship or development

"GPT-4" → "OpenAI"

improves

Enhancement or optimization

"GPT-4" → "GPT-3.5"

depends_on

Required dependency

"Frontend" → "API"

alternative_to

Competing or substitute

"PostgreSQL" → "MySQL"

related_to

General semantic connection

"Docker" → "Kubernetes"

Sources: orchestrator/modules/search/services/entity_extractor.py:185-270


Relationship Extraction Process

Algorithm:

  1. Entity Filtering: Limit to top 20 entities (avoid token limits)

  2. LLM Prompt: Send entity list + text to GPT-4o-mini

  3. JSON Parsing: Extract structured relationships from response

  4. Evidence Capture: Store supporting text snippets (max 500 chars)

  5. Strength Assignment: Default 0.8 strength for LLM-identified relationships

Sources: orchestrator/modules/search/services/entity_extractor.py:185-270


Knowledge Graph Storage Schema

The knowledge graph uses three PostgreSQL tables with workspace isolation.

spinner

Table Descriptions

kb_entities: Core entity records with deduplication by canonical name

  • Unique constraint: canonical_name (normalized lowercase)

  • Mention count: Incremented on each new mention (PageRank-like frequency signal)

  • Embedding: Optional vector for semantic entity search

  • Workspace isolation: workspace_id for multi-tenancy

knowledge_entity_mentions: Links entities to document chunks

  • Position tracking: position_in_source for entity location

  • Context window: Surrounding text for disambiguation

  • Extraction method: "llm" or "regex" for quality tracking

  • Composite unique: (knowledge_item_id, entity_id, position_in_source)

entity_relationships: Entity-to-entity connections

  • Strength: 0.0-1.0 confidence score

  • Evidence: Text snippet supporting the relationship

  • Upsert logic: On conflict, keep stronger relationship and merge evidence

Sources: orchestrator/modules/search/services/entity_extractor.py:304-402


Database Operations

Helper Functions

Upsert Logic:

  1. Query by canonical_name (case-insensitive)

  2. If exists: increment mention_count, return existing id

  3. If new: insert entity, return new id

Deduplication Strategy: All variations map to a single canonical form (e.g., "react", "React", "REACT" → "react").


Links entity to document chunk with:

  • Surrounding context (150 char window)

  • Confidence score from extraction method

  • Position for ordered mentions

  • Extraction method for provenance

Conflict handling: ON CONFLICT DO NOTHING prevents duplicate mention records.


Creates/updates relationships:

  • Resolves entity names to IDs via canonical name lookup

  • On conflict: keeps stronger relationship, merges evidence text

  • Workspace-scoped for multi-tenancy

Sources: orchestrator/modules/search/services/entity_extractor.py:304-402


CodeGraph: Code-Specific Knowledge Graph

While document entity extraction focuses on concepts and technologies, CodeGraph builds a structural knowledge graph of code symbols using static analysis and PageRank ranking.

CodeGraph vs Document Entities

Aspect
Document Entities
CodeGraph

Source

Uploaded documents (PDF, MD, DOCX)

Indexed codebases (GitHub repos)

Entities

Concepts, technologies, people, orgs

Functions, classes, methods, interfaces

Relationships

Semantic (uses, created_by, related_to)

Structural (calls, imports, inherits)

Ranking

Mention count

PageRank on call graph

Storage

kb_entities, entity_relationships

codegraph_symbols, codegraph_relationships

Tools

search_knowledge, semantic_search

search_codebase, get_call_graph

Sources: orchestrator/modules/codegraph/ranking/pagerank_ranker.py:1-117


CodeGraph Architecture

spinner

Sources: orchestrator/modules/codegraph/ranking/pagerank_ranker.py:1-117, orchestrator/modules/agents/services/agent_platform_tools.py:98-206


PageRankRanker Class

The PageRankRanker implements Aider-inspired structural importance ranking, achieving 4-6% context usage vs 54-70% with naive approaches.

Algorithm:

  1. Build DiGraph: Create NetworkX directed graph from call/import relationships

  2. Run PageRank: Standard algorithm with damping factor α=0.85

  3. Sort by Rank: Order symbols by importance score (descending)

  4. Fit Budget: Iteratively add symbols until token budget exhausted

  5. Return Top-K: Most structurally important symbols within budget

Token Estimation: ~4 chars per token heuristic for signatures + docstrings.

Sources: orchestrator/modules/codegraph/ranking/pagerank_ranker.py:20-116


Integration with RAG System

Entity extraction enhances retrieval through three mechanisms:

1. Query Enhancement

When users query the knowledge base, entity lookup expands queries:

2. Semantic Filtering

Document chunks tagged with entities enable precise filtering:

3. Relationship-Based Context

When retrieving a chunk about "React", automatically pull related entities:

Sources: orchestrator/modules/rag/service.py:210-294


Agent Platform Tools

Agents access knowledge graphs via the AgentPlatformTools service.

Document Knowledge Graph Tools

Uses RAG service with entity-enhanced retrieval.

Sources: orchestrator/modules/agents/services/agent_platform_tools.py:60-76


CodeGraph Tools

search_codebase: Fuzzy/semantic symbol search with PageRank ranking

get_call_graph: Traverse dependencies in both directions

analyze_architecture: High-level codebase overview with PageRank

Returns module structure, key classes, dependency patterns, and top-referenced symbols.

Sources: orchestrator/modules/agents/services/agent_platform_tools.py:98-206


Usage Examples

Extracting Entities from a Document


Extracting Relationships


Storing in Database


Querying the Knowledge Graph


Performance Characteristics

Extraction Performance

Operation
Avg Time
Bottleneck

Regex extraction

<100ms

Regex engine

LLM entity extraction

2-4s

OpenAI API latency

LLM relationship extraction

2-3s

OpenAI API latency

Database upsert (batch)

50-200ms

PostgreSQL writes

Total per document

4-8s

LLM calls

Scalability Considerations

  1. LLM Rate Limits: OpenAI API has rate limits; batch processing recommended

  2. Text Truncation: LLM methods only process first 6,000-8,000 chars

  3. Deduplication: Canonical name matching prevents entity explosion

  4. Workspace Isolation: All queries filtered by workspace_id for multi-tenancy

Sources: orchestrator/modules/search/services/entity_extractor.py:47-88


Future Enhancements

  1. Entity Embeddings: Store vector representations for semantic entity search

  2. Graph Algorithms: Implement centrality measures (betweenness, eigenvector) for entity importance

  3. Query Expansion: Auto-expand user queries with related entities

  4. Entity Disambiguation: Handle homonyms (e.g., "React" as library vs. "react" as verb)

  5. Cross-Workspace Entities: Shared marketplace entities for common technologies

  6. Temporal Tracking: Track entity evolution across document versions


Sources:


Last updated