Entity & Concept Extraction

After facts are extracted from sources, the system identifies the entities and concepts mentioned within each fact. This per-fact extraction is what connects raw evidence to the knowledge graph's structural nodes.

Entities vs. concepts

The system distinguishes two fundamental categories:

Entities

Entities are specific, named real-world things:

People — "Albert Einstein", "Marie Curie"
Organizations — "NASA", "World Health Organization"
Locations — "Paris", "Mount Everest"
Publications — "Nature", "The Lancet"

Entities have proper names and refer to unique, identifiable things in the world.

Concepts

Concepts are abstract topics, ideas, techniques, or phenomena:

"photosynthesis", "machine learning", "democratic governance"
"quantum entanglement", "supply chain management"
"cognitive behavioral therapy", "pyramid construction techniques"

Concepts describe categories, processes, theories, or subjects that can be explored and discussed.

Per-fact extraction

Entity and concept extraction happens per-fact — for each individual fact, the LLM lists which entities and concepts are mentioned in that specific fact. This per-fact granularity is a deliberate design choice that dramatically reduces hallucinated cross-fact associations.

If extraction were done at the document level, the LLM might associate entities and concepts that appear in different paragraphs but have no actual relationship. By extracting per-fact, every entity-fact and concept-fact link is grounded in a specific, verifiable statement.

Extraction schema

For each extracted entity or concept, the system records:

Field	Description
name	The entity or concept name (2-150 characters)
node_type	`entity` or `concept`
entity_subtype	For entities: `person`, `organization`, `location`, or `publication`
fact_indices	Which facts (by index) mention this entity/concept
aliases	Alternative names provided by the LLM
extraction_role	How the entity appears: `mentioned`, `subject`, or `actor`

Validation rules

Extracted names go through validation to filter out noise:

Length: Must be 2-150 characters
No pure initials: Rejects patterns like "K. M. A."
No citation artifacts: Filters out "et al." and similar academic citation fragments
Alphabetic content: Must be >40% alphabetic characters (rejects mostly punctuation/numbers)
No excessive repetition: Rejects names with repeated substrings

These rules ensure that only meaningful, well-formed entity and concept names enter the graph.

From extraction to seeds

Extracted entities and concepts don't become graph nodes directly. Instead, they become seeds — lightweight proto-nodes that accumulate facts over time. When a seed gathers enough evidence, it gets promoted to a full node in the knowledge graph.

This two-step process (extraction -> seed -> node) prevents the graph from being cluttered with nodes backed by only a single mention. It ensures every node in the graph has a meaningful factual base.

Entities vs. concepts​

Entities​

Concepts​

Per-fact extraction​

Extraction schema​

Validation rules​

From extraction to seeds​