Skip to main content

Entity & Concept Extraction

After facts are extracted from sources, the system identifies the entities and concepts mentioned within each fact. This per-fact extraction is what connects raw evidence to the knowledge graph's structural nodes.

Entities vs. concepts

The system distinguishes two fundamental categories:

Entities

Entities are specific, named real-world things:

  • People — "Albert Einstein", "Marie Curie"
  • Organizations — "NASA", "World Health Organization"
  • Locations — "Paris", "Mount Everest"
  • Publications — "Nature", "The Lancet"

Entities have proper names and refer to unique, identifiable things in the world.

Concepts

Concepts are abstract topics, ideas, techniques, or phenomena:

  • "photosynthesis", "machine learning", "democratic governance"
  • "quantum entanglement", "supply chain management"
  • "cognitive behavioral therapy", "pyramid construction techniques"

Concepts describe categories, processes, theories, or subjects that can be explored and discussed.

Per-fact extraction

Entity and concept extraction happens per-fact — for each individual fact, the LLM lists which entities and concepts are mentioned in that specific fact. This per-fact granularity is a deliberate design choice that dramatically reduces hallucinated cross-fact associations.

If extraction were done at the document level, the LLM might associate entities and concepts that appear in different paragraphs but have no actual relationship. By extracting per-fact, every entity-fact and concept-fact link is grounded in a specific, verifiable statement.

Extraction schema

For each extracted entity or concept, the system records:

FieldDescription
nameThe entity or concept name (2-150 characters)
node_typeentity or concept
entity_subtypeFor entities: person, organization, location, or publication
fact_indicesWhich facts (by index) mention this entity/concept
aliasesAlternative names provided by the LLM
extraction_roleHow the entity appears: mentioned, subject, or actor

Validation rules

Extracted names go through validation to filter out noise:

  • Length: Must be 2-150 characters
  • No pure initials: Rejects patterns like "K. M. A."
  • No citation artifacts: Filters out "et al." and similar academic citation fragments
  • Alphabetic content: Must be >40% alphabetic characters (rejects mostly punctuation/numbers)
  • No excessive repetition: Rejects names with repeated substrings

These rules ensure that only meaningful, well-formed entity and concept names enter the graph.

From extraction to seeds

Extracted entities and concepts don't become graph nodes directly. Instead, they become seeds — lightweight proto-nodes that accumulate facts over time. When a seed gathers enough evidence, it gets promoted to a full node in the knowledge graph.

This two-step process (extraction -> seed -> node) prevents the graph from being cluttered with nodes backed by only a single mention. It ensures every node in the graph has a meaningful factual base.