Seeds & Routing
Seeds are lightweight proto-nodes — containers that accumulate facts about an entity or concept before being promoted to full graph nodes. They are the bridge between raw fact extraction and the structured knowledge graph.
What seeds are
When entity and concept extraction identifies a name within a fact, the system creates or updates a seed for that name. Seeds track:
- The entity/concept name and type
- How many facts reference this seed
- Which facts are linked to it
- A contextual embedding (updated as more facts arrive)
- Status information for disambiguation and merging
Seeds solve an important problem: not every mention deserves a full graph node. By accumulating evidence in seeds first, the system ensures that only well-supported topics become permanent nodes.
Seed statuses
Each seed has one of four statuses:
| Status | Meaning |
|---|---|
| active | Normal seed, accumulating facts |
| ambiguous | Has been split into disambiguated children (e.g., "Mars" split into "Mars (planet)" and "Mars (Roman god)") |
| promoted | Has been converted to a full node in the knowledge graph |
| merged | Has been consolidated into another seed (with a pointer to the target) |
Fact routing
When a new fact mentions an entity or concept, the system must route that fact to the correct seed. This is straightforward for active seeds but requires disambiguation logic for ambiguous ones.
Routing algorithm
- Look up the seed by its deterministic key (based on name + type)
- If active or promoted — link the fact directly (normal path)
- If merged — follow the merge chain (up to 5 hops) to find the target seed
- If ambiguous — route to the correct child using disambiguation
- If not found — check phonetic matches for potential typos, then create a new seed
Disambiguation
When a seed has been split into disambiguated children (e.g., "Mars" -> "Mars (planet)" + "Mars (Roman god)"), new facts mentioning "Mars" need to be routed to the correct child.
The system uses a multi-strategy approach:
- Embedding similarity — Compare the new fact's embedding against each child seed's embedding in Qdrant. The closest match usually wins.
- Keyword heuristics — For cases where embeddings are ambiguous, text-based matching provides additional signal.
- LLM fallback — When the top two children have too-close scores, an LLM is asked to pick the correct disambiguation based on the fact's full context.
Seed deduplication
The system detects and merges duplicate seeds that refer to the same entity/concept but with different spellings or names:
- Phonetic codes — Seeds are assigned phonetic codes (similar to Soundex) to detect names that sound alike
- Trigram similarity — Character-level trigram comparison catches typos and minor spelling variations
- Embedding comparison — Semantic similarity between seed embeddings identifies conceptual duplicates
When duplicates are found, they are merged: one seed absorbs the other's facts, and the merged seed gets a merged_into_key pointer.
Re-embedding
Seeds are re-embedded at configurable fact-count thresholds (e.g., 5, 10, 25, 50 facts). As a seed accumulates more facts, its contextual embedding becomes richer and more representative:
- Gather the top facts for context
- Build context text: name + type + top facts + aliases
- Compute a hash of the context text
- If the hash changed since last embedding — re-embed and update Qdrant
- Update the context hash on the seed
This progressive re-embedding means that seeds become better at disambiguation over time.
Promotion to nodes
When a seed accumulates sufficient facts (a configurable threshold), it is promoted to a full node in the knowledge graph:
- The seed's status changes to
promoted - A new node is created with the seed's name, type, and linked facts
- The node enters the node pipeline for dimension generation, definition synthesis, and parent selection
- Edge candidates from the seed's fact co-occurrences are resolved into graph edges
Promotion is the moment when accumulated evidence becomes structured knowledge in the graph.