We raised $7.5M seed led by Pebblebed! Learn more.
< backDeep Dives
Mar 17, 2026
8 minutes read

Grounding AI Memory: How Cognee Uses Ontologies to Build Structured Knowledge

David Myriel
David MyrielAI Researcher

Modern AI agents forget. Hand them a document today and they can answer questions about it—but give them two weeks and a hundred more documents, and the signal collapses into noise. Vector databases help with recall, but raw semantic similarity isn't knowledge. It's proximity.

Cognee takes a different approach. Rather than treating knowledge as a bag of embedded chunks, it builds a structured, persistent knowledge graph using a pipeline it calls ECL: Extract, Cognify, Load. At the heart of the Cognify step is a mechanism that many RAG-based systems skip entirely: ontology-based entity validation.

This post explains what that means technically, why it matters for real-world applications, and how to integrate it into your own projects.


The Problem with Unstructured Knowledge Graphs

If you've ever tried analyzing multiple scientific papers or enterprise documents with AI, you know the frustration: traditional search treats each document as an isolated island, missing the connections between related concepts across sources.

When an LLM extracts entities and relationships from text, it's also not deterministic. The same concept might surface as "car manufacturer", "automobile maker", or "vehicle producer" depending on source phrasing. These fragments are semantically close, but structurally they're distinct nodes in a graph. The result: a fragmented, redundant knowledge base that degrades retrieval quality and breaks cross-document reasoning.

This is the graph fragmentation problem, and it's where ontologies come in.


What Is an Ontology in Cognee?

In Cognee, an ontology is an optional RDF/OWL file you provide as a reference vocabulary. It acts as a formal schema that ensures entity types ("classes") and entity mentions ("individuals") extracted from your data are linked to canonical, well-defined concepts.

A typical ontology defines:

  • Classes — types of things in a domain (e.g., CarManufacturer, ElectricCar, SoftwareCompany)
  • Individuals — specific instances of those classes (e.g., BMW, Tesla, Apple)
  • Object properties — relationships between entities (e.g., produces, develops)
  • Class hierarchies — inheritance via rdfs:subClassOf (e.g., ElectricCar is a subclass of Car)

Cognee parses ontologies via RDFLib, so any format RDFLib supports works: RDF/XML (.owl, .rdf), Turtle (.ttl), N-Triples, JSON-LD, and others.


Why Use an Ontology?

Three concrete benefits, straight from the design:

Consistency — standardize how entities and types are represented across all your documents. "automobile maker" and "car manufacturer" collapse into a single canonical node.

Enrichment — bring inherited relationships from a domain schema into your graph automatically. If your ontology says ElectricCar is a subclass of Car, Cognee adds that structural relationship even when source documents never state it explicitly.

Control — align Cognee's graph with existing enterprise or scientific vocabularies. If your organization already uses SNOMED CT for medical concepts or FIBO for finance, you can ground Cognee's extraction in those same standards.


How It Works: Inside the Pipeline

The ontology system hooks into the cognify() step. Here's the flow:

Step 1 — LLM Extraction: Cognee uses Instructor-powered structured output to generate a KnowledgeGraph (a list of typed Node and Edge objects) from each document chunk. Entity names at this stage are unconstrained — entirely determined by the LLM's interpretation.

Step 2 — Resolver and Lookup: The RDFLibOntologyResolver parses your OWL file and builds two in-memory lookup dictionaries: one for classes, one for individuals. Keys are normalized to lowercase with underscores. The lookup is built once and cached.

Step 3 — Fuzzy Matching: Exact matching would be too brittle. Cognee's FuzzyMatchingStrategy uses Python's difflib.get_close_matches() at a configurable cutoff (default: 0.80). It checks for an exact match first, then falls back to fuzzy matching. "car manufacturer"CarManufacturer. "automobile maker" → same canonical node.

Step 4 — Canonicalization and Subgraph Expansion: For matched entities, Cognee replaces the LLM-derived name with the canonical ontology URI-derived name (eliminating cross-document duplicates), then performs a BFS traversal to extract the surrounding ontology structure — rdfs:subClassOf hierarchies, owl:ObjectProperty edges — and injects those relationships directly into the knowledge graph. Every node is tagged ontology_valid = True if matched, False otherwise.


A Real-World Example: Medical Research

The contrast between running with and without an ontology is most visible in complex domains like medicine. Consider analyzing a collection of research papers on cardiovascular disease, type 2 diabetes, and hypertension.

Without ontology, a query about "symptoms of cardiovascular diseases" might return results that are general and disconnected — information about individual papers without awareness of how concepts like obesity, diet, blood pressure, and cardiovascular risk are related to each other across the literature.

With a medical ontology (SNOMED CT, MeSH, or a curated subset), the same query returns detailed explanations linking symptoms to diseases, surfacing associations like obesity as a risk factor for cardiovascular disease, and connecting nutrient data across multiple studies. The knowledge graph visualization shows those connections explicitly — participants, age groups, dietary factors, conditions, and treatments all as linked nodes rather than disconnected fragments.

This isn't magic — it's the ontology providing the structural connective tissue that the LLM extraction alone can't infer reliably from isolated document chunks.


Sourcing and Preparing Your Ontology

Cognee works best with manually curated, focused ontologies tailored to your dataset. Large public ontologies like Wikidata or DBpedia define millions of classes — too broad to use wholesale. Matching precision drops, performance suffers, and you end up with false positives.

The practical approach is to work with a subset:

  • Pick only the terms (classes, properties, individuals) relevant to your domain
  • Extract those terms plus immediate context (parent classes, related properties)
  • Save in a format RDFLib can parse

Common public sources to draw subsets from:

  • General vocabularies: schema.org, Dublin Core Terms, SKOS, PROV-O, FOAF
  • Knowledge graph backbones: DBpedia Ontology, Wikidata (Wikibase RDF)
  • Healthcare: SNOMED CT (licensed), ICD, UMLS, MeSH, HL7/FHIR RDF
  • Finance: FIBO (Financial Industry Business Ontology)
  • Geo/IoT: GeoSPARQL, SOSA/SSN, GeoNames
  • Units: QUDT

If none of these fit, a small hand-crafted ontology with a few dozen classes and relationships is often more effective than an adapted public one. Start minimal, test matching behavior with resolver.find_closest_match(), and expand from there.


Configuring Ontologies in Cognee

Simplest — pass the file path directly to cognify():

Via environment variables:

Programmatic control — custom resolver and matching cutoff:

REST API for multi-tenant / SaaS contexts:


Graceful Degradation

If no ontology is provided, Cognee uses a default resolver with an empty lookup. All entities receive ontology_valid = False and the graph is built entirely from LLM extraction — no errors, no broken pipeline. Ontology support is strictly additive.

This makes it safe to introduce incrementally: start with LLM-only extraction, validate the graph quality, then layer in an ontology as your domain model matures.


Getting Started

Full working examples are available in the cognee repository: the basic ontology demo and the advanced ontology demo.


The Bigger Picture

Cognee's ontology system is a concrete implementation of an idea that's easy to articulate but hard to execute well: LLM-extracted knowledge should be grounded in formal structure when that structure exists.

The pluggable BaseOntologyResolver interface means the system isn't locked to RDFLib or OWL. The FuzzyMatchingStrategy handles the messy reality that natural language and formal schemas rarely align perfectly. The ontology_valid flag on every node gives downstream consumers the information they need to make trust-based decisions about graph content. And the subsetting discipline keeps it performant and precise.

For developers building production AI memory systems — whether for medical research, financial analysis, or enterprise knowledge management — this is the kind of infrastructure that separates a working prototype from a system you can trust.


Cognee is open source. The ontology module lives at cognee/modules/ontology/. Full documentation at docs.cognee.ai and the community is active on Discord.

Cognee is the fastest way to start building reliable Al agent memory.

Latest