Cognee Benchmark Results

Production-ready Al Memory, tested and measured.

Benchmark Overview

We benchmarked Cognee against leading memory frameworks, including MemO, Graphiti, and LightRAG, using a subset of 24 HotPotQA multi-hop questions designed to test complex reasoning and factual consistency. Benchmarks were executed on Modal Cloud using 45 repeated runs per system to ensure reproducibility and remove noise caused by LLM-based evaluation variance.

All code, configs, and datasets are open-sourced. You can reproduce every step yourself.

Available on GitHub to replicate and validate our evaluations independently.

Key performance metrics

Results for Cognee

Human-like correctness

0.93

DeepEval correctness

0.85

DeepEval f1

0.84

DeepEval EM

0.69

Real-World Evaluation

Unlike typical QA tests that reward surface-level matches, our benchmark measures information correctness, reasoning depth, and faithfulness. We evaluate answers not just by what they say — but whether they actually make sense.

Optimized Cognee configurations

Cognee Graph Completion with Chain-of-Thought (CoT) shows significant performance improvements over the previous non-optimized version:

Human-like Correctness: +25% (0.738 → 0.925)DeepEval Correctness: +49% (0.569 → 0.846)DeepEval F1: +314% (0.203 → 0.841)DeepEval EM: +1618% (0.04 → 0.687)

Comprehensive Metrics Comparison

1Cognee delivers the most contextually accurate and human-like answers across all evaluated systems.

2Its hybrid graph + vector memory produces responses that reflect true comprehension — not just keyword overlap.

3Cognee's architecture runs on Modal's distributed infrastructure. It easily scales from single-instance tests to multi-node workloads.

4Its graph-completion retrievers consistently outperform simpler retrievers in both correctness and performance.

Open EvaluationWe invite the community to explore the data, re-run the experiments and contribute improvements.

Insights into our evaluation methods, implications for AI development, and a deeper analysis of results.

Available on GitHub to replicate and validate our evaluations independently.

Looking for a custom deployment? Chat with our engineers!