Cognee Benchmark Results
Production-ready Al Memory, tested and measured.Benchmark Overview
We benchmarked Cognee against leading memory frameworks, including MemO, Graphiti, and LightRAG, using a subset of 24 HotPotQA multi-hop questions designed to test complex reasoning and factual consistency. Benchmarks were executed on Modal Cloud using 45 repeated runs per system to ensure reproducibility and remove noise caused by LLM-based evaluation variance.Key performance metrics
Results for CogneeHuman-like correctness
DeepEval correctness
DeepEval f1
DeepEval EM
Real-World Evaluation
Unlike typical QA tests that reward surface-level matches, our benchmark measures information correctness, reasoning depth, and faithfulness. We evaluate answers not just by what they say — but whether they actually make sense.Optimized Cognee configurations
Cognee Graph Completion with Chain-of-Thought (CoT) shows significant performance improvements over the previous non-optimized version:
Human-like Correctness: +25% (0.738 → 0.925)DeepEval Correctness: +49% (0.569 → 0.846)DeepEval F1: +314% (0.203 → 0.841)DeepEval EM: +1618% (0.04 → 0.687)
Comprehensive Metrics Comparison
1Cognee delivers the most contextually accurate and human-like answers across all evaluated systems.
2Its hybrid graph + vector memory produces responses that reflect true comprehension — not just keyword overlap.
3Cognee's architecture runs on Modal's distributed infrastructure. It easily scales from single-instance tests to multi-node workloads.
4Its graph-completion retrievers consistently outperform simpler retrievers in both correctness and performance.
Looking for a custom deployment? Chat with our engineers!