Evaluation Results

AI Memory Benchmark Results

Understanding how well different AI memory systems retain and utilize context across interactions is crucial for enhancing LLM performance.

We have conducted a comprehensive evaluation of cognee AI memory system against other leading tools, including Dreamify (our proprietary tool), cognee (in vanilla setting), Zep/Graphiti, and Mem0. This analysis provides a detailed comparison of performance metrics, helping developers select the best AI memory solution for their applications.

The evaluation results are based on the following metrics:

Key Performance Metrics

Results for Cognee (Dreamify)

0.89

Human-LLM Correctness

0.75

DeepEval Correctness

0.71

DeepEval F1

0.54

DeepEval EM

Benchmark Comparison

Dreamify: Our Hyperparam framework increases accuracy even more

Cognee with Dreamify shows significant performance improvements across all metrics:

Human-LLM Correctness: ~+6% (0.84 → 0.89)
DeepEval Correctness: ~+32% (0.57 → 0.75)
DeepEval F1: ~+255% (0.20 → 0.71)
DeepEval EM: ~ +1250% (0.04 → 0.54)

Comprehensive Metrics Comparison

Dive Deeper

Read Our Blog PostInsights into our evaluation methods, implications for AI development, and a deeper analysis of results.

Access the Benchmark CodeAvailable on GitHub to replicate and validate our evaluations independently.

What is Next?

Continuous improvement is key. We are actively enhancing our benchmarks, integrating new metrics, and evaluating additional AI memory solutions. Stay tuned for updates and more detailed analysis.

Have questions or want help optimizing your AI system?