Post

๐Ÿ’ก CRAG (Comprehensive RAG) is a new RAG benchmark dataset

CRAG: Comprehensive RAG Benchmark Dataset

Curiosity: How can we create a realistic benchmark for RAG systems? What makes CRAG more challenging than existing datasets?

CRAG (Comprehensive RAG) is a new benchmark dataset that provides robust and challenging test cases for evaluating RAG and QA systems. Even GPT-4 struggles, achieving less than 34% accuracy, highlighting the challenge.

Paper: https://arxiv.org/pdf/2406.04744

The Problem

Retrieve: Existing RAG datasets have limitations.

IssueDescriptionImpact
Lack of DiversityLimited question typesโš ๏ธ Incomplete evaluation
Complexity GapDonโ€™t represent real-world QAโš ๏ธ Suboptimal assessment
Evaluation IssuesPoor performance metricsโš ๏ธ Unreliable results

Result: Suboptimal performance evaluation of RAG systems.

CRAG Dataset Overview

Innovate: Comprehensive benchmark for RAG evaluation.

graph TB
    A[CRAG Dataset] --> B[4,409 QA Pairs]
    A --> C[5 Domains]
    A --> D[8 Question Categories]
    A --> E[Mock APIs]
    A --> F[Score System]
    
    E --> E1[Web Search]
    E --> E2[KG Search]
    
    F --> F1[Penalize Hallucinations]
    F --> F2[Reliable Evaluation]
    
    style A fill:#e1f5ff
    style B fill:#fff3cd
    style F fill:#d4edda

Dataset Features

Retrieve: CRAGโ€™s comprehensive features.

FeatureDetailsBenefit
QA Pairs4,409 pairsโฌ†๏ธ Large scale
Domains5 domainsโฌ†๏ธ Diversity
Categories8 question typesโฌ†๏ธ Coverage
ComplexitySimple facts to complex queriesโฌ†๏ธ Real-world
Mock APIsWeb and KG searchโฌ†๏ธ Realistic
Score SystemPenalizes hallucinationsโฌ†๏ธ Reliable

Evaluation Tasks

Innovate: Comprehensive task coverage.

Task Types:

  • Web Retrieval: Realistic web search scenarios
  • Structured Querying: Knowledge Graph queries
  • Summarization: Multi-document summarization

Coverage: From simple facts to complex multi-hop queries.

Performance Results

Retrieve: CRAG reveals significant challenges.

SystemAccuracyNotes
Advanced LLMs (GPT-4)<34%Highlights challenge
Direct RAG44%Needs improvement
SOTA Industry RAG63%Without hallucination

Key Findings:

  • Even best LLMs struggle (<34%)
  • Direct RAG only reaches 44%
  • Industry solutions achieve 63% (best case)

Score System Innovation

Innovate: Better evaluation through hallucination penalties.

Key Feature: Penalizes hallucinated answers more than missing answers

Benefits:

  • โœ… Encourages accuracy over completeness
  • โœ… Reduces false information
  • โœ… More reliable evaluation
  • โœ… Better reflects real-world needs

Key Takeaways

Retrieve: CRAG provides a comprehensive benchmark with 4,409 QA pairs across 5 domains and 8 categories, including realistic retrieval scenarios and a score system that penalizes hallucinations.

Innovate: By creating a challenging benchmark that even GPT-4 struggles with (<34% accuracy), CRAG encourages development of more advanced RAG solutions, with industry SOTA reaching 63% accuracy.

Curiosity โ†’ Retrieve โ†’ Innovation: Start with curiosity about RAG evaluation, retrieve insights from CRAGโ€™s comprehensive approach, and innovate by building RAG systems that can handle the complexity and diversity of real-world QA tasks.

Next Steps:

  • Read the full paper
  • Test your RAG on CRAG
  • Analyze performance gaps
  • Improve your systems

 CRAG abstract

Translate to Korean

๐Ÿ˜‰ ๋‹ค์Œ์€ RAG ํŒŒ์ดํ”„๋ผ์ธ์„ ํ…Œ์ŠคํŠธํ•  ์ˆ˜ ์žˆ๋Š” ์–ด๋ ค์šด ์‹ค์ œ ๋ฒค์น˜๋งˆํฌ์ž…๋‹ˆ๋‹ค! GPT-4์™€ ๊ฐ™์€ LLM์กฐ์ฐจ๋„ 34% ๋ฏธ๋งŒ์˜ ์ •ํ™•๋„๋ฅผ ๋‹ฌ์„ฑํ•˜๋Š” ๋ฐ ์–ด๋ ค์›€์„ ๊ฒช๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

๊ธฐ์กด RAG ๋ฐ์ดํ„ฐ ์„ธํŠธ๋Š” ๋‹ค์–‘์„ฑ์ด ๋ถ€์กฑํ•˜๊ณ  ์‹ค์ œ QA ์ž‘์—…์˜ ๋ณต์žก์„ฑ์„ ๋‚˜ํƒ€๋‚ด์ง€ ๋ชปํ•˜์—ฌ ์„ฑ๋Šฅ ํ‰๊ฐ€๊ฐ€ ์ตœ์ ํ™”๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

๐Ÿ’ก CRAG(Comprehensive RAG)๋Š” RAG ๋ฐ QA ์‹œ์Šคํ…œ์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•œ ๊ฐ•๋ ฅํ•˜๊ณ  ๋„์ „์ ์ธ ํ…Œ์ŠคํŠธ ์ผ€์ด์Šค๋ฅผ ์ œ๊ณตํ•˜๋Š” ์ƒˆ๋กœ์šด RAG ๋ฒค์น˜๋งˆํฌ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋กœ, ์‹ ๋ขฐํ•  ์ˆ˜ ์žˆ๋Š” LLM ๊ธฐ๋ฐ˜ ์งˆ๋ฌธ ๋‹ต๋ณ€์˜ ๋ฐœ์ „์„ ์žฅ๋ คํ•ฉ๋‹ˆ๋‹ค.

  • โ›ณ CRAG์—๋Š” 5๊ฐœ ๋„๋ฉ”์ธ๊ณผ 8๊ฐœ ์งˆ๋ฌธ ๋ฒ”์ฃผ์— ๊ฑธ์ณ 4,409๊ฐœ์˜ QA ์Œ์ด ํฌํ•จ๋˜์–ด ์žˆ์œผ๋ฉฐ, ๊ฐ„๋‹จํ•œ ์‚ฌ์‹ค๋ถ€ํ„ฐ ๋ณต์žกํ•œ ์ฟผ๋ฆฌ๊นŒ์ง€ ๋‹ค๋ฃน๋‹ˆ๋‹ค.
  • โ›ณ ์›น ๋ฐ KG(Knowledge Graph) ๊ฒ€์ƒ‰์„ ์œ„ํ•œ ๋ชจ์˜ API๋ฅผ ์ œ๊ณตํ•˜์—ฌ ํ˜„์‹ค์ ์ธ ๊ฒ€์ƒ‰ ์‹œ๋‚˜๋ฆฌ์˜ค๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
  • โ›ณ ๋ฏธ๊ฒฐ ๋‹ต๋ณ€๋ณด๋‹ค ํ™˜๊ฐ์— ๊ฑธ๋ฆฐ ๋‹ต๋ณ€์— ๋” ๋งŽ์€ ํŽ˜๋„ํ‹ฐ๋ฅผ ์ฃผ๋Š” ์ ์ˆ˜ ์‹œ์Šคํ…œ์„ ๋„์ž…ํ•˜์—ฌ ์‹ ๋ขฐํ•  ์ˆ˜ ์žˆ๋Š” ํ‰๊ฐ€๋ฅผ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค.
  • โ›ณ ์›น ๊ฒ€์ƒ‰, ๊ตฌ์กฐ์  ์ฟผ๋ฆฌ ๋ฐ ์š”์•ฝ์„ ์œ„ํ•œ ์ž‘์—…์„ ์ œ๊ณตํ•˜์—ฌ RAG ์†”๋ฃจ์…˜์„ ์ข…ํ•ฉ์ ์œผ๋กœ ํ‰๊ฐ€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ธฐ์—ฌ

  • ๐Ÿ‘‰ ๊ฐ€์žฅ ์ง„๋ณด๋œ LLM์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์„ฑ๊ณผ๋ฅผ ๊ฑฐ๋‘ก๋‹ˆ๋‹ค. <34% accuracy on CRAG, highlighting the challenge.
  • ๐Ÿ‘‰ Direct application of RAG improves accuracy to only 44%, indicating the need for more advanced solutions.
  • ๐Ÿ‘‰ State-of-the-art industry RAG solutions reach 63% accuracy without hallucination.
This post is licensed under CC BY 4.0 by the author.