Post

πŸ’‘ CRAG (Comprehensive RAG) is a new RAG benchmark dataset

CRAG: Comprehensive RAG Benchmark Dataset

Curiosity: How can we create a realistic benchmark for RAG systems? What makes CRAG more challenging than existing datasets?

CRAG (Comprehensive RAG) is a new benchmark dataset that provides robust and challenging test cases for evaluating RAG and QA systems. Even GPT-4 struggles, achieving less than 34% accuracy, highlighting the challenge.

Paper: https://arxiv.org/pdf/2406.04744

The Problem

Retrieve: Existing RAG datasets have limitations.

IssueDescriptionImpact
Lack of DiversityLimited question types⚠️ Incomplete evaluation
Complexity GapDon’t represent real-world QA⚠️ Suboptimal assessment
Evaluation IssuesPoor performance metrics⚠️ Unreliable results

Result: Suboptimal performance evaluation of RAG systems.

CRAG Dataset Overview

Innovate: Comprehensive benchmark for RAG evaluation.

graph TB
    A[CRAG Dataset] --> B[4,409 QA Pairs]
    A --> C[5 Domains]
    A --> D[8 Question Categories]
    A --> E[Mock APIs]
    A --> F[Score System]
    
    E --> E1[Web Search]
    E --> E2[KG Search]
    
    F --> F1[Penalize Hallucinations]
    F --> F2[Reliable Evaluation]
    
    style A fill:#e1f5ff
    style B fill:#fff3cd
    style F fill:#d4edda

Dataset Features

Retrieve: CRAG’s comprehensive features.

FeatureDetailsBenefit
QA Pairs4,409 pairs⬆️ Large scale
Domains5 domains⬆️ Diversity
Categories8 question types⬆️ Coverage
ComplexitySimple facts to complex queries⬆️ Real-world
Mock APIsWeb and KG search⬆️ Realistic
Score SystemPenalizes hallucinations⬆️ Reliable

Evaluation Tasks

Innovate: Comprehensive task coverage.

Task Types:

  • Web Retrieval: Realistic web search scenarios
  • Structured Querying: Knowledge Graph queries
  • Summarization: Multi-document summarization

Coverage: From simple facts to complex multi-hop queries.

Performance Results

Retrieve: CRAG reveals significant challenges.

SystemAccuracyNotes
Advanced LLMs (GPT-4)<34%Highlights challenge
Direct RAG44%Needs improvement
SOTA Industry RAG63%Without hallucination

Key Findings:

  • Even best LLMs struggle (<34%)
  • Direct RAG only reaches 44%
  • Industry solutions achieve 63% (best case)

Score System Innovation

Innovate: Better evaluation through hallucination penalties.

Key Feature: Penalizes hallucinated answers more than missing answers

Benefits:

  • βœ… Encourages accuracy over completeness
  • βœ… Reduces false information
  • βœ… More reliable evaluation
  • βœ… Better reflects real-world needs

Key Takeaways

Retrieve: CRAG provides a comprehensive benchmark with 4,409 QA pairs across 5 domains and 8 categories, including realistic retrieval scenarios and a score system that penalizes hallucinations.

Innovate: By creating a challenging benchmark that even GPT-4 struggles with (<34% accuracy), CRAG encourages development of more advanced RAG solutions, with industry SOTA reaching 63% accuracy.

Curiosity β†’ Retrieve β†’ Innovation: Start with curiosity about RAG evaluation, retrieve insights from CRAG’s comprehensive approach, and innovate by building RAG systems that can handle the complexity and diversity of real-world QA tasks.

Next Steps:

  • Read the full paper
  • Test your RAG on CRAG
  • Analyze performance gaps
  • Improve your systems

 CRAG abstract

Translate to Korean

πŸ˜‰ λ‹€μŒμ€ RAG νŒŒμ΄ν”„λΌμΈμ„ ν…ŒμŠ€νŠΈν•  수 μžˆλŠ” μ–΄λ €μš΄ μ‹€μ œ λ²€μΉ˜λ§ˆν¬μž…λ‹ˆλ‹€! GPT-4와 같은 LLM쑰차도 34% 미만의 정확도λ₯Ό λ‹¬μ„±ν•˜λŠ” 데 어렀움을 κ²ͺκ³  μžˆμŠ΅λ‹ˆλ‹€.

κΈ°μ‘΄ RAG 데이터 μ„ΈνŠΈλŠ” 닀양성이 λΆ€μ‘±ν•˜κ³  μ‹€μ œ QA μž‘μ—…μ˜ λ³΅μž‘μ„±μ„ λ‚˜νƒ€λ‚΄μ§€ λͺ»ν•˜μ—¬ μ„±λŠ₯ 평가가 μ΅œμ ν™”λ˜μ§€ μ•ŠμŠ΅λ‹ˆλ‹€.

πŸ’‘ CRAG(Comprehensive RAG)λŠ” RAG 및 QA μ‹œμŠ€ν…œμ„ ν‰κ°€ν•˜κΈ° μœ„ν•œ κ°•λ ₯ν•˜κ³  도전적인 ν…ŒμŠ€νŠΈ μΌ€μ΄μŠ€λ₯Ό μ œκ³΅ν•˜λŠ” μƒˆλ‘œμš΄ RAG 벀치마크 데이터 μ„ΈνŠΈλ‘œ, μ‹ λ’°ν•  수 μžˆλŠ” LLM 기반 질문 λ‹΅λ³€μ˜ λ°œμ „μ„ μž₯λ €ν•©λ‹ˆλ‹€.

  • β›³ CRAGμ—λŠ” 5개 도메인과 8개 질문 범주에 걸쳐 4,409개의 QA 쌍이 ν¬ν•¨λ˜μ–΄ 있으며, κ°„λ‹¨ν•œ 사싀뢀터 λ³΅μž‘ν•œ μΏΌλ¦¬κΉŒμ§€ λ‹€λ£Ήλ‹ˆλ‹€.
  • β›³ μ›Ή 및 KG(Knowledge Graph) 검색을 μœ„ν•œ λͺ¨μ˜ APIλ₯Ό μ œκ³΅ν•˜μ—¬ ν˜„μ‹€μ μΈ 검색 μ‹œλ‚˜λ¦¬μ˜€λ₯Ό μ œκ³΅ν•©λ‹ˆλ‹€.
  • β›³ λ―Έκ²° 닡변보닀 ν™˜κ°μ— κ±Έλ¦° 닡변에 더 λ§Žμ€ νŽ˜λ„ν‹°λ₯Ό μ£ΌλŠ” 점수 μ‹œμŠ€ν…œμ„ λ„μž…ν•˜μ—¬ μ‹ λ’°ν•  수 μžˆλŠ” 평가λ₯Ό 보μž₯ν•©λ‹ˆλ‹€.
  • β›³ μ›Ή 검색, ꡬ쑰적 쿼리 및 μš”μ•½μ„ μœ„ν•œ μž‘μ—…μ„ μ œκ³΅ν•˜μ—¬ RAG μ†”λ£¨μ…˜μ„ μ’…ν•©μ μœΌλ‘œ 평가할 수 μžˆμŠ΅λ‹ˆλ‹€.

κΈ°μ—¬

  • πŸ‘‰ κ°€μž₯ μ§„λ³΄λœ LLM은 λ‹€μŒκ³Ό 같은 μ„±κ³Όλ₯Ό κ±°λ‘‘λ‹ˆλ‹€. <34% accuracy on CRAG, highlighting the challenge.
  • πŸ‘‰ Direct application of RAG improves accuracy to only 44%, indicating the need for more advanced solutions.
  • πŸ‘‰ State-of-the-art industry RAG solutions reach 63% accuracy without hallucination.
This post is licensed under CC BY 4.0 by the author.