Post

πŸ’‘ CRAG (Comprehensive RAG) is a new RAG benchmark dataset

πŸ˜‰ Here’s a tough real-world benchmark to put your RAG pipeline to the test! Even LLMs like GPT-4 struggle, achieving less than 34% accuracy.

Existing RAG datasets lack diversity and fail to represent the complexity of real-world QA tasks, leading to suboptimal performance evaluation.

πŸ’‘ CRAG (Comprehensive RAG) is a new RAG benchmark dataset that provides a robust and challenging test-cases for evaluating RAG and QA systems, encouraging advancements in reliable LLM-based question answering.

  • β›³ CRAG includes 4,409 QA pairs across five domains and eight question categories, covering simple facts to complex queries.
  • β›³ Provides mock APIs for web and Knowledge Graph (KG) search, offering realistic retrieval scenarios.
  • β›³ Introduces a score system that penalizes hallucinated answers more than missing answers, ensuring reliable evaluations.
  • β›³ Offers tasks for web retrieval, structured querying, and summarization, allowing comprehensive evaluation of RAG solutions.

Contribution

  • πŸ‘‰ Most advanced LLMs achieve <34% accuracy on CRAG, highlighting the challenge.
  • πŸ‘‰ Direct application of RAG improves accuracy to only 44%, indicating the need for more advanced solutions.
  • πŸ‘‰ State-of-the-art industry RAG solutions reach 63% accuracy without hallucination.

Link to the paper: https://arxiv.org/pdf/2406.04744

 CRAG abstract

Translate to Korean

πŸ˜‰ λ‹€μŒμ€ RAG νŒŒμ΄ν”„λΌμΈμ„ ν…ŒμŠ€νŠΈν•  수 μžˆλŠ” μ–΄λ €μš΄ μ‹€μ œ λ²€μΉ˜λ§ˆν¬μž…λ‹ˆλ‹€! GPT-4와 같은 LLM쑰차도 34% 미만의 정확도λ₯Ό λ‹¬μ„±ν•˜λŠ” 데 어렀움을 κ²ͺκ³  μžˆμŠ΅λ‹ˆλ‹€.

κΈ°μ‘΄ RAG 데이터 μ„ΈνŠΈλŠ” 닀양성이 λΆ€μ‘±ν•˜κ³  μ‹€μ œ QA μž‘μ—…μ˜ λ³΅μž‘μ„±μ„ λ‚˜νƒ€λ‚΄μ§€ λͺ»ν•˜μ—¬ μ„±λŠ₯ 평가가 μ΅œμ ν™”λ˜μ§€ μ•ŠμŠ΅λ‹ˆλ‹€.

πŸ’‘ CRAG(Comprehensive RAG)λŠ” RAG 및 QA μ‹œμŠ€ν…œμ„ ν‰κ°€ν•˜κΈ° μœ„ν•œ κ°•λ ₯ν•˜κ³  도전적인 ν…ŒμŠ€νŠΈ μΌ€μ΄μŠ€λ₯Ό μ œκ³΅ν•˜λŠ” μƒˆλ‘œμš΄ RAG 벀치마크 데이터 μ„ΈνŠΈλ‘œ, μ‹ λ’°ν•  수 μžˆλŠ” LLM 기반 질문 λ‹΅λ³€μ˜ λ°œμ „μ„ μž₯λ €ν•©λ‹ˆλ‹€.

  • β›³ CRAGμ—λŠ” 5개 도메인과 8개 질문 범주에 걸쳐 4,409개의 QA 쌍이 ν¬ν•¨λ˜μ–΄ 있으며, κ°„λ‹¨ν•œ 사싀뢀터 λ³΅μž‘ν•œ μΏΌλ¦¬κΉŒμ§€ λ‹€λ£Ήλ‹ˆλ‹€.
  • β›³ μ›Ή 및 KG(Knowledge Graph) 검색을 μœ„ν•œ λͺ¨μ˜ APIλ₯Ό μ œκ³΅ν•˜μ—¬ ν˜„μ‹€μ μΈ 검색 μ‹œλ‚˜λ¦¬μ˜€λ₯Ό μ œκ³΅ν•©λ‹ˆλ‹€.
  • β›³ λ―Έκ²° 닡변보닀 ν™˜κ°μ— κ±Έλ¦° 닡변에 더 λ§Žμ€ νŽ˜λ„ν‹°λ₯Ό μ£ΌλŠ” 점수 μ‹œμŠ€ν…œμ„ λ„μž…ν•˜μ—¬ μ‹ λ’°ν•  수 μžˆλŠ” 평가λ₯Ό 보μž₯ν•©λ‹ˆλ‹€.
  • β›³ μ›Ή 검색, ꡬ쑰적 쿼리 및 μš”μ•½μ„ μœ„ν•œ μž‘μ—…μ„ μ œκ³΅ν•˜μ—¬ RAG μ†”λ£¨μ…˜μ„ μ’…ν•©μ μœΌλ‘œ 평가할 수 μžˆμŠ΅λ‹ˆλ‹€.

κΈ°μ—¬

  • πŸ‘‰ κ°€μž₯ μ§„λ³΄λœ LLM은 λ‹€μŒκ³Ό 같은 μ„±κ³Όλ₯Ό κ±°λ‘‘λ‹ˆλ‹€. <34% accuracy on CRAG, highlighting the challenge.
  • πŸ‘‰ Direct application of RAG improves accuracy to only 44%, indicating the need for more advanced solutions.
  • πŸ‘‰ State-of-the-art industry RAG solutions reach 63% accuracy without hallucination.
This post is licensed under CC BY 4.0 by the author.