Post

How to evaluate LLM Model?

LLM evaluation frameworks & tools every AI/ML engineer should know.

 LLM evaluation frameworks

LLM evaluation frameworks and tools are important because they provide standardized benchmarks to measure and improve the performance, reliability and fairness of language models.

Also, it is very important to have metrics in place to evaluate LLMs. These metrics act as scoring mechanisms that assess an LLM’s outputs based on the given criteria.

Here is my article on evaluating large language models. πŸ‘‰https://levelup.gitconnected.com/evaluating-large-language-models-a-developers-guide-ffd21a055feb

MMLU-Pro released by TIGER-Lab on Hugging Face, continues these vital efforts by offering a more robust and challenging massive multi-task language understanding dataset! πŸŽ‰ 😍

Evaluating LLMs is both crucial and challenging, especially with existing benchmarks like MMLU reaching saturation.

TL;DR: πŸ“Š

  • πŸ“š 12K complex questions across various disciplines with careful human verification
  • πŸ”’ Augmented to 10 options per question (instead of 4) to reduce random guessing
  • πŸ“Š 56% of questions from MMLU, 34% from STEM websites, and the rest from TheoremQA, and SciBench
  • πŸ” Performance drops without chain-of-thought reasoning, indicating a more challenging benchmark!

Results compared to MMLU

  • πŸ“‰ GPT-4o drops by 17% (from 0.887 to 0.7149)
  • πŸ“‰ Mixtral 8x7B drops by 31% (from 0.714 to 0.404)
  • πŸ“‰ Llama-3-70B drops by 27% (from 0.820 to 0.5541)
  1. Dataset: https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro
  2. Leaderboard: https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro#4-leaderboard
Translate to Korean

λͺ¨λ“  AI/ML μ—”μ§€λ‹ˆμ–΄κ°€ μ•Œμ•„μ•Ό ν•  ν•΄μ‹œνƒœκ·Έ#LLM 평가 ν”„λ ˆμž„μ›Œν¬ 및 도ꡬ.

 LLM evaluation frameworks

LLM 평가 ν”„λ ˆμž„μ›Œν¬μ™€ νˆ΄μ€ μ–Έμ–΄ λͺ¨λΈμ˜ μ„±λŠ₯, μ‹ λ’°μ„± 및 곡정성을 μΈ‘μ •ν•˜κ³  κ°œμ„ ν•˜κΈ° μœ„ν•œ ν‘œμ€€ν™”λœ 벀치마크λ₯Ό μ œκ³΅ν•˜κΈ° λ•Œλ¬Έμ— μ€‘μš”ν•©λ‹ˆλ‹€.

λ˜ν•œ LLM을 ν‰κ°€ν•˜κΈ° μœ„ν•œ λ©”νŠΈλ¦­μ„ λ§ˆλ ¨ν•˜λŠ” 것이 맀우 μ€‘μš”ν•©λ‹ˆλ‹€. μ΄λŸ¬ν•œ λ©”νŠΈλ¦­μ€ 주어진 기쀀에 따라 LLM의 좜λ ₯을 ν‰κ°€ν•˜λŠ” μŠ€μ½”μ–΄λ§ λ©”μ»€λ‹ˆμ¦˜ 역할을 ν•©λ‹ˆλ‹€.

λ‹€μŒμ€ λŒ€κ·œλͺ¨ μ–Έμ–΄ λͺ¨λΈ 평가에 λŒ€ν•œ κΈ°μ‚¬μž…λ‹ˆλ‹€. πŸ‘‰https://levelup.gitconnected.com/evaluating-large-language-models-a-developers-guide-ffd21a055feb

Hugging Face TIGER-Labμ—μ„œ μΆœμ‹œν•œ MMLU-ProλŠ” 보닀 κ°•λ ₯ν•˜κ³  도전적인 λŒ€κ·œλͺ¨ 닀쀑 μž‘μ—… μ–Έμ–΄ 이해 데이터 μ„ΈνŠΈλ₯Ό μ œκ³΅ν•˜μ—¬ μ΄λŸ¬ν•œ μ€‘μš”ν•œ λ…Έλ ₯을 κ³„μ†ν•©λ‹ˆλ‹€! πŸŽ‰ 😍

LLM을 ν‰κ°€ν•˜λŠ” 것은 μ€‘μš”ν•˜λ©΄μ„œλ„ μ–΄λ €μš΄ 일이며, 특히 MMLU와 같은 κΈ°μ‘΄ λ²€μΉ˜λ§ˆν¬κ°€ 포화 μƒνƒœμ— λ„λ‹¬ν•œ μƒν™©μ—μ„œλŠ” λ”μš± κ·Έλ ‡μŠ΅λ‹ˆλ‹€.

TL;DR: πŸ“Š

  • πŸ“š μ‹ μ€‘ν•œ 인적 검증을 톡해 λ‹€μ–‘ν•œ 뢄야에 걸친 12K개의 λ³΅μž‘ν•œ 질문
  • πŸ”’ λ¬΄μž‘μœ„ 좔츑을 쀄이기 μœ„ν•΄ μ§ˆλ¬Έλ‹Ή 4κ°œκ°€ μ•„λ‹Œ 10개의 μ˜΅μ…˜μœΌλ‘œ λŠ˜μ–΄λ‚¬μŠ΅λ‹ˆλ‹€.
  • πŸ“Š 질문의 56%λŠ” MMLU, 34%λŠ” STEM μ›Ήμ‚¬μ΄νŠΈ, λ‚˜λ¨Έμ§€λŠ” TheoremQA 및 SciBenchμ—μ„œ λ‚˜μ™”μŠ΅λ‹ˆλ‹€.
  • πŸ” μƒκ°μ˜ 연쇄 μΆ”λ‘  없이 μ„±λŠ₯이 떨어지며, μ΄λŠ” 더 μ–΄λ €μš΄ 벀치마크λ₯Ό λ‚˜νƒ€λƒ…λ‹ˆλ‹€!

MMLU와 λΉ„κ΅ν•œ κ²°κ³Ό

  • πŸ“‰ GPT-4oλŠ” 17% ν•˜λ½(0.887μ—μ„œ 0.7149둜)
  • πŸ“‰ Mixtral 8x7B 31% κ°μ†Œ(0.714μ—μ„œ 0.404둜)
  • πŸ“‰ 라마-3-70B 27% ν•˜λ½(0.820μ—μ„œ 0.5541둜)
  1. 데이터 μ„ΈνŠΈ: https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro
  2. λ¦¬λ”λ³΄λ“œ: https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro#4-leaderboard
This post is licensed under CC BY 4.0 by the author.