Post

How to evaluate LLM Model?

LLM evaluation frameworks & tools every AI/ML engineer should know.

 LLM evaluation frameworks

LLM evaluation frameworks and tools are important because they provide standardized benchmarks to measure and improve the performance, reliability and fairness of language models.

Also, it is very important to have metrics in place to evaluate LLMs. These metrics act as scoring mechanisms that assess an LLMโ€™s outputs based on the given criteria.

Here is my article on evaluating large language models. ๐Ÿ‘‰https://levelup.gitconnected.com/evaluating-large-language-models-a-developers-guide-ffd21a055feb

MMLU-Pro released by TIGER-Lab on Hugging Face, continues these vital efforts by offering a more robust and challenging massive multi-task language understanding dataset! ๐ŸŽ‰ ๐Ÿ˜

Evaluating LLMs is both crucial and challenging, especially with existing benchmarks like MMLU reaching saturation.

TL;DR: ๐Ÿ“Š

  • ๐Ÿ“š 12K complex questions across various disciplines with careful human verification
  • ๐Ÿ”ข Augmented to 10 options per question (instead of 4) to reduce random guessing
  • ๐Ÿ“Š 56% of questions from MMLU, 34% from STEM websites, and the rest from TheoremQA, and SciBench
  • ๐Ÿ” Performance drops without chain-of-thought reasoning, indicating a more challenging benchmark!

Results compared to MMLU

  • ๐Ÿ“‰ GPT-4o drops by 17% (from 0.887 to 0.7149)
  • ๐Ÿ“‰ Mixtral 8x7B drops by 31% (from 0.714 to 0.404)
  • ๐Ÿ“‰ Llama-3-70B drops by 27% (from 0.820 to 0.5541)
  1. Dataset: https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro
  2. Leaderboard: https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro#4-leaderboard
Translate to Korean

๋ชจ๋“  AI/ML ์—”์ง€๋‹ˆ์–ด๊ฐ€ ์•Œ์•„์•ผ ํ•  ํ•ด์‹œํƒœ๊ทธ#LLM ํ‰๊ฐ€ ํ”„๋ ˆ์ž„์›Œํฌ ๋ฐ ๋„๊ตฌ.

 LLM evaluation frameworks

LLM ํ‰๊ฐ€ ํ”„๋ ˆ์ž„์›Œํฌ์™€ ํˆด์€ ์–ธ์–ด ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ, ์‹ ๋ขฐ์„ฑ ๋ฐ ๊ณต์ •์„ฑ์„ ์ธก์ •ํ•˜๊ณ  ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•œ ํ‘œ์ค€ํ™”๋œ ๋ฒค์น˜๋งˆํฌ๋ฅผ ์ œ๊ณตํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.

๋˜ํ•œ LLM์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•œ ๋ฉ”ํŠธ๋ฆญ์„ ๋งˆ๋ จํ•˜๋Š” ๊ฒƒ์ด ๋งค์šฐ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฉ”ํŠธ๋ฆญ์€ ์ฃผ์–ด์ง„ ๊ธฐ์ค€์— ๋”ฐ๋ผ LLM์˜ ์ถœ๋ ฅ์„ ํ‰๊ฐ€ํ•˜๋Š” ์Šค์ฝ”์–ด๋ง ๋ฉ”์ปค๋‹ˆ์ฆ˜ ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค.

๋‹ค์Œ์€ ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ ํ‰๊ฐ€์— ๋Œ€ํ•œ ๊ธฐ์‚ฌ์ž…๋‹ˆ๋‹ค. ๐Ÿ‘‰https://levelup.gitconnected.com/evaluating-large-language-models-a-developers-guide-ffd21a055feb

Hugging Face TIGER-Lab์—์„œ ์ถœ์‹œํ•œ MMLU-Pro๋Š” ๋ณด๋‹ค ๊ฐ•๋ ฅํ•˜๊ณ  ๋„์ „์ ์ธ ๋Œ€๊ทœ๋ชจ ๋‹ค์ค‘ ์ž‘์—… ์–ธ์–ด ์ดํ•ด ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ์ œ๊ณตํ•˜์—ฌ ์ด๋Ÿฌํ•œ ์ค‘์š”ํ•œ ๋…ธ๋ ฅ์„ ๊ณ„์†ํ•ฉ๋‹ˆ๋‹ค! ๐ŸŽ‰ ๐Ÿ˜

LLM์„ ํ‰๊ฐ€ํ•˜๋Š” ๊ฒƒ์€ ์ค‘์š”ํ•˜๋ฉด์„œ๋„ ์–ด๋ ค์šด ์ผ์ด๋ฉฐ, ํŠนํžˆ MMLU์™€ ๊ฐ™์€ ๊ธฐ์กด ๋ฒค์น˜๋งˆํฌ๊ฐ€ ํฌํ™” ์ƒํƒœ์— ๋„๋‹ฌํ•œ ์ƒํ™ฉ์—์„œ๋Š” ๋”์šฑ ๊ทธ๋ ‡์Šต๋‹ˆ๋‹ค.

TL;DR: ๐Ÿ“Š

  • ๐Ÿ“š ์‹ ์ค‘ํ•œ ์ธ์  ๊ฒ€์ฆ์„ ํ†ตํ•ด ๋‹ค์–‘ํ•œ ๋ถ„์•ผ์— ๊ฑธ์นœ 12K๊ฐœ์˜ ๋ณต์žกํ•œ ์งˆ๋ฌธ
  • ๐Ÿ”ข ๋ฌด์ž‘์œ„ ์ถ”์ธก์„ ์ค„์ด๊ธฐ ์œ„ํ•ด ์งˆ๋ฌธ๋‹น 4๊ฐœ๊ฐ€ ์•„๋‹Œ 10๊ฐœ์˜ ์˜ต์…˜์œผ๋กœ ๋Š˜์–ด๋‚ฌ์Šต๋‹ˆ๋‹ค.
  • ๐Ÿ“Š ์งˆ๋ฌธ์˜ 56%๋Š” MMLU, 34%๋Š” STEM ์›น์‚ฌ์ดํŠธ, ๋‚˜๋จธ์ง€๋Š” TheoremQA ๋ฐ SciBench์—์„œ ๋‚˜์™”์Šต๋‹ˆ๋‹ค.
  • ๐Ÿ” ์ƒ๊ฐ์˜ ์—ฐ์‡„ ์ถ”๋ก  ์—†์ด ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง€๋ฉฐ, ์ด๋Š” ๋” ์–ด๋ ค์šด ๋ฒค์น˜๋งˆํฌ๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค!

MMLU์™€ ๋น„๊ตํ•œ ๊ฒฐ๊ณผ

  • ๐Ÿ“‰ GPT-4o๋Š” 17% ํ•˜๋ฝ(0.887์—์„œ 0.7149๋กœ)
  • ๐Ÿ“‰ Mixtral 8x7B 31% ๊ฐ์†Œ(0.714์—์„œ 0.404๋กœ)
  • ๐Ÿ“‰ ๋ผ๋งˆ-3-70B 27% ํ•˜๋ฝ(0.820์—์„œ 0.5541๋กœ)
  1. ๋ฐ์ดํ„ฐ ์„ธํŠธ: https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro
  2. ๋ฆฌ๋”๋ณด๋“œ: https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro#4-leaderboard
This post is licensed under CC BY 4.0 by the author.