Post

All about the evaluation for LLMs.

The Universe of Evaluation ( Evalverse )

πŸ”₯ Evalverse-space (Updated on May 17, 2024)

What’s new?

  • β€’ Weekly scores: Arena Elo (240515), Arena Elo (240508), Arena Elo (240501)
  • β€’ New benchmarks: AlpacaEval 2.0, MMLU-Pro
  • β€’ New models: GPT-4o-0513, Grok-1, OpenELM, Qwen-Max-0428, Snowflake-Arctic-Instruct, Yi-Large
  • β€’ New tab: πŸ† Full leaderboard

New benchmarks highly correlated with human preferences (Arena Elo)

Check all of this out together in Evalverse-space.

 Evaverse Chart

Translate to Korean

ν‰κ°€μ˜ 곡간 ( Evalverse )

πŸ”₯ evalverse-space (μ—…λ°μ΄νŠΈ: 2024λ…„ 5μ›” 17일)

μƒˆλ‘œμš΄ κΈ°λŠ₯

  • β€’ μ£Όκ°„ 점수: μ•„λ ˆλ‚˜ μ—˜λ‘œ(240515), μ•„λ ˆλ‚˜ μ—˜λ‘œ(240508), μ•„λ ˆλ‚˜ μ—˜λ‘œ(240501)
  • β€’ μƒˆλ‘œμš΄ 벀치마크: AlpacaEval 2.0, MMLU-Pro
  • β€’ μ‹ κ·œ λͺ¨λΈ: GPT-4o-0513, Grok-1, OpenELM, Qwen-Max-0428, Snowflake-Arctic-Instruct, Yi-Large
  • β€’ μƒˆ νƒ­: πŸ† 전체 μˆœμœ„ν‘œ

μΈκ°„μ˜ μ„ ν˜Έλ„μ™€ 높은 상관 관계가 μžˆλŠ” μƒˆλ‘œμš΄ 벀치마크(Arena Elo)

  • β€’ LC-AlpacaEval 2.0 (4μ›” 6일) https://lnkd.in/gqmzrjyb
  • β€’ MMLU-Pro(5μ›” 15일) https://lnkd.in/gNeh2RHP

이 λͺ¨λ“  것을 Evalverse-spaceμ—μ„œ ν•¨κ»˜ ν™•μΈν•˜μ„Έμš”.

  • πŸ‘‰ ν—ˆκΉ…νŽ˜μ΄μŠ€ 슀페이슀: https://lnkd.in/gR75pHfC
This post is licensed under CC BY 4.0 by the author.