Post

All about the evaluation for LLMs.

Evalverse: The Universe of LLM Evaluation

Curiosity: How do we systematically evaluate LLM performance across diverse tasks? What benchmarks and metrics provide meaningful insights into model capabilities?

Evalverse is a comprehensive evaluation platform that aggregates LLM benchmarks, model comparisons, and performance metrics in one unified space. Updated regularly, it provides the most current view of the LLM landscape.

Evalverse Overview

graph TB
    A[Evalverse Platform] --> B[Weekly Scores]
    A --> C[Benchmarks]
    A --> D[Models]
    A --> E[Leaderboards]
    
    B --> B1[Arena Elo]
    C --> C1[AlpacaEval 2.0]
    C --> C2[MMLU-Pro]
    D --> D1[GPT-4o]
    D --> D2[Latest Models]
    E --> E1[Full Rankings]
    
    style A fill:#e1f5ff
    style B fill:#fff3cd
    style C fill:#d4edda
    style D fill:#f8d7da
    style E fill:#e7d4f8

Latest Updates (May 17, 2024)

CategoryUpdatesDetails
Weekly ScoresArena Elo rankings240515, 240508, 240501
New BenchmarksAlpacaEval 2.0, MMLU-ProHuman preference correlation
New Models6 models addedGPT-4o, Grok-1, OpenELM, etc.
New FeaturesFull leaderboard tabComprehensive rankings

New Benchmarks

Retrieve: Benchmarks highly correlated with human preferences provide more meaningful evaluation.

1. LC-AlpacaEval 2.0

AspectDetailsLink
Release DateApril 6, 2024arXiv
FocusLong-context evaluationExtended capabilities
CorrelationHigh with human preferencesReliable metrics

2. MMLU-Pro

AspectDetailsLink
Release DateMay 15, 2024HuggingFace
FocusAdvanced reasoningChallenging tasks
CorrelationHigh with Arena EloHuman-aligned

New Models Added

ModelProviderKey Features
GPT-4o-0513OpenAIMultimodal, real-time
Grok-1xAILarge-scale reasoning
OpenELMAppleEfficient, open-source
Qwen-Max-0428AlibabaMultilingual, large-scale
Snowflake-Arctic-InstructSnowflakeEnterprise-focused
Yi-Large01.AIHigh performance

Evaluation Metrics

graph LR
    A[Evaluation Metrics] --> B[Arena Elo]
    A --> C[Task-Specific]
    A --> D[Human Preference]
    
    B --> B1[Head-to-Head]
    C --> C1[MMLU]
    C --> C2[HumanEval]
    D --> D1[AlpacaEval]
    
    style A fill:#e1f5ff
    style B fill:#fff3cd
    style C fill:#d4edda
    style D fill:#f8d7da

Why Evalverse Matters

Retrieve: Evalverse provides:

  • Centralized evaluation hub
  • Regular updates with latest models
  • Multiple benchmark perspectives
  • Human preference correlation

Innovate: By using Evalverse, you can:

  • Compare models systematically
  • Track performance trends
  • Make informed model choices
  • Understand evaluation landscape

Access Evalverse

πŸ‘‰ HuggingFace Space: https://huggingface.co/spaces/upstage/evalverse-space

Features:

  • Interactive leaderboards
  • Model comparisons
  • Benchmark details
  • Weekly score tracking

Key Takeaways

Retrieve: Evalverse provides a comprehensive, regularly updated view of LLM evaluation, aggregating benchmarks, model scores, and human preference metrics.

Innovate: By leveraging Evalverse’s centralized evaluation data, you can make informed decisions about model selection and understand the current state of LLM capabilities.

Curiosity β†’ Retrieve β†’ Innovation: Start with curiosity about model performance, retrieve insights from Evalverse’s comprehensive data, and innovate by selecting the best models for your specific use cases.

 Evaverse Chart

Translate to Korean

ν‰κ°€μ˜ 곡간 ( Evalverse )

πŸ”₯ evalverse-space (μ—…λ°μ΄νŠΈ: 2024λ…„ 5μ›” 17일)

μƒˆλ‘œμš΄ κΈ°λŠ₯

  • β€’ μ£Όκ°„ 점수: μ•„λ ˆλ‚˜ μ—˜λ‘œ(240515), μ•„λ ˆλ‚˜ μ—˜λ‘œ(240508), μ•„λ ˆλ‚˜ μ—˜λ‘œ(240501)
  • β€’ μƒˆλ‘œμš΄ 벀치마크: AlpacaEval 2.0, MMLU-Pro
  • β€’ μ‹ κ·œ λͺ¨λΈ: GPT-4o-0513, Grok-1, OpenELM, Qwen-Max-0428, Snowflake-Arctic-Instruct, Yi-Large
  • β€’ μƒˆ νƒ­: πŸ† 전체 μˆœμœ„ν‘œ

μΈκ°„μ˜ μ„ ν˜Έλ„μ™€ 높은 상관 관계가 μžˆλŠ” μƒˆλ‘œμš΄ 벀치마크(Arena Elo)

  • β€’ LC-AlpacaEval 2.0 (4μ›” 6일) https://lnkd.in/gqmzrjyb
  • β€’ MMLU-Pro(5μ›” 15일) https://lnkd.in/gNeh2RHP

이 λͺ¨λ“  것을 Evalverse-spaceμ—μ„œ ν•¨κ»˜ ν™•μΈν•˜μ„Έμš”.

  • πŸ‘‰ ν—ˆκΉ…νŽ˜μ΄μŠ€ 슀페이슀: https://lnkd.in/gR75pHfC
This post is licensed under CC BY 4.0 by the author.