Post

๐—ฅ๐—ผ๐˜‚๐˜๐—ฒ ๐˜†๐—ผ๐˜‚๐—ฟ ๐—พ๐˜‚๐—ฒ๐—ฟ๐˜† ๐˜๐—ผ ๐—ฎ ๐˜€๐—บ๐—ฎ๐—น๐—น๐—ฒ๐—ฟ ๐—Ÿ๐—Ÿ๐—  ๐˜„๐—ต๐—ฒ๐—ป ๐—ฝ๐—ผ๐˜€๐˜€๐—ถ๐—ฏ๐—น๐—ฒ

RouteLLM โ‡’ ๐—ฐ๐˜‚๐˜ ๐Ÿฑ๐Ÿฌ% ๐—ผ๐—ณ ๐—ฐ๐—ผ๐˜€๐˜ โœ‚๏ธ

The LMSys team maintains the ChatbotArena, which is a great evaluation system based on thousands of matches: when a user submits a query, they receive the answers from two hidden models A and B, and vote between the two. This preference data allows them to create an ELO ranking, which is a great indicator of model strength.

The team has found another great usage of this preference data they gathered: train a router to route user queries to the most appropriate model.

The main idea is that ๐™ข๐™–๐™ฃ๐™ฎ ๐™ฆ๐™ช๐™š๐™ง๐™ž๐™š๐™จ ๐™™๐™ค ๐™ฃ๐™ค๐™ฉ ๐™ง๐™š๐™ฆ๐™ช๐™ž๐™ง๐™š ๐™– ๐™จ๐™ฉ๐™ง๐™ค๐™ฃ๐™œ ๐™ข๐™ค๐™™๐™š๐™ก: for instance โ€œsummarize this paragraph in 1 sentenceโ€ can be solved very well by a small model like Llama-3-8B, which is orders of magnitude cheaper to run than the usual behemoths. If you manage to selectively route all easy queries to the smaller LLM, you can save a lot on the costs with minimal performance reduction (a queries will be poorly answered due to mis-routing)

So the team set on to train a router that given a query, chooses the most appropriate LLM to answer it, between a strong/expensive one and a weak/cheap one.

๐Ÿ› ๏ธ Create a router between GPT-4 (strong model) and Mixtral-8x7B (small model)

๐Ÿ”ข Use preference data from 80k labels

  • โ†’ Augment this with gold preference data for specific benchmarks
  • โ†’ Define custom metrics to measure perf gain from routing
  • โ†’ Test on MT-Bench, GSM8k, and MMLU

๐Ÿ’ฅ Achieve 95% of GPT-4 quality on MT-Bench for over 2x cost reduction

โœจ Overhead cost are minimal, even the most expensive routing method introduces an overhead under 0.4% of GPT-4 generation

๐Ÿง‚ Grain of salt: MT-Bench is really the benchmark where this method performs best, and introducing โ€œgold dataโ€ from the benchmark probably biased results upwards. So the โ€œ95% perf for 2x cost reductionโ€ will not be as impressive in a real setting

  • ๐™๐™š๐™–๐™™ ๐™ฉ๐™๐™š ๐™ฅ๐™–๐™ฅ๐™š๐™ง ๐™๐™š๐™ง๐™š ๐Ÿ‘‰ https://huggingface.co/papers/2406.18665
  • ๐˜พ๐™ค๐™™๐™š ๐™ง๐™š๐™ฅ๐™ค ๐™ž๐™จ ๐™๐™š๐™ง๐™š (already 1.7k stars) ๐Ÿ‘‰ https://github.com/lm-sys/RouteLLM

 Router

This post is licensed under CC BY 4.0 by the author.