Post

The giant leaps of open-source models for Vision Models

๐•๐ข๐ฌ๐ข๐จ๐ง ๐ฅ๐š๐ง๐ ๐ฎ๐š๐ ๐ž ๐ฆ๐จ๐๐ž๐ฅ๐ฌ

Andrew Reed built a cool space that shows that OS LLMs are catching up with closed source LLMs in ELO ranking in the Arena (link below). For vision, the same dynamic is happening: the field is still evolving fast, but soon OS models will be able to match GPT-4oโ€™s vision skills.

I witnessed the Idefics teamโ€™s work and their many late nights before their publishing of Idefics-2-8b. Now they just published a paper that summarizes their insights!

๐™ƒ๐™š๐™ง๐™šโ€™๐™จ ๐™– ๐™จ๐™ช๐™ข๐™ข๐™–๐™ง๐™ฎ ๐™ค๐™› ๐™ฌ๐™๐™–๐™ฉ ๐™ฉ๐™๐™š๐™ฎ ๐™›๐™ค๐™ช๐™ฃ๐™™:

โžค ๐—ฃ๐—ฒ๐—ฟ๐—ณ๐—ผ๐—ฟ๐—บ๐—ฎ๐—ป๐—ฐ๐—ฒ ๐—ผ๐—ณ ๐—ฉ๐—Ÿ๐— ๐˜€ ๐—ถ๐˜€ ๐—น๐—ฎ๐—ฟ๐—ด๐—ฒ๐—น๐˜† ๐—ฑ๐—ฟ๐—ถ๐˜ƒ๐—ฒ๐—ป ๐—ฏ๐˜† ๐—ฝ๐—ฒ๐—ฟ๐—ณ ๐—ผ๐—ณ ๐˜๐—ต๐—ฒ๐—ถ๐—ฟ ๐˜๐—ฒ๐˜…๐˜-๐—ผ๐—ป๐—น๐˜† ๐—ฏ๐—ฎ๐—ฐ๐—ธ๐—ฏ๐—ผ๐—ป๐—ฒ๐˜€. In ablation studies, replacing the llama-1-7b with Mistral-7b directly brings +7% performance ๐Ÿคฏ

โžค ๐—ง๐—ต๐—ฒ๐˜† ๐—ฐ๐—ผ๐—บ๐—ฝ๐—ฎ๐—ฟ๐—ฒ๐—ฑ ๐˜๐˜„๐—ผ ๐—ฐ๐—ผ๐—บ๐—ฝ๐—ฒ๐˜๐—ถ๐—ป๐—ด ๐—ฎ๐—ฟ๐—ฐ๐—ต๐—ถ๐˜๐—ฒ๐—ฐ๐˜๐˜‚๐—ฟ๐—ฒ๐˜€:

  • ๐Ÿ”€ ๐—–๐—ฟ๐—ผ๐˜€๐˜€ ๐—ฎ๐˜๐˜๐—ฒ๐—ป๐˜๐—ถ๐—ผ๐—ป ๐—ฎ๐—ฟ๐—ฐ๐—ต๐—ถ๐˜๐—ฒ๐—ฐ๐˜๐˜‚๐—ฟ๐—ฒ: images are encoded through the vision backbone, and their information is inserted within the text processing at various places
  • ๐Ÿ”ข ๐—™๐˜‚๐—น๐—น๐˜† ๐—ฎ๐˜‚๐˜๐—ผ๐—ฟ๐—ฒ๐—ด๐—ฟ๐—ฒ๐˜€๐˜€๐—ถ๐˜ƒ๐—ฒ ๐—ฎ๐—ฟ๐—ฐ๐—ต๐—ถ๐˜๐—ฒ๐—ฐ๐˜๐˜‚๐—ฟ๐—ฒ: the output is directly concatenated to the sequence of text embeddings, and entire sequence passed as input to the LM (cf image) The comparisonโ€™s outcome is the following โ‡’ ๐—™๐˜‚๐—น๐—น๐˜† ๐—ฎ๐˜‚๐˜๐—ผ๐—ฟ๐—ฒ๐—ด๐—ฟ๐—ฒ๐˜€๐˜€๐—ถ๐˜ƒ๐—ฒ ๐—ฎ๐—ฟ๐—ฐ๐—ต๐—ถ๐˜๐—ฒ๐—ฐ๐˜๐˜‚๐—ฟ๐—ฒ ๐—ผ๐˜‚๐˜๐—ฝ๐—ฒ๐—ฟ๐—ณ๐—ผ๐—ฟ๐—บ๐˜€ ๐—ฐ๐—ฟ๐—ผ๐˜€๐˜€-๐—ฎ๐˜๐˜๐—ฒ๐—ป๐˜๐—ถ๐—ผ๐—ป ๐—ฎ๐—ฟ๐—ฐ๐—ต๐—ถ๐˜๐—ฒ๐—ฐ๐˜๐˜‚๐—ฟ๐—ฒ when you fine-tune the whole system using LoRA

โžก๏ธ ๐—ง๐—ต๐—ฒ๐˜€๐—ฒ ๐—ณ๐—ถ๐—ป๐—ฑ๐—ถ๐—ป๐—ด๐˜€ ๐—น๐—ฒ๐—ฑ ๐˜๐—ผ ๐˜€๐—ฒ๐˜ƒ๐—ฒ๐—ฟ๐—ฎ๐—น ๐—ฎ๐—ฟ๐—ฐ๐—ต๐—ถ๐˜๐—ฒ๐—ฐ๐˜๐˜‚๐—ฟ๐—ฎ๐—น ๐—ถ๐—บ๐—ฝ๐—ฟ๐—ผ๐˜ƒ๐—ฒ๐—บ๐—ฒ๐—ป๐˜ ๐—ถ๐—ป ๐—œ๐—ฑ๐—ฒ๐—ณ๐—ถ๐—ฐ๐˜€-๐Ÿฎ: โžค Replaced cross-attention architecture with fully autoregressive architecture

โžค Enable treating images with varying aspect ratio

โžค Allow to split an image in 4, to be encoded on 320 vision tokens instead of 64, if you want to increase perf at the cost of more compute

โœจ As a result, Idefics-2 reaches state-of-the-art performance for this model size! Now just a few more steps to catch up to GPT-4o!

Congrats for this great release Lรฉo Tronchon Hugo Laurenรงon Victor Sanh! ๐Ÿ‘

๐Ÿ‘‰ ๐—ฅ๐—ฒ๐—ฎ๐—ฑ ๐˜๐—ต๐—ฒ ๐—œ๐—ฑ๐—ฒ๐—ณ๐—ถ๐—ฐ๐˜€-๐Ÿฎ ๐—ฝ๐—ฎ๐—ฝ๐—ฒ๐—ฟ: https://huggingface.co/papers/2405.02246

๐Ÿš€ ๐—”๐—ป๐—ฑ๐—ฟ๐—ฒ๐˜„โ€™๐˜€ ๐˜€๐—ฝ๐—ฎ๐—ฐ๐—ฒ ๐˜๐—ต๐—ฎ๐˜ ๐˜€๐—ต๐—ผ๐˜„๐˜€ ๐—ข๐—ฆ ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น๐˜€ ๐—ฐ๐—ฎ๐˜๐—ฐ๐—ต๐—ถ๐—ป๐—ด ๐˜‚๐—ฝ (๐—ณ๐—ผ๐—ฟ ๐˜๐—ฒ๐˜…๐˜ ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น๐˜€): https://huggingface.co/spaces/andrewrreed/closed-vs-open-arena-elo

โš”๏ธ ๐—–๐—ผ๐—บ๐—ฝ๐—ฎ๐—ฟ๐—ฒ ๐˜ƒ๐—ถ๐˜€๐—ถ๐—ผ๐—ป ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น๐˜€ ๐—ถ๐—ป ๐˜๐—ต๐—ฒ ๐—ฉ๐—ถ๐˜€๐—ถ๐—ผ๐—ป ๐—ฎ๐—ฟ๐—ฒ๐—ป๐—ฎ: https://huggingface.co/spaces/WildVision/vision-arena

 ๐•๐ข๐ฌ๐ข๐จ๐ง ๐ฅ๐š๐ง๐ ๐ฎ๐š๐ ๐ž ๐ฆ๐จ๐๐ž๐ฅ๐ฌ

Translate to Korean

์˜คํ”ˆ ์†Œ์Šค ๋ชจ๋ธ์˜ ๊ฑฐ๋Œ€ํ•œ ๋„์•ฝ

Andrew Reed ์•„๋ ˆ๋‚˜์˜ ELO ์ˆœ์œ„์—์„œ OS LLM์ด ํด๋กœ์ฆˆ๋“œ ์†Œ์Šค LLM์„ ๋”ฐ๋ผ์žก๊ณ  ์žˆ์Œ์„ ๋ณด์—ฌ์ฃผ๋Š” ๋ฉ‹์ง„ ๊ณต๊ฐ„์„ ๊ตฌ์ถ•ํ–ˆ์Šต๋‹ˆ๋‹ค(์•„๋ž˜ ๋งํฌ). ๋น„์ „์˜ ๊ฒฝ์šฐ์—๋„ ๋™์ผํ•œ ์—ญํ•™์ด ์ผ์–ด๋‚˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค: ์ด ๋ถ„์•ผ๋Š” ์—ฌ์ „ํžˆ ๋น ๋ฅด๊ฒŒ ์ง„ํ™”ํ•˜๊ณ  ์žˆ์ง€๋งŒ ๊ณง OS ๋ชจ๋ธ์ด GPT-4o์˜ ๋น„์ „ ๊ธฐ์ˆ ๊ณผ ์ผ์น˜ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๋‚˜๋Š” Idefics ํŒ€์˜ ์ž‘์—…๊ณผ Idefics-2-8b๋ฅผ ์ถœํŒํ•˜๊ธฐ ์ „์— ๋งŽ์€ ๋Šฆ์€ ๋ฐค์„ ๋ชฉ๊ฒฉํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด์ œ ๊ทธ๋“ค์€ ๊ทธ๋“ค์˜ ํ†ต์ฐฐ๋ ฅ์„ ์š”์•ฝํ•œ ๋…ผ๋ฌธ์„ ๋ฐœํ‘œํ–ˆ์Šต๋‹ˆ๋‹ค!

๊ทธ๋“ค์ด ๋ฐœ๊ฒฌํ•œ ๋‚ด์šฉ์„ ์š”์•ฝํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

โžค VLM์˜ ์„ฑ๋Šฅ์€ ์ฃผ๋กœ ํ…์ŠคํŠธ ์ „์šฉ ๋ฐฑ๋ณธ์˜ ์„ฑ๋Šฅ์— ์˜ํ•ด ์ขŒ์šฐ๋ฉ๋‹ˆ๋‹ค. ์ ˆ์ œ ์—ฐ๊ตฌ์—์„œ llama-1-7b๋ฅผ Mistral-7b๋กœ ์ง์ ‘ ๋Œ€์ฒดํ•˜๋ฉด +7%์˜ ์„ฑ๋Šฅ์„ ๐Ÿคฏ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

โžค ๋‘ ๊ฐ€์ง€ ๊ฒฝ์Ÿ ์•„ํ‚คํ…์ฒ˜๋ฅผ ๋น„๊ตํ–ˆ์Šต๋‹ˆ๋‹ค.

  • ๐Ÿ”€ ํฌ๋กœ์Šค ์–ดํ…์…˜ ์•„ํ‚คํ…์ฒ˜: ์ด๋ฏธ์ง€๋Š” ๋น„์ „ ๋ฐฑ๋ณธ์„ ํ†ตํ•ด ์ธ์ฝ”๋”ฉ๋˜๊ณ  ํ•ด๋‹น ์ •๋ณด๋Š” ๋‹ค์–‘ํ•œ ์œ„์น˜์—์„œ ํ…์ŠคํŠธ ์ฒ˜๋ฆฌ ๋‚ด์— ์‚ฝ์ž…๋ฉ๋‹ˆ๋‹ค.
  • ๐Ÿ”ข ์™„์ „ ์ž๋™ ํšŒ๊ท€ ์•„ํ‚คํ…์ฒ˜: ์ถœ๋ ฅ์€ ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ ์‹œํ€€์Šค์— ์ง์ ‘ ์—ฐ๊ฒฐ๋˜๊ณ  ์ „์ฒด ์‹œํ€€์Šค๋Š” LM์— ์ž…๋ ฅ์œผ๋กœ ์ „๋‹ฌ๋ฉ๋‹ˆ๋‹ค(cf ์ด๋ฏธ์ง€). ๋น„๊ต ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹คโ‡’ ์™„์ „ ์ž๋™ ํšŒ๊ท€ ์•„ํ‚คํ…์ฒ˜๋Š” LoRA๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ „์ฒด ์‹œ์Šคํ…œ์„ ๋ฏธ์„ธ ์กฐ์ •ํ•  ๋•Œ ๊ต์ฐจ ์ฃผ์˜ ์•„ํ‚คํ…์ฒ˜๋ณด๋‹ค ์„ฑ๋Šฅ์ด ๋›ฐ์–ด๋‚ฉ๋‹ˆ๋‹ค.

โžก๏ธ ์ด๋Ÿฌํ•œ ๋ฐœ๊ฒฌ์€ Idefics-2์˜ ๋ช‡ ๊ฐ€์ง€ ์•„ํ‚คํ…์ฒ˜ ๊ฐœ์„ ์œผ๋กœ ์ด์–ด์กŒ์Šต๋‹ˆ๋‹ค. โžค cross-attention ์•„ํ‚คํ…์ฒ˜๋ฅผ ์™„์ „ ์ž๋™ ํšŒ๊ท€ ์•„ํ‚คํ…์ฒ˜๋กœ ๋Œ€์ฒดํ–ˆ์Šต๋‹ˆ๋‹ค.

โžค ๋‹ค์–‘ํ•œ ์ข…ํšก๋น„๋กœ ์ด๋ฏธ์ง€ ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅ

โžค ๋” ๋งŽ์€ ์ปดํ“จํŒ… ๋น„์šฉ์œผ๋กœ ์„ฑ๋Šฅ์„ ๋†’์ด๋ ค๋ฉด ์ด๋ฏธ์ง€๋ฅผ 4๊ฐœ๋กœ ๋ถ„ํ• ํ•˜์—ฌ 64๊ฐœ ๋Œ€์‹  320๊ฐœ์˜ ๋น„์ „ ํ† ํฐ์œผ๋กœ ์ธ์ฝ”๋”ฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

โœจ ๊ฒฐ๊ณผ์ ์œผ๋กœ Idefics-2๋Š” ์ด ๋ชจ๋ธ ํฌ๊ธฐ์— ๋Œ€ํ•ด ์ตœ์ฒจ๋‹จ ์„ฑ๋Šฅ์— ๋„๋‹ฌํ–ˆ์Šต๋‹ˆ๋‹ค! ์ด์ œ GPT-4o๋ฅผ ๋”ฐ๋ผ์žก๊ธฐ ์œ„ํ•œ ๋ช‡ ๋‹จ๊ณ„๋งŒ ๋” ๊ฑฐ์น˜๋ฉด ๋ฉ๋‹ˆ๋‹ค!

์ด ๋ฉ‹์ง„ ๋ฆด๋ฆฌ์Šค Lรฉo Tronchon Hugo Laurenรงon Victor Sanh ์ถ•ํ•˜ํ•ฉ๋‹ˆ๋‹ค! ๐Ÿ‘

๐Ÿ‘‰ Idefics-2 ๋…ผ๋ฌธ ์ฝ๊ธฐ: https://huggingface.co/papers/2405.02246

๐Ÿš€ OS ๋ชจ๋ธ์ด ๋”ฐ๋ผ์žก๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ฃผ๋Š” Andrew์˜ ๊ณต๊ฐ„ (ํ…์ŠคํŠธ ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ) : https://huggingface.co/spaces/andrewrreed/closed-vs-open-arena-elo

โš”๏ธ ๋น„์ „ ๋ถ„์•ผ์˜ ๋น„์ „ ๋ชจ๋ธ ๋น„๊ต: https://huggingface.co/spaces/WildVision/vision-arena

This post is licensed under CC BY 4.0 by the author.