Post

Alibaba presents FunAudioLLM

Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

 FunAudioLLM

This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs).

At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, speaking style, and speaker identity.

SenseVoice-Small delivers exceptionally low-latency ASR for 5 languages, and SenseVoice-Large supports high-precision ASR for over 50 languages, while CosyVoice excels in multi-lingual voice generation, zero-shot in-context learning, cross-lingual voice cloning, and instruction-following capabilities.

The models related to SenseVoice and CosyVoice have been open-sourced on Modelscope and Huggingface, along with the corresponding training, inference, and fine-tuning codes released on GitHub.

By integrating these models with LLMs, FunAudioLLM enables applications such as speech-to-speech translation, emotional voice chat, interactive podcasts, and expressive audiobook narration, thereby pushing the boundaries of voice interaction technology.

Translate to Korean

์ธ๊ฐ„๊ณผ LLM ๊ฐ„์˜ ์ž์—ฐ์Šค๋Ÿฌ์šด ์ƒํ˜ธ์ž‘์šฉ์„ ์œ„ํ•œ Voice Understanding and Generation Foundation ๋ชจ๋ธ

์ด ๋ณด๊ณ ์„œ์—์„œ๋Š” ์ธ๊ฐ„๊ณผ ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ(LLM) ๊ฐ„์˜ ์ž์—ฐ์Šค๋Ÿฌ์šด ์Œ์„ฑ ์ƒํ˜ธ ์ž‘์šฉ์„ ํ–ฅ์ƒ์‹œํ‚ค๋„๋ก ์„ค๊ณ„๋œ ๋ชจ๋ธ ์ œํ’ˆ๊ตฐ์ธ FunAudioLLM์„ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค.

๊ทธ ์ค‘์‹ฌ์—๋Š” ๋‘ ๊ฐ€์ง€ ํ˜์‹ ์ ์ธ ๋ชจ๋ธ์ด ์žˆ์Šต๋‹ˆ๋‹ค: ๋‹ค๊ตญ์–ด ์Œ์„ฑ ์ธ์‹, ๊ฐ์ • ์ธ์‹ ๋ฐ ์˜ค๋””์˜ค ์ด๋ฒคํŠธ ๊ฐ์ง€๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” SenseVoice; CosyVoice๋Š” ์—ฌ๋Ÿฌ ์–ธ์–ด, ์Œ์ƒ‰, ๋งํ•˜๊ธฐ ์Šคํƒ€์ผ ๋ฐ ํ™”์ž ์ •์ฒด์„ฑ์„ ์ œ์–ดํ•˜์—ฌ ์ž์—ฐ์Šค๋Ÿฌ์šด ์Œ์„ฑ ์ƒ์„ฑ์„ ์šฉ์ดํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

SenseVoice-Small์€ 5๊ฐœ ์–ธ์–ด์— ๋Œ€ํ•ด ๋งค์šฐ ์งง์€ ๋Œ€๊ธฐ ์‹œ๊ฐ„ ASR์„ ์ œ๊ณตํ•˜๊ณ , SenseVoice-Large๋Š” 50๊ฐœ ์ด์ƒ์˜ ์–ธ์–ด์— ๋Œ€ํ•ด ๊ณ ์ •๋ฐ€ ASR์„ ์ง€์›ํ•˜๋ฉฐ, CosyVoice๋Š” ๋‹ค๊ตญ์–ด ์Œ์„ฑ ์ƒ์„ฑ, ์ œ๋กœ์ƒท ์ปจํ…์ŠคํŠธ ๋‚ด ํ•™์Šต, ๋‹ค๊ตญ์–ด ์Œ์„ฑ ๋ณต์ œ ๋ฐ ๋ช…๋ น ์ถ”์ข… ๊ธฐ๋Šฅ์— ํƒ์›”ํ•ฉ๋‹ˆ๋‹ค. SenseVoice ๋ฐ CosyVoice์™€ ๊ด€๋ จ๋œ ๋ชจ๋ธ์€ GitHub์— ๋ฆด๋ฆฌ์Šค๋œ ํ•ด๋‹น ํ•™์Šต, ์ถ”๋ก  ๋ฐ ๋ฏธ์„ธ ์กฐ์ • ์ฝ”๋“œ์™€ ํ•จ๊ป˜ Modelscope ๋ฐ Huggingface์—์„œ ์˜คํ”ˆ ์†Œ์Šค๋กœ ์ œ๊ณต๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

FunAudioLLM์€ ์ด๋Ÿฌํ•œ ๋ชจ๋ธ์„ LLM๊ณผ ํ†ตํ•ฉํ•˜์—ฌ ์Œ์„ฑ ๋ณ€ํ™˜ ๋ฒˆ์—ญ, ๊ฐ์„ฑ ์Œ์„ฑ ์ฑ„ํŒ…, ๋Œ€ํ™”ํ˜• ํŒŸ์บ์ŠคํŠธ ๋ฐ ํ‘œํ˜„๋ ฅ ์žˆ๋Š” ์˜ค๋””์˜ค๋ถ ๋‚ด๋ ˆ์ด์…˜๊ณผ ๊ฐ™์€ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜์—ฌ ์Œ์„ฑ ์ƒํ˜ธ ์ž‘์šฉ ๊ธฐ์ˆ ์˜ ๊ฒฝ๊ณ„๋ฅผ ๋„“ํž™๋‹ˆ๋‹ค.

This post is licensed under CC BY 4.0 by the author.