Post

Hugging Face x AWS

Deploy Llama 3 70B on AWS Inferentia2 with Hugging Face Optimum

Are you struggling with GPU access or availability and want to use Meta Llama 3 70B in your Amazon Web Services (AWS) environment? ๐Ÿค” Excited to share Metaโ€™s Llama 3 70B on AWS Inferentia2 using Hugging Face Optimum!

TL;DR: ๐Ÿ“Œ

  • ๐Ÿค— Easy setup using Hugging Face Optimum and SageMaker SDK
  • ๐Ÿ”ฅ Deploy Llama 3 70B on inf2.48xlarge with Hugging Face TGI
  • โšกCreate an interactive Gradio demo with streaming responses
  • ๐Ÿ”“ Leverage pre-compiled configurations for Llama 3 70B from Hugging Face Hub
  • โฐ Benchmark with llmperf with ~132.8 tokens/second and latency of 23.46 ms/token

Thats not the limit! We are just getting started and are already improving performance and working on more supported modes. ๐Ÿค—

Translate to Korean

Hugging Face Optimum์„ ์‚ฌ์šฉํ•˜์—ฌ AWS Inferentia2์— Llama 3 70B ๋ฐฐํฌ

GPU ์•ก์„ธ์Šค ๋˜๋Š” ๊ฐ€์šฉ์„ฑ์— ์–ด๋ ค์›€์„ ๊ฒช๊ณ  ์žˆ์œผ๋ฉฐ Amazon Web Services (AWS) ํ™˜๊ฒฝ์—์„œ Meta Llama 3 70B๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์‹ถ์Šต๋‹ˆ๊นŒ? ๐Ÿค” Hugging Face Optimum์„ ์‚ฌ์šฉํ•˜์—ฌ AWS Inferentia2์—์„œ Meta์˜ Llama 3 70B๋ฅผ ๊ณต์œ ํ•˜๊ฒŒ ๋˜์–ด ๊ธฐ์ฉ๋‹ˆ๋‹ค!

TL์ž…๋‹ˆ๋‹ค. ๋ฐ•์‚ฌ: ๐Ÿ“Œ

  • ๐Ÿค— Hugging Face Optimum ๋ฐ SageMaker SDK๋ฅผ ์‚ฌ์šฉํ•œ ๊ฐ„ํŽธํ•œ ์„ค์ •
  • ๐Ÿ”ฅ Hugging Face TGI๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ inf2.48xlarge์— Llama 3 70B๋ฅผ ๋ฐฐํฌํ•ฉ๋‹ˆ๋‹ค.
  • โšก์ŠคํŠธ๋ฆฌ๋ฐ ์‘๋‹ต์ด ํฌํ•จ๋œ ๋Œ€ํ™”ํ˜• Gradio ๋ฐ๋ชจ ๋งŒ๋“ค๊ธฐ
  • ๐Ÿ”“ Hugging Face Hub์—์„œ Llama 3 70B์— ๋Œ€ํ•ด ์‚ฌ์ „ ์ปดํŒŒ์ผ๋œ ๊ตฌ์„ฑ ํ™œ์šฉ
  • โฐ ~132.8 ํ† ํฐ/์ดˆ ๋ฐ 23.46ms/ํ† ํฐ์˜ ๋Œ€๊ธฐ ์‹œ๊ฐ„์œผ๋กœ llmperf๋ฅผ ์‚ฌ์šฉํ•œ ๋ฒค์น˜๋งˆํฌ

๊ทธ๊ฒŒ ํ•œ๊ณ„๊ฐ€ ์•„๋‹™๋‹ˆ๋‹ค! ์ด์ œ ๋ง‰ ์‹œ์ž‘ํ–ˆ์œผ๋ฉฐ ์ด๋ฏธ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ•˜๊ณ  ๋” ๋งŽ์€ ์ง€์› ๋ชจ๋“œ๋ฅผ ์ž‘์—…ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๐Ÿค—


Lillys AI ์š”์•ฝ : https://lilys.ai/digest/681590

Deploy Llama 3 70B on AWS Inferentia2 with Hugging Face Optimum

1.๋ฉ”ํƒ€์˜ ์ตœ์‹  ์˜คํ”ˆ LLM ๋ชจ๋ธ, Llama 3์— ๋Œ€ํ•œ ๋‚ด์šฉ

  • 2024๋…„ 4์›” ๋ฐœํ‘œ๋œ Meta์˜ ์ตœ์‹  ์˜คํ”ˆ LLM์ธ Llama 3์€ 15์กฐ ํ† ํฐ์— ๋Œ€ํ•ด ํ›ˆ๋ จ๋˜์—ˆ์œผ๋ฉฐ 8์ฒœ ๊ฐœ ํ† ํฐ๊นŒ์ง€ ์ง€์›ํ•˜๋Š” ์ปจํ…์ŠคํŠธ ๊ธธ์ด ์ฐฝ์„ ๊ฐ€์ง„ ์šฐ์ˆ˜ํ•œ ์˜คํ”ˆ LLM ์ค‘ ํ•˜๋‚˜์ด๋‹ค.
  • Meta๋Š” ์ธ๊ฐ„ ํ”ผ๋“œ๋ฐฑ์— ๋Œ€ํ•œ ๊ฐ•ํ™” ํ•™์Šต์œผ๋กœ ๋Œ€ํ™”ํ˜• ๋ชจ๋ธ์„ ๋ฏธ์„ธ ์กฐ์ •ํ–ˆ์œผ๋ฉฐ 1์ฒœ๋งŒ ๊ฐœ ์ด์ƒ์˜ ์ธ๊ฐ„ ์ฃผ์„์— ๋Œ€ํ•ด ์ ์šฉํ–ˆ๋‹ค.
  • ํ•ด๋‹น ๋ธ”๋กœ๊ทธ์—์„œ๋Š” AWS Inferentia2์— Hugging Face Optimum์„ ํ†ตํ•ด Meta-Llama-3-70B-Instruct ๋ชจ๋ธ์„ ๋ฐฐํฌํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์†Œ๊ฐœํ•œ๋‹ค.
  • Hugging Face LLM Inf2 Container๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ AWS Inferentia2์— LLM์„ ์‰ฝ๊ฒŒ ๋ฐฐํฌํ•˜๋Š” ๋ฐฉ๋ฒ•, Text Generation Inference ๋ฐ Optimum Neuron์— ์˜ํ•ด ๊ตฌ๋™๋˜๋Š” ์ƒˆ๋กœ์šด ๋ชฉ์ ์ง€์› ์ถ”๋ก  ์ปจํ…Œ์ด๋„ˆ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.
  • ๋ธ”๋กœ๊ทธ์—์„œ๋Š” ๊ฐœ๋ฐœ ํ™˜๊ฒฝ ์„ค์ •, ์ƒˆ๋กœ์šด Hugging Face LLM Inf2 DLC ๊ฒ€์ƒ‰, Inferentia2์— Llama 3 70B ๋ฐฐํฌ, ๋ชจ๋ธ๋กœ ์ถ”๋ก  ๋ฐ ์ฑ„ํŒ…, llmperf๋ฅผ ํ†ตํ•œ Inferentia2์—์„œ llama 3 70B ๋ฒค์น˜๋งˆํ‚น, ์ฒญ์†Œ๊นŒ์ง€ ๋‹ค๋ฃฌ๋‹ค. ๐Ÿš€

2.๏ธAWS Inferentia 2 ์†Œ๊ฐœ

  • AWS Inferentia (Inf2)์€ ๋”ฅ๋Ÿฌ๋‹ ์ถ”๋ก  ์ž‘์—…์„ ์œ„ํ•œ ๋ชฉ์ ์œผ๋กœ ์„ค๊ณ„๋œ EC2์ž…๋‹ˆ๋‹ค.
  • Inferentia 2๋Š” AWS Inferentia์˜ ํ›„์† ์ œํ’ˆ์œผ๋กœ, ์ตœ๋Œ€ 4๋ฐฐ ๋” ๋†’์€ ์ฒ˜๋ฆฌ๋Ÿ‰ ๋ฐ ์ตœ๋Œ€ 10๋ฐฐ ๋‚ฎ์€ ์ง€์—ฐ ์‹œ๊ฐ„์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
  • | ์ธ์Šคํ„ด์Šค ์‚ฌ์ด์ฆˆ | ๊ฐ€์†๊ธฐ | Neuron ์ฝ”์–ด | ๊ฐ€์†๊ธฐ ๋ฉ”๋ชจ๋ฆฌ | vCPU | CPU ๋ฉ”๋ชจ๋ฆฌ | ์˜จ๋””๋งจ๋“œ ๊ฐ€๊ฒฉ ($/์‹œ๊ฐ„) | | โ€” | โ€” | โ€” | โ€” | โ€” | โ€” | โ€” | | inf2.xlarge | 1 | 2 | 32 | 4 | 16 | 0.76 | | inf2.8xlarge | 1 | 2 | 32 | 32 | 128 | 1.97 | | inf2.24xlarge | 6 | 12 | 192 | 96 | 384 | 6.49 | | inf2.48xlarge | 12 | 24 | 384 | 192 | 768 | 12.98 | ์ถ”๊ฐ€๋กœ, Inferentia 2๋Š” C++์—์„œ ์‚ฌ์šฉ์ž ์ง€์ • ์—ฐ์‚ฐ์ž ๋ฐ
    1
    
    FP8
    
    (cFP8)๊ณผ ๊ฐ™์€ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ ์œ ํ˜•์„ ์ง€์›ํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

3.๊ฐœ๋ฐœ ํ™˜๊ฒฝ ์„ค์ • ๋ฐ SageMaker ์„ธํŒ…

  • Amazon SageMaker์— Mixtral์„ ๋ฐฐํฌํ•˜๊ธฐ ์œ„ํ•ด
    1
    
    sagemaker
    
    Python SDK๋ฅผ ์‚ฌ์šฉํ•  ๊ฒƒ์ด๋‹ค.
  • AWS ๊ณ„์ •์„ ๊ตฌ์„ฑํ•˜๊ณ 
    1
    
    sagemaker
    
    Python SDK๊ฐ€ ์„ค์น˜๋˜์–ด ์žˆ์–ด์•ผ ํ•œ๋‹ค.
  • ๋กœ์ปฌ ํ™˜๊ฒฝ์—์„œ SageMaker๋ฅผ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ ํ•„์š”ํ•œ ๊ถŒํ•œ์ด ๋ถ€์—ฌ๋œ IAM Role์— ์•ก์„ธ์Šคํ•ด์•ผ ํ•œ๋‹ค.
  • ๊ถŒํ•œ์— ๊ด€ํ•œ ์ž์„ธํ•œ ๋‚ด์šฉ์€ ์—ฌ๊ธฐ๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

4.๏ธ์ƒˆ๋กœ์šด ํ—ˆ๊น…ํŽ˜์ด์Šค LLM Inf2 DLC ๊ฒ€์ƒ‰

  • ์ƒˆ๋กœ์šด ํ—ˆ๊น…ํŽ˜์ด์Šค TGI Neuronx DLC๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ AWS Inferentia2์—์„œ ์ถ”๋ก ์„ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • 1
    
    sagemaker
    
    SDK์˜
    1
    
    get_huggingface_llm_image_uri
    
    ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์›ํ•˜๋Š”
    1
    
    backend
    
    ,
    1
    
    session
    
    ,
    1
    
    region
    
    ,
    1
    
    version
    
    ์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ ์ ์ ˆํ•œ ํ—ˆ๊น…ํŽ˜์ด์Šค TGI Neuronx DLC URI๋ฅผ ๊ฒ€์ƒ‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๋ชจ๋“  ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ๋ฒ„์ „์€ ์—ฌ๊ธฐ์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
1
2
3
4
5
6
7
8
9
10
# TODO: ๋ฐœ๋งค ์‹œ ํ™œ์„ฑํ™”
from sagemaker.huggingface import get_huggingface_llm_image_uri
 
# llm ์ด๋ฏธ์ง€ uri ๊ฒ€์ƒ‰
llm_image = get_huggingface_llm_image_uri(
  "huggingface-neuronx",
  version="0.0.22"
)
 
print(f"llm ์ด๋ฏธ์ง€ uri: {llm_image}")

5.Llama 3 70B๋ฅผ Inferentia2์— ๋ฐฐํฌ

  • ์ถ”๋ก  ์‹œ, AWS Inferentia2๋Š” ๋™์  ๋ชจ์–‘์„ ์ง€์›ํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— ์‹œํ€€์Šค ๊ธธ์ด์™€ ๋ฐฐ์น˜ ์‚ฌ์ด์ฆˆ๋ฅผ ์‚ฌ์ „์— ์ง€์ •ํ•ด์•ผ ํ•จ.
  • Inferentia2์˜ ์ „์›์„ ์ตœ๋Œ€ํ•œ ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•ด, Llama 3 70B๋ฅผ ํฌํ•จํ•œ ์ธ๊ธฐ ์žˆ๋Š” ๋ชจ๋ธ์— ๋Œ€ํ•œ ๋ฏธ๋ฆฌ ์ปดํŒŒ์ผ๋œ ์„ค์ •์„ ๋‹ด์€ neuron ๋ชจ๋ธ ์บ์‹œ๋ฅผ ๋งŒ๋“ค์—ˆ๋‹ค.
  • ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ๋ชจ๋ธ์„ ์ง์ ‘ ์ปดํŒŒ์ผํ•  ํ•„์š”๊ฐ€ ์—†๊ณ  ์บ์‹œ๋กœ๋ถ€ํ„ฐ ๋ฏธ๋ฆฌ ์ปดํŒŒ์ผ๋œ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.
  • Llama 3 70B์— ์ ํ•ฉํ•œ ๊ตฌ์„ฑ์„ ์ฐพ์œผ๋ ค๋ฉด Hugging Face Hub์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์—†๋Š” ๊ฒฝ์šฐ Optimum CLI๋ฅผ ์‚ฌ์šฉํ•ด ์ง์ ‘ ์ปดํŒŒ์ผํ•˜๊ฑฐ๋‚˜ ์บ์‹œ ์ €์žฅ์†Œ์— ์š”์ฒญํ•  ์ˆ˜ ์žˆ๋‹ค.
  • Llama 3 70B๋ฅผ Inferentia2์— ๋ฐฐํฌํ•˜๊ธฐ ์ „์— ํ•„์š”ํ•œ TGI Neuronx Endpoint ๊ตฌ์„ฑ์„ ์ •์˜ํ•ด์•ผ ํ•˜๋ฉฐ,
    1
    
    inf2.48xlarge
    
    ์ธ์Šคํ„ด์Šค ์œ ํ˜•์„ ์‚ฌ์šฉํ•  ๊ฒƒ์„ ๊ถŒ์žฅํ•œ๋‹ค.

6.๏ธ๋ชจ๋ธ ์ถ”๋ก  ๋ฐ ์ฑ„ํŒ… ์‹คํ–‰

  • ๋ฐฐํฌ๋œ ์—”๋“œํฌ์ธํŠธ์—์„œ ์ถ”๋ก ์„ ์‹คํ–‰ํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ์„ธ๋ฐ€ํ•˜๋‹ค.
  • ๋ฉ”์‹œ์ง€ API๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๋ชจ๋ธ๊ณผ ๋Œ€ํ™”์‹์œผ๋กœ ์ƒํ˜ธ ์ž‘์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.
  • 1
    
    system
    
    ,
    1
    
    assistant
    
    ,
    1
    
    user
    
    ๊ฐ€ ๋ฉ”์‹œ์ง€ ์—ญํ• ๋กœ ์ •์˜๋œ๋‹ค.
  • ๋ชจ๋ธ์—์„œ ๋ฐ›์€ ์‘๋‹ต์„ ๊ทธ๋ผ๋””์˜ค ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์— ์ŠคํŠธ๋ฆฌ๋ฐํ•˜์—ฌ ์‚ฌ์šฉ์ž ๊ฒฝํ—˜์„ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค.
  • ๊ทธ๋ผ๋””์˜ค ์•ฑ
    1
    
    share=True
    
    ๋ฅผ ํ†ตํ•ด 72์‹œ๊ฐ„ ๋™์•ˆ ๋ชจ๋ธ์„ ํ…Œ์ŠคํŠธํ•˜๊ณ  ๊ณต์œ ํ•  ์ˆ˜ ์žˆ๋‹ค.

7.AWS Inferentia2๋กœ Llama 3 70B์˜ ์„ฑ๋Šฅ ํ‰๊ฐ€ ๋ฐ ๋ฒค์น˜๋งˆํ‚น

  • Amazon SageMaker์— Llama 3 70B๋ฅผ ์„ฑ๊ณต์ ์œผ๋กœ ๋ฐฐํฌํ•˜๊ณ  ํ…Œ์ŠคํŠธํ–ˆ๋‹ค.
  • ์ด์ œ ๋ชจ๋ธ์„ ๋ฒค์น˜๋งˆํ‚นํ•˜์—ฌ ์„ฑ๋Šฅ์„ ํ™•์ธํ•˜๋ ค๊ณ  ํ•œ๋‹ค.
  • 1
    
    sagemaker
    
    ๋ฅผ ์ง€์›ํ•˜๋Š” llmperf ํฌํฌ๋ฅผ ์‚ฌ์šฉํ•  ๊ฒƒ์ด๋‹ค.
  • 1
    
    llmperf
    
    ํŒจํ‚ค์ง€๋ฅผ ์„ค์น˜ํ•˜๊ณ  ๋ฒค์น˜๋งˆํฌ ์‹คํ–‰ํ•  ์˜ˆ์ •์ด๋‹ค.
    1
    
    5
    
    ๊ฐœ์˜ ๋™์‹œ ์‚ฌ์šฉ์ž์™€ ์ตœ๋Œ€
    1
    
    50
    
    ๊ฐœ ์š”์ฒญ์œผ๋กœ ๋ฒค์น˜๋งˆํฌ๋ฅผ ์ง„ํ–‰ํ•  ๊ฒƒ์ด๋‹ค.

8.์ค‘์š”:
1
first-time-to-token
๊ฐ’์˜ ์ •ํ™•ํ•œ ์ธก์ •์„ ์œ„ํ•ด์„œ๋Š” ๋ฒค์น˜๋งˆํ‚น์„ ๋™์ผ ํ˜ธ์ŠคํŠธ๋‚˜ ํ”„๋กœ๋•์…˜ ์ง€์—ญ์—์„œ ์‹คํ–‰ํ•ด์•ผ ํ•จ.

  • 1
    
    first-time-to-token
    
    ,
    1
    
    latency (ms/token)
    
    ,
    1
    
    throughput (tokens/s)
    
    ์„ ์ธก์ •ํ•˜๋Š” ๋ฒค์น˜๋งˆํฌ๊ฐ€ ์ง„ํ–‰๋  ๊ฒƒ์ž„.
  • 1
    
    results
    
    ํด๋”์—์„œ ์ƒ์„ธ ๋‚ด์šฉ ํ™•์ธ ๊ฐ€๋Šฅ.
  • ์ด ๋ฒค์น˜๋งˆํฌ๋Š” ์œ ๋Ÿฝ์—์„œ ์‹œ์ž‘๋์ง€๋งŒ ์—”๋“œํฌ์ธํŠธ๋Š” us-east-1์—์„œ ์‹คํ–‰๋˜๊ณ  ์žˆ์–ด
    1
    
    first-time-to-token
    
    ๊ฐ’์— ์ƒ๋‹นํ•œ ์˜ํ–ฅ์„ ๋ฏธ์นจ.
  • ๋„คํŠธ์›Œํฌ ํ†ต์‹ ์„ ํฌํ•จํ•˜๊ธฐ ๋•Œ๋ฌธ์—
    1
    
    first-time-to-token
    
    ๊ฐ’์ด ์˜ํ–ฅ์„ ๋ฐ›์Œ.

9.Python LLM ์„ฑ๋Šฅ ํ…Œ์ŠคํŠธ ์‹คํ–‰ ์„ค์ •

  • ๋ฉ”์‹œ์ง€ API๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์„ค์ •์„ ํ•˜๊ณ , โ€˜llmperfโ€™์—๊ฒŒ ๋ฉ”์‹œ์ง€ API๋ฅผ ์‚ฌ์šฉ ์ค‘์ด๋ผ๊ณ  ์•Œ๋ฆฐ๋‹ค.
  • โ€˜token_benchmark_ray.pyโ€™ ์Šคํฌ๋ฆฝํŠธ๋ฅผ ์‹คํ–‰ํ•  ๋•Œ ์„ค์ •์€ ๋ชจ๋ธ ์ด๋ฆ„, โ€˜sagemakerโ€™ LLM API, ์ตœ๋Œ€ ์™„๋ฃŒ ์š”์ฒญ ์ˆ˜ 50๋ฒˆ, ์ œํ•œ ์‹œ๊ฐ„ 600์ดˆ, ๋™์‹œ ์š”์ฒญ ์ˆ˜ 5๊ฐœ, ๊ฒฐ๊ณผ ์ €์žฅ ๋””๋ ‰ํ† ๋ฆฌ โ€˜resultsโ€™๋กœ ์ง€์ •ํ•œ๋‹ค.

10.๊ฒฐ๊ณผ๊ฐ’ ํŒŒ์‹ฑ ํ›„ ๊น”๋”ํ•˜๊ฒŒ ํ‘œ์‹œ

  • ๊ฒฐ๊ณผ๊ฐ’์„ ๊ตฌ๋ฌธ ๋ถ„์„ํ•˜๊ณ  ์ž˜ ํ‘œ์‹œํ•ฉ๋‹ˆ๋‹ค.
  • summary.json ํŒŒ์ผ์„ ์ฝ์–ด ๊ฒฐ๊ณผ๋ฅผ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.
  • Concurrent requests๊ฐ€ 5์ผ ๋•Œ์˜ ํ‰๊ท  ์ž…๋ ฅ ํ† ํฐ ๊ธธ์ด, ํ‰๊ท  ์ถœ๋ ฅ ํ† ํฐ ๊ธธ์ด, ์ฒซ ๋ฒˆ์งธ ํ† ํฐ ์™„์„ฑ๊นŒ์ง€ ์†Œ์š” ์‹œ๊ฐ„ ํ‰๊ท , ํ‰๊ท  ์ฒ˜๋ฆฌ๋Ÿ‰, ํ‰๊ท  ๋Œ€๊ธฐ ์‹œ๊ฐ„์„ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.

11. AWS Inferentia2์—์„œ Llama 370B ๋ฒค์น˜๋งˆํ‚น ๊ฒฐ๊ณผ

  • 5๊ฐœ์˜ ๋™์‹œ ์š”์ฒญ์œผ๋กœ 150 ํ† ํฐ ์ƒ์„ฑ ๊ฒฐ๊ณผ
  • AWS Inferentia2์—์„œ Llama 370B๋ฅผ ํ…Œ์ŠคํŠธํ•˜๊ณ  ๋ฒค์น˜๋งˆํ‚น ์„ฑ๊ณต
  • ๋ฒค์น˜๋งˆํฌ๋Š” ๋ชจ๋ธ ์„ฑ๋Šฅ์˜ ์ „์ฒด ํ‘œํ˜„์ด ์•„๋‹ˆ์ง€๋งŒ ์ฒซ ๋ฒˆ์งธ ์ข‹์€ ์ง€ํ‘œ๋ฅผ ์ œ๊ณต
  • ์šด์˜์ค‘์ธ ๋ชจ๋ธ ์‚ฌ์šฉ ์‹œ ๋” ๊ธด ๋ฒค์น˜๋งˆํฌ๋ฅผ ๊ถŒ์žฅํ•˜๋ฉฐ, ์ƒ์‚ฐ ๋ฒค์น˜๋งˆํฌ์— ๋” ๋งž๊ฒŒ ์ˆ˜์ •, ๋ณต์ œ๋ณธ ์ˆ˜๋ฅผ ๋ณ€๊ฒฝํ•˜์—ฌ ๋ชจ๋ธ์„ ํ…Œ์ŠคํŠธํ•  ๊ฒƒ์„ ๊ถŒ์žฅ
This post is licensed under CC BY 4.0 by the author.