vLLM์„ ํ™œ์šฉํ•œ Large-scale AI ๋ชจ๋ธ ๊ฐ€์†ํ™” | LLM Acceleration

2025. 12. 16. 00:08ยท๐Ÿ› ๏ธ Engineering/Distributed Training & Inference
๋ฐ˜์‘ํ˜•

์‹คํ—˜ ์ฝ”๋“œ์™€ ์ƒ์„ธ ๊ฒฐ๊ณผ๋Š” ๋งํฌ๋ฅผ ์ฐธ๊ณ  - https://github.com/ldj7672/Vision-AI-Tutorials/tree/main/inference_acceleration

 

Vision-AI-Tutorials/inference_acceleration at main · ldj7672/Vision-AI-Tutorials

Computer Vision & AI๋ฅผ ์‰ฝ๊ฒŒ ๋ฐฐ์šฐ๊ณ  ์‹ค์Šตํ•  ์ˆ˜ ์žˆ๋Š” ์˜ˆ์ œ ๋ชจ์Œ์ž…๋‹ˆ๋‹ค. Contribute to ldj7672/Vision-AI-Tutorials development by creating an account on GitHub.

github.com

 

1. ๊ฐœ์š”

Large-scale AI ๋ชจ๋ธ์€ ์ˆ˜์‹ญ์–ต ํŒŒ๋ผ๋ฏธํ„ฐ ๊ทœ๋ชจ์˜ Transformer ๊ธฐ๋ฐ˜ ๊ตฌ์กฐ๋ฅผ ๊ณตํ†ต์ ์œผ๋กœ ์‚ฌ์šฉํ•˜๋ฉฐ, ์ถ”๋ก  ๊ณผ์ •์—์„œ attention ์—ฐ์‚ฐ๊ณผ KV cache ๊ด€๋ฆฌ๊ฐ€ ์ฃผ์š” ๋ณ‘๋ชฉ์œผ๋กœ ์ž‘์šฉํ•œ๋‹ค. ํŠนํžˆ ์š”์ฒญ ์ˆ˜๊ฐ€ ์ฆ๊ฐ€ํ•˜๊ฑฐ๋‚˜ ์ปจํ…์ŠคํŠธ ๊ธธ์ด๊ฐ€ ๊ธธ์–ด์งˆ์ˆ˜๋ก GPU ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰๊ณผ ์ถ”๋ก  ์ง€์—ฐ(latency)์ด ๊ธ‰๊ฒฉํžˆ ์ฆ๊ฐ€ํ•œ๋‹ค.

 

vLLM์€ ์ด๋Ÿฌํ•œ ๋Œ€๊ทœ๋ชจ ๋ชจ๋ธ ์ถ”๋ก  ๋ณ‘๋ชฉ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์„ค๊ณ„๋œ ๊ณ ์„ฑ๋Šฅ ์ธํผ๋Ÿฐ์Šค ์—”์ง„์œผ๋กœ, PagedAttention, continuous batching, CUDA Graph ๋“ฑ ๋‹ค์–‘ํ•œ ์‹œ์Šคํ…œ ์ˆ˜์ค€ ์ตœ์ ํ™”๋ฅผ ํ†ตํ•ด ์ถ”๋ก  ํšจ์œจ์„ ํฌ๊ฒŒ ๊ฐœ์„ ํ•œ๋‹ค. ๋ณธ ๊ธ€์—์„œ๋Š” vLLM์„ ํ™œ์šฉํ•ด Large-scale AI ๋ชจ๋ธ์˜ ์ถ”๋ก ์„ ๊ฐ€์†ํ™”ํ•œ ์‹คํ—˜์„ ์ •๋ฆฌํ•˜๊ณ , ์–ด๋–ค ์„ค์ •๊ณผ ๋ฉ”์ปค๋‹ˆ์ฆ˜์ด ์‹ค์ œ ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ๊ธฐ์—ฌํ•˜๋Š”์ง€๋ฅผ ๊ธฐ์ˆ ์ ์œผ๋กœ ๋ถ„์„ํ•œ๋‹ค.

 

์‹คํ—˜์€ Vision-Language Model(VLM)์ธ Qwen2-VL-7B-Instruct๋ฅผ ๋Œ€์ƒ์œผ๋กœ ์ˆ˜ํ–‰ํ•˜์˜€์œผ๋ฉฐ, Hugging Face Transformers ๊ธฐ๋ฐ˜ ์ถ”๋ก ๊ณผ vLLM ๊ธฐ๋ฐ˜ ์ถ”๋ก ์„ ๋น„๊ตํ•œ๋‹ค. ๋‹ค๋งŒ ๋ณธ ์‹คํ—˜์—์„œ ๊ด€์ฐฐ๋œ ๊ฐ€์†ํ™” ํŠน์„ฑ๊ณผ ์ธ์‚ฌ์ดํŠธ๋Š” ์ผ๋ฐ˜์ ์ธ ๋Œ€๊ทœ๋ชจ LLM ๋ฐ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ชจ๋ธ์—๋„ ๋™์ผํ•˜๊ฒŒ ์ ์šฉ ๊ฐ€๋Šฅํ•˜๋‹ค.

 

 

2. vLLM์„ ํ™œ์šฉํ•œ Large-scale Model ๊ฐ€์†ํ™”

2.1 PagedAttention๊ณผ KV Cache ๊ด€๋ฆฌ

vLLM์˜ ๊ฐ€์žฅ ํ•ต์‹ฌ์ ์ธ ์ฐจ๋ณ„์ ์€ PagedAttention ๊ธฐ๋ฐ˜์˜ KV cache ๊ด€๋ฆฌ ๋ฐฉ์‹์ด๋‹ค. ๊ธฐ์กด Transformers ์ถ”๋ก  ์—”์ง„์€ ์š”์ฒญ ๋‹จ์œ„๋กœ ์—ฐ์†์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ๊ณต๊ฐ„์— KV cache๋ฅผ ํ• ๋‹นํ•œ๋‹ค. ์ด ๋ฐฉ์‹์€ ๊ตฌํ˜„์€ ๋‹จ์ˆœํ•˜์ง€๋งŒ, ์š”์ฒญ ์ˆ˜๊ฐ€ ์ฆ๊ฐ€ํ•˜๊ฑฐ๋‚˜ ์‹œํ€€์Šค ๊ธธ์ด๊ฐ€ ๋‹ฌ๋ผ์งˆ ๊ฒฝ์šฐ GPU ๋ฉ”๋ชจ๋ฆฌ ๋‹จํŽธํ™”๊ฐ€ ๋น ๋ฅด๊ฒŒ ๋ฐœ์ƒํ•œ๋‹ค.

 

vLLM์€ KV cache๋ฅผ page ๋‹จ์œ„๋กœ ๋ถ„ํ• ํ•˜์—ฌ ๊ด€๋ฆฌํ•œ๋‹ค. ๊ฐ ํ† ํฐ์˜ KV๋Š” ์—ฐ์†๋œ ๋Œ€ํ˜• ๋ฒ„ํผ๊ฐ€ ์•„๋‹Œ, ์—ฌ๋Ÿฌ ํŽ˜์ด์ง€์— ๋ถ„์‚ฐ ์ €์žฅ๋˜๋ฉฐ ๋…ผ๋ฆฌ์ ์œผ๋กœ๋งŒ ์—ฐ๊ฒฐ๋œ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ตฌ์กฐ์  ์ด์ ์„ ์–ป๋Š”๋‹ค.

  • GPU ๋ฉ”๋ชจ๋ฆฌ ๋‹จํŽธํ™” ๊ฐ์†Œ ๋ฐ ๋ฉ”๋ชจ๋ฆฌ ํ™œ์šฉ๋ฅ  ์ฆ๊ฐ€
  • ์š”์ฒญ ๊ฐ„ KV cache์˜ ์œ ์—ฐํ•œ ํ• ๋‹น ๋ฐ ํšŒ์ˆ˜
  • ์„œ๋กœ ๋‹ค๋ฅธ ๊ธธ์ด์˜ ์‹œํ€€์Šค์—์„œ๋„ ์•ˆ์ •์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ

์ด ๊ตฌ์กฐ๋Š” ๋‹จ์ˆœํžˆ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ ˆ์•ฝํ•˜๋Š” ์ˆ˜์ค€์„ ๋„˜์–ด, ๋™์‹œ ๋‹ค์ค‘ ์š”์ฒญ ํ™˜๊ฒฝ์—์„œ ์ถ”๋ก  ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•˜๋Š” ํ•ต์‹ฌ ์š”์†Œ๋กœ ์ž‘์šฉํ•œ๋‹ค. ํŠนํžˆ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ชจ๋ธ์ฒ˜๋Ÿผ ์ž…๋ ฅ ํ† ํฐ ์ˆ˜๊ฐ€ ๊ฐ€๋ณ€์ ์ด๊ฑฐ๋‚˜ ๊ธ‰๊ฒฉํžˆ ์ฆ๊ฐ€ํ•˜๋Š” ๊ฒฝ์šฐ, PagedAttention ๊ธฐ๋ฐ˜ KV cache ๊ด€๋ฆฌ์˜ ํšจ๊ณผ๋Š” ๋”์šฑ ๋‘๋“œ๋Ÿฌ์ง„๋‹ค.

 

2.2 Continuous Batching

๊ธฐ์กด LLM/VLM ์ถ”๋ก  ํŒŒ์ดํ”„๋ผ์ธ์€ static batching๋ฅผ ์ „์ œ๋กœ ์„ค๊ณ„๋œ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค. ์ด ๊ฒฝ์šฐ, ๋ฐฐ์น˜๊ฐ€ ๋๋‚  ๋•Œ๊นŒ์ง€ ์ƒˆ๋กœ์šด ์š”์ฒญ์€ ๋Œ€๊ธฐ ์ƒํƒœ๋กœ ๋‚จ๊ฒŒ ๋˜๋ฉฐ GPU ์œ ํœด ์‹œ๊ฐ„์ด ๋ฐœ์ƒํ•œ๋‹ค.

 

vLLM์€ continuous batching ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•œ๋‹ค. ์‹คํ–‰ ์ค‘์ธ ๋ฐฐ์น˜๊ฐ€ ์ข…๋ฃŒ๋˜๊ธฐ๋ฅผ ๊ธฐ๋‹ค๋ฆฌ์ง€ ์•Š๊ณ , ์ƒˆ๋กœ์šด ์š”์ฒญ์ด ๋“ค์–ด์˜ค๋ฉด ์ฆ‰์‹œ ๊ธฐ์กด ๋ฐฐ์น˜์— ํ•ฉ๋ฅ˜์‹œํ‚จ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํšจ๊ณผ๋ฅผ ์–ป๋Š”๋‹ค.

  • ์š”์ฒญ ๋Œ€๊ธฐ ์‹œ๊ฐ„(latency) ๊ฐ์†Œ
  • GPU ์‚ฌ์šฉ๋ฅ  ๊ทน๋Œ€ํ™”
  • ์š”์ฒญ ๊ธธ์ด๊ฐ€ ์„œ๋กœ ๋‹ฌ๋ผ๋„ ํšจ์œจ์ ์ธ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅ

์ด ๋ฐฉ์‹์€ ๋Œ€๊ทœ๋ชจ ๋ชจ๋ธ ์ถ”๋ก ์—์„œ ํŠนํžˆ ์ค‘์š”ํ•˜๋‹ค. ๋ชจ๋ธ์ด ํด์ˆ˜๋ก ๋‹จ์ผ ์ถ”๋ก  ๋น„์šฉ์ด ํฌ๊ธฐ ๋•Œ๋ฌธ์—, GPU๊ฐ€ ์ž ์‹œ๋ผ๋„ ๋†€๊ฒŒ ๋˜๋Š” ์‹œ๊ฐ„์ด ์ „์ฒด ์‹œ์Šคํ…œ ์ฒ˜๋ฆฌ๋Ÿ‰์— ํฐ ์˜ํ–ฅ์„ ๋ฏธ์นœ๋‹ค. vLLM์˜ continuous batching์€ ์ด๋Ÿฌํ•œ ๋น„ํšจ์œจ์„ ๊ตฌ์กฐ์ ์œผ๋กœ ์ œ๊ฑฐํ•œ๋‹ค.

 

2.3 CUDA Graph

๋Œ€๊ทœ๋ชจ Transformer ๋ชจ๋ธ ์ถ”๋ก ์—์„œ๋Š” ๋™์ผํ•œ ์—ฐ์‚ฐ ํŒจํ„ด์ด ๋ฐ˜๋ณต์ ์œผ๋กœ ์‹คํ–‰๋œ๋‹ค. ์ด๋•Œ ๋ฐœ์ƒํ•˜๋Š” CPU-GPU ๊ฐ„ ์ปค๋„ ๋Ÿฐ์น˜ ์˜ค๋ฒ„ํ—ค๋“œ๋Š” ๋ฌด์‹œํ•˜๊ธฐ ์–ด๋ ค์šด ๋น„์šฉ์ด ๋œ๋‹ค.

 

vLLM์€ CUDA Graph๋ฅผ ํ™œ์šฉํ•ด ์ถ”๋ก  ๊ณผ์ •์—์„œ ๋ฐ˜๋ณต๋˜๋Š” CUDA ์—ฐ์‚ฐ์„ ๊ทธ๋ž˜ํ”„ ํ˜•ํƒœ๋กœ ์บก์ฒ˜ํ•œ ๋’ค ์žฌ์‚ฌ์šฉํ•œ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ตœ์ ํ™”๋ฅผ ๋‹ฌ์„ฑํ•œ๋‹ค.

  • ์ปค๋„ ๋Ÿฐ์น˜ ๋ฐ ๋™๊ธฐํ™” ์˜ค๋ฒ„ํ—ค๋“œ ๊ฐ์†Œ
  • CPU ๊ฐœ์ž… ์ตœ์†Œํ™”๋กœ ์ธํ•œ ์ง€์—ฐ ์‹œ๊ฐ„ ๊ฐ์†Œ
  • ๋ฐ˜๋ณต ์ถ”๋ก  ํ™˜๊ฒฝ์—์„œ ์•ˆ์ •์ ์ธ ์„ฑ๋Šฅ ์œ ์ง€

 

CUDA Graph๋Š” ๋‹จ์ผ ์š”์ฒญ์—์„œ๋„ ํšจ๊ณผ๊ฐ€ ์žˆ์ง€๋งŒ, ํŠนํžˆ ๋™์ผํ•œ ๋ชจ๋ธ ์„ค์ •์œผ๋กœ ๋ฐ˜๋ณต ์ถ”๋ก ์ด ์ด๋ฃจ์–ด์ง€๋Š” ์„œ๋น„์Šค ํ™˜๊ฒฝ์—์„œ ํฐ ์„ฑ๋Šฅ ์ฐจ์ด๋ฅผ ๋งŒ๋“ ๋‹ค. ๋ณธ ์‹คํ—˜์—์„œ๋„ CUDA Graph ๋น„ํ™œ์„ฑํ™” ์‹œ ์ถ”๋ก  ์‹œ๊ฐ„์ด ์œ ์˜๋ฏธํ•˜๊ฒŒ ์ฆ๊ฐ€ํ–ˆ์œผ๋ฉฐ, vLLM ์„ฑ๋Šฅ์˜ ์ค‘์š”ํ•œ ๊ตฌ์„ฑ ์š”์†Œ์ž„์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

 

3. ์‹คํ—˜ ์„ค์ •

3.1 ์‹คํ—˜ ํ™˜๊ฒฝ

  • GPU: NVIDIA A100-SXM4-80GB
  • CUDA: 12.8
  • PyTorch: 2.9.0+cu128
  • ๋ชจ๋ธ: Qwen2-VL-7B-Instruct
  • ์ธก์ • ๋ฐฉ์‹: ๊ฐ ์„ค์ •๋‹น 10ํšŒ ์ถ”๋ก  ํ›„ ํ‰๊ท 

3.2 ๋น„๊ต ๋Œ€์ƒ ๋ฐ ์‹คํ—˜ ๊ตฌ์„ฑ

์‹คํ—˜์€ ์ด 5๊ฐ€์ง€ ์„ค์ •์œผ๋กœ ๊ตฌ์„ฑ๋˜๋ฉฐ, ๊ฐ ์‹คํ—˜์€ vLLM์˜ ํ•ต์‹ฌ ์ตœ์ ํ™” ์š”์†Œ๋ฅผ ํ•˜๋‚˜์”ฉ ์ œ๊ฑฐํ•˜๊ฑฐ๋‚˜ ๋ณ€๊ฒฝํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์„ค๊ณ„๋˜์—ˆ๋‹ค.

1) Transformers Baseline

  • Hugging Face Transformers ๊ธฐ๋ฐ˜ ์ถ”๋ก 
  • ์ผ๋ฐ˜์ ์ธ FP16 ์„ค์ •
  • static batching ๋ฐ ๊ธฐ๋ณธ KV cache ๊ด€๋ฆฌ ๋ฐฉ์‹ ์‚ฌ์šฉ

vLLM์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š์•˜์„ ๋•Œ์˜ ๊ธฐ์ค€ ์„ฑ๋Šฅ์„ ์ธก์ •ํ•˜๊ธฐ ์œ„ํ•œ ๋ฒ ์ด์Šค๋ผ์ธ์ด๋‹ค.

2) vLLM + No Chunked Prefill

  • vLLM ์‚ฌ์šฉ
  • FP32 ์ •๋ฐ€๋„
  • enable_chunked_prefill=False

Chunked Prefill์ด ๋น„ํ™œ์„ฑํ™”๋œ ์ƒํƒœ์—์„œ vLLM์˜ ๊ธฐ๋ณธ ๊ฐ€์† ํšจ๊ณผ(PagedAttention, continuous batching, CUDA Graph)๋ฅผ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•œ ์‹คํ—˜์ด๋‹ค. ์ƒ๋Œ€์ ์œผ๋กœ ์งง์€ ์ž…๋ ฅ ์‹œํ€€์Šค์—์„œ Chunked Prefill์˜ ์˜ํ–ฅ๋„๋ฅผ ๊ด€์ฐฐํ•œ๋‹ค.

3) vLLM + No CUDA Graph

  • vLLM ์‚ฌ์šฉ
  • FP32 ์ •๋ฐ€๋„
  • enforce_eager=True (CUDA Graph ๋น„ํ™œ์„ฑํ™”)

CUDA Graph๊ฐ€ ์ถ”๋ก  ์„ฑ๋Šฅ์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์„ ๋ถ„๋ฆฌํ•ด์„œ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•œ ์‹คํ—˜์ด๋‹ค. ๋ฐ˜๋ณต ํ˜ธ์ถœ ํ™˜๊ฒฝ์—์„œ ์ปค๋„ ๋Ÿฐ์น˜ ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ์„ฑ๋Šฅ์— ์–ผ๋งˆ๋‚˜ ์˜ํ–ฅ์„ ์ฃผ๋Š”์ง€๋ฅผ ๊ด€์ฐฐํ•œ๋‹ค.

 

4) vLLM Baseline (FP32)

  • vLLM ๊ธฐ๋ณธ ์„ค์ •
  • FP32 ์ •๋ฐ€๋„
  • Chunked Prefill ๋ฐ CUDA Graph ๋ชจ๋‘ ํ™œ์„ฑํ™”

vLLM์˜ ๊ธฐ๋ณธ ๊ตฌ์„ฑ๋งŒ์œผ๋กœ ์–ป์„ ์ˆ˜ ์žˆ๋Š” ์ˆœ์ˆ˜ ๊ฐ€์† ํšจ๊ณผ๋ฅผ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•œ ์‹คํ—˜์ด๋‹ค. ์ •ํ™•๋„๋ฅผ ์ตœ๋Œ€ํ•œ ์œ ์ง€ํ•˜๋ฉด์„œ ์–ป์„ ์ˆ˜ ์žˆ๋Š” ์„ฑ๋Šฅ ์ƒํ•œ์„ ์„ ๋ณด์—ฌ์ค€๋‹ค.

 

5) vLLM + BF16

  • vLLM ๊ธฐ๋ณธ ์„ค์ •
  • BF16 ์ •๋ฐ€๋„ (dtype="bfloat16")

์—ฐ์‚ฐ ์ •๋ฐ€๋„๋ฅผ BF16์œผ๋กœ ๋‚ฎ์ท„์„ ๋•Œ์˜ ์„ฑ๋Šฅ ๋ณ€ํ™”๋ฅผ ์ธก์ •ํ•œ๋‹ค. A100 GPU์˜ BF16 ํ•˜๋“œ์›จ์–ด ๊ฐ€์†์„ ํ™œ์šฉํ•œ ์ตœ์  ์„ค์ •์œผ๋กœ, ๋ณธ ์‹คํ—˜์—์„œ ๊ฐ€์žฅ ๋†’์€ ์ฒ˜๋ฆฌ๋Ÿ‰๊ณผ ๊ฐ€์žฅ ๋‚ฎ์€ ์ถ”๋ก  ์ง€์—ฐ ์‹œ๊ฐ„์„ ๊ธฐ๋กํ–ˆ๋‹ค.

 

4. ์‹คํ—˜ ๊ฒฐ๊ณผ 

vLLM ๊ธฐ๋ณธ ์„ค์ •๋งŒ์œผ๋กœ๋„ ์•ฝ 4.5๋ฐฐ์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด ๋‚˜ํƒ€๋‚ฌ์œผ๋ฉฐ, BF16 ์ ์šฉ ์‹œ 7๋ฐฐ ์ด์ƒ์˜ ๊ฐ€์†์„ ๋‹ฌ์„ฑํ–ˆ๋‹ค.

  • Transformers Baseline
    • Transformers๋Š” ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์€ ๊ฐ€์žฅ ์ ์ง€๋งŒ, KV cache ๊ด€๋ฆฌ์™€ ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ ์ตœ์ ํ™”๊ฐ€ ๋ถ€์กฑํ•ด ์ถ”๋ก  ์‹œ๊ฐ„์ด ๊ฐ€์žฅ ๊ธธ๋‹ค. ๋‹จ์ผ ์š”์ฒญ ๊ธฐ์ค€์—์„œ๋„ ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ํฌ๋‹ค.
  • vLLM Baseline
    • PagedAttention๊ณผ CUDA Graph๋งŒ์œผ๋กœ๋„ ์ถ”๋ก  ์‹œ๊ฐ„์ด ๋Œ€ํญ ๊ฐ์†Œํ•œ๋‹ค. FP32 ์ •๋ฐ€๋„๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ๋„ 4๋ฐฐ ์ด์ƒ์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋ณด์ธ๋‹ค.
  • vLLM + BF16
    • BF16 ์ ์šฉ ์‹œ ์—ฐ์‚ฐ throughput์ด ํฌ๊ฒŒ ์ฆ๊ฐ€ํ•œ๋‹ค.
    • ์ •ํ™•๋„ ์ €ํ•˜๊ฐ€ ๊ฑฐ์˜ ์—†์ด ์ถ”๋ก  ์†๋„๋Š” FP32 ๋Œ€๋น„ ์•ฝ 39% ์ถ”๊ฐ€ ๊ฐœ์„ ๋˜๋ฉฐ, ์ „์ฒด์ ์œผ๋กœ Transformers ๋Œ€๋น„ 7๋ฐฐ ์ด์ƒ์˜ ๊ฐ€์†์„ ๋‹ฌ์„ฑํ•œ๋‹ค.

VLM ์ถ”๋ก ์€ ๊ตฌ์กฐ์ ์œผ๋กœ ๋น„์šฉ์ด ๋†’์ง€๋งŒ, vLLM์„ ํ™œ์šฉํ•˜๋ฉด ๋‹จ์ผ ์š”์ฒญ ๊ธฐ์ค€์—์„œ๋„ ํฐ ์„ฑ๋Šฅ ๊ฐœ์„ ์„ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค. ํŠนํžˆ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ชจ๋ธ์„ ์„œ๋น„์Šค ํ™˜๊ฒฝ์— ์˜ฌ๋ฆด ๊ฒฝ์šฐ, vLLM์€ ์„ ํƒ์ด ์•„๋‹Œ ํ•„์ˆ˜์— ๊ฐ€๊นŒ์šด ์ธํผ๋Ÿฐ์Šค ์—”์ง„์ด๋‹ค. ๋ณธ ์‹คํ—˜์ด VLM ์„œ๋น™ ๋ฐ ๊ฐ€์†ํ™” ์ „๋žต์„ ์„ค๊ณ„ํ•˜๋Š” ๋ฐ ์ฐธ๊ณ  ์ž๋ฃŒ๊ฐ€ ๋˜๊ธฐ๋ฅผ ๋ฐ”๋ž€๋‹ค.

๋ฐ˜์‘ํ˜•

'๐Ÿ› ๏ธ Engineering > Distributed Training & Inference' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

DeepSpeed ์™„๋ฒฝ ์ดํ•ดํ•˜๊ธฐ!  (1) 2025.07.07
PyTorch FSDP (Fully Sharded Data Parallel) ์™„๋ฒฝ ์ดํ•ดํ•˜๊ธฐ!  (4) 2025.07.06
PyTorch ๋ถ„์‚ฐ ํ•™์Šต ๊ธฐ์ดˆ: ๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌํ™”, ๋ชจ๋ธ ๋ณ‘๋ ฌํ™”, ํŒŒ์ดํ”„๋ผ์ธ ๋ณ‘๋ ฌํ™”  (1) 2025.07.03
GPU ํด๋Ÿฌ์Šคํ„ฐ: SuperPOD์™€ Slurm์˜ ๊ฐœ๋…๊ณผ ํ™œ์šฉ๋ฒ•  (1) 2025.07.03
'๐Ÿ› ๏ธ Engineering/Distributed Training & Inference' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€
  • DeepSpeed ์™„๋ฒฝ ์ดํ•ดํ•˜๊ธฐ!
  • PyTorch FSDP (Fully Sharded Data Parallel) ์™„๋ฒฝ ์ดํ•ดํ•˜๊ธฐ!
  • PyTorch ๋ถ„์‚ฐ ํ•™์Šต ๊ธฐ์ดˆ: ๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌํ™”, ๋ชจ๋ธ ๋ณ‘๋ ฌํ™”, ํŒŒ์ดํ”„๋ผ์ธ ๋ณ‘๋ ฌํ™”
  • GPU ํด๋Ÿฌ์Šคํ„ฐ: SuperPOD์™€ Slurm์˜ ๊ฐœ๋…๊ณผ ํ™œ์šฉ๋ฒ•
๋ญ…์ฆค
๋ญ…์ฆค
AI ๊ธฐ์ˆ  ๋ธ”๋กœ๊ทธ
    ๋ฐ˜์‘ํ˜•
  • ๋ญ…์ฆค
    moovzi’s Doodle
    ๋ญ…์ฆค
  • ์ „์ฒด
    ์˜ค๋Š˜
    ์–ด์ œ
  • ๊ณต์ง€์‚ฌํ•ญ

    • โœจ About Me
    • ๋ถ„๋ฅ˜ ์ „์ฒด๋ณด๊ธฐ (209)
      • ๐Ÿ“– Fundamentals (34)
        • Computer Vision (9)
        • 3D vision & Graphics (6)
        • AI & ML (16)
        • NLP (2)
        • etc. (1)
      • ๐Ÿ› Research (72)
        • Deep Learning (7)
        • Perception (19)
        • OCR (7)
        • Multi-modal (4)
        • Image•Video Generation (17)
        • 3D Vision (4)
        • Material • Texture Recognit.. (8)
        • NLP • LLM (6)
        • etc. (0)
      • ๐Ÿ› ๏ธ Engineering (8)
        • Distributed Training & Infe.. (5)
        • AI & ML ์ธ์‚ฌ์ดํŠธ (3)
      • ๐Ÿ’ป Programming (92)
        • Python (18)
        • Computer Vision (12)
        • LLM (4)
        • AI & ML (18)
        • Database (3)
        • Distributed Computing (6)
        • Apache Airflow (6)
        • Docker & Kubernetes (14)
        • ์ฝ”๋”ฉ ํ…Œ์ŠคํŠธ (4)
        • C++ (1)
        • etc. (6)
      • ๐Ÿ’ฌ ETC (3)
        • ์ฑ… ๋ฆฌ๋ทฐ (3)
  • ๋งํฌ

  • ์ธ๊ธฐ ๊ธ€

  • ํƒœ๊ทธ

    airflow
    generative ai
    Computer Vision
    Python
    pytorch
    pyspark
    AI
    3D Vision
    T2i
    Image generation
    deep learning
    multi-modal
    ํŒŒ์ด์ฌ
    ๊ฐ์ฒด ๊ฒ€์ถœ
    LLM
    nlp
    Text recognition
    ml
    OpenAI
    OpenCV
    ๊ฐ์ฒด๊ฒ€์ถœ
    material recognition
    segmentation
    ์ปดํ“จํ„ฐ๋น„์ „
    OCR
    diffusion
    pandas
    object detection
    ๋”ฅ๋Ÿฌ๋‹
    ๋„์ปค
  • ์ตœ๊ทผ ๋Œ“๊ธ€

  • ์ตœ๊ทผ ๊ธ€

  • hELLOยท Designed By์ •์ƒ์šฐ.v4.10.3
๋ญ…์ฆค
vLLM์„ ํ™œ์šฉํ•œ Large-scale AI ๋ชจ๋ธ ๊ฐ€์†ํ™” | LLM Acceleration
์ƒ๋‹จ์œผ๋กœ

ํ‹ฐ์Šคํ† ๋ฆฌํˆด๋ฐ”