DeepSpeed ์™„๋ฒฝ ์ดํ•ดํ•˜๊ธฐ!

2025. 7. 7. 21:21ยท๐Ÿ› ๏ธ Engineering/Distributed Training & Inference
๋ฐ˜์‘ํ˜•

1. DeepSpeed๋ž€?

DeepSpeed๋Š” Microsoft์—์„œ ๊ฐœ๋ฐœํ•œ ๋Œ€๊ทœ๋ชจ ๋ถ„์‚ฐ ํ•™์Šต ์ตœ์ ํ™” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์ด๋‹ค. ๊ธฐ์กด์˜ PyTorch DDP๋งŒ์œผ๋กœ๋Š” ํ•™์Šตํ•˜๊ธฐ ์–ด๋ ค์šด ์ˆ˜์‹ญ์–ต~์ˆ˜๋ฐฑ์–ต ํŒŒ๋ผ๋ฏธํ„ฐ ๋ชจ๋ธ์„ ํšจ์œจ์ ์œผ๋กœ ํ•™์Šต์‹œํ‚ค๊ธฐ ์œ„ํ•ด ๋“ฑ์žฅํ–ˆ๋‹ค. GPU ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ ˆ์•ฝํ•˜๊ณ  ํ•™์Šต ์†๋„๋ฅผ ๋†’์ด๊ธฐ ์œ„ํ•œ ZeRO ์ตœ์ ํ™”, Mixed Precision, Offloading, Pipeline/Tensor ๋ณ‘๋ ฌํ™” ๋“ฑ ๋‹ค์–‘ํ•œ ๊ธฐ๋ฒ•์„ ํ†ตํ•ฉ์ ์œผ๋กœ ์ œ๊ณตํ•˜๋Š” ๊ฒƒ์ด ํŠน์ง•์ด๋‹ค. PyTorch ๊ธฐ๋ฐ˜์œผ๋กœ ์ž‘๋™ํ•˜๋ฉฐ ์‚ฌ์šฉ๋ฒ•๊ณผ ํ†ตํ•ฉ์ด ์šฉ์ดํ•˜๋‹ค.

 

2. ์ฃผ์š” ํŠน์ง•

2.1 ZeRO (Zero Redundancy Optimizer)

DeepSpeed์˜ ํ•ต์‹ฌ ๊ธฐ๋Šฅ์ด๋‹ค. ํŒŒ๋ผ๋ฏธํ„ฐ, ๊ทธ๋ผ๋””์–ธํŠธ, ์˜ตํ‹ฐ๋งˆ์ด์ € ์ƒํƒœ๋ฅผ GPU๋ผ๋ฆฌ ๋ถ„์‚ฐ(shard)ํ•ด GPU ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ ๋Œ€ํญ ์ ˆ๊ฐํ•œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ZeRO-3๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ ์ž์ฒด๊นŒ์ง€ shardํ•˜์—ฌ GPU ํ•˜๋‚˜๊ฐ€ ๋ชจ๋ธ ์ „์ฒด๋ฅผ ํ•ญ์ƒ ์˜ฌ๋ฆฌ์ง€ ์•Š๊ณ ๋„ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•˜๋‹ค. PyTorch DDP๋Š” ๋ชจ๋ธ ์ „์ฒด๋ฅผ ๊ฐ GPU์— ๋ณต์ œํ•˜๋Š” ๋ฐฉ์‹์ด๋ผ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์ด ๋†’์ง€๋งŒ, ZeRO๋Š” ์ด๋ฅผ ์ตœ์†Œํ™”ํ•œ๋‹ค.

  • ZeRO-1: optimizer states ๋ถ„์‚ฐ
  • ZeRO-2: optimizer states + gradients ๋ถ„์‚ฐ
  • ZeRO-3: optimizer states + gradients + parameters ๊นŒ์ง€ ๋ชจ๋‘ shard

2.2 Mixed Precision Training

FP16, BF16 ๋“ฑ์„ ํ™œ์šฉํ•ด ์—ฐ์‚ฐ ์†๋„์™€ GPU ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ ์ค„์ธ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด 32GB GPU๋กœ 16GB ๋ชจ๋ธ์„ ํ•™์Šตํ•  ๋•Œ FP16์„ ์ ์šฉํ•˜๋ฉด ๊ฐ™์€ GPU์—์„œ 2๋ฐฐ ๊ฐ€๊นŒ์šด ๋ฐฐ์น˜ ์‚ฌ์ด์ฆˆ๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

2.3 Gradient Accumulation & CPU Offloading

batch๋ฅผ ์—ฌ๋Ÿฌ step์œผ๋กœ ๋‚˜๋ˆ  ๊ทธ๋ž˜๋””์–ธํŠธ๋ฅผ ๋ˆ„์ (accumulate)ํ•˜๊ณ , optimizer ์ƒํƒœ๋ฅผ CPU์— ์˜ฌ๋ ค GPU ๋ฉ”๋ชจ๋ฆฌ ๋ถ€๋‹ด์„ ์ค„์ธ๋‹ค. DDP๋‚˜ FSDP๋ณด๋‹ค CPU offload๊ฐ€ ์‰ฝ๊ฒŒ ์ ์šฉ๋œ๋‹ค.

2.4 Pipeline & Tensor Parallelism

๋ชจ๋ธ์„ ์—ฌ๋Ÿฌ stage๋‚˜ ํ…์„œ ๋‹จ์œ„๋กœ ์ชผ๊ฐœ ๋‹ค์ˆ˜์˜ GPU์—์„œ ๋ณ‘๋ ฌ๋กœ ํ•™์Šตํ•œ๋‹ค. Transformer์ฒ˜๋Ÿผ ๊นŠ์€ ๋„คํŠธ์›Œํฌ๋ฅผ ํŒŒ์ดํ”„๋ผ์ธ ์Šคํ…Œ์ด์ง€๋กœ ๋‚˜๋ˆ  ์ฒ˜๋ฆฌํ•˜๊ฑฐ๋‚˜ ํ…์„œ ์—ฐ์‚ฐ ์ž์ฒด๋ฅผ ์ชผ๊ฐœ์–ด ๋ถ„์‚ฐ ์—ฐ์‚ฐํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ, PyTorch์˜ ๊ธฐ๋ณธ DDP/FSDP๊ฐ€ ์ง€์›ํ•˜์ง€ ์•Š๋Š” ์ˆ˜์ค€์˜ fine-grained ๋ณ‘๋ ฌ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•˜๋‹ค.

 

2.5 ๋ถ„์‚ฐํ•™์Šต ํ”„๋ ˆ์ž„์›Œํฌ ๋น„๊ต

๋ฐฉ๋ฒ• ๋ฌด์—‡์„ ๋ณ‘๋ ฌํ™”? ํŠน์ง•
DDP ๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌํ™” ๋ชจ๋“  GPU๊ฐ€ ๊ฐ™์€ ๋ชจ๋ธ ์ „์ฒด๋ฅผ ๋ณต์ œ, ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ batch ํ•™์Šต
FSDP ๋ฐ์ดํ„ฐ + ํŒŒ๋ผ๋ฏธํ„ฐ shard ํŒŒ๋ผ๋ฏธํ„ฐ shard๋กœ GPU ๋ฉ”๋ชจ๋ฆฌ ์ค„์ž„, ๊ทธ๋ž˜๋„ ๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ
DeepSpeed Pipeline ๋ชจ๋ธ ์ž์ฒด๋ฅผ stage๋ณ„๋กœ ๋ถ„ํ•  GPU์— ๋ชจ๋ธ์„ ๋‚˜๋ˆ ์„œ ์˜ฌ๋ ค ๋ฉ”๋ชจ๋ฆฌ ๋ถ„์‚ฐ, forward/backward๋„ ์ˆœ์ฐจ ์ˆ˜ํ–‰

 

DDP, FSDP๊ฐ€ ๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ์„ ์œ„ํ•œ ํ”„๋ ˆ์ž„์›Œํฌ๋ผ๋ฉด, DeepSpeed๋Š” ZeRO(๋ฐ์ดํ„ฐ+ํŒŒ๋ผ๋ฏธํ„ฐ ๋ณ‘๋ ฌ) ๋ฟ ์•„๋‹ˆ๋ผ ํŒŒ์ดํ”„๋ผ์ธ ๋ณ‘๋ ฌ๋„ ์ง€์›ํ•ด ์ดˆ๋Œ€ํ˜• ๋ชจ๋ธ ํ•™์Šต์— ํ›จ์”ฌ ์ ํ•ฉํ•˜๋‹ค.

 

3. DeepSpeed Pipeline Parallelism ์‹ค์ œ ๋™์ž‘ ์˜ˆ์‹œ

์˜ˆ์‹œ: 3-stage ํŒŒ์ดํ”„๋ผ์ธ

  • ๋ชจ๋ธ์„ stage 3๊ฐœ๋กœ ๋‚˜๋ˆ„์—ˆ๋‹ค๊ณ  ๊ฐ€์ • (ex. Transformer block์„ ๋‚˜๋ˆ”)
  • GPU๋„ 3๋Œ€ (GPU0, GPU1, GPU2)
๋ฏธ๋‹ˆ๋ฐฐ์น˜1
GPU0: stage1 --> output1
GPU1:           stage2 --> output2
GPU2:                      stage3 --> final_output

๋ฏธ๋‹ˆ๋ฐฐ์น˜2
GPU0: stage1 --> output1
GPU1:           stage2 --> output2
GPU2:                      stage3 --> final_output

...

์ฆ‰ forward๋Š”

  • GPU0์ด stage1์„ ๊ณ„์‚ฐ → output์„ GPU1๋กœ ๋„˜๊น€
  • GPU1์ด stage2 ๊ณ„์‚ฐ → output์„ GPU2๋กœ ๋„˜๊น€
  • GPU2๊ฐ€ stage3 ๊ณ„์‚ฐ → ์ตœ์ข… output

๋™์‹œ์— pipeline bubble์„ ์ค„์ด๊ธฐ ์œ„ํ•ด

time step1: GPU0(๋ฏธ๋‹ˆ๋ฐฐ์น˜1 stage1)
time step2: GPU0(๋ฏธ๋‹ˆ๋ฐฐ์น˜2 stage1), GPU1(๋ฏธ๋‹ˆ๋ฐฐ์น˜1 stage2)
time step3: GPU0(๋ฏธ๋‹ˆ๋ฐฐ์น˜3 stage1), GPU1(๋ฏธ๋‹ˆ๋ฐฐ์น˜2 stage2), GPU2(๋ฏธ๋‹ˆ๋ฐฐ์น˜1 stage3)
...

์ด๋ ‡๊ฒŒ ๋ฏธ๋‹ˆ๋ฐฐ์น˜๋ฅผ ๋‚˜๋ˆ„์–ด ์„œ๋กœ ๋‹ค๋ฅธ stage๊ฐ€ ๋™์‹œ์— ๋Œ์•„๊ฐ€๋„๋ก ํ•ด idle time(bubble)์„ ์ตœ์†Œํ™”ํ•œ๋‹ค.

 

Backward ํ๋ฆ„

GPU2: backward(stage3)
GPU1: backward(stage2)
GPU0: backward(stage1)

backward๋Š” forward์˜ ๋ฐ˜๋Œ€ ๋ฐฉํ–ฅ์œผ๋กœ ํ˜๋Ÿฌ๊ฐ.

 

โœ… Pipeline parallelism
๋ชจ๋ธ์„ ์—ฌ๋Ÿฌ stage๋กœ ์ชผ๊ฐœ GPU์— ๋ถ„์‚ฐ → ์ˆœ์ฐจ์ ์œผ๋กœ forward/backward ์ˆ˜ํ–‰ → bubble ์ตœ์†Œํ™” ์œ„ํ•ด ๋ฏธ๋‹ˆ๋ฐฐ์น˜ ๋‚˜๋ˆ„์–ด pipeline ์ฑ„์šฐ๊ธฐ

 

4. ์‚ฌ์šฉ ๋ฐฉ๋ฒ•

4.1 ์„ค์น˜

pip install deepspeed

PyTorch, CUDA, NCCL์ด ์„ค์น˜๋˜์–ด ์žˆ์–ด์•ผ ํ•˜๋ฉฐ, ๋Œ€๊ทœ๋ชจ ๋ถ„์‚ฐ์„ ์œ„ํ•ด Slurm ๊ฐ™์€ ์Šค์ผ€์ค„๋Ÿฌ ํ™˜๊ฒฝ์ด ์ค€๋น„๋˜์–ด ์žˆ์œผ๋ฉด ์ข‹๋‹ค.

4.1 ๊ธฐ๋ณธ ํ•™์Šต ์Šคํฌ๋ฆฝํŠธ ๊ตฌ์กฐ

DeepSpeed๋Š” deepspeed ๋Ÿฐ์ฒ˜๋ฅผ ์‚ฌ์šฉํ•ด ๋ถ„์‚ฐ ํ•™์Šต์„ ์‹œ์ž‘ํ•œ๋‹ค.

deepspeed --num_gpus=4 train.py --deepspeed --deepspeed_config ds_config.json

์—ฌ๊ธฐ์„œ ds_config.json์€ ํ•™์Šต ๋ฐ ZeRO, FP16 ์„ค์ • ๋“ฑ์„ ๋‹ด์€ ์„ค์ • ํŒŒ์ผ์ด๋‹ค.

4.2 ์˜ˆ์‹œ ds_config.json

{
  "train_batch_size": 64,
  "gradient_accumulation_steps": 4,
  "fp16": {
    "enabled": true
  },
  "zero_optimization": {
    "stage": 2,
    "offload_optimizer": {
      "device": "cpu"
    }
  }
}
  • train_batch_size: ์ „์ฒด ํ•™์Šต์—์„œ ๋ชฉํ‘œ๋กœ ํ•˜๋Š” effective batch size. gradient_accumulation_steps์™€ num_gpus๋ฅผ ๊ณฑํ•ด GPU๋‹น ์ฒ˜๋ฆฌ๋Ÿ‰์„ ๊ฒฐ์ •ํ•œ๋‹ค.
  • gradient_accumulation_steps: gradient๋ฅผ ์—ฌ๋Ÿฌ step ๋™์•ˆ ๋ˆ„์  ํ›„ ํ•œ๊บผ๋ฒˆ์— optimizer step์„ ์ˆ˜ํ–‰ํ•ด ๋” ํฐ batch size๋ฅผ ํ‰๋‚ด๋‚ธ๋‹ค.
  • fp16.enabled: float32 ๋Œ€์‹  float16(๋ฐ˜์ •๋ฐ€๋„) ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•ด ์—ฐ์‚ฐ ์†๋„์™€ GPU ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ์„ ์ค„์ธ๋‹ค.
  • zero_optimization.stage: ZeRO ์ตœ์ ํ™” ๋‹จ๊ณ„. stage 2๋Š” optimizer states์™€ gradients๋ฅผ shardํ•˜์—ฌ ๋ฉ”๋ชจ๋ฆฌ footprint๋ฅผ ํฌ๊ฒŒ ์ค„์ธ๋‹ค.
  • offload_optimizer.device: optimizer ์ƒํƒœ๋ฅผ CPU๋กœ ์˜ฎ๊ฒจ GPU ๋ฉ”๋ชจ๋ฆฌ ๋ถ€๋‹ด์„ ๋” ์ค„์ธ๋‹ค.
GPU ํด๋Ÿฌ์Šคํ„ฐ ๊ตฌ์„ฑ → Slurm์œผ๋กœ ์ž์› ๊ด€๋ฆฌ → PyTorch ๋ชจ๋ธ → DeepSpeed๋กœ ๋ž˜ํ•‘ + ds_config → Slurm์—์„œ srun/sbatch๋กœ ์‹คํ–‰

DeepSpeed๋Š” ์ดˆ๊ฑฐ๋Œ€ ๋ชจ๋ธ์„ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด ๋“ฑ์žฅํ•œ ํ”„๋ ˆ์ž„์›Œํฌ๋กœ, ZeRO๋ฅผ ํ†ตํ•ด GPU ๋ฉ”๋ชจ๋ฆฌ ๋ถ€๋‹ด์„ ๊ทน์ ์œผ๋กœ ์ค„์ด๊ณ , FP16, Offloading, Parallelism ๋“ฑ์„ ์กฐํ•ฉํ•ด ๋Œ€๊ทœ๋ชจ ๋ถ„์‚ฐ ํ•™์Šต์„ ํšจ์œจ์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•œ๋‹ค.

๋ฐ˜์‘ํ˜•

'๐Ÿ› ๏ธ Engineering > Distributed Training & Inference' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

vLLM์„ ํ™œ์šฉํ•œ Large-scale AI ๋ชจ๋ธ ๊ฐ€์†ํ™” | LLM Acceleration  (0) 2025.12.16
PyTorch FSDP (Fully Sharded Data Parallel) ์™„๋ฒฝ ์ดํ•ดํ•˜๊ธฐ!  (4) 2025.07.06
PyTorch ๋ถ„์‚ฐ ํ•™์Šต ๊ธฐ์ดˆ: ๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌํ™”, ๋ชจ๋ธ ๋ณ‘๋ ฌํ™”, ํŒŒ์ดํ”„๋ผ์ธ ๋ณ‘๋ ฌํ™”  (1) 2025.07.03
GPU ํด๋Ÿฌ์Šคํ„ฐ: SuperPOD์™€ Slurm์˜ ๊ฐœ๋…๊ณผ ํ™œ์šฉ๋ฒ•  (1) 2025.07.03
'๐Ÿ› ๏ธ Engineering/Distributed Training & Inference' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€
  • vLLM์„ ํ™œ์šฉํ•œ Large-scale AI ๋ชจ๋ธ ๊ฐ€์†ํ™” | LLM Acceleration
  • PyTorch FSDP (Fully Sharded Data Parallel) ์™„๋ฒฝ ์ดํ•ดํ•˜๊ธฐ!
  • PyTorch ๋ถ„์‚ฐ ํ•™์Šต ๊ธฐ์ดˆ: ๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌํ™”, ๋ชจ๋ธ ๋ณ‘๋ ฌํ™”, ํŒŒ์ดํ”„๋ผ์ธ ๋ณ‘๋ ฌํ™”
  • GPU ํด๋Ÿฌ์Šคํ„ฐ: SuperPOD์™€ Slurm์˜ ๊ฐœ๋…๊ณผ ํ™œ์šฉ๋ฒ•
๋ญ…์ฆค
๋ญ…์ฆค
AI ๊ธฐ์ˆ  ๋ธ”๋กœ๊ทธ
    ๋ฐ˜์‘ํ˜•
  • ๋ญ…์ฆค
    moovzi’s Doodle
    ๋ญ…์ฆค
  • ์ „์ฒด
    ์˜ค๋Š˜
    ์–ด์ œ
  • ๊ณต์ง€์‚ฌํ•ญ

    • โœจ About Me
    • ๋ถ„๋ฅ˜ ์ „์ฒด๋ณด๊ธฐ (213)
      • ๐Ÿ“– Fundamentals (34)
        • Computer Vision (9)
        • 3D vision & Graphics (6)
        • AI & ML (16)
        • NLP (2)
        • etc. (1)
      • ๐Ÿ› Research (75)
        • Deep Learning (7)
        • Perception (19)
        • OCR (7)
        • Multi-modal (5)
        • Image•Video Generation (18)
        • 3D Vision (4)
        • Material • Texture Recognit.. (8)
        • Large-scale Model (7)
        • etc. (0)
      • ๐Ÿ› ๏ธ Engineering (8)
        • Distributed Training & Infe.. (5)
        • AI & ML ์ธ์‚ฌ์ดํŠธ (3)
      • ๐Ÿ’ป Programming (92)
        • Python (18)
        • Computer Vision (12)
        • LLM (4)
        • AI & ML (18)
        • Database (3)
        • Distributed Computing (6)
        • Apache Airflow (6)
        • Docker & Kubernetes (14)
        • ์ฝ”๋”ฉ ํ…Œ์ŠคํŠธ (4)
        • C++ (1)
        • etc. (6)
      • ๐Ÿ’ฌ ETC (4)
        • ์ฑ… ๋ฆฌ๋ทฐ (4)
  • ๋งํฌ

    • ๋ฆฌํ‹€๋ฆฌ ํ”„๋กœํ•„ (๋ฉ˜ํ† ๋ง, ๋ฉด์ ‘์ฑ…,...)
    • ใ€Ž๋‚˜๋Š” AI ์—”์ง€๋‹ˆ์–ด์ž…๋‹ˆ๋‹คใ€
    • Instagram
    • Brunch
    • Github
  • ์ธ๊ธฐ ๊ธ€

  • ์ตœ๊ทผ ๋Œ“๊ธ€

  • ์ตœ๊ทผ ๊ธ€

  • hELLOยท Designed By์ •์ƒ์šฐ.v4.10.3
๋ญ…์ฆค
DeepSpeed ์™„๋ฒฝ ์ดํ•ดํ•˜๊ธฐ!
์ƒ๋‹จ์œผ๋กœ

ํ‹ฐ์Šคํ† ๋ฆฌํˆด๋ฐ”