PyTorch FSDP (Fully Sharded Data Parallel) 완벽 이해하기!

뭅즤 2025. 7. 6. 19:15

1. FSDP(Fully Sharded Data Parallel)이란?

1.1 FSDP 개념

FSDP는 PyTorch에서 제공하는 고급 분산 학습 기법으로, 모델의 모든 파라미터를 GPU마다 복제하는 기존 DDP 방식과 달리, 모델의 파라미터를 GPU끼리 shard(조각) 단위로 나누어 저장하는 방식이다. 이를 통해 GPU 메모리 사용량을 대폭 절약할 수 있다.

FSDP는 GPU마다 모델 전체가 아닌 일부 shard만 저장하고, forward 및 backward 연산 시 필요한 파라미터를 GPU 간에 서로 교환(all-gather)하여 연산을 수행한 후 다시 shard로 분산 저장(reduce-scatter)하는 방식으로 동작한다.

1.2 DDP vs FSDP 차이

구분	DDP	FSDP
모델 파라미터	각 GPU가 전체 모델 복제	모델 파라미터를 shard 단위로 나눠서 저장
GPU 메모리	GPU마다 모델 전체를 저장해 메모리 많이 사용	GPU끼리 shard 단위로 나누어 메모리 절약
데이터 처리	각 GPU가 다른 데이터를 병렬 처리	각 GPU가 다른 데이터를 병렬 처리 (DDP와 동일)
GPU 간 통신	gradient 동기화(all-reduce)만 수행	모델 파라미터 교환(all-gather/reduce-scatter) 추가

즉, FSDP는 모델 병렬의 메모리 절약 장점과 데이터 병렬의 빠른 병렬처리 장점을 결합한 방식이다.

2. FSDP 이해하기

GPU0 : shard 0
GPU1 : shard 1
GPU2 : shard 2
GPU3 : shard 3

그래서 모델 파라미터를 위와 같이 여러 GPU가 나눠서 가지고 있다는 건데, 그럼 어떻게 학습을 하는걸까?

2.1 FSDP의 forward 과정

forward를 시작하기 전에,
필요한 layer (파라미터)를 GPU끼리 all-gather 해서
→ 각 GPU가 forward 계산에 필요한 파라미터를 임시로 전부 모음
forward 계산이 끝나면,
다시 그 파라미터들을 shard 단위로 분산 저장 (reduce-scatter)해서
→ GPU 메모리를 다시 비웁니다.

즉, 개별 GPU가 모델 전체를 항상 들고 있진 않지만, forward 할 때만 잠시 GPU끼리 파라미터를 모아서 계산을 하고, 끝나면 다시 shard로 나눠서 저장하는 방식이라고 보면 된다.

좀더 포멀하게 정리하면 -- GPU가 모델 전체를 항상 들고 있지 않아도 forward 계산을 할 수 있는 이유는, 필요할 때만 파라미터를 네트워크로 모아서(all-gather) 계산하기 때문이다.

2.2 FSDP forward / backward 통신 흐름

[ GPU0 ]   [ GPU1 ]   [ GPU2 ]   [ GPU3 ]
   |          |          |          |
   |          |          |          |
   |------ all-gather 파라미터 -----> |
   |  (모델 shard를 서로 주고받아)       |
   |  => 각 GPU가 forward에 필요한     |
   |     전체 파라미터를 임시로 보유       |
   |          |          |          |
   |    forward & backward 계산      |
   |          |          |          |
   |<----- reduce-scatter ----------|
   | (계산 끝난 후 파라미터를 다시         |
   |     shard 단위로 나눠 저장)        |
   |          |          |          |

2.3 결국 순간적이더라도 모델 전체를 개별 GPU에 올리는 것 아닌가??

공부하다보니 의문이 든다. 그럼 FSDP를 쓰면 결국 forward/backward 할 때 개별 GPU가 모델 전체를 들고 있는 셈인데, 그럼 왜 굳이 샤드를 나누는가? 모델 전체를 들고 있을 수 있는데?? GPU 간 통신만 많아지는데 뭐가 효율적이라는 걸까??

✅ 우선 메모리에 올라가는 값들을 알아보자

요소	생성 시기	왜 필요한가	GPU 메모리 영향	특징
weights	모델 초기화	forward 계산 및 업데이트	상대적으로 작음	- 모델 파라미터(W, b) - 항상 GPU에 상주
gradients	backward	weights 업데이트	weights와 비슷함	- ∂Loss/∂W 계산 결과 - backward 후 optimizer가 사용
activations	forward 계산 중간	backward 시 gradient chain rule 계산에 필요	가장 큼	- 각 layer의 출력값 - batch size, sequence length에 따라 급격히 증가

즉, 모델 파라미터가 weights이고 forward를 하면서 생성되는 값들이 activations이고, 이 activations를 활용해서 backward 할 때 gradients를 구하는 거라 보면 된다.

✅ FSDP 실제 연산 과정

layer1 : 필요한 shard만 all-gather → forward → reduce-scatter
layer2 : 필요한 shard만 all-gather → forward → reduce-scatter
...

결론만 얘기하면 FSDP는 forward/backward 동안 모델 전체를 한꺼번에 all-gather 해서 GPU에 올리는 방식이 아니다.
FSDP는 모델을 여러 레이어 또는 파라미터 bucket 단위로 쪼개서 관리한다.
그래서 forward에서 레이어 하나를 계산할 때
- 그 레이어에 필요한 파라미터 shard만 GPU끼리 all-gather 해서 전체 파라미터를 재구성
- forward 연산을 수행
- 연산이 끝나면 다시 reduce-scatter 해서 shard만 남기고 나머지 메모리를 비운다.
backward도 같은 방식으로 역순으로 수행한다.

그래서 forward/backward 동안에도 GPU가 모델 전체를 메모리에 동시에 올리고 있는 순간은 없다. 모델 하나를 통째로 GPU에 올릴 수 없는 경우에도, FSDP는 layer-by-layer 또는 bucket-by-bucket 으로 계산을 이어나가면서 학습이 가능해진다.

FSDP가 레거시 bucket 방식으로 묶은 파라미터를 chunk 단위로 all-gather 하는데, 이 chunk size를 크게 잡으면 사실상 forward 초반에 많은 파라미터를 한꺼번에 GPU로 모아서 연산하기 때문에 peak가 커져서 DDP와 유사한 메모리 패턴을 보이는 경우가 있다. 그래서 이걸 단순화해 “forward/backward 때 GPU가 전체를 들고 있다”고들 흔히 말함.
하지만 PyTorch 최신 FSDP (v1.12+ 이후) 는 bucket size를 작게 하고, 레이어 단위로 dynamic하게 all-gather를 수행하기 때문 실제로는 layer별로 필요한 shard만 GPU에 올려 메모리 사용량을 분산한다.

시점	DDP	FSDP
forward/backward 중	GPU가 전체 파라미터 + activations + gradients 유지	필요 layer shard만 all-gather → 연산 → reduce-scatter, 전체를 동시에 올리지 않음
forward/backward 끝난 후 (steady)	GPU가 전체 파라미터 유지	GPU는 자신의 shard만 유지, optimizer step도 shard 기반

✅ FSDP forward 흐름

[Layer1]
GPU0: shard0
GPU1: shard1
GPU2: shard2
GPU3: shard3

    ↓ all-gather (Layer1)
GPU0: weight1 전체
GPU1: weight1 전체
GPU2: weight1 전체
GPU3: weight1 전체

    ↓ forward 계산 (Layer1)
    ↓ reduce-scatter
GPU0: shard0 (Layer1)
GPU1: shard1 (Layer1)
GPU2: shard2 (Layer1)
GPU3: shard3 (Layer1)


[Layer2]
GPU0: shard0
GPU1: shard1
GPU2: shard2
GPU3: shard3

    ↓ all-gather (Layer2)
GPU0: weight2 전체
GPU1: weight2 전체
GPU2: weight2 전체
GPU3: weight2 전체

    ↓ forward 계산 (Layer2)
    ↓ reduce-scatter
GPU0: shard0 (Layer2)
GPU1: shard1 (Layer2)
GPU2: shard2 (Layer2)
GPU3: shard3 (Layer2)


[Layer3]
GPU0: shard0
GPU1: shard1
GPU2: shard2
GPU3: shard3

    ↓ all-gather (Layer3)
GPU0: weight3 전체
GPU1: weight3 전체
GPU2: weight3 전체
GPU3: weight3 전체

    ↓ forward 계산 (Layer3)
    ↓ reduce-scatter
GPU0: shard0 (Layer3)
GPU1: shard1 (Layer3)
GPU2: shard2 (Layer3)
GPU3: shard3 (Layer3)

...

3. FSDP 사용을 위한 사전 준비

3.1 환경 설정

PyTorch 1.12 이상 필수
CUDA, NCCL 설치 필요 (GPU 간 빠른 통신 지원)
모든 GPU 서버에 동일한 코드, 데이터, 라이브러리 설치 필요

3.2 멀티 GPU 환경 준비

단일 머신 또는 멀티 머신(멀티 노드) 환경 준비
멀티 머신이라면 MASTER_ADDR, MASTER_PORT 등 설정 필요

4. PyTorch FSDP 사용 방법

4.1 기본 코드 구조

import torch
import torch.distributed as dist
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.utils.data import DataLoader, DistributedSampler

# 분산 환경 초기화
dist.init_process_group(backend='nccl')

# 모델 생성 및 FSDP 적용
model = MyModel().cuda()
fsdp_model = FSDP(model)

# 데이터 로더 설정
dataset = MyDataset()
sampler = DistributedSampler(dataset)
dataloader = DataLoader(dataset, batch_size=64, sampler=sampler)

optimizer = torch.optim.Adam(fsdp_model.parameters(), lr=1e-4)

# 학습 루프
for epoch in range(10):
    sampler.set_epoch(epoch)
    for inputs, targets in dataloader:
        inputs, targets = inputs.cuda(), targets.cuda()
        outputs = fsdp_model(inputs)
        loss = torch.nn.functional.cross_entropy(outputs, targets)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

# 분산 환경 정리
dist.destroy_process_group()

4.2 멀티 머신 환경에서 FSDP 실행 방법

멀티 머신 환경이라면 torchrun 또는 Slurm 같은 job scheduler를 통해 실행할 수 있다.

torchrun 예시

torchrun --nnodes=2 --nproc_per_node=4 train.py

Slurm 스크립트 예시

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --gres=gpu:4

srun torchrun --nnodes=$SLURM_JOB_NUM_NODES \
              --nproc_per_node=$SLURM_NTASKS_PER_NODE \
              train.py

FSDP 사용 시 주의사항

FSDP는 GPU 메모리를 크게 절약하지만, GPU 간 통신(all-gather, reduce-scatter)이 늘어난다.
checkpoint 저장 시 FSDP가 나눈 shard들을 통합해서 저장해야 하므로, checkpoint 관리 방법에 주의가 필요하다.

FSDP는 초대규모 모델을 학습할 때 GPU 메모리를 효율적으로 활용하기 위한 PyTorch의 강력한 기능이다. 특히 모델 크기가 커서 GPU 메모리가 부족한 상황에서 매우 효과적이며, 데이터 병렬의 장점과 모델 병렬의 메모리 관리 장점을 결합한 최신 분산 학습 방법이다.