[Gen AI] LDM (Latent Diffusion Models) ๊ฐœ๋… ์„ค๋ช…

2025. 6. 29. 18:14ยท๐Ÿ› Research/Generative AI
๋ฐ˜์‘ํ˜•

์ƒ์„ฑ ๋ชจ๋ธ์—์„œ Diffusion์€ ๊ณ ํ•ด์ƒ๋„ ์ด๋ฏธ์ง€๋ฅผ ๋งŒ๋“ค์–ด๋‚ด๋Š” ํ•ต์‹ฌ ๊ธฐ์ˆ ๋กœ ์ž๋ฆฌ ์žก์•˜์ง€๋งŒ, DDPM์ฒ˜๋Ÿผ ํ”ฝ์…€ ๊ณต๊ฐ„์—์„œ ์ง์ ‘ ๋…ธ์ด์ฆˆ๋ฅผ ๋‹ค๋ฃจ๋Š” ๋ฐฉ์‹์—๋Š” ์น˜๋ช…์ ์ธ ๋‹จ์ ์ด ์žˆ์—ˆ๋‹ค. ๋ฐ”๋กœ ์—ฐ์‚ฐ๊ณผ ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์ด๋‹ค.

 

 

[Gen AI] Diffusion Model๊ณผ DDPM ๊ฐœ๋… ์„ค๋ช…

์ƒ์„ฑ ๋ชจ๋ธ์—์„œ Diffusion ๋ชจ๋ธ์€ ๊ณ ํ•ด์ƒ๋„ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•˜๋Š” ํ•ต์‹ฌ ๊ธฐ์ˆ ๋กœ ์ฃผ๋ชฉ๋ฐ›๊ณ  ์žˆ๋Š”๋ฐ, ์ด ๋ชจ๋ธ์€ ๋…ธ์ด์ฆˆ๋ฅผ ์ ์  ์ œ๊ฑฐํ•ด๊ฐ€๋ฉฐ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•œ๋‹ค๋Š” ๊ฐœ๋…์œผ๋กœ, Stable Diffusion, DALL·E 2 ๋“ฑ ๋‹ค์–‘ํ•œ

mvje.tistory.com

 

 

์˜ˆ๋ฅผ ๋“ค์–ด, 256×256 ํ•ด์ƒ๋„์˜ ์ด๋ฏธ์ง€๋ฅผ ์ง์ ‘ ๋””ํ“จ์ „(ํ”ฝ์…€ ๋‹จ์œ„๋กœ ๋…ธ์ด์ฆˆ๋ฅผ ๋„ฃ๊ณ  ์ œ๊ฑฐ)ํ•˜๋ ค๋ฉด, ์ˆ˜๋ฐฑ MB์— ๋‹ฌํ•˜๋Š” feature๋ฅผ ๋ฐ˜๋ณต์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•ด์•ผ ํ•œ๋‹ค. ๊ณ ํ•ด์ƒ๋„์ผ์ˆ˜๋ก ์ด ๋ถ€๋‹ด์€ ๊ธฐํ•˜๊ธ‰์ˆ˜์ ์œผ๋กœ ์ปค์ ธ, GPU ๋ฉ”๋ชจ๋ฆฌ ํ•œ๊ณ„์— ๊ธˆ์„ธ ๋„๋‹ฌํ•œ๋‹ค. ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋“ฑ์žฅํ•œ ๊ฒƒ์ด ๋ฐ”๋กœ Latent Diffusion Models (LDM) ์ด๋‹ค.

 

๐Ÿ’กLDM training์„ ํ•ด๋ณด๊ณ  ์‹ถ๋‹ค๋ฉด → LDM_MNIST

 

Vision-AI-Tutorials/Image_Generation/LDM_MNIST at main · ldj7672/Vision-AI-Tutorials

Computer Vision & AI๋ฅผ ์‰ฝ๊ฒŒ ๋ฐฐ์šฐ๊ณ  ์‹ค์Šตํ•  ์ˆ˜ ์žˆ๋Š” ์˜ˆ์ œ ๋ชจ์Œ์ž…๋‹ˆ๋‹ค. Contribute to ldj7672/Vision-AI-Tutorials development by creating an account on GitHub.

github.com


1. LDM์ด๋ž€?

 

LDM(Latent Diffusion Model)์€ 2022๋…„ CVPR์— ๋ฐœํ‘œ๋œ ๋…ผ๋ฌธ “High-Resolution Image Synthesis with Latent Diffusion Models”์—์„œ ์ฒ˜์Œ ์ œ์•ˆ๋˜์—ˆ๋‹ค. ์ด ๋ชจ๋ธ์€ Stable Diffusion์˜ ๋ฐ”๋กœ ๊ทธ ์›ํ˜•์ด๋ฉฐ, ๊ธฐ์กด DDPM ๋ฐฉ์‹์˜ ๊ฐ€์žฅ ํฐ ๋ณ‘๋ชฉ์ธ ํ”ฝ์…€ ๊ณต๊ฐ„์—์„œ์˜ ๋””ํ“จ์ „์„ latent ๊ณต๊ฐ„์œผ๋กœ ์˜ฎ๊ธฐ๋Š” ์ „๋žต์„ ์ทจํ–ˆ๋‹ค.

 

์ฆ‰, LDM์€ ๋จผ์ € ๊ณ ํ•ด์ƒ๋„ ์ด๋ฏธ์ง€๋ฅผ VAE(Variational Autoencoder)๋ฅผ ํ†ตํ•ด ํ›จ์”ฌ ์ž‘์€ latent ๊ณต๊ฐ„์œผ๋กœ ์ธ์ฝ”๋”ฉํ•œ๋‹ค. ๊ทธ ํ›„ ์ด latent ๊ณต๊ฐ„์—์„œ DDPM๊ณผ ๋™์ผํ•œ ๋ฐฉ์‹์œผ๋กœ ๋…ธ์ด์ฆˆ๋ฅผ ์ฃผ์ž…ํ•˜๊ณ  ์ œ๊ฑฐํ•˜๋Š” ๊ณผ์ •์„ ํ•™์Šตํ•œ๋‹ค. ๋งˆ์ง€๋ง‰์— ์ด latent๋ฅผ ๋‹ค์‹œ ๋””์ฝ”๋”ฉํ•ด ์›๋ž˜์˜ ์ด๋ฏธ์ง€๋กœ ๋ณต์›ํ•œ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ๋””ํ“จ์ „์ด ์ฒ˜๋ฆฌํ•ด์•ผ ํ•  feature map ํฌ๊ธฐ๊ฐ€ ์ˆ˜์‹ญ ๋ฐฐ ์ž‘์•„์ ธ, ํ›จ์”ฌ ๋น ๋ฅด๊ณ  ์ ์€ ์ž์›์œผ๋กœ ๊ณ ํ•ด์ƒ๋„ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋œ๋‹ค.

 

์˜ˆ๋ฅผ ๋“ค์–ด, 256×256×3 ํฌ๊ธฐ์˜ ์ด๋ฏธ์ง€๋ฅผ VAE๋กœ ์ธ์ฝ”๋”ฉํ•˜๋ฉด 32×32×4 ์ •๋„์˜ latent๋กœ ๋ณ€ํ™˜๋œ๋‹ค. ์—ฌ๊ธฐ์„œ ๋””ํ“จ์ „์„ ์ˆ˜ํ–‰ํ•˜๋ฉด, ์›๋ž˜๋ณด๋‹ค ์—ฐ์‚ฐ๋Ÿ‰์ด ์•ฝ 60๋ฐฐ ๊ฐ€๊นŒ์ด ์ค„์–ด๋“ ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ ์ผ๋ฐ˜์ ์ธ ์†Œ๋น„์ž GPU๋กœ๋„ Stable Diffusion ๊ฐ™์€ ๊ณ ํ•ด์ƒ๋„ ์ด๋ฏธ์ง€ ์ƒ์„ฑ์ด ๊ฐ€๋Šฅํ•ด์กŒ๋‹ค.

 

์ง„์งœ ๊ฐ„๋‹จํžˆ ๋งํ•˜๋ฉด,
LDM์€ <์ด๋ฏธ์ง€>๋ฅผ VAE Encoder์— ํ†ต๊ณผ์‹œ์ผœ <latent vector>๋กœ ๋ณ€ํ™˜ํ•˜๊ณ , ์ด latent vector์— ๋…ธ์ด์ฆˆ๋ฅผ ์ฃผ์ž…ํ•œ ๋’ค,
Unet์ด ๊ทธ ๋…ธ์ด์ฆˆ๋ฅผ ์˜ˆ์ธกํ•˜๋„๋ก ํ•™์Šตํ•˜๋Š” ๋ชจ๋ธ์ด๋‹ค. ๊ทธ๋ž˜์„œ ์‹ค์ œ LDM์€ VAE + Unet์œผ๋กœ ๊ตฌ์„ฑ๋˜๋ฉฐ, ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋Š” ์ด๋ฏธ์ง€์ด๊ณ , ์กฐ๊ฑด ์ •๋ณด๋กœ๋Š” ํด๋ž˜์Šค๋‚˜ ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ ๋“ฑ์ด ํ•จ๊ป˜ ํ™œ์šฉ๋  ์ˆ˜ ์žˆ๋‹ค.

 

2. LDM์˜ ๊ตฌ์กฐ์™€ ํ•™์Šต ๊ณผ์ •

LDM์€ ํฌ๊ฒŒ ์„ธ ๊ฐ€์ง€ ๋‹จ๊ณ„๋กœ ์ด๋ฃจ์–ด์ง„๋‹ค.

2.1 Encoding: ์ด๋ฏธ์ง€ → latent

์›๋ณธ ์ด๋ฏธ์ง€ xโ‚€๋ฅผ VAE Encoder๋ฅผ ํ†ตํ•ด latent zโ‚€๋กœ ์••์ถ•ํ•œ๋‹ค.

xโ‚€ → Encoder → zโ‚€

์ด latent zโ‚€๋Š” ์ด๋ฏธ์ง€์˜ ๊ตฌ์กฐ์ , ์‹œ๊ฐ์  ์ •๋ณด๋ฅผ ์œ ์ง€ํ•˜์ง€๋งŒ, ํ”ฝ์…€ ๋‹จ์œ„์˜ ๋…ธ์ด์ฆˆ์—๋Š” ๋œ ๋ฏผ๊ฐํ•œ compact representation์ด๋‹ค. VAE ํ•™์Šต์€ diffusion ํ•™์Šต ์ด์ „์— ๋จผ์ € ์™„๋ฃŒ๋˜๋ฉฐ, ์ดํ›„์—๋Š” Encoder์™€ Decoder๋ฅผ ๊ณ ์ •์‹œ์ผœ ์‚ฌ์šฉํ•œ๋‹ค.

 

2.2 Latent Diffusion: latent์—์„œ noise ์ฃผ์ž…๊ณผ ์˜ˆ์ธก

์ด์ œ ๊ธฐ์กด DDPM์—์„œ ํ”ฝ์…€ ๊ณต๊ฐ„์—์„œ ํ•˜๋˜ ๊ฒƒ์„ latent ๊ณต๊ฐ„์—์„œ ์ˆ˜ํ–‰ํ•œ๋‹ค.

 

Forward Process

zโ‚€ → zโ‚ → zโ‚‚ → ... → z_T
zโ‚œ = sqrt(αฬ„_t) * zโ‚€ + sqrt(1-αฬ„_t) * ε
  • zโ‚€์— ์‹œ๊ฐ„ step t์— ๋”ฐ๋ผ ์ ์  ๋…ธ์ด์ฆˆ๋ฅผ ๋”ํ•ด zโ‚œ๋ฅผ ๋งŒ๋“ ๋‹ค.
  • ํ•™์Šต์—์„œ๋Š” xโ‚€์—์„œ ๋ฐ”๋กœ zโ‚€๋ฅผ ์–ป์€ ๋’ค, random noise ε๊ณผ timestep t๋ฅผ ์ƒ˜ํ”Œ๋งํ•ด์‹์œผ๋กœ ํ•œ ๋ฒˆ์— ๋งŒ๋“ ๋‹ค.
  • ์ด๋Š” DDPM์—์„œ์˜ ๋ฐฉ์‹๊ณผ ์™„์ „ํžˆ ๋™์ผํ•˜๋‹ค.

 

์•„๋ž˜๋Š” time embedding๊ณผ class embedding์„ Unet์— ์ง์ ‘ ์ฃผ์ž…ํ•˜๋Š” ํ˜•ํƒœ์˜ ๋‹จ์ˆœํ™”๋œ PyTorch ์˜ˆ์ œ ์ฝ”๋“œ์ด๋‹ค.

class LDMUNet(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        # Time embedding (128 → 512)
        self.time_mlp = nn.Sequential(
            nn.Linear(128, 512), nn.ReLU(), nn.Linear(512, 512)
        )
        # Class embedding (256 → 512)
        self.class_emb = nn.Embedding(num_classes + 1, 256)
        self.class_mlp = nn.Sequential(
            nn.Linear(256, 512), nn.ReLU(), nn.Linear(512, 512)
        )

        # Input projection
        self.input_proj = nn.Conv2d(4, 64, 3, padding=1)  # [4 → 64]

        # Encoder
        self.enc1 = nn.Conv2d(64 + 512, 128, 4, stride=2, padding=1) # 32→16
        self.enc2 = nn.Conv2d(128 + 512, 256, 4, stride=2, padding=1) # 16→8
        self.enc3 = nn.Conv2d(256 + 512, 512, 4, stride=2, padding=1) # 8→4

        # Middle
        self.middle1 = nn.Conv2d(512 + 512, 768, 3, padding=1)
        self.middle2 = nn.Conv2d(768 + 512, 512, 3, padding=1)

        # Decoder
        self.dec3 = nn.ConvTranspose2d(1024 + 512, 256, 4, stride=2, padding=1) # 4→8
        self.dec2 = nn.ConvTranspose2d(512 + 512, 128, 4, stride=2, padding=1)  # 8→16
        self.dec1 = nn.ConvTranspose2d(192 + 512, 64, 4, stride=2, padding=1)   # 16→32

        # Output
        self.output_proj = nn.Conv2d(64 + 512, 4, 3, padding=1) # back to [4]

    def forward(self, x, t, class_labels=None):
        """
        - x: [batch, 4, 32, 32] (latent z_t)
        - t: [batch] timestep
        - class_labels: [batch] class index
        """
        # Create condition embedding
        t_emb = self.time_mlp(t)  # [batch, 512]
        if class_labels is None:
            class_labels = torch.full(
                (x.size(0),), self.class_emb.num_embeddings - 1,
                device=x.device, dtype=torch.long
            )
        c_emb = self.class_mlp(self.class_emb(class_labels))  # [batch, 512]
        cond_emb = (t_emb + c_emb).unsqueeze(-1).unsqueeze(-1)  # [batch, 512, 1, 1]

        # Input projection
        x = self.input_proj(x)  # [batch, 64, 32, 32]

        # Encoder with condition
        x = torch.cat([x, cond_emb.expand(-1, -1, 32, 32)], dim=1)
        x1 = F.relu(self.enc1(x))  # [batch, 128, 16, 16]

        x1_cat = torch.cat([x1, cond_emb.expand(-1, -1, 16, 16)], dim=1)
        x2 = F.relu(self.enc2(x1_cat))  # [batch, 256, 8, 8]

        x2_cat = torch.cat([x2, cond_emb.expand(-1, -1, 8, 8)], dim=1)
        x3 = F.relu(self.enc3(x2_cat))  # [batch, 512, 4, 4]

        # Middle with condition
        x3_cat = torch.cat([x3, cond_emb.expand(-1, -1, 4, 4)], dim=1)
        x_mid = F.relu(self.middle1(x3_cat))  # [batch, 768, 4, 4]
        x_mid = torch.cat([x_mid, cond_emb.expand(-1, -1, 4, 4)], dim=1)
        x_mid = F.relu(self.middle2(x_mid))   # [batch, 512, 4, 4]

        # Decoder with condition
        x_mid_cat = torch.cat([x_mid, x3, cond_emb.expand(-1, -1, 4, 4)], dim=1)
        x = F.relu(self.dec3(x_mid_cat))  # [batch, 256, 8, 8]

        x = torch.cat([x, x2, cond_emb.expand(-1, -1, 8, 8)], dim=1)
        x = F.relu(self.dec2(x))          # [batch, 128, 16, 16]

        x = torch.cat([x, x1, cond_emb.expand(-1, -1, 16, 16)], dim=1)
        x = F.relu(self.dec1(x))          # [batch, 64, 32, 32]

        # Output projection with final condition
        x = torch.cat([x, cond_emb.expand(-1, -1, 32, 32)], dim=1)
        return self.output_proj(x)        # [batch, 4, 32, 32]
  • ์œ„ Unet ์˜ˆ์ œ ์ฝ”๋“œ๋Š” timestep embedding + class embedding์„ ๋”ํ•ด ๋งŒ๋“  cond_emb๋ฅผ encoder์™€ decoder์— ์ฃผ์ž…ํ•ด, ๊ฐ ๋‹จ๊ณ„์—์„œ diffusion ๊ณผ์ •์˜ ์‹œ๊ฐ„๊ณผ ์กฐ๊ฑด์„ ๋ฐ˜์˜ํ•œ๋‹ค.
  • cond_emb๋Š” ์›๋ž˜ 1D vector์ด์ง€๋งŒ, ์ธ์ฝ”๋”/๋””์ฝ”๋”์—์„œ spatial feature map๊ณผ concatํ•˜๊ธฐ ์œ„ํ•ด [batch, 512, H, W]๋กœ broadcastํ•ด์„œ ์‚ฌ์šฉํ•œ๋‹ค.
  • ์ด๋ ‡๊ฒŒ ๊ฐ„๋‹จํ™”๋œ ๊ตฌ์กฐ๋กœ๋„ LDM์˜ "time + class guided Unet" ๊ฐœ๋…์„ ์ง๊ด€์ ์œผ๋กœ ์‹คํ—˜ํ•  ์ˆ˜ ์žˆ๋‹ค.

 

 

Reverse Process

noise_pred = UNet(zโ‚œ, t)
L = E[||ε - noise_pred||²]
  • ํ•™์Šต ๋Œ€์ƒ์ธ UNet์€ ์ด zโ‚œ์™€ t๋ฅผ ์ž…๋ ฅ๋ฐ›์•„, zโ‚œ์— ์„ž์—ฌ ์žˆ๋Š” ๋…ธ์ด์ฆˆ ε์„ ์˜ˆ์ธกํ•œ๋‹ค.
  • ๊ทธ๋ฆฌ๊ณ  ์‹ค์ œ noise ε๊ณผ MSE Loss๋ฅผ ํ†ตํ•ด ์ฐจ์ด๋ฅผ ์ค„์—ฌ๋‚˜๊ฐ„๋‹ค.
  • ์ด๋Ÿฐ ๋ฐ˜๋ณต์„ ํ†ตํ•ด, ๋‹ค์–‘ํ•œ t์—์„œ ๋…ธ์ด์ฆˆ๊ฐ€ ํฌํ•จ๋œ zโ‚œ๋ฅผ ๋ณด๊ณ  ๊ทธ ์•ˆ์— ์–ด๋–ค ๋…ธ์ด์ฆˆ๊ฐ€ ๋“ค์–ด์žˆ๋Š”์ง€ ์ž˜ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•™์Šต๋œ๋‹ค.

 

2.3 Decoding: latent → ์ด๋ฏธ์ง€ ๋ณต์›

์ƒ์„ฑ์ด ๋๋‚œ latent zโ‚€๋Š” ๊ณ ์ •๋œ VAE Decoder๋ฅผ ํ†ตํ•ด ๋‹ค์‹œ ๊ณ ํ•ด์ƒ๋„ ์ด๋ฏธ์ง€๋กœ ๋ณต์›๋œ๋‹ค.

zโ‚€ → Decoder → xฬ‚โ‚€

์ด ๊ณผ์ •์„ ํ†ตํ•ด latent์—์„œ ์ž˜ ๋งŒ๋“ค์–ด์ง„ ๊ตฌ์กฐ๊ฐ€ ๋””์ฝ”๋”ฉ๋˜์–ด ํ”ฝ์…€ ๊ณต๊ฐ„์˜ ์ž์—ฐ์Šค๋Ÿฌ์šด ์ด๋ฏธ์ง€๋กœ ๋ฐ”๋€๋‹ค.

 

์ •๋ฆฌํ•˜๋ฉด...

โœ… LDM ํ•™์Šต ํ”Œ๋กœ์šฐ

[์ด๋ฏธ์ง€ xโ‚€]
     |
     v
[VAE Encoder]
     |
     v
[latent zโ‚€]
     |
     v
[Noise injection]
zโ‚œ = sqrt(αฬ„_t) * zโ‚€ + sqrt(1-αฬ„_t) * ε
     |
     v
+-------------------+
|  Unet(zโ‚œ, t, cond)|
| → noise_pred      |
+-------------------+
     |
     v
[MSE Loss]
 = || noise_pred - ε ||²
     |
     v
[Backpropagation]

 

3. Inference & DDIM Sampling

์ƒ์„ฑ(Inference) ๋‹จ๊ณ„์—์„œ๋Š” ์™„์ „ํžˆ ๋žœ๋คํ•œ latent noise z_T์—์„œ ์‹œ์ž‘ํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ํ•™์Šต๋œ UNet์„ ์ด์šฉํ•ด ์ ์ง„์ ์œผ๋กœ ๋…ธ์ด์ฆˆ๋ฅผ ์ œ๊ฑฐํ•˜๋ฉฐ z_{T-1}, z_{T-2}, ..., zโ‚€์œผ๋กœ ์ด๋™ํ•œ๋‹ค.

 

์ด๋•Œ DDPM ๋ฐฉ์‹์€ ์ˆ˜๋ฐฑ~์ˆ˜์ฒœ ์Šคํ…์„ ๊ฑฐ์น˜๋ฉฐ ์กฐ๊ธˆ์”ฉ ๋…ธ์ด์ฆˆ๋ฅผ ์ค„์ด๋Š” stochastic sampling์„ ํ•œ๋‹ค. ๋ฐ˜๋ฉด LDM์€ ์ผ๋ฐ˜์ ์œผ๋กœ DDIM(Denoising Diffusion Implicit Models) ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•ด, ๊ฐ™์€ β ์Šค์ผ€์ค„์„ ์ ์€ ์Šคํ…์œผ๋กœ deterministicํ•˜๊ฒŒ ๋›ด๋‹ค. DDIM์€ ODE(ํ™•๋ฅ ์  ๋ฏธ๋ถ„ ๋ฐฉ์ •์‹ → ๊ฒฐ์ •์  ๋ฏธ๋ถ„ ๋ฐฉ์ •์‹) ํ•ด์„์„ ์ด์šฉํ•ด noise sampling term์„ ์ œ๊ฑฐํ•˜๊ฑฐ๋‚˜ ์กฐ์ ˆํ•จ์œผ๋กœ์จ,

  • ๋” ์ ์€ step์—์„œ๋„ ๊นจ๋—ํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋งŒ๋“ค ์ˆ˜ ์žˆ๊ณ ,
  • ๋™์ผํ•œ z_T์—์„œ ํ•ญ์ƒ ๊ฐ™์€ xฬ‚โ‚€๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜๋„ ์žˆ๋‹ค.

์ฆ‰, DDIM์€ ํ•œ ์Šคํ…๋‹น ๋” ๋งŽ์€ ๋…ธ์ด์ฆˆ๋ฅผ ์ œ๊ฑฐํ•˜๋ฉฐ ์ด๋™ํ•˜๊ณ , η ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํ†ตํ•ด sampling stochasticity๋ฅผ ์กฐ์ ˆํ•  ์ˆ˜๋„ ์žˆ๋‹ค. ์ด ๋•๋ถ„์— ๋ณดํ†ต 50~100 ์Šคํ… ์ •๋„๋งŒ์œผ๋กœ๋„ ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋‚ธ๋‹ค.

 

์ •๋ฆฌํ•˜๋ฉด...

โœ… LDM ํ•™์Šต ํ”Œ๋กœ์šฐ

[Random latent z_T ~ N(0,I)]
     |
     v
for t = T...1:
    +-------------------+
    |  Unet(zโ‚œ, t, cond)|
    | → noise_pred      |
    +-------------------+
     |
     v
    [Denoise step]
    z_{t-1} = f(zโ‚œ, noise_pred, t)

     (๋ฐ˜๋ณต)
     |
     v
[latent zโ‚€]
     |
     v
[VAE Decoder]
     |
     v
[์ด๋ฏธ์ง€ xฬ‚โ‚€ (์ƒ˜ํ”Œ)]

 

4. Text-to-Image (T2I)์™€ LDM

 

Stable Diffusion์€ ์ด LDM ๊ตฌ์กฐ๋ฅผ ๊ทธ๋Œ€๋กœ ๊ฐ€์ ธ์™€, UNet์— text embedding์„ Cross-Attention์œผ๋กœ ์—ฐ๊ฒฐํ•ด ์กฐ๊ฑด๋ถ€ ์ƒ์„ฑ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ–ˆ๋‹ค. ์ฆ‰, T2I๋Š” UNet์˜ ์ž…๋ ฅ์œผ๋กœ

(zโ‚œ, t, text_embedding)

์„ ์ฃผ์–ด,

  • zโ‚œ์™€ t๋ฅผ ๋ณด๊ณ  noise๋ฅผ ์˜ˆ์ธกํ•˜๋˜,
  • attention query-key-value์— text embedding์„ ๋„ฃ์–ด prompt์— ๋งž๊ฒŒ ์ด๋ฏธ์ง€๋ฅผ ๋งŒ๋“ค์–ด๊ฐ„๋‹ค.

๊ฒฐ๊ณผ์ ์œผ๋กœ “astronaut riding a horse” ๊ฐ™์€ ๋ฌธ์žฅ์„ ์ž…๋ ฅํ•˜๋ฉด, ์ด ์กฐ๊ฑด์— ๋งž๊ฒŒ latent ๊ณต๊ฐ„์—์„œ ๋…ธ์ด์ฆˆ๋ฅผ ์ œ๊ฑฐํ•ด๋‚˜๊ฐ€๋ฉด์„œ ์›ํ•˜๋Š” ์ด๋ฏธ์ง€๊ฐ€ ํƒ„์ƒํ•œ๋‹ค.


 

LDM์€ “ํ”ฝ์…€ ๊ณต๊ฐ„ ๋Œ€์‹  latent ๊ณต๊ฐ„์—์„œ diffusion์„ ์ˆ˜ํ–‰”ํ•˜์—ฌ ๋ฉ”๋ชจ๋ฆฌ์™€ ์†๋„๋ฅผ ํš๊ธฐ์ ์œผ๋กœ ๊ฐœ์„ ํ•œ DDPM์˜ ํ™•์žฅํŒ์ด๋ฉฐ,
์ด๋ฅผ ํ†ตํ•ด Stable Diffusion ๊ฐ™์€ ๊ณ ํ•ด์ƒ๋„ Text-to-Image ์ƒ์„ฑ์ด ๊ฐ€๋Šฅํ•ด์กŒ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฐฉ์‹์€ ์ดํ›„ ControlNet, Inpainting, 3D NeRF reconstruction ๋“ฑ ๋‹ค์–‘ํ•œ ๋””ํ“จ์ „ ๊ธฐ๋ฐ˜ ๊ธฐ์ˆ ์˜ ํ‘œ์ค€์ด ๋˜์—ˆ์œผ๋ฉฐ, ์—ฌ์ „ํžˆ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ƒ์„ฑ(ํ…์ŠคํŠธ-์ด๋ฏธ์ง€-์˜ค๋””์˜ค) ๋ถ„์•ผ์—์„œ ํ™œ๋ฐœํžˆ ์—ฐ๊ตฌ๋˜๊ณ  ์žˆ๋‹ค.

๋ฐ˜์‘ํ˜•

'๐Ÿ› Research > Generative AI' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

[Gen AI] Diffusion Transformer (DiT) ์™„๋ฒฝ ์ดํ•ดํ•˜๊ธฐ!  (0) 2025.07.15
[Gen AI] Diffusion ๋ชจ๋ธ ์ƒ˜ํ”Œ๋ง & ํ•™์Šต ํŠธ๋ฆญ ์ •๋ฆฌ  (4) 2025.07.08
[Gen AI] Diffusion Model๊ณผ DDPM ๊ฐœ๋… ์„ค๋ช…  (0) 2025.03.31
[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] DREAMFUSION: TEXT-TO-3D USING 2D DIFFUSION  (0) 2025.03.23
[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] Zero-1-to-3: Zero-shot One Image to 3D Object | Single-view object reconstruction  (0) 2025.03.22
'๐Ÿ› Research/Generative AI' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€
  • [Gen AI] Diffusion Transformer (DiT) ์™„๋ฒฝ ์ดํ•ดํ•˜๊ธฐ!
  • [Gen AI] Diffusion ๋ชจ๋ธ ์ƒ˜ํ”Œ๋ง & ํ•™์Šต ํŠธ๋ฆญ ์ •๋ฆฌ
  • [Gen AI] Diffusion Model๊ณผ DDPM ๊ฐœ๋… ์„ค๋ช…
  • [๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] DREAMFUSION: TEXT-TO-3D USING 2D DIFFUSION
๋ญ…์ฆค
๋ญ…์ฆค
AI ๊ธฐ์ˆ  ๋ธ”๋กœ๊ทธ
    ๋ฐ˜์‘ํ˜•
  • ๋ญ…์ฆค
    CV DOODLE
    ๋ญ…์ฆค
  • ์ „์ฒด
    ์˜ค๋Š˜
    ์–ด์ œ
  • ๊ณต์ง€์‚ฌํ•ญ

    • โœจ About Me
    • ๋ถ„๋ฅ˜ ์ „์ฒด๋ณด๊ธฐ (203) N
      • ๐Ÿ“– Fundamentals (33)
        • Computer Vision (9)
        • 3D vision & Graphics (6)
        • AI & ML (15)
        • NLP (2)
        • etc. (1)
      • ๐Ÿ› Research (68) N
        • Deep Learning (7)
        • Image Classification (2)
        • Detection & Segmentation (17)
        • OCR (7)
        • Multi-modal (4)
        • Generative AI (9) N
        • 3D Vision (3)
        • Material & Texture Recognit.. (8)
        • NLP & LLM (11)
        • etc. (0)
      • ๐Ÿ› ๏ธ Engineering (7)
        • Distributed Training (4)
        • AI & ML ์ธ์‚ฌ์ดํŠธ (3)
      • ๐Ÿ’ป Programming (86)
        • Python (18)
        • Computer Vision (12)
        • LLM (4)
        • AI & ML (18)
        • Database (3)
        • Apache Airflow (6)
        • Docker & Kubernetes (14)
        • ์ฝ”๋”ฉ ํ…Œ์ŠคํŠธ (4)
        • C++ (1)
        • etc. (6)
      • ๐Ÿ’ฌ ETC (3)
        • ์ฑ… ๋ฆฌ๋ทฐ (3)
  • ๋งํฌ

  • ์ธ๊ธฐ ๊ธ€

  • ํƒœ๊ทธ

    ํŒŒ์ด์ฌ
    diffusion
    generative ai
    ๋”ฅ๋Ÿฌ๋‹
    multi-modal
    ๊ฐ์ฒด ๊ฒ€์ถœ
    pytorch
    ml
    LLM
    ChatGPT
    segmentation
    object detection
    pandas
    ๋„์ปค
    AI
    material recognition
    OpenCV
    Text recognition
    OpenAI
    3D Vision
    airflow
    CNN
    ๊ฐ์ฒด๊ฒ€์ถœ
    ์ปดํ“จํ„ฐ๋น„์ „
    OCR
    deep learning
    Computer Vision
    Python
    ํ”„๋กฌํ”„ํŠธ์—”์ง€๋‹ˆ์–ด๋ง
    nlp
  • ์ตœ๊ทผ ๋Œ“๊ธ€

  • ์ตœ๊ทผ ๊ธ€

  • hELLOยท Designed By์ •์ƒ์šฐ.v4.10.3
๋ญ…์ฆค
[Gen AI] LDM (Latent Diffusion Models) ๊ฐœ๋… ์„ค๋ช…
์ƒ๋‹จ์œผ๋กœ

ํ‹ฐ์Šคํ† ๋ฆฌํˆด๋ฐ”