[T2I] Back to Basics: Let Denoising Generative Models Denoise | Just image Transformers (JiT) ๋ฆฌ๋ทฐ

2025. 11. 29. 14:28ยท๐Ÿ› Research/Image•Video Generation
๋ฐ˜์‘ํ˜•

1. Intro

์ตœ๊ทผ Back to Basics: Let Denoising Generative Models Denoise (JiT) ๋…ผ๋ฌธ์ด Diffusion ๋ถ„์•ผ์—์„œ ๊ฝค ํ•ซํ•œ ์—ฐ๊ตฌ์ด๋‹ค. ํ•ต์‹ฌ์€ ๋งค์šฐ ๋‹จ์ˆœํ•œ๋ฐ, "Diffusion ๋ชจ๋ธ์€ ๋ณธ๋ž˜ ๊นจ๋—ํ•œ ์ด๋ฏธ์ง€๋ฅผ ๋ณต์›ํ•˜๋Š” ๋ชจ๋ธ์ธ๋ฐ, ์™œ ๋Œ€๋ถ€๋ถ„์˜ ๊ตฌํ˜„์€ ๋…ธ์ด์ฆˆ(ฯต)๋‚˜ v(velocity)๋งŒ ์˜ˆ์ธกํ• ๊นŒ?" JiT๋Š” ๋ฐ”๋กœ ์ด ์งˆ๋ฌธ์—์„œ ์ถœ๋ฐœํ•ด, "๊ทธ๋ƒฅ ํด๋ฆฐ ์ด๋ฏธ์ง€(x)๋ฅผ ์ง์ ‘ ์˜ˆ์ธกํ•˜๋ฉด ๋” ์ž˜ ๋œ๋‹ค"๋ผ๋Š” ๋งค์šฐ ์ง๊ด€์ ์ด์ง€๋งŒ ๊ฐ•๋ ฅํ•œ ๊ฒฐ๋ก ์„ ์ œ์‹œํ•œ๋‹ค. ํŠนํžˆ ๊ณ ํ•ด์ƒ๋„ ํ”ฝ์…€ ๊ณต๊ฐ„์—์„œ๋Š” ์ด ํšจ๊ณผ๊ฐ€ ๊ทน์ ์œผ๋กœ ๋‚˜ํƒ€๋‚œ๋‹ค.

1.1 ๋ฌธ์ œ์˜์‹: ์™œ x-prediction์ธ๊ฐ€?

๊ธฐ์กด diffusion ๋ชจ๋ธ์€ ํฌ๊ฒŒ ฯต-prediction ๋˜๋Š” v-prediction์„ ์‚ฌ์šฉํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด ๋‘ ๋Œ€์ƒ์€ ๋…ธ์ด์ฆˆ๊ฐ€ ํฌ๊ฒŒ ํฌํ•จ๋œ ๊ณ ์ฐจ์› ๋ฐ์ดํ„ฐ(latent)์ด๋ฉฐ, ๋ชจ๋ธ์ด ์ด๋ฅผ ์ง์ ‘ ์˜ˆ์ธกํ•˜๋Š” ๊ณผ์ •์—์„œ ๋†’์€ capacity๋ฅผ ์š”๊ตฌํ•œ๋‹ค.

 

๋ฐ˜๋ฉด, ์ž์—ฐ ์ด๋ฏธ์ง€ x๋Š” ๋ณธ์งˆ์ ์œผ๋กœ ์ €์ฐจ์› manifold ์œ„์— ์กด์žฌํ•œ๋‹ค(๋…ผ๋ฌธ Fig. 1). ์ฆ‰, ๋ชจ๋ธ์ด ์˜ˆ์ธกํ•ด์•ผ ํ•˜๋Š” ์ •๋ณด๋Ÿ‰์ด ํ›จ์”ฌ ์ ๋‹ค. ๋”ฐ๋ผ์„œ ๋ชจ๋ธ capacity๊ฐ€ ์ถฉ๋ถ„ํ•˜์ง€ ์•Š์€ ์ƒํ™ฉ์—์„œ๋Š” ์˜คํžˆ๋ ค x๋ฅผ ์ง์ ‘ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์ด ํ›จ์”ฌ ์•ˆ์ •์ ์ด๋ผ๋Š” ์ ์„ ์ด ๋…ผ๋ฌธ์€ ๋งค์šฐ ์„ค๋“๋ ฅ ์žˆ๊ฒŒ ๋ณด์—ฌ์ค€๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด, 512×512 ์ด๋ฏธ์ง€์˜ 32×32 ํŒจ์น˜๋Š” 3,072์ฐจ์›์— ์ด๋ฅด๋ฉฐ, ์ด๋ฅผ ๊ทธ๋Œ€๋กœ ๋ชจ๋ธ์ด ๋‹ค๋ฃจ๋Š” ๊ฒƒ์€ ๋งค์šฐ ์–ด๋ ต๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ x๋Š” ๋…ธ์ด์ฆˆ๋ณด๋‹ค ๊ตฌ์กฐ๊ฐ€ ๋ช…ํ™•ํ•˜๊ณ  manifold ๊ตฌ์กฐ๊ฐ€ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๋ชจ๋ธ์ด ์ด๋ฅผ ํ•™์Šตํ•˜๋Š” ๋ฐ ํ›จ์”ฌ ์œ ๋ฆฌํ•˜๋‹ค.

 

1.2 JiT๊ฐ€ ์ œ์•ˆํ•˜๋Š” ํ•ต์‹ฌ ์ฒ ํ•™

JiT(JUST image Transformers)์˜ ์ฒ ํ•™์€ ๋ช…ํ™•ํ•˜๋‹ค.

  • ViT๋ฅผ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ•œ๋‹ค
  • ๋ณ„๋„์˜ tokenizer, VAE, perceptual loss ํ•„์š” ์—†์Œ
  • latent space๋„ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ  ์˜ค์ง ํ”ฝ์…€ ๊ณต๊ฐ„์—์„œ diffusion ์ˆ˜ํ–‰
  • diffusion์˜ prediction target์„ x๋กœ ๊ณ ์ •ํ•œ๋‹ค

์ฆ‰, "์žˆ๋Š” ๊ทธ๋Œ€๋กœ์˜ Transformer + ์žˆ๋Š” ๊ทธ๋Œ€๋กœ์˜ ์ด๋ฏธ์ง€" ์กฐํ•ฉ๋งŒ์œผ๋กœ high-resolution diffusion์„ ์„ฑ๊ณต์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์„ ์‹ค์ฆํ–ˆ๋‹ค.

 

2. Diffusion Prediction Space ๋ถ„์„

๋…ผ๋ฌธ์—์„œ๋Š” x, ฯต, v ์„ธ ๊ฐ€์ง€๋ฅผ prediction target์œผ๋กœ ๋‘˜ ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ด๋ฅผ loss space์™€ ์กฐํ•ฉํ•˜๋ฉด ์ด 9๊ฐ€์ง€ ๊ฒฝ์šฐ๊ฐ€ ๋œ๋‹ค๊ณ  ์ •๋ฆฌํ•œ๋‹ค. (Table 1) ์„ธ ๊ฒฝ์šฐ๋Š” ์•„๋ž˜์™€ ๊ฐ™์€ ์„ฑ๊ฒฉ์„ ๊ฐ€์ง„๋‹ค.

2.1 x-prediction

  • ๋ชจ๋ธ ์ถœ๋ ฅ์ด ์ง์ ‘ ํด๋ฆฐ ์ด๋ฏธ์ง€ ๋ณต์›
  • manifold ์ƒ์˜ ๊ตฌ์กฐ์ ์ธ ๋ฐ์ดํ„ฐ๋ฅผ ์˜ˆ์ธก → ํ•™์Šต ์šฉ์ด
  • 256×256 ์ด์ƒ ๊ณ ํ•ด์ƒ๋„์—์„œ๋„ ์•ˆ์ •์ ์œผ๋กœ ๋™์ž‘
  • ํŠนํžˆ high-dim pixel space์—์„œ ๋ชจ๋ธ capacity ์š”๊ตฌ๊ฐ€ ๊ฐ€์žฅ ๋‚ฎ์Œ

2.2 ฯต-prediction

  • ๋ชจ๋ธ์ด clean image๋ฅผ ์˜ˆ์ธกํ•˜์ง€ ์•Š๊ณ  ๋…ธ์ด์ฆˆ ฯต๋ฅผ ์ง์ ‘ ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉ์‹
  • ๋…ธ์ด์ฆˆ๋ฅผ ์˜ˆ์ธกํ•ด์•ผ ํ•˜๋ฏ€๋กœ ๊ณ ์ฐจ์› ๊ณต๊ฐ„ ์ „์ฒด๋ฅผ modeling ํ•„์š”
  • latent space์—์„  ์ข‹์ง๋‚œ, ๊ณ ์ฐจ์› pixel-space์—์„œ๋Š” ๋ชจ๋ธ capacity๊ฐ€ ๋ถ€์กฑํ•˜๋ฉด catastrophic failure ๋ฐœ์ƒ
  • ์‹คํ—˜์—์„œ ์‹ค์ œ๋กœ FID 300 ์ด์ƒ์œผ๋กœ ๋ถ•๊ดด
  • DDPM, DDIM, Stable Diffusion(LDM), DiT ๋“ฑ ๋Œ€๋ถ€๋ถ„

2.3 v-prediction

  • Flow matching / Rectified Flow ๋ชจ๋ธ
  • ์—ฌ์ „ํžˆ x๋ณด๋‹ค ๊ณ ์ฐจ์› ์ •๋ณด + ๋…ธ์ด์ฆˆ๊ฐ€ ์„ž์ธ off-manifold ๊ฐ’์ด๋ฏ€๋กœ pixel space์—์„œ๋Š” ฯต-prediction์ฒ˜๋Ÿผ collapse ์œ„ํ—˜์ด ์กด์žฌ

ํ•ต์‹ฌ ๊ด€์ฐฐ

  • 256×256 pixel-space์—์„œ ฯต/v prediction์€ ์™„์ „ํžˆ ๋ถ•๊ดดํ•˜์ง€๋งŒ, x-prediction์€ ์ •์ƒ ์ž‘๋™ํ•œ๋‹ค. (Table 2(a))
  • ๋ฐ˜๋ฉด 64×64 ๊ฐ™์€ ์ €ํ•ด์ƒ๋„์—์„œ๋Š” capacity issue๊ฐ€ ๋œํ•ด 9๊ฐœ ์กฐํ•ฉ ๋ชจ๋‘ ์ค€์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ž„(Table 2(b))

์ฆ‰, “์™œ ์ง€๊ธˆ๊นŒ์ง€ pixel diffusion ๋ชจ๋ธ์€ latent space์— ์˜์กดํ–ˆ๋Š”๊ฐ€?”์— ๋Œ€ํ•œ ๋‹ต์€ ๋ช…ํ™•ํ•ด์ง„๋‹ค. ฯต/v๋ฅผ ์ง์ ‘ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์€ ๊ณ ์ฐจ์› ๊ณต๊ฐ„์—์„œ ๋„ˆ๋ฌด ํž˜๋“ค๊ธฐ ๋•Œ๋ฌธ์ด๊ณ , JiT๋Š” ๊ทธ ๋ฐฉํ–ฅ์„ ๋ฐ”๊ฟ” x๋ฅผ ์ง์ ‘ ์˜ˆ์ธกํ•˜๋ฉด ์ด ๋ฌธ์ œ๊ฐ€ ์—†์–ด์ง„๋‹ค๋Š” ์ ์„ ์‹คํ—˜์œผ๋กœ ์ฆ๋ช…ํ•œ ์…ˆ์ด๋‹ค.

 

3. JiT Architecture

JiT(JuST image Transformer)๋Š” ์ด๋ฆ„ ๊ทธ๋Œ€๋กœ ์ €์ŠคํŠธ ์ด๋ฏธ์ง€ Transformer์ด๋‹ค. ์ง€๊ธˆ๊นŒ์ง€์˜ Diffusion ์‹œ์Šคํ…œ์ด ๊ฐ€์ง„ ๋ณต์žกํ•œ ๊ตฌ์„ฑ์š”์†Œ๋“ค(์˜ˆ: VAE, latent tokenizer, perceptual loss, multi-scale U-Net ๋“ฑ)์„ ๊ณผ๊ฐํžˆ ์ œ๊ฑฐํ•˜๊ณ , ์›๋ณธ ์ด๋ฏธ์ง€(pixels)๋งŒ Transformer๋กœ ์ง์ ‘ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•œ๋‹ค.

 

์‹ ๊ธฐํ•˜๊ฒŒ๋„(?) ์ด ๋‹จ์ˆœํ•œ ๊ตฌ์กฐ๊ฐ€ ๊ณ ํ•ด์ƒ๋„ ์ด๋ฏธ์ง€ ์ƒ์„ฑ์—์„œ collapse ์—†์ด ์•ˆ์ •์ ์œผ๋กœ ์ž‘๋™ํ•œ๋‹ค.

 

3.1 Patchify → ViT → Patch Reconstruction

JiT๋Š” ์ด๋ฏธ์ง€๋ฅผ Vision Transformer(ViT)์ฒ˜๋Ÿผ ๊ณ ์ • ํฌ๊ธฐ ํŒจ์น˜(tokens)๋กœ ๋‚˜๋ˆ„์–ด ์ž…๋ ฅํ•œ๋‹ค. ์ด๋ฏธ์ง€ ํฌ๊ธฐ๊ฐ€ 512×512๋ผ๊ณ  ๊ฐ€์ •ํ•˜๋ฉด, ํŒจ์น˜ ์‚ฌ์ด์ฆˆ์— ๋”ฐ๋ผ ์ž…๋ ฅ ํ† ํฐ์˜ ํ˜•ํƒœ๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

 

Patch Size  Patch Dim  Token ๊ฐœ์ˆ˜ ํŠน์ง•
16 x 16 768 1024 ํ† ํฐ ๊ฐœ์ˆ˜๊ฐ€ ๋งŽ์•„ ๋А๋ฆผ
32 x 32 3072 256 compute ํšจ์œจ + ํŒจ์น˜ ๋‹จ์œ„ ํ‘œํ˜„๋ ฅ ๊ท ํ˜•
64 x 64 12288 64 ํ† ํฐ ๊ฐœ์ˆ˜ ์ ์–ด์„œ ๋น ๋ฅด์ง€๋งŒ, ํ† ํฐ ์ฐจ์›์ด ๋งค์šฐ ํผ

 

JiT์˜ ๊ธฐ๋ณธ ์„ค์ •์€ p=32๋กœ, ๊ฐ patch๋Š” 3072์ฐจ์›(=32×32×3)์ด๋ผ๋Š” ๋งค์šฐ ํฐ ๋ฒกํ„ฐ๋‹ค.

 

์ค‘์š”ํ•œ ํฌ์ธํŠธ๋Š” ์ผ๋ฐ˜ Diffusion Transformer๋Š” latent(4~8 channels)๋‚˜ CNN feature map์„ ํ† ํฐ์œผ๋กœ ์“ฐ์ง€๋งŒ, JiT๋Š” ์•„์˜ˆ ์›์‹œ pixel ํŒจ์น˜๋ฅผ ํ† ํฐ์œผ๋กœ ์‚ฌ์šฉํ•œ๋‹ค. ์ฆ‰, ๋” ์ด์ƒ tokenizer๊ฐ€ ํ•„์š” ์—†๊ณ  ์ด๋ฏธ์ง€ ๊ทธ ์ž์ฒด๊ฐ€ ๋ชจ๋ธ์˜ ์ž…๋ ฅ ํ† ํฐ์ด๋‹ค. ์ด๊ฒŒ ๊ฐ€๋Šฅํ•œ ์ด์œ ๊ฐ€ ๋ฐ”๋กœ x-prediction ๊ธฐ๋ฐ˜์˜ ์•ˆ์ •์„ฑ ๋•๋ถ„์ด๋ผ๊ณ  ํ•œ๋‹ค.

 

3.2 Bottleneck embedding

3072์ฐจ์›์˜ ํŒจ์น˜๋ฅผ ๊ทธ๋Œ€๋กœ Transformer์— ๋„ฃ์œผ๋ฉด ๋ฉ”๋ชจ๋ฆฌ์™€ compute ๋น„์šฉ์ด ๋งค์šฐ ํฌ๋‹ค. ๊ทธ๋ž˜์„œ JiT๋Š” Patch Embedding ๋‹จ๊ณ„์—์„œ “๋ณ‘๋ชฉ(bottleneck)” ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.

3072 (raw patch)
   → 32 (bottleneck)
      → 768 (transformer hidden dim)

๋†€๋ž๊ฒŒ๋„ bottleneck ํฌ๊ธฐ๋ฅผ ๊ทน๋‹จ์ ์œผ๋กœ ์ค„์—ฌ๋„ ์„ฑ๋Šฅ์ด ๊ฑฐ์˜ ๋–จ์–ด์ง€์ง€ ์•Š๋Š”๋‹ค. (๋…ผ๋ฌธ Figure 4: d′=32๋กœ ์ค„์—ฌ๋„ ImageNet FID ๊ฐœ์„ )

 

๋…ผ๋ฌธ์—์„œ ๊ฐ•์กฐํ•˜๋“ฏ clean image x๋Š” ์›๋ž˜ low-dimensional manifold ์œ„์— ์žˆ๊ธฐ ๋•Œ๋ฌธ์— raw pixel ์ •๋ณด ์ „์ฒด๋ฅผ ๋ณด์กดํ•  ํ•„์š”๊ฐ€ ์—†๋‹ค. ์ฆ‰, patch์˜ ๋ชจ๋“  ๋””ํ…Œ์ผ์„ ์œ ์ง€ํ•  ํ•„์š”๊ฐ€ ์—†๊ณ  manifold ๊ตฌ์กฐ๋งŒ ์ž˜ ์ถ”์ถœํ•˜๋ฉด Transformer๊ฐ€ ์•ˆ์ •์ ์œผ๋กœ ๋ณต์›ํ•  ์ˆ˜ ์žˆ๋‹ค.

 

์ด ์‹คํ—˜์€ “pixel diffusion์€ patch dimension ๋•Œ๋ฌธ์— ๋ถˆ๊ฐ€๋Šฅํ•˜๋‹ค”๋Š” ๊ธฐ์กด ์ƒ๊ฐ์ด ํ‹€๋ ธ์Œ์„ ๋ณด์—ฌ์ค€๋‹ค.

 

 

3.3 Transformer Backbone — Plain ViT, But Diffusionized

Patch Embedding ์ดํ›„์—๋Š” ๊ฑฐ์˜ ๊ทธ๋Œ€๋กœ์˜ ViT๊ฐ€ ์‚ฌ์šฉ๋œ๋‹ค.

  • SwiGLU FFN
  • RMSNorm
  • qk-Norm
  • Rotary Positional Embedding (RoPE)
  • AdaLN-zero ํด๋ž˜์Šค conditioning

 

์ฆ‰, ๊ณ ๋„ํ™”๋œ U-Net ๊ตฌ์กฐ๋‚˜ latent ํŠนํ™” ๋ชจ๋“ˆ์ด ์•„๋‹ˆ๋ผ, ์‚ฌ์‹ค์ƒ ์ผ๋ฐ˜ ์–ธ์–ด ๋ชจ๋ธ/๋น„์ „ ๋ชจ๋ธ๊ณผ ๋™์ผํ•œ Transformer๋กœ diffusion์„ ์ˆ˜ํ–‰ํ•œ๋‹ค. ์ด๋Š” ๋ชจ๋ธ ๊ตฌ์กฐ๊ฐ€ ํŠน์ • ๋„๋ฉ”์ธ์— ์ข…์†๋˜์ง€ ์•Š๋Š”๋‹ค๋Š” ์˜๋ฏธ์ด๊ธฐ๋„ ํ•˜๋‹ค.

 

3.4 Output: ๋ชจ๋ธ์ด ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์€ ํ•ญ์ƒ Clean Image Patch(x_pred)

์—ฌ๊ธฐ์„œ JiT์˜ ํ•ต์‹ฌ์ด ๋“œ๋Ÿฌ๋‚˜๋Š”๋ฐ, Transformer๋Š” ๋งค ์Šคํ…๋งˆ๋‹ค noisy image z_t๋ฅผ ๋ฐ›์•„ clean image์˜ ํŒจ์น˜(x_pred)๋ฅผ ์ง์ ‘ ์˜ˆ์ธกํ•œ๋‹ค.

x_pred = net(z_t, t)

 

๋ฌผ๋ก  ์ด๋Ÿฌํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ ํ•œ ๋ฒˆ์— ์ด๋ฏธ์ง€๊ฐ€ ๋ณต์›๋˜๋Š” ๊ฒƒ์€ ์•„๋‹ˆ๊ณ  x_pred๋Š” ์ตœ์ข… ๊ฒฐ๊ณผ๋ฌผ์ด ์•„๋‹ˆ๋ผ flow ๊ณ„์‚ฐ์„ ์œ„ํ•œ ์ค‘๊ฐ„ ์ถ”์ •์น˜๋‹ค.

 

3.5 v-loss ๊ธฐ๋ฐ˜ Flow Matching — x_pred → v_pred

Transformer๊ฐ€ ์˜ˆ์ธกํ•œ x_pred๋กœ๋ถ€ํ„ฐ velocity๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค.

v_pred = (x_pred - z_t) / (1 - t)

๊ทธ๋ฆฌ๊ณ  ์ •๋‹ต v์™€ ๋น„๊ตํ•ด v-loss๋กœ ํ•™์Šตํ•œ๋‹ค.

 

์ด ๊ณผ์ •์ด ์ค‘์š”ํ•œ ์ด์œ ๋Š”

  • x-prediction์ด pixel-space์—์„œ ์•ˆ์ •์ 
  • v-loss๊ฐ€ gradient ๊ท ํ˜•์„ ๋งž์ถค
  • flow matching ODE sampling๊ณผ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์—ฐ๊ฒฐ๋จ

 

๋‹ค์‹œ ๋งํ•ด, “์˜ˆ์ธก์€ x๋กœ, ํ•™์Šต์€ v๋กœ" ํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

 

 

3.6 Sampling: multi-step ODE solver (Heun/Euler)

JiT๋Š” “clean ์ด๋ฏธ์ง€๋ฅผ ํ•œ ๋ฒˆ์— ์˜ˆ์ธกํ•˜๋Š” ๋ชจ๋ธ”์ด ์•„๋‹ˆ๋ฉฐ ์—ฌ์ „ํžˆ multi-step sampling์„ ์ˆ˜ํ–‰ํ•œ๋‹ค.

 

Sampling ์ ˆ์ฐจ๋Š”

  1. ์ดˆ๊ธฐ noise ์ด๋ฏธ์ง€ zโ‚€ ์ƒ์„ฑ
  2. patchify → embedding
  3. Transformer๋กœ x_pred(t) ์˜ˆ์ธก
  4. x_pred(t) → v_pred ๊ณ„์‚ฐ
  5. z ์—…๋ฐ์ดํŠธ (Heun / Euler ODE step)
  6. ๋‹ค์‹œ patchifyํ•˜์—ฌ ๋ฐ˜๋ณต
  7. 50 step ์ •๋„ ์ˆ˜ํ–‰ → ์ตœ์ข… clean image ๋„๋‹ฌ

 

4. ์‹คํ—˜ ๊ฒฐ๊ณผ

 

Figure 2. Toy Experiment๋ฅผ ๋ณด๋ฉด 2์ฐจ์›(2D) ๋ฐ์ดํ„ฐ ๋ถ„ํฌ๋ฅผ ๋งŒ๋“ค๊ณ , ์ด๋ฅผ ๋ฌด์ž‘์œ„ projection matrix๋กœ 256D, 1024D, 4096D ๊ฐ™์€ ๊ณ ์ฐจ์› ๊ณต๊ฐ„์œผ๋กœ ๋งคํ•‘ํ•œ ๋’ค, ์ด ๊ณ ์ฐจ์› ๋ฐ์ดํ„ฐ๋ฅผ ๋ณด๊ณ  3๊ฐ€์ง€ ๋ฐฉ์‹(x/ฯต/v)์œผ๋กœ ๋‹ค์‹œ ์›๋ณธ ๋ถ„ํฌ๋ฅผ ๋ณต์›ํ•˜๋„๋ก ํ•™์Šต์‹œ์ผฐ๋‹ค.

 

๊ฒฐ๊ณผ๋Š” ๋งค์šฐ ์ง๊ด€์ ์ธ๋ฐ, 

  • x-prediction: D๊ฐ€ ์•„๋ฌด๋ฆฌ ์ปค์ ธ๋„ ์›๋ž˜์˜ 2D manifold๋ฅผ ์ •ํ™•ํžˆ ๋ณต์›
  • ฯต-prediction: ๋ฐ์ดํ„ฐ๊ฐ€ blob ํ˜•ํƒœ๋กœ ๋ถ•๊ดด
  • v-prediction: ๊ณ ์ฐจ์›์—์„œ ๊ฑฐ์˜ collapse

์ฆ‰, ๋ณธ์งˆ์ ์œผ๋กœ low-dimensionalํ•œ ๊ตฌ์กฐ(x)๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์€ ์‰ฌ์šด ๋ฐ˜๋ฉด, ๊ณ ์ฐจ์› ์ „์ฒด์— ํผ์ง„ ๋…ธ์ด์ฆˆ(ฯต/v)๋ฅผ ์ง์ ‘ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์€ ๋งค์šฐ ์–ด๋ ต๋‹ค๋Š” ์‚ฌ์‹ค์„ ๋ณด์—ฌ์ค€๋‹ค. ์ด ์‹คํ—˜์€ “pixel diffusion์ด ์‹คํŒจํ•œ ์ด์œ ๊ฐ€ compute ๋•Œ๋ฌธ์ด ์•„๋‹ˆ๋ผ ์˜ˆ์ธก target ์ž˜๋ชป ๋•Œ๋ฌธ”์ด๋ผ๋Š” JiT ๋…ผ๋ฌธ์˜ ํ•ต์‹ฌ ์ฃผ์žฅ์„ ์ง๊ด€์ ์œผ๋กœ ์ฆ๋ช…ํ•œ๋‹ค.

 

 

์‹คํ—˜ ๊ฒฐ๊ณผ๋ฅผ ๋ณด๋ฉด FID ๊ธฐ์ค€์œผ๋กœ Latent Diffusion(DiT-XL/2) ๋Œ€๋น„ FLOPs๊ฐ€ ํ›จ์”ฌ ๋‚ฎ์€๋ฐ ์„ฑ๋Šฅ์€ ๋™๊ธ‰์ธ ๊ฒƒ์„ ํ™•์ธํ•ฆ ์ˆ˜ ์žˆ๋‹ค. VAE ์—†์ด pixel-space์—์„œ ์ด ์ˆ˜์ค€์˜ ์„ฑ๋Šฅ์ด ๋‚˜์˜จ ๊ฒƒ ์ž์ฒด๊ฐ€ ํ•ต์‹ฌ์ ์ธ ๋ถ€๋ถ„์ด๋ฉฐ, ํŠนํžˆ 1024×1024 pixel diffusion์ด collapse ์—†์ด ๋Œ์•„๊ฐ„ ์ฒซ ์‚ฌ๋ก€ ์ค‘ ํ•˜๋‚˜๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

 

 

Figure 7์„ ๋ณด๋ฉด x-prediction์ด v-prediction๋ณด๋‹ค loss๊ฐ€ ๋” ๋‚ฎ๊ณ  ์•ˆ์ •์ ์ธ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๊ณ , ์‹ค์ œ ๋ณต์› ๊ฒฐ๊ณผ๋„ x-prediction์ด t๊ฐ€ ๋‚ฎ์„ ๋•Œ ์กฐ๊ธˆ ๋” ํ’ˆ์งˆ์ด ์ข‹๋‹ค. 

 

 


์ด ๋…ผ๋ฌธ์€ ๊ทธ๋™์•ˆ pixel diffusion์ด ์–ด๋ ค์› ๋˜ ์ด์œ ๋Š” patch dimension์ด๋‚˜ compute ๋ฌธ์ œ ๋•Œ๋ฌธ์ด ์•„๋‹ˆ๋ผ, ฯต/v ๊ฐ™์€ off-manifold, high-dimensional target์„ ์˜ˆ์ธกํ–ˆ๊ธฐ ๋•Œ๋ฌธ์ด๋ผ๋Š” ์ ์„ ๋ช…ํ™•ํ•˜๊ฒŒ ๋ณด์—ฌ์ค€๋‹ค. ์ฆ‰, pixel diffusion์˜ ํ•œ๊ณ„๋Š” ๊ตฌ์กฐ์  ํ•œ๊ณ„๊ฐ€ ์•„๋‹ˆ๋ผ ์„ค๊ณ„์  ์„ ํƒ์˜ ๋ฌธ์ œ์˜€๋‹ค๋Š” ์ธ์‚ฌ์ดํŠธ๋ฅผ ์ œ์‹œํ•œ๋‹ค.

 

๋ฌผ๋ก  ์•„์‰ฌ์šด ์ ์€ ImageNet ์Šค์ผ€์ผ์—์„œ๋งŒ ์‹คํ—˜์ด ์ง„ํ–‰๋˜์–ด T2I๋‚˜ multi-modal conditioning์— ๊ด€ํ•œ ์‹คํ—˜์ด ์—†๋‹ค๋Š” ์ ๊ณผ GPU ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์€ LDM ๊ณ„์—ด๋ณด๋‹ค ๋–จ์–ด์งˆ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’๋‹ค๋Š” ์ ๋„ ์žˆ๋‹ค. ํ•˜์ง€๋งŒ, ์ด ๋ฐฉํ–ฅ์ด ์˜ณ๋‹ค๋ฉด ์•ž์œผ๋กœ ๊ด€๋ จ๋œ ์—ฐ๊ตฌ๊ฐ€ ๋” ์ง„ํ–‰๋˜์ง€ ์•Š์„๊นŒ ์‹ถ๋‹ค.

๋ฐ˜์‘ํ˜•

'๐Ÿ› Research > Imageโ€ขVideo Generation' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

[Video Gen] HunyuanVideo:A Systematic Framework For Large Video Generative Models ๋ฆฌ๋ทฐ  (0) 2025.11.30
[Omni] OmniGen2: Exploration to Advanced Multimodal Generation | ํ†ตํ•ฉ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ƒ์„ฑ ๋ชจ๋ธ  (1) 2025.11.30
[Gen AI] T2I & TI2I ๋ฐ์ดํ„ฐ์…‹ ๋ฐ ๋ฒค์น˜๋งˆํฌ ์ •๋ฆฌ | ์ด๋ฏธ์ง€ ์ƒ์„ฑ & ํŽธ์ง‘ ๋ฐ์ดํ„ฐ์…‹  (0) 2025.11.01
[Gen AI] BAGEL: Unified Multimodal Design - ์ดํ•ด์™€ ์ƒ์„ฑ์˜ ํ†ตํ•ฉ ๊ตฌ์กฐ  (0) 2025.10.31
[Gen AI] Qwen-Image ํ…Œํฌ๋‹ˆ์ปฌ ๋ฆฌํฌํŠธ ๋ถ„์„ | T2I, TI2I | ์ด๋ฏธ์ง€ ์ƒ์„ฑ ํŽธ์ง‘ ๋ชจ๋ธ  (0) 2025.09.15
'๐Ÿ› Research/Image•Video Generation' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€
  • [Video Gen] HunyuanVideo:A Systematic Framework For Large Video Generative Models ๋ฆฌ๋ทฐ
  • [Omni] OmniGen2: Exploration to Advanced Multimodal Generation | ํ†ตํ•ฉ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ƒ์„ฑ ๋ชจ๋ธ
  • [Gen AI] T2I & TI2I ๋ฐ์ดํ„ฐ์…‹ ๋ฐ ๋ฒค์น˜๋งˆํฌ ์ •๋ฆฌ | ์ด๋ฏธ์ง€ ์ƒ์„ฑ & ํŽธ์ง‘ ๋ฐ์ดํ„ฐ์…‹
  • [Gen AI] BAGEL: Unified Multimodal Design - ์ดํ•ด์™€ ์ƒ์„ฑ์˜ ํ†ตํ•ฉ ๊ตฌ์กฐ
๋ญ…์ฆค
๋ญ…์ฆค
AI ๊ธฐ์ˆ  ๋ธ”๋กœ๊ทธ
    ๋ฐ˜์‘ํ˜•
  • ๋ญ…์ฆค
    moovzi’s Doodle
    ๋ญ…์ฆค
  • ์ „์ฒด
    ์˜ค๋Š˜
    ์–ด์ œ
  • ๊ณต์ง€์‚ฌํ•ญ

    • โœจ About Me
    • ๋ถ„๋ฅ˜ ์ „์ฒด๋ณด๊ธฐ (213)
      • ๐Ÿ“– Fundamentals (34)
        • Computer Vision (9)
        • 3D vision & Graphics (6)
        • AI & ML (16)
        • NLP (2)
        • etc. (1)
      • ๐Ÿ› Research (75)
        • Deep Learning (7)
        • Perception (19)
        • OCR (7)
        • Multi-modal (5)
        • Image•Video Generation (18)
        • 3D Vision (4)
        • Material • Texture Recognit.. (8)
        • Large-scale Model (7)
        • etc. (0)
      • ๐Ÿ› ๏ธ Engineering (8)
        • Distributed Training & Infe.. (5)
        • AI & ML ์ธ์‚ฌ์ดํŠธ (3)
      • ๐Ÿ’ป Programming (92)
        • Python (18)
        • Computer Vision (12)
        • LLM (4)
        • AI & ML (18)
        • Database (3)
        • Distributed Computing (6)
        • Apache Airflow (6)
        • Docker & Kubernetes (14)
        • ์ฝ”๋”ฉ ํ…Œ์ŠคํŠธ (4)
        • C++ (1)
        • etc. (6)
      • ๐Ÿ’ฌ ETC (4)
        • ์ฑ… ๋ฆฌ๋ทฐ (4)
  • ๋งํฌ

    • ๋ฆฌํ‹€๋ฆฌ ํ”„๋กœํ•„ (๋ฉ˜ํ† ๋ง, ๋ฉด์ ‘์ฑ…,...)
    • ใ€Ž๋‚˜๋Š” AI ์—”์ง€๋‹ˆ์–ด์ž…๋‹ˆ๋‹คใ€
    • Instagram
    • Brunch
    • Github
  • ์ธ๊ธฐ ๊ธ€

  • ์ตœ๊ทผ ๋Œ“๊ธ€

  • ์ตœ๊ทผ ๊ธ€

  • hELLOยท Designed By์ •์ƒ์šฐ.v4.10.3
๋ญ…์ฆค
[T2I] Back to Basics: Let Denoising Generative Models Denoise | Just image Transformers (JiT) ๋ฆฌ๋ทฐ
์ƒ๋‹จ์œผ๋กœ

ํ‹ฐ์Šคํ† ๋ฆฌํˆด๋ฐ”