[Video Gen] HunyuanVideo:A Systematic Framework For Large Video Generative Models ๋ฆฌ๋ทฐ

2025. 11. 30. 22:25ยท๐Ÿ› Research/Image•Video Generation
๋ฐ˜์‘ํ˜•

์˜ค๋Š˜์€ Video Generation ๋ถ„์•ผ์—์„œ ์œ ๋ช…ํ•œ HunyanVideo ์— ๋Œ€ํ•ด ์‚ดํŽด๋ณด๊ณ ์ž ํ•œ๋‹ค.

 

1. ๊ฐœ์š”

 

HunyuanVideo๋Š” ๋Œ€๊ทœ๋ชจ ํ…์ŠคํŠธ-๋น„๋””์˜ค ์ƒ์„ฑ ์˜คํ”ˆ์†Œ์Šค ๋ชจ๋ธ๋กœ, ๊ณ ํ•ด์ƒ๋„·์žฅ๋ฉด ์ผ๊ด€์„ฑ·๋ชจ์…˜ ์ž์—ฐ์Šค๋Ÿฌ์›€·ํ”„๋กฌํ”„ํŠธ ์ถฉ์‹ค๋„ ๋“ฑ์„ ๋™์‹œ์— ๋‹ฌ์„ฑํ•˜๊ธฐ ์œ„ํ•ด ์„ค๊ณ„๋œ ๋ฉ€ํ‹ฐ์Šค์ผ€์ผ ๋น„๋””์˜ค ์ƒ์„ฑ ์‹œ์Šคํ…œ์ด๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” ๋‹ค์Œ ์„ธ ๊ฐ€์ง€๋ฅผ ํ•ต์‹ฌ ๋ชฉํ‘œ๋กœ ์‚ผ๋Š”๋‹ค.

  1. ๊ณ ํ•ด์ƒ๋„·๊ณ ํ’ˆ์งˆ ๋น„๋””์˜ค ์ƒ์„ฑ (1080p+)
  2. ์žฅ๋ฉด·์‹œ๊ฐ„์  ์ผ๊ด€์„ฑ ๊ฐ•ํ™”
  3. ๋ณต์žกํ•œ ์žฅ๋ฉด ๊ตฌ์„ฑ ๋ฐ ์ƒ์„ธ ๋ฌ˜์‚ฌ ๋Šฅ๋ ฅ ๊ฐ•ํ™”

์ด๋ฅผ ์œ„ํ•ด ์ด๋ฏธ์ง€ ๊ธฐ๋ฐ˜ diffusion ๋ชจ๋ธ์„ ํ™•์žฅํ•˜๋Š” ๋ฐฉ์‹์ด ์•„๋‹ˆ๋ผ, ๋น„๋””์˜ค๋ฅผ ์ง์ ‘ ๋‹ค๋ฃจ๋Š” ์ „์šฉ Temporal-DiT ๊ตฌ์กฐ๋ฅผ ์ฑ„ํƒํ•œ๋‹ค.

 

2. Data Pre-processing

HunyuanVideo๋Š” ๋Œ€๊ทœ๋ชจ·๊ณ ํ’ˆ์งˆ ๋น„๋””์˜ค ์ƒ์„ฑ ๋ชจ๋ธ์„ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด ๋น„๋””์˜ค ์ „์ฒ˜๋ฆฌ ํŒŒ์ดํ”„๋ผ์ธ์„ ๋งค์šฐ ๊นŒ๋‹ค๋กญ๊ฒŒ ์„ค๊ณ„ํ–ˆ๋‹ค. ๋‹จ์ˆœํžˆ ์›น์—์„œ ๋น„๋””์˜ค๋ฅผ ์ˆ˜์ง‘ํ•ด ํ•™์Šตํ•˜๋Š” ๋ฐฉ์‹์ด ์•„๋‹ˆ๋ผ, ์ด๋ฏธ์ง€ ์ƒ์„ฑ ๋ชจ๋ธ์— ์ค€ํ•˜๋Š” ์ •๊ตํ•œ ๋ฐ์ดํ„ฐ ์ •์ œ ๋‹จ๊ณ„๋ฅผ ์ˆ˜ํ–‰ํ•œ๋‹ค. ์ „์ฒ˜๋ฆฌ ๋‹จ๊ณ„๋Š” ํฌ๊ฒŒ 5๋‹จ๊ณ„๋กœ ๊ตฌ์„ฑ๋œ๋‹ค.

2.1 ํด๋ฆฝ ๋ถ„ํ•  (Clip Segmentation)

  • ์›น ๋น„๋””์˜ค๋ฅผ ์ผ์ • ๊ธธ์ด(์˜ˆ: 2~8์ดˆ) ๋‹จ์œ„๋กœ ์ž˜๋ผ clip์„ ๋งŒ๋“ ๋‹ค.
  • ๋„ˆ๋ฌด ์งง๊ฑฐ๋‚˜(์•„์ฃผ ์งง์€ ์ •์  ์žฅ๋ฉด), ๋„ˆ๋ฌด ๊ธด ์˜์ƒ์€ ์ œ์™ธํ•œ๋‹ค.

2.2 ์˜์ƒ ํ’ˆ์งˆ ํ•„ํ„ฐ๋ง (Quality Filtering)

๋‹ค์–‘ํ•œ VLM ๊ธฐ๋ฐ˜ ํ•„ํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต์— ๋ถˆํ•„์š”ํ•œ ํด๋ฆฝ์„ ์ œ๊ฑฐํ•œ๋‹ค.

  • ํ•ด์ƒ๋„ ๋‚ฎ์Œ, ๋…ธ์ด์ฆˆ ์‹ฌํ•จ
  • ์ง€๋‚˜์น˜๊ฒŒ ํ”๋“ค๋ฆฌ๋Š” ์˜์ƒ
  • ํ™”๋ฉด ๊ฑฐ์˜ ์ •์  (motionless)
  • ์ •๋ณด๊ฐ€ ์—†๋Š” ๋ฐฐ๊ฒฝ ์˜์ƒ
  • ์ž๋ง‰/์›Œํ„ฐ๋งˆํฌ๊ฐ€ ๊ณผ๋„ํ•˜๊ฒŒ ํฌํ•จ๋œ ๊ฒฝ์šฐ

์ „์ฒ˜๋ฆฌ ํ•„ํ„ฐ๋Š” Qwen-VL, CLIP scoring, motion scoring ๋“ฑ์„ ์กฐํ•ฉํ•ด ๊ตฌ์„ฑ๋œ๋‹ค.

2.3 ๋ชจ์…˜ ๋ถ„์„ (Motion & Dynamics Filtering)

๋น„๋””์˜ค ์ƒ์„ฑ ๋ชจ๋ธ์€ ๋™์ž‘ ์ •๋ณด๊ฐ€ ํ•ต์‹ฌ์ด๋ฏ€๋กœ, ๋„ˆ๋ฌด ์ •์ ์ธ ๋น„๋””์˜ค๋Š” ํ•™์Šต์— ๋„์›€์ด ๋˜์ง€ ์•Š๋Š”๋‹ค.

  • Optical Flow ๊ธฐ๋ฐ˜ motion intensity ์ธก์ •
  • ๋„ˆ๋ฌด ์ •์ ์ธ ์˜์ƒ์€ ์ œ์™ธ
  • ๋„ˆ๋ฌด ๋น ๋ฅด๊ฒŒ ์›€์ง์—ฌ ๋ชจ์…˜ blur๊ฐ€ ์‹ฌํ•œ ์˜์ƒ๋„ ์ œ์™ธ

2.4 ์žฅ๋ฉด ์•ˆ์ •์„ฑ ๋ถ„์„ (Scene Stability)

๋น„๋””์˜ค ๋‚ด๋ถ€์—์„œ ์žฅ๋ฉด์ด ๊ณผ๋„ํ•˜๊ฒŒ ํŠ€๋Š” ๊ฒฝ์šฐ(์”ฌ ์ปท, ์ ํ”„์ปท)๋Š” ์‹œ๊ฐ„ ์ผ๊ด€์„ฑ ํ•™์Šต์— ๋ฐฉํ•ด๋œ๋‹ค.

  • Shot detection ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ์ปท ์œ„์น˜ ํƒ์ง€
  • ํ•˜๋‚˜์˜ clip ์•ˆ์— ์ปท์ด ์—ฌ๋Ÿฌ ๋ฒˆ ๋ฐœ์ƒํ•˜๋ฉด ํ•ด๋‹น clip ์ œ๊ฑฐ

2.5 ํ…์ŠคํŠธ Caption ์ƒ์„ฑ (MLLM Captioning)

์ด ๋‹จ๊ณ„๊ฐ€ ๋งค์šฐ ์ค‘์š”ํ•˜๋‹ค. ๋น„๋””์˜ค ์ƒ์„ฑ ๋ชจ๋ธ์€ ์ •ํ™•ํ•œ ํ…์ŠคํŠธ ์กฐ๊ฑด์ด ์žˆ์–ด์•ผ ํ•™์Šต์ด ์•ˆ์ •ํ™”๋œ๋‹ค.

  • Qwen2-VL ๊ธฐ๋ฐ˜ MLLM์œผ๋กœ ์ƒ์„ธ ๋น„๋””์˜ค ์บก์…˜ ์ƒ์„ฑ
  • ์žฅ๋ฉด ์ •๋ณด + ๊ฐ์ฒด + ๋™์ž‘ + ์นด๋ฉ”๋ผ ์‹œ์ ๊นŒ์ง€ ๋ชจ๋‘ ๊ธฐ์ˆ ํ•˜๋„๋ก ์‹œ์Šคํ…œ ํ”„๋กฌํ”„ํŠธ ์‚ฌ์šฉ

"A brown dog running along a beach while the camera slowly follows from behind. Waves move softly in the background."

 

์ด๋ ‡๊ฒŒ ๊ณ ํ’ˆ์งˆ ์บก์…˜์„ ์ƒ์„ฑํ•˜์—ฌ ํ”„๋กฌํ”„ํŠธ ์ถฉ์‹ค๋„๋ฅผ ๊ทน๋Œ€ํ™”ํ•œ๋‹ค.

 

 

3. ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜

HunyuanVideo์˜ ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜๋Š” ํฌ๊ฒŒ ์„ธ ๋ถ€๋ถ„์œผ๋กœ ๋‚˜๋‰œ๋‹ค.

  1. Causal 3D VAE: ๋น„๋””์˜ค/์ด๋ฏธ์ง€๋ฅผ ์‹œ๊ณต๊ฐ„(latent) ๊ณต๊ฐ„์œผ๋กœ ์••์ถ•ํ•˜๋Š” ๋ชจ๋“ˆ
  2. Diffusion Backbone (Video DiT): 3D latent ์ƒ์—์„œ ๋น„๋””์˜ค๋ฅผ ์ƒ์„ฑํ•˜๋Š” Transformer
  3. Text Encoder(MLLM + CLIP): ํ’๋ถ€ํ•œ ํ…์ŠคํŠธ ์กฐ๊ฑด์„ ์ƒ์„ฑํ•˜๋Š” ๋ชจ๋“ˆ

์ด ์„ธ ๋ชจ๋“ˆ์ด ํ•ฉ์ณ์ ธ "ํ…์ŠคํŠธ → 3D latent ๋น„๋””์˜ค → ํ”ฝ์…€ ๋น„๋””์˜ค"๋กœ ์ด์–ด์ง€๋Š” end-to-end ์ƒ์„ฑ ํŒŒ์ดํ”„๋ผ์ธ์„ ๊ตฌ์„ฑํ•œ๋‹ค.

 

3.1 Causal 3D VAE

๋‹ค๋ฅธ ๋น„๋””์˜ค ๋ชจ๋ธ๋“ค์€ ์ข…์ข… ์ด๋ฏธ์ง€ VAE๋ฅผ ๋จผ์ € ํ•™์Šตํ•œ ๋’ค, ์‹œ๊ฐ„ ์ถ•์„ ์–น๋Š” ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜์ง€๋งŒ, HunyuanVideo๋Š” ์•„์˜ˆ ๋น„๋””์˜ค ์ „์šฉ 3D VAE๋ฅผ ์ฒ˜์Œ๋ถ€ํ„ฐ ๋”ฐ๋กœ ํ•™์Šตํ•˜๋Š” ๋ฐฉ์‹์„ ์ฑ„ํƒํ•œ๋‹ค. ์ด ๋ชจ๋“ˆ์˜ ๋ชฉ์ ์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  • ๊ณ ํ•ด์ƒ๋„ ๋น„๋””์˜ค๋ฅผ ์‹œ๊ณต๊ฐ„์„ ๋ชจ๋‘ ์••์ถ•ํ•œ latent ๊ณต๊ฐ„์œผ๋กœ ๋ณ€ํ™˜
  • ์ดํ›„ Diffusion Transformer๊ฐ€ ์ฒ˜๋ฆฌํ•ด์•ผ ํ•  ํ† ํฐ ์ˆ˜๋ฅผ ํฌ๊ฒŒ ์ค„์—ฌ ํ•™์Šต·์ถ”๋ก ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•จ

 

 

์••์ถ• ๋ฐฉ์‹

์ž…๋ ฅ ๋น„๋””์˜ค์˜ ํฌ๊ธฐ๋ฅผ (T+1)×3×H×W๋ผ๊ณ  ํ•˜๋ฉด, 3D VAE๋Š” CausalConv3D๋ฅผ ๋ฐ˜๋ณต์ ์œผ๋กœ ์ ์šฉํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์€ latent๋ฅผ ์ถœ๋ ฅํ•œ๋‹ค.

  • ์ถœ๋ ฅ ํฌ๊ธฐ: ((T/ct)+1) × C × (H/cs) × (W/cs)
  • ๋…ผ๋ฌธ ๊ธฐ๋ณธ ์„ค์ •: ct = 4, cs = 8, C = 16

์ฆ‰, ์‹œ๊ฐ„์ถ•์€ 4๋ฐฐ, ๊ณต๊ฐ„์ถ•์€ 8๋ฐฐ ์ค„์ด๊ณ  ์ฑ„๋„์„ 16์œผ๋กœ ํ™•์žฅํ•œ ํ˜•ํƒœ์˜ ์‹œ๊ณต๊ฐ„ latent๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ตฌ์กฐ์ด๋‹ค. ์ด๋กœ ์ธํ•ด 1080p ๋น„๋””์˜ค๋ผ๋„ latent ์ƒ์—์„œ๋Š” ํ›จ์”ฌ ์ž‘์€ ํฌ๊ธฐ๋กœ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค.

 

CausalConv3D๋ฅผ ์“ฐ๋Š” ์ด์œ 

  • ๋น„๋””์˜ค์˜ ์‹œ๊ฐ„ ์ˆœ์„œ๋ฅผ ๋ณด์กดํ•˜๋ฉด์„œ๋„ ํšจ์œจ์ ์œผ๋กœ ์ธ์ฝ”๋”ฉํ•˜๊ธฐ ์œ„ํ•จ์ด๋‹ค.
  • Causal ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•ด "ํ˜„์žฌ ํ”„๋ ˆ์ž„์€ ๊ณผ๊ฑฐ ํ”„๋ ˆ์ž„๋งŒ ์ฐธ๊ณ "ํ•˜๋„๋ก ์„ค๊ณ„ํ•˜์—ฌ, ํ–ฅํ›„ autoregressive ํ™•์žฅ์ด๋‚˜ ์˜จ๋ผ์ธ ์ฒ˜๋ฆฌ์—๋„ ์œ ๋ฆฌํ•˜๋‹ค.

ํ•™์Šต ์‹œ ํŠน์ง•

  • ์ด๋ฏธ์ง€์™€ ๋น„๋””์˜ค๋ฅผ 4:1 ๋น„์œจ๋กœ ์„ž์–ด ํ•™์Šตํ•˜์—ฌ, ์ •์  ์ด๋ฏธ์ง€์™€ ๋™์  ๋น„๋””์˜ค ๋ชจ๋‘ ์ž˜ ์žฌ๊ตฌ์„ฑํ•˜๋„๋ก ํ•œ๋‹ค.
  • L1, KL loss๋ฟ ์•„๋‹ˆ๋ผ LPIPS(perceptual loss), GAN loss๋ฅผ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜์—ฌ ์‹œ๊ฐ์  ํ’ˆ์งˆ์„ ๋Œ์–ด์˜ฌ๋ฆฐ๋‹ค.
  • ๊ณ ์† ๋ชจ์…˜ ๋น„๋””์˜ค ์žฌ๊ตฌ์„ฑ์„ ์œ„ํ•ด ํ”„๋ ˆ์ž„ ๊ฐ„ ๊ฐ„๊ฒฉ์„ ๋žœ๋ค ์ƒ˜ํ”Œ๋งํ•˜๋Š” ์ „๋žต์„ ์‚ฌ์šฉํ•œ๋‹ค.

์ด๋ ‡๊ฒŒ ํ•™์Šต๋œ 3D VAE๋Š” ์ดํ›„ Diffusion ๋‹จ๊ณ„์—์„œ "์ž…๋ ฅ/์ถœ๋ ฅ ๋น„๋””์˜ค๋ฅผ ์˜ค๊ฐˆ ์ˆ˜ ์žˆ๋Š” ์••์ถ• ํ‘œํ˜„ ๊ณต๊ฐ„"์œผ๋กœ ์‚ฌ์šฉ๋œ๋‹ค.

 

3.2 Unified Image & Video Diffusion Backbone (Video DiT)

HunyuanVideo์˜ ํ•ต์‹ฌ์€ ์ด๋ฏธ์ง€์™€ ๋น„๋””์˜ค๋ฅผ ํ•˜๋‚˜์˜ ํ†ตํ•ฉ Transformer(DiT)๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋ฅผ "Unified Image and Video Generative Architecture"๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค.

 

์ž…๋ ฅ ๊ตฌ์„ฑ

  1. ๋น„๋””์˜ค/์ด๋ฏธ์ง€ latent
    • 3D VAE๋ฅผ ๊ฑฐ์นœ latent: T×C×H×W
    • ์ด๋ฏธ์ง€๋Š” "ํ”„๋ ˆ์ž„์ด 1๊ฐœ์ธ ๋น„๋””์˜ค"๋กœ ์ทจ๊ธ‰ํ•˜์—ฌ ๋™์ผํ•œ ํฌ๋งท์œผ๋กœ ์ฒ˜๋ฆฌํ•œ๋‹ค.
  2. ํ…์ŠคํŠธ ์กฐ๊ฑด(hidden states)
    • MLLM(Hunyuan-Large ๊ณ„์—ด)๋กœ ํ…์ŠคํŠธ๋ฅผ ์ธ์ฝ”๋”ฉํ•œ ํ† ํฐ ์‹œํ€€์Šค
    • CLIP text encoder์—์„œ ์ถ”์ถœํ•œ global text embedding (๋งˆ์ง€๋ง‰ ํ† ํฐ)๋„ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜์—ฌ ์ „์—ญ์ ์ธ ์˜๋ฏธ๋ฅผ ๋ณด๊ฐ•ํ•œ๋‹ค.
  3. ๋…ธ์ด์ฆˆ ๋ฐ ์‹œ๊ฐ„ ์Šคํ… ์ •๋ณด
    • Rectified Flow ๊ธฐ๋ฐ˜ diffusion์ด๋ฏ€๋กœ time step t์— ํ•ด๋‹นํ•˜๋Š” ์กฐ๊ฑด์ด ํฌํ•จ๋œ๋‹ค.

3D Patchification

๋น„๋””์˜ค latent๋Š” 3D Conv(์ปค๋„ ํฌ๊ธฐ kt×kh×kw)๋ฅผ ํ†ตํ•ด ์‹œ๊ณต๊ฐ„ ํŒจ์น˜ ํ† ํฐ์œผ๋กœ ๋ณ€ํ™˜๋œ๋‹ค.

  • ํ† ํฐ ๊ฐœ์ˆ˜: (T/kt) × (H/kh) × (W/kw)
  • ๊ฐ ํ† ํฐ์€ (kt×kh×kw×C) ์ฐจ์›์˜ ๋ฒกํ„ฐ๋กœ flatten๋œ๋‹ค.

์ด๋ ‡๊ฒŒ ๋˜๋ฉด ์ตœ์ข…์ ์œผ๋กœ "๋น„๋””์˜ค ์ „์ฒด๊ฐ€ 1D ํ† ํฐ ์‹œํ€€์Šค"๋กœ ํŽผ์ณ์ ธ Transformer์— ์ž…๋ ฅ๋œ๋‹ค.

 

3.3 Full Spatio-temporal Attention + Dual/Single Stream ๊ตฌ์กฐ

๊ธฐ์กด ๋น„๋””์˜ค Diffusion ๋ชจ๋ธ(Open-Sora, Imagen Video, MagicVideo ๋“ฑ)์€ ์ฃผ๋กœ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•ด์™”๋‹ค.

  • 2D ๊ณต๊ฐ„ U-Net + 1D Temporal block
  • ๋˜๋Š” ๊ณต๊ฐ„/์‹œ๊ฐ„์„ ๋ถ„๋ฆฌํ•œ factorized attention (2D + 1D)

์ด ๋ฐฉ์‹์€ ์—ฐ์‚ฐ๋Ÿ‰์„ ์ค„์ด๋Š” ์žฅ์ ์€ ์žˆ์ง€๋งŒ, ๊ณต๊ฐ„·์‹œ๊ฐ„ ์ƒํ˜ธ์ž‘์šฉ์„ ์ถฉ๋ถ„ํžˆ ๋ชจ๋ธ๋งํ•˜๊ธฐ ์–ด๋ ต๊ณ , ์žฅ๋ฉด·์นด๋ฉ”๋ผ ๋ชจ์…˜ ์ผ๊ด€์„ฑ์ด ๋–จ์–ด์ง€๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค.

 

HunyuanVideo๋Š” FLUX์—์„œ ์‚ฌ์šฉํ•œ ๊ฒƒ๊ณผ ์œ ์‚ฌํ•œ Dual-flow Attention ๋ธ”๋ก์„ ๋น„๋””์˜ค ์˜์—ญ์œผ๋กœ ํ™•์žฅํ•˜๊ณ , ๊ณต๊ฐ„·์‹œ๊ฐ„์„ ์™„์ „ํžˆ ํ†ตํ•ฉํ•œ Full Attention Transformer๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.

 

Dual-stream → Single-stream

  • Dual-stream ๋‹จ๊ณ„
    • ์˜์ƒ ํ† ํฐ๊ณผ ํ…์ŠคํŠธ ํ† ํฐ์„ ์„œ๋กœ ๋‹ค๋ฅธ ์ŠคํŠธ๋ฆผ์—์„œ ๊ฐ๊ฐ ์ฒ˜๋ฆฌ
    • ๊ฐ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๊ฐ€ ์ž๊ธฐ ํ‘œํ˜„์„ ์ถฉ๋ถ„ํžˆ ํ•™์Šตํ•˜๋„๋ก ๋•๋Š” ๋‹จ๊ณ„
  • Single-stream ๋‹จ๊ณ„
    • ์ดํ›„ ๋‘ ์ŠคํŠธ๋ฆผ์˜ ํ† ํฐ์„ concatํ•˜์—ฌ ํ•˜๋‚˜์˜ Transformer์— ๋„ฃ์–ด ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ joint attention์„ ์ˆ˜ํ–‰
    • ์ด ๋‹จ๊ณ„์—์„œ ํ…์ŠคํŠธ ์กฐ๊ฑด๊ณผ ๋น„๋””์˜ค latent๊ฐ€ ๊นŠ๊ฒŒ ์œตํ•ฉ

์ด ๊ตฌ์กฐ ๋•๋ถ„์—, ์ดˆ๊ธฐ์—๋Š” ๋น„์ „-ํ…์ŠคํŠธ ํ‘œํ˜„์ด ์„œ๋กœ ๊ฐ„์„ญ ์—†์ด ์•ˆ์ •์ ์œผ๋กœ ํ•™์Šต๋˜๊ณ , ํ›„๋ฐ˜์—๋Š” ๋ณต์žกํ•œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ƒํ˜ธ์ž‘์šฉ์„ ํฌ์ฐฉํ•  ์ˆ˜ ์žˆ๋‹ค.

 

Full Spatio-temporal Attention์˜ ํŠน์ง•

  • ํ† ํฐ ๊ฐ„ self-attention์ด ์‹œ๊ฐ„·๊ณต๊ฐ„์„ ๊ฐ€๋ฆฌ์ง€ ์•Š๊ณ  ์ „ ๋ฒ”์œ„์—์„œ ๊ณ„์‚ฐ๋œ๋‹ค.
  • ํŠน์ • ํ”„๋ ˆ์ž„์˜ ๊ฐ์ฒด๋Š” ๋‹ค๋ฅธ ํ”„๋ ˆ์ž„์˜ ๊ฐ™์€ ๊ฐ์ฒด์™€ ์ง์ ‘ attention์„ ์ฃผ๊ณ ๋ฐ›์„ ์ˆ˜ ์žˆ๋‹ค.
  • ๊ฒฐ๊ณผ์ ์œผ๋กœ ๋Œ€์ƒ ํ˜•ํƒœ ์œ ์ง€, ๋ชจ์…˜ ์ผ๊ด€์„ฑ, ์นด๋ฉ”๋ผ ์›€์ง์ž„ ํ‘œํ˜„์ด ์ด์ „ 2D+1D ๊ตฌ์กฐ๋ณด๋‹ค ์šฐ์ˆ˜ํ•˜๋‹ค.

 

3.4 3D RoPE: ์‹œ๊ฐ„·๊ณต๊ฐ„์„ ๋™์‹œ์— ์ธ์ฝ”๋”ฉํ•˜๋Š” ํฌ์ง€์…˜ ์ž„๋ฒ ๋”ฉ

๋น„๋””์˜ค๋Š” ์‹œ๊ฐ„ T, ์„ธ๋กœ H, ๊ฐ€๋กœ W๋ผ๋Š” ์„ธ ์ฐจ์›์˜ ์œ„์น˜ ์ •๋ณด๋ฅผ ๋ชจ๋‘ ํ•„์š”๋กœ ํ•œ๋‹ค. HunyuanVideo๋Š” ์ด๋ฅผ ์œ„ํ•ด 3D ํ™•์žฅ RoPE(Rotary Position Embedding)์„ ์‚ฌ์šฉํ•œ๋‹ค.

  • RoPE๋ฅผ ์‹œ๊ฐ„ T, ๋†’์ด H, ๋„ˆ๋น„ W ์ถ•์— ๋Œ€ํ•ด ๊ฐ๊ฐ ๊ณ„์‚ฐํ•œ๋‹ค.
  • query/key ์ฑ„๋„์„ (dt, dh, dw) ์„ธ ๋ฉ์–ด๋ฆฌ๋กœ ๋‚˜๋ˆˆ ๋’ค, ๊ฐ ๋ฉ์–ด๋ฆฌ์— ๋Œ€์‘ํ•˜๋Š” ์ถ•์˜ RoPE๋ฅผ ๊ณฑํ•œ๋‹ค.
  • ์ดํ›„ ๋‹ค์‹œ concatํ•˜์—ฌ ์ตœ์ข… query/key๋ฅผ ๊ตฌ์„ฑํ•˜๊ณ  attention์„ ๊ณ„์‚ฐํ•œ๋‹ค.

์ด ๋ฐฉ์‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์žฅ์ ์„ ์ œ๊ณตํ•œ๋‹ค.

  • ๋‹ค์–‘ํ•œ ํ•ด์ƒ๋„·๋น„์œจ·๊ธธ์ด์˜ ๋น„๋””์˜ค๋ฅผ ํ•˜๋‚˜์˜ ๋ชจ๋ธ๋กœ ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅ
  • ์‹œ๊ฐ„ ์ถ• extrapolation ๋Šฅ๋ ฅ์ด ํ–ฅ์ƒ๋˜์–ด ๋” ๊ธด ๋น„๋””์˜ค์—๋„ ์ผ๋ฐ˜ํ™” ๊ฐ€๋Šฅ
  • ๊ณต๊ฐ„·์‹œ๊ฐ„ ๊ด€๊ณ„๋ฅผ ๋” ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ํ•™์Šต

 

3.5 ํ…์ŠคํŠธ ์ธ์ฝ”๋”: MLLM + Bidirectional Refiner + CLIP

HunyuanVideo๋Š” ๋‹จ์ˆœํ•œ CLIP text encoder๋‚˜ T5 ๊ณ„์—ด์ด ์•„๋‹ˆ๋ผ, visual instruction ํ•™์Šต์ด ๋๋‚œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ LLM(MLLM)์„ ํ…์ŠคํŠธ ์ธ์ฝ”๋”๋กœ ์‚ฌ์šฉํ•œ๋‹ค. ์ด ์„ ํƒ์˜ ์ด์œ ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  1. MLLM์€ ์‹œ๊ฐ์  ๋ฌธ๋งฅ์— ๋งž์ถ˜ ์„ธ๋ฐ€ํ•œ ์„ค๋ช… ๋Šฅ๋ ฅ์„ ๊ฐ€์ง€๊ณ  ์žˆ์–ด, ๋น„๋””์˜ค ํ”„๋กฌํ”„ํŠธ์— ์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š” ์นด๋ฉ”๋ผ ์ƒท, ์žฅ๋ฉด ์ „ํ™˜, ๋ถ„์œ„๊ธฐ ๋“ฑ์„ ํ’๋ถ€ํ•˜๊ฒŒ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค.
  2. ์‹œ์Šคํ…œ ํ”„๋กฌํ”„ํŠธ๋ฅผ ํ†ตํ•ด ๋ชจ๋ธ ์นœํ™”์ ์ธ prompt ์Šคํƒ€์ผ๋กœ ์œ ๋„ํ•  ์ˆ˜ ์žˆ์–ด, diffusion backbone์ด ์ดํ•ดํ•˜๊ธฐ ์‰ฌ์šด ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ์„ ์ œ๊ณตํ•œ๋‹ค.
  3. causal attention ๊ธฐ๋ฐ˜์ด๊ธฐ ๋•Œ๋ฌธ์— autoregressive ํ”„๋กฌํ”„ํŠธ ์ฒ˜๋ฆฌ์™€ ์ž˜ ๋งž๋Š”๋‹ค.

๋‹จ, diffusion ๋ชจ๋ธ ์ž…์žฅ์—์„œ๋Š” ์–‘๋ฐฉํ–ฅ ์ •๋ณด๋ฅผ ๊ฐ€์ง„ ํ…์ŠคํŠธ representation์ด ๋” ์œ ๋ฆฌํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ๋…ผ๋ฌธ์—์„œ๋Š” [Token Refiner] ๋ธ”๋ก์„ ์ถ”๊ฐ€ํ•ด MLLM์˜ causal feature๋ฅผ bidirectional ํ˜•ํƒœ๋กœ ๋‹ค์‹œ ์ •์ œํ•œ๋‹ค. ์—ฌ๊ธฐ์— CLIP-Large text์˜ ๋งˆ์ง€๋ง‰ ํ† ํฐ์„ global guidance๋กœ ์ถ”๊ฐ€ํ•˜์—ฌ, ์„ธ๋ฐ€ํ•œ MLLM feature์™€ ์ „์—ญ์ ์ธ CLIP feature๋ฅผ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜๋Š” ๊ตฌ์กฐ์ด๋‹ค.

 

3.6 ๋‹ค๋ฅธ ๋น„๋””์˜ค ์ƒ์„ฑ ๋ชจ๋ธ๊ณผ์˜ ์ฐจ์ด์  ์ •๋ฆฌ

์ •๋ฆฌํ•˜๋ฉด, HunyuanVideo์˜ ์•„ํ‚คํ…์ฒ˜๋Š” ๊ธฐ์กด open-source ๋น„๋””์˜ค ๋ชจ๋ธ๋“ค๊ณผ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ฐจ๋ณ„์ ์„ ๊ฐ€์ง„๋‹ค.

  1. ๋น„๋””์˜ค ์ „์šฉ 3D VAE๋ฅผ ์ฒ˜์Œ๋ถ€ํ„ฐ ํ•™์Šตํ•œ๋‹ค.
    • ๋งŽ์€ ๋ชจ๋ธ์ด ์ด๋ฏธ์ง€ VAE๋ฅผ ์žฌํ™œ์šฉํ•˜๋Š” ๊ฒƒ๊ณผ ๋‹ฌ๋ฆฌ, ๋น„๋””์˜ค์— ์ตœ์ ํ™”๋œ CausalConv3D VAE๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ณ ํ•ด์ƒ๋„·๊ณ ํ”„๋ ˆ์ž„ ๋น„๋””์˜ค๋ฅผ ์ง์ ‘ latent ๊ณต๊ฐ„์—์„œ ๋‹ค๋ฃฌ๋‹ค.
  2. Full Spatiotemporal Attention์„ ์‚ฌ์šฉํ•˜๋Š” ํ†ตํ•ฉ Transformer์ด๋‹ค.
    • 2D + 1D factorized attention ๋Œ€์‹ , ์‹œ๊ฐ„·๊ณต๊ฐ„์„ ํ•œ ๋ฒˆ์— ๋‹ค๋ฃจ๋Š” full attention์„ ํ†ตํ•ด ๋” ๊ฐ•ํ•œ ๋ชจ์…˜·์žฅ๋ฉด ์ผ๊ด€์„ฑ์„ ์–ป๋Š”๋‹ค.
  3. Dual-stream → Single-stream ๊ตฌ์กฐ๋กœ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์œตํ•ฉ์„ ์ˆ˜ํ–‰ํ•œ๋‹ค.
    • ์ดˆ๊ธฐ์—๋Š” ํ…์ŠคํŠธ/๋น„์ „ ํ‘œํ˜„์„ ๋ถ„๋ฆฌํ•ด ์•ˆ์ •์ ์œผ๋กœ ํ•™์Šตํ•˜๊ณ , ํ›„๋ฐ˜์—๋Š” ํ•˜๋‚˜์˜ ์ŠคํŠธ๋ฆผ์—์„œ ๊นŠ๊ฒŒ ์œตํ•ฉํ•˜๋Š” ์„ค๊ณ„์ด๋‹ค.
  4. ํ…์ŠคํŠธ ์ธ์ฝ”๋”๋กœ MLLM์„ ์‚ฌ์šฉํ•œ๋‹ค.
    • CLIP/T5 ๊ธฐ๋ฐ˜๋ณด๋‹ค ์„ธ๋ฐ€ํ•œ ์žฅ๋ฉด·์นด๋ฉ”๋ผ·์Šคํƒ€์ผ ๋ฌ˜์‚ฌ๋ฅผ ์ž˜ ๋ฐ˜์˜ํ•˜๋ฉฐ, token refiner + CLIP global feature๋กœ ๋ณด์™„ํ•œ๋‹ค.

์ด๋Ÿฌํ•œ ์„ค๊ณ„ ๋•๋ถ„์— HunyuanVideo๋Š” ์ด๋ฏธ์ง€์™€ ๋น„๋””์˜ค๋ฅผ ๋ชจ๋‘ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ๋Š” unified generative backbone์ด๋ฉด์„œ, ๋™์‹œ์— ๊ณ ํ•ด์ƒ๋„·์žฅ์‹œ๊ฐ„·๊ณ ํ’ˆ์งˆ ๋น„๋””์˜ค ์ƒ์„ฑ์— ์ตœ์ ํ™”๋œ ์•„ํ‚คํ…์ฒ˜๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

 

3.7 Model Scaling

 

Figure 10์€ DiT-T2X ๋ชจ๋ธ ํŒจ๋ฐ€๋ฆฌ(์ด๋ฏธ์ง€·๋น„๋””์˜ค ์ƒ์„ฑ ๋ชจ๋ธ)์˜ Loss vs Compute(FLOPs) ๊ด€๊ณ„๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋‘ ๊ฐ€์ง€ ์ฃผ์š” scaling ๋ฒ•์น™์„ ๋„์ถœํ•œ๋‹ค. 

 

(1) Compute C ↔ Model Parameters N scaling law

(2) Compute C ↔ Dataset Tokens D scaling law

 

์ฆ‰, ๋Œ€๊ทœ๋ชจ ๋น„๋””์˜ค ์ƒ์„ฑ ๋ชจ๋ธ์ด scale up ํ• ์ˆ˜๋ก ๋” ์„ฑ๋Šฅ์ด ์ข‹์•„์ง„๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ฃผ๊ณ , ์ตœ์  ๋ชจ๋ธ๊ณผ ๋ฐ์ดํ„ฐ ํฌ๊ธฐ๋ฅผ ์ˆ˜์น˜์ ์œผ๋กœ ๊ฒฐ์ •ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ๋œปํ•˜๊ธฐ๋„ ํ•œ๋‹ค.

 

 

4. ํ•™์Šต ๋ฐฉ๋ฒ•


HunyuanVideo๋Š” ๋Œ€๊ทœ๋ชจ ์˜์ƒ ๋ฐ์ดํ„ฐ ํ•™์Šต์„ ์œ„ํ•œ ํŠน๋ณ„ํ•œ ์ „๋žต์„ ์‚ฌ์šฉํ•œ๋‹ค.

4.1 Multi-Stage Training

  1. Stage 1: ์ด๋ฏธ์ง€ ์‚ฌ์ „ํ•™์Šต(Image Pretraining)
    • 2D ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•ด ๊ณต๊ฐ„์  ์ดํ•ด๋ ฅ ํ™•๋ณด
    • Image-DiT๋กœ ์ดˆ๋ฐ˜ ์•ˆ์ •ํ™”
  2. Stage 2: ์ €ํ•ด์ƒ๋„ ๋น„๋””์˜ค ํ•™์Šต(Low-Res Video)
    • 256p/360p short clip ์œ„์ฃผ ํ•™์Šต
    • Temporal attention ์•ˆ์ •ํ™”
  3. Stage 3: ๊ณ ํ•ด์ƒ๋„ ๋น„๋””์˜ค ํ•™์Šต(High-Res Video)
    • 1080p·4K๊นŒ์ง€ ํ™•์žฅ
    • Long-range temporal memory ํ•™์Šต

์ด 3๋‹จ๊ณ„ ๊ตฌ์กฐ๋Š” compute ํšจ์œจ์„ฑ๊ณผ ๋น„๋””์˜ค ํ’ˆ์งˆ์„ ๋™์‹œ์— ๋‹ฌ์„ฑํ•˜๊ธฐ ์œ„ํ•ด ํ•„์š”ํ•˜๋‹ค.

 

4.2 Balanced Training 

๋น„๋””์˜ค ํ”„๋กฌํ”„ํŠธ๋Š” ์ด๋ฏธ์ง€๋ณด๋‹ค ํ›จ์”ฌ ๋ณต์žกํ•˜๋ฏ€๋กœ, ํ…์ŠคํŠธ ์ดํ•ด ๋Šฅ๋ ฅ์„ ๊ฐ•ํ™”ํ•˜๊ธฐ ์œ„ํ•ด LLaMA ๊ธฐ๋ฐ˜ ํ…์ŠคํŠธ ์ธ์ฝ”๋”๋ฅผ ํŠน๋ณ„ํžˆ ํŠœ๋‹ํ•œ๋‹ค.

ํ…์ŠคํŠธ ํ€„๋ฆฌํ‹ฐ๊ฐ€ ๋‚ฎ์œผ๋ฉด ๋‹ค์Œ ๋ฌธ์ œ๊ฐ€ ์ƒ๊ธด๋‹ค.

  • camera instruct๊ฐ€ ์ž˜ ์•ˆ ์ง€์ผœ์ง
  • ๋™์ž‘ ๋ฌ˜์‚ฌ ๋ˆ„๋ฝ
  • ์บ๋ฆญํ„ฐ consistency ์ €ํ•˜

์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ํ…์ŠคํŠธ ํŒŒํŠธ์™€ ๋น„๋””์˜ค ํŒŒํŠธ๋ฅผ ๋ณ„๋„๋กœ ํ•™์Šตํ•˜๋ฉด์„œ๋„ cross-attention์—์„œ ๊ท ํ˜•์ด ๋งž๋„๋ก ํ•™์Šต๋ฅ  ์Šค์ผ€์ค„์„ ์กฐ์ •ํ•œ๋‹ค.

 

4.3 Long-Video Training

์ผ๋ฐ˜ ๋น„๋””์˜ค ๋ชจ๋ธ์€ 2~5์ดˆ ๊ธธ์ด์— ๊ตญํ•œ๋˜์ง€๋งŒ, HunyuanVideo๋Š” 10~30์ดˆ ๋น„๋””์˜ค๋„ ์ฒ˜๋ฆฌํ•œ๋‹ค.

์ด๋ฅผ ์œ„ํ•ด ์•„๋ž˜์™€ ๊ฐ™์€ ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•œ๋‹ค.

  • long-context temporal attention
  • video chunking
  • temporal hierarchical encoding

 

5. ์ƒ˜ํ”Œ๋ง ๋ฐฉ์‹

HunyuanVideo๋Š” Rectified Flow ๊ธฐ๋ฐ˜์˜ sampling์„ ์‚ฌ์šฉํ•˜๋ฉฐ, ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฐฉ์‹์œผ๋กœ ๋น„๋””์˜ค๋ฅผ ์ƒ์„ฑํ•œ๋‹ค.

  1. Text Encoder๋กœ prompt ์ž„๋ฒ ๋”ฉ ์ƒ์„ฑ
  2. Noise ๋น„๋””์˜ค latent ์ค€๋น„
  3. Video-DiT๊ฐ€ iterative denoising ์ˆ˜ํ–‰
  4. ์™„์„ฑ๋œ latent๋ฅผ Video-VAE Decoder๋กœ ๋ณต์›
  5. ์ตœ์ข… ๋น„๋””์˜ค ์ƒ์„ฑ

ํ”„๋ ˆ์ž„ ๋‹จ์œ„๊ฐ€ ์•„๋‹Œ spatiotemporal latent ๋‹จ์œ„๋กœ ์ƒ˜ํ”Œ๋งํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ชจ์…˜ ์ผ๊ด€์„ฑ์ด ๋›ฐ์–ด๋‚˜๋‹ค.

 

6. ์‹คํ—˜ ๊ฒฐ๊ณผ

 

HunyuanVideo๋Š” ๋‹ค์–‘ํ•œ ๋ฒค์น˜๋งˆํฌ์—์„œ ๋†’์€ ์„ฑ๋Šฅ์„ ๊ธฐ๋กํ•œ๋‹ค.

  • ๊ณ ํ•ด์ƒ๋„ ํ’ˆ์งˆ(1080p+) ์šฐ์ˆ˜
  • ์บ๋ฆญํ„ฐ ์ผ๊ด€์„ฑ, ์นด๋ฉ”๋ผ ๋ชจ์…˜ ์ผ๊ด€์„ฑ ํ–ฅ์ƒ
  • ๋ณต์žกํ•œ ์žฅ๋ฉด์—์„œ๋„ artifact ๊ฐ์†Œ
  • ์žฅ์‹œ๊ฐ„ ๋น„๋””์˜ค ์ƒ์„ฑ ๊ฐ€๋Šฅ

์‹ค์ œ ์ƒ˜ํ”Œ์—์„œ๋„ ์•ˆ์ •์ ์ธ ๋ชจ์…˜ ํ๋ฆ„๊ณผ ๋””ํ…Œ์ผ ์žฌํ˜„๋ ฅ์ด ๋‘๋“œ๋Ÿฌ์ง„๋‹ค.

 

 

 


HunyuanVideo๋Š” ๋Œ€๊ทœ๋ชจ ๋น„๋””์˜ค ์ƒ์„ฑ ๋ชจ๋ธ์ด ์–ด๋–ป๊ฒŒ ์„ค๊ณ„๋˜๊ณ , ์–ด๋–ค ๋ฐฉ์‹์œผ๋กœ ์•ˆ์ •์ ์ธ scaling์„ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ๋Š”์ง€๋ฅผ ๋ช…ํ™•ํ•˜๊ฒŒ ๋ณด์—ฌ์ฃผ๋Š” ๋Œ€ํ‘œ์  ์—ฐ๊ตฌ์ด๋‹ค. 3D VAE·Full Spatiotemporal Attention·MLLM ๊ธฐ๋ฐ˜ ํ…์ŠคํŠธ ์กฐ๊ฑด ๋“ฑ ์„ค๊ณ„ ์ „๋ฐ˜์ด ์œ ๊ธฐ์ ์œผ๋กœ ๊ฒฐํ•ฉ๋˜์–ด ๊ณ ํ•ด์ƒ๋„·์žฅ์‹œ๊ฐ„·์ผ๊ด€์„ฑ ์žˆ๋Š” ๋น„๋””์˜ค ์ƒ์„ฑ์ด ๊ฐ€๋Šฅํ•ด์กŒ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋น„๋””์˜ค ์ƒ์„ฑ ๋ชจ๋ธ ์—ญ์‹œ ์ด๋ฏธ์ง€·์–ธ์–ด ๋ชจ๋ธ์ฒ˜๋Ÿผ ๋ณธ๊ฒฉ์ ์ธ ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธ๋กœ ํ™•์žฅ๋  ์ˆ˜ ์žˆ์Œ์„ ์ž…์ฆํ•œ ์˜๋ฏธ ์žˆ๋Š” ์‚ฌ๋ก€๋ผ ํ•  ์ˆ˜ ์žˆ๋‹ค.

๋ฐ˜์‘ํ˜•

'๐Ÿ› Research > Imageโ€ขVideo Generation' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

[T2V] Goku: Flow Based Video Generative Foundation Models ๋ฆฌ๋ทฐ  (0) 2026.01.04
[Omni] OmniGen2: Exploration to Advanced Multimodal Generation | ํ†ตํ•ฉ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ƒ์„ฑ ๋ชจ๋ธ  (1) 2025.11.30
[T2I] Back to Basics: Let Denoising Generative Models Denoise | Just image Transformers (JiT) ๋ฆฌ๋ทฐ  (0) 2025.11.29
[Gen AI] T2I & TI2I ๋ฐ์ดํ„ฐ์…‹ ๋ฐ ๋ฒค์น˜๋งˆํฌ ์ •๋ฆฌ | ์ด๋ฏธ์ง€ ์ƒ์„ฑ & ํŽธ์ง‘ ๋ฐ์ดํ„ฐ์…‹  (0) 2025.11.01
[Gen AI] BAGEL: Unified Multimodal Design - ์ดํ•ด์™€ ์ƒ์„ฑ์˜ ํ†ตํ•ฉ ๊ตฌ์กฐ  (0) 2025.10.31
'๐Ÿ› Research/Image•Video Generation' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€
  • [T2V] Goku: Flow Based Video Generative Foundation Models ๋ฆฌ๋ทฐ
  • [Omni] OmniGen2: Exploration to Advanced Multimodal Generation | ํ†ตํ•ฉ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ƒ์„ฑ ๋ชจ๋ธ
  • [T2I] Back to Basics: Let Denoising Generative Models Denoise | Just image Transformers (JiT) ๋ฆฌ๋ทฐ
  • [Gen AI] T2I & TI2I ๋ฐ์ดํ„ฐ์…‹ ๋ฐ ๋ฒค์น˜๋งˆํฌ ์ •๋ฆฌ | ์ด๋ฏธ์ง€ ์ƒ์„ฑ & ํŽธ์ง‘ ๋ฐ์ดํ„ฐ์…‹
๋ญ…์ฆค
๋ญ…์ฆค
AI ๊ธฐ์ˆ  ๋ธ”๋กœ๊ทธ
    ๋ฐ˜์‘ํ˜•
  • ๋ญ…์ฆค
    moovzi’s Doodle
    ๋ญ…์ฆค
  • ์ „์ฒด
    ์˜ค๋Š˜
    ์–ด์ œ
  • ๊ณต์ง€์‚ฌํ•ญ

    • โœจ About Me
    • ๋ถ„๋ฅ˜ ์ „์ฒด๋ณด๊ธฐ (213)
      • ๐Ÿ“– Fundamentals (34)
        • Computer Vision (9)
        • 3D vision & Graphics (6)
        • AI & ML (16)
        • NLP (2)
        • etc. (1)
      • ๐Ÿ› Research (75)
        • Deep Learning (7)
        • Perception (19)
        • OCR (7)
        • Multi-modal (5)
        • Image•Video Generation (18)
        • 3D Vision (4)
        • Material • Texture Recognit.. (8)
        • Large-scale Model (7)
        • etc. (0)
      • ๐Ÿ› ๏ธ Engineering (8)
        • Distributed Training & Infe.. (5)
        • AI & ML ์ธ์‚ฌ์ดํŠธ (3)
      • ๐Ÿ’ป Programming (92)
        • Python (18)
        • Computer Vision (12)
        • LLM (4)
        • AI & ML (18)
        • Database (3)
        • Distributed Computing (6)
        • Apache Airflow (6)
        • Docker & Kubernetes (14)
        • ์ฝ”๋”ฉ ํ…Œ์ŠคํŠธ (4)
        • C++ (1)
        • etc. (6)
      • ๐Ÿ’ฌ ETC (4)
        • ์ฑ… ๋ฆฌ๋ทฐ (4)
  • ๋งํฌ

    • ๋ฆฌํ‹€๋ฆฌ ํ”„๋กœํ•„ (๋ฉ˜ํ† ๋ง, ๋ฉด์ ‘์ฑ…,...)
    • ใ€Ž๋‚˜๋Š” AI ์—”์ง€๋‹ˆ์–ด์ž…๋‹ˆ๋‹คใ€
    • Instagram
    • Brunch
    • Github
  • ์ธ๊ธฐ ๊ธ€

  • ์ตœ๊ทผ ๋Œ“๊ธ€

  • ์ตœ๊ทผ ๊ธ€

  • hELLOยท Designed By์ •์ƒ์šฐ.v4.10.3
๋ญ…์ฆค
[Video Gen] HunyuanVideo:A Systematic Framework For Large Video Generative Models ๋ฆฌ๋ทฐ
์ƒ๋‹จ์œผ๋กœ

ํ‹ฐ์Šคํ† ๋ฆฌํˆด๋ฐ”