[T2V] Goku: Flow Based Video Generative Foundation Models ๋ฆฌ๋ทฐ

2026. 1. 4. 21:43ยท๐Ÿ› Research/Image•Video Generation
๋ฐ˜์‘ํ˜•

1. Intro

Goku๋Š” ๋‹จ์ˆœํžˆ ์•„์นด๋ฐ๋ฏนํ•œ ๋…ผ๋ฌธ์ด ์•„๋‹ˆ๋ผ, ํ”„๋กœ๋•์…˜ ๋ ˆ๋ฒจ์˜ joint image/video generative foundation model์„ ์‹ค์ œ๋กœ ํ•™์Šต·์šด์˜ํ•˜๊ธฐ ์œ„ํ•ด ํ•„์š”ํ•œ ๊ตฌ์„ฑ์š”์†Œ(ํ† ํฌ๋‚˜์ด์ €, ์•„ํ‚คํ…์ฒ˜, ๋ฐ์ดํ„ฐ ํ๋ ˆ์ด์…˜, ๋ถ„์‚ฐ ํ•™์Šต ์‹œ์Šคํ…œ)๋ฅผ ํ•œ ๋ฒˆ์— ์ •๋ฆฌํ•œ ์„ค๊ณ„ ์ œ์•ˆ์— ๊ฐ€๊น๋‹ค.๋น„๋””์˜ค ์ƒ์„ฑ์—์„œ ๋ณ‘๋ชฉ์€ ํฌ๊ฒŒ ์„ธ ๊ฐ€์ง€๋กœ ์ •๋ฆฌ๋œ๋‹ค.

  • ํ‘œํ˜„ ๋ณ‘๋ชฉ: ์‹œ๊ฐ„์ถ•์ด ์ถ”๊ฐ€๋˜๋ฉด์„œ scene transition, camera motion, action dynamics ๋“ฑ ์žฅ๋ฉด์˜ ๋ณ€ํ™” ์–‘์ƒ์ด ๊ธ‰๊ฒฉํžˆ ๋ณต์žกํ•ด์ง„๋‹ค.
  • ๋ฐ์ดํ„ฐ ๋ณ‘๋ชฉ: ๋Œ€๊ทœ๋ชจ video-text ํŽ˜์–ด๋Š” ๋…ธ์ด์ฆˆ, ์›Œํ„ฐ๋งˆํฌ, ์ €ํ’ˆ์งˆ ์ƒ˜ํ”Œ, ๋ถ„ํฌ ํŽธํ–ฅ์ด ์‹ฌํ•˜๋ฉฐ, ๊ฒฐ๊ณผ์ ์œผ๋กœ ํ๋ ˆ์ด์…˜ ํ’ˆ์งˆ์ด ๋ชจ๋ธ ์„ฑ๋Šฅ์„ ์ขŒ์šฐํ•œ๋‹ค.
  • ์‹œ์Šคํ…œ ๋ณ‘๋ชฉ: ๋น„๋””์˜ค ํ† ํฐ์€ ์‹œํ€€์Šค ๊ธธ์ด๊ฐ€ ๋งค์šฐ ๊ธธ์–ด์ง€๋ฏ€๋กœ, full-attention ๊ธฐ๋ฐ˜ ํ•™์Šต์„ ํ•˜๋ ค๋ฉด sequence parallelism, sharding, checkpointing, fault tolerance ๊ฐ™์€ ๋Œ€๊ทœ๋ชจ ๋ถ„์‚ฐ ํ•™์Šต ์ธํ”„๋ผ๊ฐ€ ์‚ฌ์‹ค์ƒ ํ•„์ˆ˜ ์กฐ๊ฑด์ด ๋œ๋‹ค.

 

2. Goku: Generative Flow Models for Visual Creation

Goku๋Š” (1) Image-Video Joint VAE ํ† ํฌ๋‚˜์ด์ €, (2) ํ…์ŠคํŠธ ์ธ์ฝ”๋”(Flan-T5), (3) Rectified Flow(RF) ๊ธฐ๋ฐ˜ video/image Transformer๋กœ ๊ตฌ์„ฑ๋œ๋‹ค. 

                         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                         โ”‚           Text Prompt          โ”‚
                         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                         โ”‚
                                         โ–ผ
                              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                              โ”‚   Flan-T5 Encoder  โ”‚
                              โ”‚  (text embeddings) โ”‚
                              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                        โ”‚  (used as cross-attn cond)
                                        โ–ผ


============================= TRAINING (Rectified Flow) =============================

   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
   โ”‚  Image / Video Pixels โ”‚
   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
               โ”‚
               โ–ผ
   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
   โ”‚ Image-Video Joint VAE Encoder โ”‚
   โ”‚  - image stride: 8×8          โ”‚
   โ”‚  - video stride: 8×8×4        โ”‚
   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
               โ”‚
               โ–ผ
        latent target x1
               โ”‚
               โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
               โ”‚               โ”‚
               โ”‚        sample x0 ~ N(0, I)
               โ”‚               โ”‚
               โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                       โ–ผ
        RF interpolation (t ∈ [0,1]):
        x_t = t·x1 + (1-t)·x0
                       โ”‚
                       โ–ผ
   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
   โ”‚               Goku Transformer                โ”‚
   โ”‚  - full attention + 3D RoPE                   โ”‚
   โ”‚  - Patch n' Pack (sequence packing)           โ”‚
   โ”‚  - QK-Norm, adaLN-Zero (t-conditioning)       โ”‚
   โ”‚  - cross-attention to Flan-T5 embeddings      โ”‚
   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
               โ”‚
               โ–ผ
        predict velocity  vฬ‚(x_t, t, text)
               โ”‚
               โ–ผ
        loss:  || vฬ‚ - v ||²   (velocity regression in latent)


============================== INFERENCE (Sampling) ================================

   sample latent x0 ~ N(0, I)
               โ”‚
               โ–ผ
   integrate / sample in latent space (ODE solve / RF sampling)
   with text conditioning via Flan-T5 (cross-attn)
               โ”‚
               โ–ผ
        generated latent x1*
               โ”‚
               โ–ผ
   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
   โ”‚ Image-Video Joint VAE Decoder  โ”‚
   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
               โ”‚
               โ–ผ
        generated image / video pixels

 

2.1 Tokenizer: Image-Video Joint VAE

Transformer ๊ธฐ๋ฐ˜ ์ƒ์„ฑ์—์„œ ๋น„์šฉ์€ ํ† ํฐ ๊ธธ์ด๊ฐ€ ๊ฒฐ์ •๋œ๋‹ค. Goku๋Š” ์ด๋ฏธ์ง€์™€ ๋น„๋””์˜ค๋ฅผ ๋™์ผํ•œ latent space๋กœ ์••์ถ•ํ•˜๋Š” Image-Video Joint VAE๋ฅผ ์‚ฌ์šฉํ•ด T2I/T2V/I2V๋ฅผ ๋™์ผ ํ”„๋ ˆ์ž„์›Œํฌ๋กœ ํ†ตํ•ฉํ•œ๋‹ค.

 

์••์ถ• ๊ทœ๊ฒฉ(Stride)

  • Image: spatial stride 8×8
  • Video: spatial-temporal stride 8×8×4

์ฆ‰, ๋น„๋””์˜ค๋Š” ์‹œ๊ฐ„์ถ•๋„ ํ•จ๊ป˜ ๋‹ค์šด์ƒ˜ํ”Œ๋งํ•˜์—ฌ ํ† ํฐ ์ˆ˜(=attention ๋น„์šฉ)๋ฅผ ์ œ์–ดํ•œ๋‹ค.

 

์„ค๊ณ„ ํ•ด์„

  • Joint VAE๋Š” ์ƒ์„ฑ ํ’ˆ์งˆ์˜ ์ƒํ•œ์„ ๋งŒ๋“ ๋‹ค.
  • temporal stride(×4) ์••์ถ•์€ ๋ชจ์…˜ ๋””ํ…Œ์ผ์„ ํฌ์ƒํ•  ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ, ์ดํ›„ ์•„๋ž˜ ์š”์†Œ๋กœ ๋ณด์™„ํ•œ๋‹ค.
    • (a) full-attention ๊ธฐ๋ฐ˜ ์‹œ๊ฐ„ ๋ชจ๋ธ๋ง์œผ๋กœ long-range temporal dependency๋ฅผ ์ง์ ‘ ํ•™์Šต
    • (b) motion score ๊ธฐ๋ฐ˜ ํ•„ํ„ฐ๋ง์œผ๋กœ ํ•™์Šตํ•˜๊ธฐ ์‰ฌ์šด ๋ชจ์…˜ ๋ถ„ํฌ๋ฅผ ํ™•๋ณด
    • (c) motion score๋ฅผ ์บก์…˜์— ์ฃผ์ž…ํ•ด ํ…์ŠคํŠธ ์กฐ๊ฑด์—์„œ ๋ชจ์…˜ ์ œ์–ด ์‹ ํ˜ธ๋ฅผ ๊ฐ•ํ™”

2.2 Model Architecture

2.2.1 ๋ธ”๋ก ๊ตฌ์„ฑ

Goku๋Š” conditional Transformer ๋ธ”๋ก์„ ์‚ฌ์šฉํ•œ๋‹ค.

  • Self-Attention: latent ํ† ํฐ ๊ฐ„ ์ƒํ˜ธ์ž‘์šฉ
  • Cross-Attention: ํ…์ŠคํŠธ ์ปจ๋””์…˜(Flan-T5 embedding) ์ฃผ์ž…
  • FFN
  • adaLN-Zero: t(๋˜๋Š” timestep) ๊ธฐ๋ฐ˜ ์•ˆ์ •์  conditioning

2.2.2 Full-Attention ์„ ํƒ์˜ ์˜๋ฏธ

๋น„๋””์˜ค Transformer๋Š” ๋น„์šฉ ๋•Œ๋ฌธ์— temporal/spatial attention์„ ๋ถ„ํ•ดํ•˜๊ฑฐ๋‚˜ factorizationํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค. Goku๋Š” motion + long-range dependency๋ฅผ ๊ฐ•ํ•˜๊ฒŒ ๋ชจ๋ธ๋งํ•˜๊ธฐ ์œ„ํ•ด plain full-attention์„ ํƒํ•œ๋‹ค.

  • ์žฅ์ : ๋ชจ์…˜ ์ „๊ฐœ, ์นด๋ฉ”๋ผ ์ด๋™, ์žฅ๋ฉด ๋‚ด ๊ฐ์ฒด ์ƒํ˜ธ์ž‘์šฉ์„ ๋‹จ์ˆœํ™” ์—†์ด ํ•™์Šต ๊ฐ€๋Šฅ
  • ๋น„์šฉ: ์‹œํ€€์Šค ๊ธธ์ด๊ฐ€ ํญ์ฆํ•˜๋ฏ€๋กœ FlashAttention + (SP/FSDP) + activation checkpointing์ด ์ „์ œ๋œ๋‹ค.

2.2.3 3D RoPE(Position Encoding)

์ด๋ฏธ์ง€·๋น„๋””์˜ค ํ† ํฐ์— 3D RoPE(๊ณต๊ฐ„+์‹œ๊ฐ„)๋ฅผ ์ ์šฉํ•œ๋‹ค.

  • ๋‹ค์–‘ํ•œ ํ•ด์ƒ๋„/๊ธธ์ด์— ๋Œ€ํ•ด RoPE์˜ extrapolation ์„ฑ์งˆ์„ ํ™œ์šฉํ•œ๋‹ค.
  • joint ํ•™์Šต์—์„œ ํ•ด์ƒ๋„ ์Šคํ…Œ์ด์ง€๊ฐ€ ๋ฐ”๋€Œ๋Š” ์ปค๋ฆฌํ˜๋Ÿผ์—์„œ๋„ ์•ˆ์ •์ ์œผ๋กœ ๋™์ž‘ํ•˜๋„๋ก ์„ค๊ณ„๋œ ์„ ํƒ์œผ๋กœ ํ•ด์„๋œ๋‹ค.

2.2.4 Patch n’ Pack(Sequence Packing)

Goku๋Š” NaViT ๊ณ„์—ด์˜ packing์„ ์ ์šฉํ•ด, ์„œ๋กœ ๋‹ค๋ฅธ ๊ธธ์ด·ํ•ด์ƒ๋„์˜ ์ด๋ฏธ์ง€/๋น„๋””์˜ค ์ƒ˜ํ”Œ์„ ํ•˜๋‚˜์˜ ๊ธด ์‹œํ€€์Šค๋กœ ํŒจํ‚นํ•ด minibatch๋ฅผ ๊ตฌ์„ฑํ•œ๋‹ค. ๋ชฉ์ ์€ ๊ธธ์ด/ํ•ด์ƒ๋„๋ณ„ ๋ฒ„ํ‚ท๊ณผ ๊ณผ๋„ํ•œ padding์„ ํ”ผํ•˜๊ณ  GPU utilization์„ ์˜ฌ๋ฆฌ๋Š” ๊ฒƒ์ด๋‹ค.

ํ•ต์‹ฌ์€ ๋‘ ๊ฐ€์ง€์ด๋‹ค.

  • Concatenate: ์„œ๋กœ ๋‹ค๋ฅธ ์ƒ˜ํ”Œ์˜ ํ† ํฐ์„ ์‹œํ€€์Šค ์ถ•์œผ๋กœ ์ด์–ด๋ถ™์ธ๋‹ค.
  • Block-diagonal attention mask: ์„œ๋กœ ๋‹ค๋ฅธ ์ƒ˜ํ”Œ ๊ฐ„ ํ† ํฐ์ด attentionํ•˜์ง€ ์•Š๋„๋ก ์ฐจ๋‹จํ•œ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด,

  • image latent length = 1,024
  • video-A latent length = 4,096
  • video-B latent length = 3,072

ํŒจ๋”ฉ ๋ฐฉ์‹์ด๋ผ๋ฉด max=4,096์— ๋งž์ถฐ image๋Š” 3,072 ํ† ํฐ์ด ๋‚ญ๋น„๋œ๋‹ค. Patch n’ Pack์€ ๋‹ค์Œ์ฒ˜๋Ÿผ ๋งŒ๋“ ๋‹ค.

  • packed length = 1,024 + 4,096 + 3,072 = 8,192
  • attention mask๋Š” ์•„๋ž˜์ฒ˜๋Ÿผ ๋ธ”๋ก๋งŒ ํ™œ์„ฑํ™”๋œ๋‹ค
[ image  ][ video-A ][ video-B ]
|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ|........|........|
|......|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ|........|
|......|........|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ|

 

์ถ”๊ฐ€์ ์œผ๋กœ,

  • packing์„ ํ•˜๋ฉด ๊ธ€๋กœ๋ฒŒ ์‹œํ€€์Šค ์ธ๋ฑ์Šค์˜ ์˜๋ฏธ๊ฐ€ ์•ฝํ•ด์ง€๋ฏ€๋กœ, ๊ฐ ํ† ํฐ์— ๋Œ€ํ•ด (t, h, w) ์ขŒํ‘œ๋ฅผ ์œ ์ง€ํ•˜๊ณ  3D RoPE๋ฅผ ์ขŒํ‘œ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ณ„์‚ฐํ•˜๋Š” ๋ฐฉ์‹์ด ์ž์—ฐ์Šค๋Ÿฝ๋‹ค.
  • stage-2์—์„œ ์ด๋ฏธ์ง€/๋น„๋””์˜ค๋ฅผ ๊ฐ™์€ batch์— ์„ž์„ ๋•Œ, packing์€ ์ด๋ฏธ์ง€/๋น„๋””์˜ค ๋น„์œจ์„ ์œ ์—ฐํ•˜๊ฒŒ ์กฐ์ ˆํ•˜๋ฉด์„œ๋„ padding ์†์‹ค์„ ์ค„์ธ๋‹ค.

2.2.5 Q-K Normalization(QK-norm)

๋Œ€๊ทœ๋ชจ Transformer ํ•™์Šต์—์„œ ๊ฐ„ํ—์ ์œผ๋กœ ๋ฐœ์ƒํ•˜๋Š” loss spike๋ฅผ ์™„ํ™”ํ•˜๊ธฐ ์œ„ํ•ด, attention dot-product ์ด์ „์— q์™€ k์— normalization์„ ์ ์šฉํ•œ๋‹ค.

  • ๋ฐฉ์‹: q <- RMSNorm(q), k <- RMSNorm(k) ํ›„ softmax(q k^T / sqrt(d))
  • ์ง๊ด€: q·k์˜ ์Šค์ผ€์ผ ํญ์ฃผ๋ฅผ ์ œํ•œํ•ด softmax ์ž…๋ ฅ ๋ถ„์‚ฐ์„ ์•ˆ์ •ํ™”ํ•œ๋‹ค.

์กฐ๊ธˆ ๋” ๊ธฐ์ˆ ์ ์œผ๋กœ ๋ณด๋ฉด, q์™€ k๋ฅผ ์ •๊ทœํ™”ํ•˜๋ฉด attention logit์€ ๋‚ด์ ์ด๋ผ๊ธฐ๋ณด๋‹ค ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„ ์„ฑ๊ฒฉ์— ๊ฐ€๊นŒ์›Œ์ง„๋‹ค.

  • ์ •๊ทœํ™”๊ฐ€ ์—†์œผ๋ฉด: logit = ||q||·||k||·cos(θ)
  • ์ •๊ทœํ™” ํ›„์—๋Š”: logit ≈ cos(θ) (์Šค์ผ€์ผ ํ•ญ์ด ํฌ๊ฒŒ ์ค„์–ด๋“ฆ)

์ฆ‰, ํŠน์ • ํ† ํฐ/ํ—ค๋“œ์—์„œ ||q|| ๋˜๋Š” ||k||๊ฐ€ ๋น„์ •์ƒ์ ์œผ๋กœ ์ปค์ ธ softmax๊ฐ€ ํ•œ์ชฝ์œผ๋กœ ์ ๋ฆฌ๋Š” ํ˜„์ƒ์„ ์™„ํ™”ํ•˜๊ณ  ํ•™์Šต์„ ๋” ์•ˆ์ •์ ์œผ๋กœ ๋งŒ๋“ ๋‹ค.

2.2.6 ๋ชจ๋ธ ์Šค์ผ€์ผ

2.3 Flow-Matching Training (Rectified Flow)

Goku์˜ ํ•™์Šต์€ rectified flow(RF) ๊ธฐ๋ฐ˜ flow formulation์— ๋ฟŒ๋ฆฌ๋ฅผ ๋‘”๋‹ค. ํ•ต์‹ฌ์€ prior์—์„œ ์‹œ์ž‘ํ•ด target data ๋ถ„ํฌ๋กœ ์ƒ˜ํ”Œ์„ ์—ฐ์†์ ์œผ๋กœ ์ด๋™์‹œํ‚ค๋Š” velocity field๋ฅผ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ํ•™์Šต ์ƒ˜ํ”Œ์€ linear interpolation์œผ๋กœ ๊ตฌ์„ฑ๋œ๋‹ค.

  • target(๋ฐ์ดํ„ฐ) ์ƒ˜ํ”Œ: x1
  • prior(๋…ธ์ด์ฆˆ) ์ƒ˜ํ”Œ: x0 ~ N(0, I)
  • ๋ณด๊ฐ„ ๊ณ„์ˆ˜: t ∈ [0, 1]

x_t = t · x1 + (1 - t) · x0

 

๋ชจ๋ธ์€ x_t๋ฅผ ์ž…๋ ฅ๋ฐ›์•„ velocity v_t = d x_t / d t ๋ฅผ ์˜ˆ์ธกํ•˜๋„๋ก ํ•™์Šต๋œ๋‹ค. ๊ตฌํ˜„ ๊ด€์ ์—์„œ๋Š” “RF objective๋กœ latent์—์„œ velocity regression์„ L2๋กœ ๋งž์ถ˜๋‹ค”๋กœ ์ดํ•ดํ•˜๋ฉด ๋œ๋‹ค. ๋…ผ๋ฌธ์€ pilot experiment๋กœ ImageNet-1K(256×256) class-conditional ์„ค์ •์—์„œ DDPM ๋Œ€๋น„ RF๊ฐ€ ๋” ๋น ๋ฅธ ์ˆ˜๋ ด์„ ๋ณด์ธ๋‹ค๊ณ  ํ•œ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ, Goku-1B(RF)๋Š” ์•ฝ 400k step์—์„œ DDPM์ด 1000k step ์ˆ˜์ค€์—์„œ ๋„๋‹ฌํ•˜๋Š” ์„ฑ๋Šฅ๋Œ€(์˜ˆ: FID-50K)์— ๋” ๋นจ๋ฆฌ ์ ‘๊ทผํ•œ๋‹ค. ์ธํผ๋Ÿฐ์Šค๋Š” latent์—์„œ ODE solve(๋˜๋Š” RF sampling ์ ˆ์ฐจ)์— ํ•ด๋‹นํ•˜๋ฉฐ, ์–ป์–ด์ง„ latent๋ฅผ Joint VAE decoder๋กœ ๋ณต์›ํ•ด ํ”ฝ์…€ ๊ณต๊ฐ„์˜ ์ด๋ฏธ์ง€/๋น„๋””์˜ค๋ฅผ ์ƒ์„ฑํ•œ๋‹ค.

2.4 Training Details

๋ชจ๋ธ ํ•™์Šต์˜ ํ•ต์‹ฌ์€ (1) multi-stage curriculum, (2) cascaded resolution, (3) long-seq ํ•™์Šต์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ๋ณ‘๋ ฌํ™”·์ฒดํฌํฌ์ธํŒ…·fault tolerance์ด๋‹ค.

2.4.1 Multi-stage Training

๋…ผ๋ฌธ์€ joint image-and-video ํ•™์Šต์„ ํ•œ ๋ฒˆ์— ์ง์ ‘ ์ตœ์ ํ™”ํ•˜๊ธฐ ์–ด๋ ต๋‹ค๊ณ  ์ „์ œํ•˜๊ณ , ์•„๋ž˜ 3๋‹จ๊ณ„๋กœ ๋ถ„ํ•ดํ•œ๋‹ค.

  • Stage-1: Text–Semantic Pairing
    • text-to-image ์ค‘์‹ฌ pretraining์œผ๋กœ ํ…์ŠคํŠธ-์‹œ๊ฐ ์˜๋ฏธ ๋งคํ•‘์„ ๋จผ์ € ์•ˆ์ •ํ™”ํ•œ๋‹ค.
    • object attributes, spatial configuration, contextual coherence ๊ฐ™์€ “์ •์  ์‹œ๊ฐ ๊ฐœ๋…”์„ ์šฐ์„  ํ•™์Šตํ•˜๋Š” ๋‹จ๊ณ„์ด๋‹ค.
  • Stage-2: Image-and-Video Joint Learning
    • ์ด๋ฏธ์ง€์™€ ๋น„๋””์˜ค๋ฅผ unified token sequence๋กœ ํ†ตํ•ฉํ•ด joint ํ•™์Šตํ•œ๋‹ค.
    • ๊ณ ํ’ˆ์งˆ ๋น„๋””์˜ค ๋ฐ์ดํ„ฐ ํ™•๋ณด๊ฐ€ ์ƒ๋Œ€์ ์œผ๋กœ ์–ด๋ ต๊ธฐ ๋•Œ๋ฌธ์—, ๊ณ ํ’ˆ์งˆ ์ด๋ฏธ์ง€๊ฐ€ ๊ฐ€์ง„ ํ’๋ถ€ํ•œ ์‹œ๊ฐ ์ •๋ณด๋ฅผ joint ํ•™์Šต์—์„œ ๋น„๋””์˜ค๋กœ ์ „์ด์‹œํ‚ค๋Š” ์„ค๊ณ„๋ฅผ ๊ฐ•์กฐํ•œ๋‹ค.
  • Stage-3: Modality-specific Finetuning
    • ์ตœ์ข… ๋‹จ๊ณ„์—์„œ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ณ„๋กœ ๋ถ„๋ฆฌํ•ด ๋ฏธ์„ธ์กฐ์ •ํ•œ๋‹ค.
    • T2I๋Š” image-centric adjustment๋กœ “๋” ๋ณด๊ธฐ ์ข‹์€ ์ด๋ฏธ์ง€” ๋ฐฉํ–ฅ์„ ๋ชฉํ‘œ๋กœ ํ•œ๋‹ค.
    • T2V๋Š” temporal smoothness, motion continuity, stability ๊ฐœ์„ ์— ์ดˆ์ ์„ ๋‘”๋‹ค.

2.4.2 Cascaded Resolution Training

๋…ผ๋ฌธ์€ Stage-2์˜ joint training์—์„œ cascade resolution์„ ์ ์šฉํ•œ๋‹ค.

  • ์ดˆ๊ธฐ: 288×512(low-res)์—์„œ text–semantic–motion์˜ ํ•ต์‹ฌ ์ƒํ˜ธ์ž‘์šฉ์„ ์ €๋น„์šฉ์œผ๋กœ ๋จผ์ € ํ•™์Šตํ•œ๋‹ค.
  • ์ดํ›„: 480×864 → 720×1280๋กœ ๋‹จ๊ณ„์ ์œผ๋กœ ํ•ด์ƒ๋„๋ฅผ ์ƒ์Šน์‹œ์ผœ ๋””ํ…Œ์ผ๊ณผ fidelity๋ฅผ ์ •๋ จํ•œ๋‹ค.

2.4.3 Efficiency & Long-seq Training System

Goku๋Š” VAE ์ดํ›„์—๋„ ๋น„๋””์˜ค ํ† ํฐ ์ˆ˜๊ฐ€ ๋งค์šฐ ํฌ๋ฉฐ, ๋…ผ๋ฌธ์€ longest sequence๊ฐ€ 220K tokens๋ฅผ ์ดˆ๊ณผํ•œ๋‹ค๊ณ  ๋ช…์‹œํ•œ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด 3D parallelism(์‹œํ€€์Šค/๋ฐ์ดํ„ฐ/ํŒŒ๋ผ๋ฏธํ„ฐ ์ถ•)์„ ์‚ฌ์šฉํ•˜๊ณ , ๊ตฌ์„ฑ ์š”์†Œ๋ฅผ ์•„๋ž˜์ฒ˜๋Ÿผ ์ œ์‹œํ•œ๋‹ค.

  • FlashAttention + Sequence Parallelism
    • full-attention ์ฑ„ํƒ์— ๋”ฐ๋ฅธ ๋ฉ”๋ชจ๋ฆฌ/์—ฐ์‚ฐ ๋ถ€๋‹ด์„ ์™„ํ™”ํ•˜๊ธฐ ์œ„ํ•ด FlashAttention๊ณผ sequence parallelism์„ ์‚ฌ์šฉํ•œ๋‹ค.
  • Sequence-Parallelism (Ulysses ๊ตฌํ˜„)
    • ์‹œํ€€์Šค ์ฐจ์›์œผ๋กœ ์ƒ˜ํ”Œ์„ shardingํ•œ๋‹ค.
    • attention ๊ณ„์‚ฐ ์‹œ all-to-all๋กœ Q/K/V shard๋ฅผ ๋ถ„๋ฐฐํ•ด ๊ฐ ์›Œ์ปค๊ฐ€ full sequence๋ฅผ ์ฒ˜๋ฆฌํ•˜๋˜ head subset๋งŒ ๋‹ด๋‹นํ•˜๋„๋ก ๊ตฌ์„ฑํ•œ๋‹ค.
    • ๊ณ„์‚ฐ ํ›„ all-to-all๋กœ ๊ฒฐ๊ณผ๋ฅผ ๋‹ค์‹œ ์ง‘๊ณ„ํ•ด head ๋ฐ ์‹œํ€€์Šค ์ฐจ์›์„ ์žฌ๊ฒฐํ•ฉํ•œ๋‹ค.
  • FSDP with HYBRID_SHARD
    • ํŒŒ๋ผ๋ฏธํ„ฐ/๊ทธ๋ž˜๋””์–ธํŠธ/์˜ตํ‹ฐ๋งˆ ์ƒํƒœ๋ฅผ shardingํ•œ๋‹ค.
    • HYBRID_SHARD(FULL_SHARD + group ๊ฐ„ replication)๋กœ all-gather/reduce-scatter ๋ฒ”์œ„๋ฅผ ์ค„์—ฌ ํ†ต์‹  ๋น„์šฉ์„ ๋‚ฎ์ถ˜๋‹ค๊ณ  ์„ค๋ช…ํ•œ๋‹ค.
  • Fine-grained Activation Checkpointing
    • ํ†ต์‹  ์˜ค๋ฒ„ํ—ค๋“œ์™€ compute๋ฅผ ๊ท ํ˜• ์žˆ๊ฒŒ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด selective / fine-grained AC๋ฅผ ์„ค๊ณ„ํ•œ๋‹ค.
    • ์ €์žฅ์ด ํ•„์š”ํ•œ activation์„ ์ตœ์†Œํ™”ํ•˜๋ฉด์„œ GPU utilization์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ์ชฝ์— ์ดˆ์ ์„ ๋‘”๋‹ค.
  • Cluster Fault Tolerance (MegaScale)
    • ๋Œ€๊ทœ๋ชจ ํด๋Ÿฌ์Šคํ„ฐ์—์„œ node failure ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์•„์ง€๋Š” ์ ์„ ์ „์ œ๋กœ self-check diagnostics, multi-level monitoring, fast restart/recovery๋ฅผ ๋„์ž…ํ•œ๋‹ค.
  • Saving/Loading: ByteCheckpoint
    • checkpoint์—๋Š” model parameters๋ฟ ์•„๋‹ˆ๋ผ EMA parameters, optimizer states, random states๊นŒ์ง€ ํฌํ•จํ•œ๋‹ค.
    • ByteCheckpoint๋ฅผ ์‚ฌ์šฉํ•ด partitioned checkpoint๋ฅผ ๋ณ‘๋ ฌ ์ €์žฅ/๋กœ๋“œํ•˜๊ณ , resharding๊นŒ์ง€ ์ง€์›ํ•ด training scale ์ „ํ™˜์„ ์œ ์—ฐํ•˜๊ฒŒ ๋งŒ๋“ ๋‹ค.
    • ๋…ผ๋ฌธ์€ 8B ๋ชจ๋ธ์„ ์ˆ˜์ฒœ GPU์—์„œ checkpointํ•  ๋•Œ training block์ด 4์ดˆ ๋ฏธ๋งŒ์ด๋ผ๊ณ  ๋ณด๊ณ ํ•œ๋‹ค.

 

3. Infrastructure Optimization

Goku๋Š” “๋ชจ๋ธ์ด ํฌ๋‹ค”๋ณด๋‹ค๋„, ๋น„๋””์˜ค latent token์ด ๋งŒ๋“ค์–ด๋‚ด๋Š” ๊ธด ์‹œํ€€์Šค๊ฐ€ ๋ณ‘๋ชฉ์ด๋‹ค. ๋…ผ๋ฌธ์€ longest sequence๊ฐ€ 220K tokens๋ฅผ ์ดˆ๊ณผํ•œ๋‹ค๊ณ  ๋ช…์‹œํ•˜๋ฉฐ, ์ด๋ฅผ ์ „์ œ๋กœ 3D parallelism(์‹œํ€€์Šค/๋ฐ์ดํ„ฐ/ํŒŒ๋ผ๋ฏธํ„ฐ ์ถ•), fine-grained activation checkpointing, ํด๋Ÿฌ์Šคํ„ฐ fault tolerance, ๊ณ ์„ฑ๋Šฅ ์ฒดํฌํฌ์ธํŒ…(ByteCheckpoint)๋ฅผ ๊ฒฐํ•ฉํ•ด ํ•™์Šต์„ ์„ฑ๋ฆฝ์‹œํ‚จ๋‹ค.

 

๊ฐ„๋‹จํžˆ ํ‘œํ˜„ํ•˜๋ฉด ์•„๋ž˜ ๊ตฌ์กฐ์ด๋‹ค.

[Sequence axis]  Ulysses Sequence Parallelism (all-to-all for Q/K/V shards)
[Data axis]      Data Parallel groups (replicated compute)
[Param axis]     FSDP HYBRID_SHARD (FULL_SHARD within shard-group)
+ Fine-grained Activation Checkpointing
+ MegaScale Fault Tolerance
+ ByteCheckpoint (parallel save/load + resharding)

3.1 Model Parallelism Strategies: 3D Parallelism์œผ๋กœ 220K-token์„ ๋ฒ„ํ‹ด๋‹ค

๋…ผ๋ฌธ์€ ๋ชจ๋ธ ํฌ๊ธฐ์™€ ์‹œํ€€์Šค ๊ธธ์ด(>220K tokens)๊ฐ€ ๋™์‹œ์— ์ปค์ง€๋ฏ€๋กœ, ๋‹จ์ผ ๋ณ‘๋ ฌํ™” ์ถ•์œผ๋กœ๋Š” ํ•™์Šต์ด ๋ถˆ๊ฐ€๋Šฅํ•˜๋‹ค๊ณ  ์ „์ œํ•œ๋‹ค. ์ด์— input sequence / data / model parameters์˜ 3๊ฐœ ์ถ•์œผ๋กœ ํ™•์žฅ๋˜๋Š” 3D parallelism์„ ์‚ฌ์šฉํ•œ๋‹ค.

3.1.1 Sequence Parallelism(SP)

Sequence-Parallelism์€ ์ž…๋ ฅ์„ sequence dimension์œผ๋กœ shardingํ•œ๋‹ค. LayerNorm ๊ฐ™์€ independent layer์—์„œ ๋ถˆํ•„์š”ํ•œ ์ค‘๋ณต ๊ณ„์‚ฐ์„ ์ œ๊ฑฐํ•˜๊ณ , ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ ์ค„์ด๋ฉฐ, non-conforming input(๊ธธ์ด/ํŒจ๋”ฉ์ด ๋‹ค๋ฅธ ์ƒ˜ํ”Œ)์— ๋Œ€ํ•œ ์ฒ˜๋ฆฌ๋ฅผ ์šฉ์ดํ•˜๊ฒŒ ํ•œ๋‹ค. ๋…ผ๋ฌธ์€ ๊ตฌํ˜„์œผ๋กœ Ulysses๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.

  • ํ•™์Šต ๋ฃจํ”„ ์‹œ์ž‘๋ถ€ํ„ฐ ์ƒ˜ํ”Œ์„ sequence-parallel group์— shardingํ•œ๋‹ค.
  • attention ๊ณ„์‚ฐ ์‹œ all-to-all๋กœ Q/K/V shard๋ฅผ ์žฌ๋ถ„๋ฐฐํ•˜์—ฌ,
    • ๊ฐ ์›Œ์ปค๊ฐ€ “full sequence”๋ฅผ ์ฒ˜๋ฆฌํ•˜๋˜
    • “attention head์˜ subset”๋งŒ ๋‹ด๋‹นํ•˜๋„๋ก ๋งŒ๋“ ๋‹ค.
  • head-wise attention์„ ๋ณ‘๋ ฌ๋กœ ๊ณ„์‚ฐํ•œ ๋’ค, ๋˜ ํ•œ ๋ฒˆ์˜ all-to-all๋กœ ๊ฒฐ๊ณผ๋ฅผ ์ง‘๊ณ„ํ•ด head์™€ sharded sequence ์ฐจ์›์„ ์žฌ๊ฒฐํ•ฉํ•œ๋‹ค.

์ฆ‰, SP๋Š” “์‹œํ€€์Šค๊ฐ€ ๋„ˆ๋ฌด ๊ธธ์–ด๋„ attention์„ full-attention์œผ๋กœ ์œ ์ง€”ํ•˜๊ธฐ ์œ„ํ•œ ์ „์ œ ์กฐ๊ฑด์œผ๋กœ ์‚ฌ์šฉ๋œ๋‹ค.

3.1.2 FSDP

๋…ผ๋ฌธ์€ ๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ ๋Œ€์‹  FSDP(Fully Sharded Data Parallelism)๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. Goku๋Š” ํŠนํžˆ HYBRID_SHARD ์ „๋žต์„ ์‚ฌ์šฉํ•œ๋‹ค.

  • shard group ๋‚ด๋ถ€๋Š” FULL_SHARD๋กœ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ƒค๋”ฉํ•˜๊ณ ,
  • shard group ๊ฐ„์—๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋ณต์ œํ•˜์—ฌ “ํšจ๊ณผ์ ์œผ๋กœ DP”๋ฅผ ๊ตฌํ˜„ํ•œ๋‹ค.
  • ๊ฒฐ๊ณผ์ ์œผ๋กœ all-gather/reduce-scatter์˜ ๋ฒ”์œ„๋ฅผ shard group ๋‚ด๋ถ€๋กœ ์ œํ•œํ•ด ํ†ต์‹  ๋น„์šฉ์„ ๋‚ฎ์ถ˜๋‹ค.

์ผ๋ฐ˜์ ์œผ๋กœ HSDP(= Hybrid Sharded Data Parallel)๋ผ ๋ถˆ๋ฆฌ๋Š” ์ „๋žต์ด๋‹ค.

3.2 Activation Checkpointing

๋…ผ๋ฌธ์€ 3.1์˜ ๋ณ‘๋ ฌํ™”๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํฌ๊ฒŒ ์ ˆ์•ฝํ•˜์ง€๋งŒ, rank ๊ฐ„ ํ†ต์‹ ์ด ๋Š˜์–ด๋‚˜ ์ „์ฒด ์„ฑ๋Šฅ์ด ๋น„์„ ํ˜•์ ์œผ๋กœ ๋–จ์–ด์งˆ ์ˆ˜ ์žˆ์Œ์„ ์ง€์ ํ•œ๋‹ค. ์ด๋ฅผ ์™„ํ™”ํ•˜๊ธฐ ์œ„ํ•ด fine-grained Activation Checkpointing(AC)๋ฅผ ์„ค๊ณ„ํ•œ๋‹ค.

ํ•ต์‹ฌ์€ “๋ฌด์กฐ๊ฑด ์ „๋ถ€ checkpoint”๊ฐ€ ์•„๋‹ˆ๋ผ, ํ”„๋กœํŒŒ์ผ๋ง ๊ด€์ ์—์„œ compute์™€ communication์˜ overlap์„ ์ตœ๋Œ€ํ™”ํ•˜๋„๋ก

  • activation ์ €์žฅ์ด ํ•„์š”ํ•œ ๋ ˆ์ด์–ด ์ˆ˜๋ฅผ ์ค„์ด๊ณ 
  • GPU utilization์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ
  • selective checkpointing์„ ์ ์šฉํ•œ ๊ฒƒ์ด๋‹ค.

3.3 Cluster Fault Tolerance

๋Œ€๊ทœ๋ชจ GPU ํด๋Ÿฌ์Šคํ„ฐ์—์„œ ํ•™์Šตํ• ์ˆ˜๋ก node failure ํ™•๋ฅ ์ด ์ฆ๊ฐ€ํ•˜๊ณ , ์ด๋Š” ํ•™์Šต ํšจ์œจ(์‹œ๊ฐ„/๋น„์šฉ)์„ ์ง์ ‘ ์•…ํ™”์‹œํ‚จ๋‹ค. ๋…ผ๋ฌธ์€ ์ด๋ฅผ ์ „์ œ๋กœ MegaScale์˜ fault tolerance ๊ธฐ๋ฒ•์„ ์ฑ„ํƒํ•œ๋‹ค.

  • self-check diagnostics
  • multi-level monitoring
  • fast restart / recovery

๋ชฉํ‘œ๋Š” ์žฅ์• ๋ฅผ ์—†์• ๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ์žฅ์• ๊ฐ€ ๋ฐœ์ƒํ•ด๋„ ํ•™์Šต ์ค‘๋‹จ ์‹œ๊ฐ„์„ ์ตœ์†Œํ™”ํ•˜๊ณ  ์ „์ฒด ์‹œ์Šคํ…œ์ด ์•ˆ์ •์ ์œผ๋กœ ์ง€์†๋˜๋„๋ก ๋งŒ๋“œ๋Š” ๊ฒƒ์ด๋‹ค.

3.4 Saving and Loading Training Stages

๋Œ€๊ทœ๋ชจ ํ•™์Šต์—์„œ๋Š” checkpoint๊ฐ€ ๋‹จ์ˆœ ๋ฐฑ์—…์ด ์•„๋‹ˆ๋ผ ์šด์˜ ์š”์†Œ์ด๋‹ค. ๋…ผ๋ฌธ์€ checkpoint์— ๋‹ค์Œ ์ƒํƒœ๋ฅผ ํฌํ•จํ•œ๋‹ค๊ณ  ๋ช…์‹œํ•œ๋‹ค.

  • model parameters
  • EMA parameters
  • optimizer states
  • random states

์ด๋Š” (1) ํด๋Ÿฌ์Šคํ„ฐ fault ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์•„์ง€๋Š” ํ™˜๊ฒฝ์—์„œ ์žฌ์‹œ์ž‘์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๊ณ , (2) ์žฌํ˜„์„ฑ์„ ๋ณด์žฅํ•˜๋ฉฐ, (3) ๋””๋ฒ„๊น…(๋น„์˜๋„์  ๋ฒ„๊ทธ, ์•…์˜์  ๊ณต๊ฒฉ ํฌํ•จ) ๊ด€์ ์—์„œ๋„ ์ค‘์š”ํ•˜๋‹ค.

 

์ด๋ฅผ ์œ„ํ•ด ByteCheckpoint๋ฅผ ์ฑ„ํƒํ•œ๋‹ค.

  • partitioned checkpoint๋ฅผ ๋ณ‘๋ ฌ ์ €์žฅ/๋กœ๋“œ(high I/O efficiency)
  • distributed checkpoint๋ฅผ resharding ์ง€์›
  • rank ์ˆ˜์™€ storage backend๊ฐ€ ๋‹ฌ๋ผ์ ธ๋„ training scale ์ „ํ™˜์„ ์œ ์—ฐํ•˜๊ฒŒ ์ฒ˜๋ฆฌ

๋…ผ๋ฌธ์€ ๊ฒฝํ—˜์ ์œผ๋กœ, 8B ๋ชจ๋ธ์„ ์ˆ˜์ฒœ GPU์—์„œ checkpointํ•  ๋•Œ training block์ด 4์ดˆ ๋ฏธ๋งŒ์ด๋ผ๊ณ  ๋ณด๊ณ ํ•œ๋‹ค.

 

4. Data Curation Pipeline

Goku ๋…ผ๋ฌธ์—์„œ๋Š” ๋ฐ์ดํ„ฐ ํ๋ ˆ์ด์…˜ ํŒŒ์ดํ”„๋ผ์ธ์„ 5๋‹จ๊ณ„๋กœ ์ •๋ฆฌํ•œ๋‹ค.

  1. image/video collection
  2. video extraction & clipping
  3. image/video filtering
  4. captioning
  5. data distribution balancing

4.1 Data Overview

๋…ผ๋ฌธ์€ raw ๋ฐ์ดํ„ฐ๋ฅผ public academic dataset + internet resources + proprietary(ํŒŒํŠธ๋„ˆ์‹ญ ๊ธฐ๋ฐ˜)๋กœ๋ถ€ํ„ฐ ์ˆ˜์ง‘ํ•˜๊ณ , rigorous filtering ์ดํ›„ ์ตœ์ข… ํ•™์Šต ์„ธํŠธ๋ฅผ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ œ์‹œํ•œ๋‹ค.

 

์„ธ๋ถ€ ๊ตฌ์„ฑ์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  • Text-to-Image
    • public 100M: LAION
    • internal 60M: ๊ณ ํ’ˆ์งˆ ์‚ฌ๋‚ด ๋ฐ์ดํ„ฐ
    • ๋…ผ๋ฌธ์€ “public data๋กœ pre-training, internal data๋กœ fine-tuning” ์ „๋žต์„ ๋ช…์‹œํ•œ๋‹ค.
    • ์ตœ์ข… ํ•™์Šต ๋ฐ์ดํ„ฐ: 160M image-text pairs
  • Text-to-Video
    • public 11M clips + in-house 25M clips
    • public ์›์ฒœ์—๋Š” Panda-70M, InternVid, OpenVid-1M, Pexels๊ฐ€ ํฌํ•จ๋œ๋‹ค.
    • ๋‹จ, “๊ทธ๋Œ€๋กœ ์‚ฌ์šฉ”์ด ์•„๋‹ˆ๋ผ ๋™์ผํ•œ ํ๋ ˆ์ด์…˜ ํŒŒ์ดํ”„๋ผ์ธ์„ ์ ์šฉํ•ด ๊ณ ํ’ˆ์งˆ ์ƒ˜ํ”Œ๋งŒ ๋‚จ๊ธด๋‹ค.
    • ์ตœ์ข… ํ•™์Šต ๋ฐ์ดํ„ฐ: 36M video-text pairs

4.2 Data Processing and Filtering

๋น„๋””์˜ค ๋ฐ์ดํ„ฐ๋Š” ๋‹จ์ˆœ ์ˆ˜์ง‘๋งŒ์œผ๋กœ๋Š” ํ•™์Šต์— ์ ํ•ฉํ•˜์ง€ ์•Š๋‹ค. ๋…ผ๋ฌธ์€ ํ’ˆ์งˆ์„ ์ขŒ์šฐํ•˜๋Š” ์ „์ฒ˜๋ฆฌ/ํด๋ฆฌํ•‘/ํ•„ํ„ฐ๋ง์„ ๋‹จ๊ณ„์ ์œผ๋กœ ์ ์šฉํ•œ๋‹ค.

4.2.1 Preprocessing & Standardization

์ธํ„ฐ๋„ท ์˜์ƒ์€ ์ธ์ฝ”๋”ฉ/๊ธธ์ด/FPS/๋น„ํŠธ๋ ˆ์ดํŠธ๊ฐ€ ์ œ๊ฐ๊ฐ์ด๋‹ค. ๋…ผ๋ฌธ์€ ๋จผ์ € ๊ณ„์‚ฐ์ ์œผ๋กœ ์ €๋ ดํ•œ 1์ฐจ ํ•„ํ„ฐ๋ง์„ ์ˆ˜ํ–‰ํ•˜๊ณ , ์ดํ›„ ์ธ์ฝ”๋”ฉ์„ H.264๋กœ ํ†ต์ผํ•œ๋‹ค. ํ•ญ๋ชฉ๋ณ„ threshold๋Š” ์œ„ table 3 ์™€ ๊ฐ™๋‹ค. ์ด ๋‹จ๊ณ„๋Š” aesthetic model ๊ฐ™์€ ๊ณ ๋น„์šฉ ํ•„ํ„ฐ๋ง๋ณด๋‹ค ๋จผ์ € ์ ์šฉ๋˜์–ด ์ „์ฒด ํŒŒ์ดํ”„๋ผ์ธ ๋น„์šฉ์„ ์ ˆ๊ฐํ•œ๋‹ค.

 

4.2.2 Video Clips Extraction

๋…ผ๋ฌธ์€ 2-stage clipping์„ ์‚ฌ์šฉํ•œ๋‹ค.

  1. PySceneDetect๋กœ shot boundary detection์„ ์ˆ˜ํ–‰ํ•ด coarse clip์„ ๋งŒ๋“ ๋‹ค.
  2. coarse clip ๋‚ด๋ถ€์—์„œ 1fps๋กœ ํ”„๋ ˆ์ž„์„ ์ƒ˜ํ”Œ๋งํ•˜๊ณ , ๊ฐ ํ”„๋ ˆ์ž„์˜ DINOv2 feature๋ฅผ ๊ตฌํ•œ ๋’ค ์ธ์ ‘ ํ”„๋ ˆ์ž„ cosine similarity๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค.
    • similarity๊ฐ€ ์ž„๊ณ„๊ฐ’ ์•„๋ž˜๋กœ ๋‚ด๋ ค๊ฐ€๋ฉด shot change๋กœ ๊ฐ„์ฃผํ•ด clip์„ ์ถ”๊ฐ€ ๋ถ„ํ• ํ•œ๋‹ค

ํ•ด์ƒ๋„๋ณ„ DINO similarity threshold ์ˆ˜์น˜๋Š” ์œ„ Table 4์— ์ •๋ฆฌ๋˜์–ด ์žˆ๋‹ค. ์ถ”๊ฐ€๋กœ, clip ๊ธธ์ด๋Š” ์ตœ๋Œ€ 10์ดˆ๋กœ ์ œํ•œํ•œ๋‹ค.

4.2.3 Clip Diversity

๊ฐ™์€ source video์—์„œ ๋‚˜์˜จ clip๋“ค์ด ์œ ์‚ฌํ•˜๋ฉด ๋ฐ์ดํ„ฐ ๋‹ค์–‘์„ฑ์ด ๋ฌด๋„ˆ์ง„๋‹ค. ๋…ผ๋ฌธ์€ ๊ฐ clip์˜ keyframe์— ๋Œ€ํ•ด perceptual hashing์„ ๊ณ„์‚ฐํ•˜๊ณ , ๋‘ clip์˜ hash๊ฐ€ ์œ ์‚ฌ(์ค‘๋ณต ๊ฐ€๋Šฅ์„ฑ ๋†’์Œ)ํ•˜๋ฉด aesthetic score๊ฐ€ ๋” ๋†’์€ clip์„ ์œ ์ง€ํ•œ๋‹ค.

4.2.4 Visual Aesthetic Filtering

๋…ผ๋ฌธ์€ keyframe์— ๋Œ€ํ•ด aesthetic model score๋ฅผ ๊ตฌํ•ด ํ‰๊ท ์„ ์ทจํ•˜๊ณ , ํ•ด์ƒ๋„๋ณ„ threshold๋กœ low-quality clip์„ ์ œ๊ฑฐํ•œ๋‹ค.

4.2.5 OCR Filtering

์›Œํ„ฐ๋งˆํฌ/์ž๋ง‰ ์ค‘์‹ฌ ์˜์ƒ์€ ์ƒ์„ฑ ํ’ˆ์งˆ๊ณผ ๋ถ„ํฌ๋ฅผ ๋ง๊ฐ€๋œจ๋ฆด ์ˆ˜ ์žˆ๋‹ค. ๋…ผ๋ฌธ์€ internal OCR๋กœ keyframe์˜ ํ…์ŠคํŠธ๋ฅผ ๊ฒ€์ถœํ•˜๊ณ , keyframe ๋‚ด ๊ฐ€์žฅ ํฐ text bbox ๋ฉด์  / keyframe ์ „์ฒด ๋ฉด์ ์„ text coverage ratio๋กœ ์ •์˜ํ•œ๋‹ค. threshold๋Š” ํ•ด์ƒ๋„๋ณ„๋กœ ๋‹ค๋ฅด๊ฒŒ ๋‘”๋‹ค.

4.2.6 Motion Filtering: RAFT optical flow ๊ธฐ๋ฐ˜ motion score

๋น„๋””์˜ค๋Š” “์–ผ๋งˆ๋‚˜ ์›€์ง์ด๋А๋ƒ”๊ฐ€ ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ๊ณผ ํ•™์Šต ๋‚œ์ด๋„๋ฅผ ์ขŒ์šฐํ•œ๋‹ค. ๋…ผ๋ฌธ์€ RAFT๋กœ mean optical flow๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ  motion score๋ฅผ ๋„์ถœํ•œ๋‹ค. ์ถ”๊ฐ€๋กœ, motion control ๊ฐ•ํ™”๋ฅผ ์œ„ํ•ด motion score๋ฅผ caption์— appendํ•œ๋‹ค.

4.2.7 Multi-level Training Data(ํ•ด์ƒ๋„ ์Šคํ…Œ์ด์ง€๋ณ„ ๋ฐ์ดํ„ฐ ์–‘)

๋…ผ๋ฌธ์€ ํ•ด์ƒ๋„/ํ•„ํ„ฐ๋ง ๊ฐ•๋„๋ฅผ ์˜ฌ๋ฆด์ˆ˜๋ก ๋ฐ์ดํ„ฐ ์–‘์ด ์ค„์–ด๋“œ๋Š” multi-level ๊ตฌ์„ฑ์„ ๋ช…์‹œํ•œ๋‹ค(Table 4).

๊ฐ ๋ ˆ๋ฒจ์€ Resolution + DINO-sim + aesthetic + OCR + motion score์˜ threshold ์กฐํ•ฉ์œผ๋กœ ์ •์˜๋œ๋‹ค.

4.3 Captioning

Goku๋Š” dense caption์„ ์ „์ œ๋กœ ํ…์ŠคํŠธ-๋น„์ฃผ์–ผ ์ •ํ•ฉ์„ ๊ฐ•ํ™”ํ•œ๋‹ค.

  • Images: InternVL2.0์œผ๋กœ ๊ฐ ์ด๋ฏธ์ง€์— dense caption์„ ์ƒ์„ฑํ•œ๋‹ค.
  • Videos:
    1. InternVL2.0์œผ๋กœ keyframe caption ์ƒ์„ฑ
    2. Tarsier2๋กœ video-wide caption ์ƒ์„ฑ
      • Tarsier2๋Š” camera motion type(zoom in, pan right ๋“ฑ)์„ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๊ธฐ์ˆ ํ•  ์ˆ˜ ์žˆ์–ด ๋ณ„๋„ motion-type predictor๊ฐ€ ํ•„์š” ์—†๋‹ค๊ณ  ์„ค๋ช…ํ•œ๋‹ค.
    3. Qwen2๋กœ keyframe caption๊ณผ video-wide caption์„ mergeํ•ด ์ตœ์ข… ์บก์…˜์„ ๋งŒ๋“ ๋‹ค.
    4. RAFT ๊ธฐ๋ฐ˜ motion score๋ฅผ ์บก์…˜์— ์ถ”๊ฐ€ํ•ด, ํ”„๋กฌํ”„ํŠธ์—์„œ motion score๋ฅผ ์ง€์ •ํ•˜๋Š” ํ˜•ํƒœ์˜ motion controllability๋ฅผ ๊ฐ•ํ™”ํ•œ๋‹ค.

4.4 Training Data Balancing

๋…ผ๋ฌธ์€ “๋น„๋””์˜ค ๋ฐ์ดํ„ฐ ๋ถ„ํฌ”๊ฐ€ ์„ฑ๋Šฅ์— ํฐ ์˜ํ–ฅ์„ ์ค€๋‹ค๊ณ  ์ „์ œํ•œ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด internal video classification model๋กœ semantic tag๋ฅผ ์ƒ์„ฑํ•˜๊ณ , tag ๋ถ„ํฌ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ๋ฅผ ์žฌ์กฐ์ •ํ•œ๋‹ค.

  • semantic tag ์ƒ์„ฑ: 4๊ฐœ์˜ evenly sampled keyframe์„ ์‚ฌ์šฉํ•ด ๋ถ„๋ฅ˜
  • ๋ถ„๋ฅ˜ ์ฒด๊ณ„: 9๊ฐœ primary class(์˜ˆ: human, scenery, animals, food ๋“ฑ) + 86๊ฐœ subcategory(์˜ˆ: half-selfie, kid, dinner, wedding ๋“ฑ)
  • ๊ด€์ธก๋œ ๋ถ„ํฌ: humans/scenery/food/urban life/animals๊ฐ€ ์ƒ๋Œ€์ ์œผ๋กœ ์šฐ์„ธ

Balancing ์ „๋žต์€ ๋‹ค์Œ์„ ๋ชฉํ‘œ๋กœ ํ•œ๋‹ค.

  • human ๊ด€๋ จ ์ฝ˜ํ…์ธ ๋Š” appearance diversity๊ฐ€ ํฌ๊ณ  ๋ชจ๋ธ๋ง ๋‚œ๋„๊ฐ€ ๋†’์œผ๋ฏ€๋กœ human ๋น„์ค‘์„ ์ƒ๋Œ€์ ์œผ๋กœ ๊ฐ•์กฐํ•œ๋‹ค.
  • ๋™์‹œ์— ๊ฐ primary category ๋‚ด๋ถ€์—์„œ subcategory๊ฐ€ ์น˜์šฐ์น˜์ง€ ์•Š๋„๋ก equitable representation์„ ๋ณด์žฅํ•œ๋‹ค.

๊ตฌ์ฒด์  ์กฐ์ • ๋ฐฉ์‹์€ ๋‹ค์Œ์œผ๋กœ ๊ธฐ์ˆ ๋œ๋‹ค.

  • overrepresented subcategory: selective down-sampling
  • underrepresented subcategory: artificial data generation + oversampling

์ด ๊ณผ์ •์„ ํ†ตํ•ด Figure 3b์™€ ๊ฐ™์€ ๊ท ํ˜• ๋ถ„ํฌ๋ฅผ ์–ป๋Š”๋‹ค๊ณ  ์„ค๋ช…ํ•œ๋‹ค.

5. Experiments

๋…ผ๋ฌธ์€ Goku๋ฅผ T2I / T2V / I2V ๊ด€์ ์—์„œ ํ‰๊ฐ€ํ•˜๋ฉฐ, ์ •๋Ÿ‰ ๋ฒค์น˜๋งˆํฌ์™€ ์ •์„ฑ ๋น„๊ต + ablation์œผ๋กœ ๊ตฌ์„ฑํ•œ๋‹ค. ์ด ์ ˆ์€ ์ˆ˜์น˜ ํ•ด์„์— ํ•„์š”ํ•œ ํฌ์ธํŠธ๋งŒ ๊ฐ„์†Œํ™”ํ•ด ์ •๋ฆฌํ•œ๋‹ค.

5.1 Text-to-Image (T2I)

Goku-T2I๋Š” dense generative caption ์ค‘์‹ฌ ํ•™์Šต์„ ์ „์ œ๋กœ, text-image alignment์„ ๊ฐ•ํ•˜๊ฒŒ ๊ฐ•์กฐํ•œ๋‹ค.

  • GenEval: ์›๋ณธ(short prompt)๊ณผ, ์›๋ณธ ์˜๋ฏธ๋ฅผ ์œ ์ง€ํ•œ ์ฑ„ ๋” ์ƒ์„ธํ•˜๊ฒŒ ํ™•์žฅํ•œ rewritten prompt(๋…ผ๋ฌธ์—์„œ๋Š” ChatGPT-4o๋ฅผ ์‚ฌ์šฉ)๋ฅผ ๋ชจ๋‘ ํ‰๊ฐ€ํ•œ๋‹ค.
    • Goku-T2I(2B)๋Š” ์›๋ณธ ํ”„๋กฌํ”„ํŠธ์—์„œ๋„ ๊ฐ•ํ•œ ์ ์ˆ˜๋ฅผ ๋ณด์ด๋ฉฐ, rewritten prompt์—์„œ๋Š” 0.76๋กœ ์ตœ๊ณ  ์„ฑ๋Šฅ์„ ๋ณด๊ณ ํ•œ๋‹ค.
    • ์ด ๊ฒฐ๊ณผ๋Š” “์ƒ์„ธ ํ”„๋กฌํ”„ํŠธ์— ๊ฐ•ํ•œ caption-centric ํ•™์Šต”์ด ์‹ค์ œ alignment ์ง€ํ‘œ์—์„œ ์ด์ ์„ ์ค€๋‹ค๋Š” ํ•ด์„๊ณผ ๋งž๋ฌผ๋ฆฐ๋‹ค.
  • T2I-CompBench / DPG-Bench: ์ƒ‰/ํ˜•ํƒœ/ํ…์Šค์ฒ˜ ๊ฐ™์€ ์กฐํ•ฉ์  ์†์„ฑ(comp. attributes)๊ณผ ์‚ฌ๋žŒ ์„ ํ˜ธ ๊ธฐ๋ฐ˜ ํ’ˆ์งˆ(๋˜๋Š” ํ”„๋กฌํ”„ํŠธ-์ด๋ฏธ์ง€ ์ •ํ•ฉ)์„ ํ•จ๊ป˜ ๋ณธ๋‹ค.
    • Table 5์—์„œ Goku-T2I(2B)๋Š” T2I-CompBench์˜ color/shape/texture ์ถ•์—์„œ ๊ฒฝ์Ÿ๋ ฅ ์žˆ๋Š” ์ˆ˜์น˜๋ฅผ ๋ณด์ด๊ณ , DPG-Bench์—์„œ 83.65๋ฅผ ๋ณด๊ณ ํ•œ๋‹ค(ํ‘œ ๋‚ด ํ‰๊ท  ์ ์ˆ˜).

์š”์•ฝํ•˜๋ฉด, T2I ์‹คํ—˜ ํŒŒํŠธ์˜ ๋ฉ”์‹œ์ง€๋Š” “RF objective ์ž์ฒด”๋ณด๋‹ค (a) dense caption ๊ธฐ๋ฐ˜ ํ•™์Šต, (b) ์„ธ๋ฐ€ํ•œ prompt์—์„œ์˜ ์ •ํ•ฉ ์šฐ์œ„์ด๋‹ค.

5.2 Text-to-Video (T2V)

  • UCF-101 zero-shot (FVD↓ / IS↑)
    • Table 6์—์„œ Goku-2B๋Š” 256×256 ๊ธฐ์ค€ FVD 246.17 / IS 45.77(±1.10)์„ ๋ณด์—ฌ์ค€๋‹ค.
    • ๋™์ผ ๋ชจ๋ธ์„ ์„œ๋กœ ๋‹ค๋ฅธ ํ•ด์ƒ๋„(240×360, 128×128)๋กœ ์ƒ์„ฑํ–ˆ์„ ๋•Œ์˜ ์ˆ˜์น˜๋„ ํ•จ๊ป˜ ์ œ์‹œํ•˜๋ฉฐ, ํ•ด์ƒ๋„/ํ† ํฐ ๊ธธ์ด/ํ•™์Šต ์•ˆ์ •์„ฑ๊ณผ์˜ trade-off๋ฅผ ๊ฐ„์ ‘์ ์œผ๋กœ ๋ณด์—ฌ์ค€๋‹ค.
  • VBench (16D ํ‰๊ฐ€์˜ ์š”์•ฝ์น˜)
    • Table 7์—์„œ Goku(ours)๋Š” Overall 84.85๋กœ ๋น„๊ต ๋Œ€์ƒ ์ค‘ ์ตœ๊ณ  ์„ฑ๋Šฅ์„ ๋ณด๊ณ ํ•œ๋‹ค.
    • Quality score 85.60, Semantic score 81.87์„ ํ•จ๊ป˜ ์ œ์‹œํ•˜๋ฉฐ, ๋‹จ์ˆœ ํ™”์งˆ๋ฟ ์•„๋‹ˆ๋ผ ์˜๋ฏธ ์ •ํ•ฉ๊ณผ ๋™์  ํ‘œํ˜„(์˜ˆ: multiple objects, dynamic degree ๋“ฑ)๊นŒ์ง€ ๊ท ํ˜• ์žˆ๊ฒŒ ๋Œ์–ด์˜ฌ๋ ธ๋‹ค๋Š” ๋ฉ”์‹œ์ง€๋ฅผ ๊ฐ•์กฐํ•œ๋‹ค.
  • ์ •์„ฑ ๋น„๊ต
    • ๊ณต๊ฐœ ๋ชจ๋ธ(CogVideoX, Open-Sora-Plan ๋“ฑ)๊ณผ ์ƒ์šฉ ์ œํ’ˆ(Pika, DreamMachine, Vidu, Kling v1.5 ๋“ฑ)์„ ํ•จ๊ป˜ ๋น„๊ตํ•œ๋‹ค.
    • ๋…ผ๋ฌธ์€ ๋ณต์žก ํ”„๋กฌํ”„ํŠธ์—์„œ ์ผ๋ถ€ ์ƒ์šฉ ๋ชจ๋ธ์ด ํ•ต์‹ฌ ์š”์†Œ๋ฅผ ๋ˆ„๋ฝํ•˜๊ฑฐ๋‚˜(์˜ˆ: ํŠน์ • ๊ฐ์ฒด/๊ตฌ์„ฑ ์‹คํŒจ), ๋ชจ์…˜ ์ผ๊ด€์„ฑ์ด ๊นจ์ง€๋Š” ์‚ฌ๋ก€๋ฅผ ์–ธ๊ธ‰ํ•˜๋ฉฐ, Goku-8B๊ฐ€ ์„ธ๋ถ€ ์š”์†Œ ๋ฐ˜์˜๊ณผ ๋ชจ์…˜ ์ผ๊ด€์„ฑ์—์„œ ์šฐ์ˆ˜ํ•จ์„ ๊ฐ•์กฐํ•œ๋‹ค.

5.3 Image-to-Video (I2V)

I2V๋Š” T2V๋ฅผ ํ•™์Šตํ•œ ๋’ค, reference image conditioning์„ ์ถ”๊ฐ€ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ํ™•์žฅํ•œ๋‹ค.

  • T2V initialization์—์„œ ์ถœ๋ฐœํ•ด, ์•ฝ 4.5M text-image-video triplet๋กœ finetuneํ•œ๋‹ค.
  • finetuning step์€ 10k๋กœ ๋น„๊ต์  ์งง๊ฒŒ ์„ค์ •๋˜์ง€๋งŒ, reference image์˜ ์ •์ฒด์„ฑ์„ ์œ ์ง€ํ•˜๋ฉด์„œ ํ…์ŠคํŠธ ์กฐ๊ฑด์— ๋งž๋Š” ๋ชจ์…˜์„ ์ƒ์„ฑํ•˜๋Š” ๋Šฅ๋ ฅ์„ ๋ณด์—ฌ์ค€๋‹ค๊ณ  ์ •๋ฆฌํ•œ๋‹ค.

5.4 Ablation 

๋…ผ๋ฌธ์€ ์•„๋ž˜ ๋‘ ์ถ•์„ ๊ฐ„๋‹จํžˆ ablationํ•œ๋‹ค.

  • Model scaling (2B → 8B): ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋Š˜๋ฆฌ๋ฉด, ์™œ๊ณก๋œ ๊ตฌ์กฐ(ํŒ”, ๋ฐ”ํ€ด ๋“ฑ) ๊ฐ™์€ “local geometry artifact”๊ฐ€ ์™„ํ™”๋˜๋Š” ๊ฒฝํ–ฅ์„ ์ œ์‹œํ•œ๋‹ค.
  • Joint training ์œ ๋ฌด: ๋™์ผํ•œ pretrained Goku-T2I(8B)์—์„œ ์ถœ๋ฐœํ•ด, ๋™์ผ step์œผ๋กœ 480p ๋น„๋””์˜ค๋ฅผ finetuneํ•  ๋•Œ
    • joint image+video training์ด ์—†์œผ๋ฉด ํ”„๋ ˆ์ž„ ํ’ˆ์งˆ์ด ๋–จ์–ด์ง€๊ฑฐ๋‚˜ photorealism์ด ๊นจ์ง€๊ธฐ ์‰ฝ๊ณ ,
    • joint training์„ ํฌํ•จํ•˜๋ฉด photorealistic frame์ด ๋” ์•ˆ์ •์ ์œผ๋กœ ์œ ์ง€๋œ๋‹ค๊ณ  ๋ณด๊ณ ํ•œ๋‹ค.

 

Goku์˜ ํ•ต์‹ฌ ๊ธฐ์—ฌ๋Š” “RF๋กœ ๋น„๋””์˜ค๋ฅผ ๋งŒ๋“ ๋‹ค”๊ฐ€ ์•„๋‹ˆ๋ผ, ํ˜„์‹ค์ ์ธ ์Šค์ผ€์ผ์—์„œ T2V/I2V ๋ชจ๋ธ์„ ํ•™์Šต ๊ฐ€๋Šฅํ•œ ํ˜•ํƒœ๋กœ ํŒจํ‚ค์ง•ํ–ˆ๋‹ค๋Š” ์ ์ด๋‹ค. ํ˜„์‹œ์ ์—์„œ ์‹œ์‚ฌ์ ์€ ํฌ๊ฒŒ ์„ธ ๊ฐ€์ง€๋กœ ์ •๋ฆฌ๋œ๋‹ค.

  1. ๋น„๋””์˜ค ์ƒ์„ฑ์€ ์•Œ๊ณ ๋ฆฌ์ฆ˜๋ณด๋‹ค ‘๋ฐ์ดํ„ฐ+์‹œ์Šคํ…œ’์˜ ๋ฌธ์ œ๋กœ ์ˆ˜๋ ดํ•˜๊ณ  ์žˆ๋‹ค.
    • longest sequence >220K tokens ๊ฐ™์€ ์„ค์ •์—์„œ full-attention์„ ์œ ์ง€ํ•˜๋ ค๋ฉด, SP/FSDP/AC/์ฒดํฌํฌ์ธํŠธ/์žฅ์•  ๋ณต๊ตฌ๊ฐ€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์ผ๋ถ€๊ฐ€ ๋œ๋‹ค.
    • ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜๋งŒ ๋ณต์ œํ•ด์„œ๋Š” ์žฌํ˜„์ด ์•ˆ ๋˜๊ณ , ํ•™์Šต ์ธํ”„๋ผ ์„ค๊ณ„๊ฐ€ ์„ฑ๋Šฅ์˜ ์‹ค์งˆ์  ๊ฒฐ์ • ์š”์ธ์ด ๋œ๋‹ค.
  2. ํ๋ ˆ์ด์…˜์ด ๊ณง ์„ฑ๋Šฅ์ด๊ณ , ํ๋ ˆ์ด์…˜์ด ๊ณง ์ปค๋ฆฌํ˜๋Ÿผ์ด๋‹ค.
    • Table 3/4์ฒ˜๋Ÿผ ํ•ด์ƒ๋„ ๋‹จ๊ณ„๋ณ„๋กœ DINO similarity/aesthetic/OCR/motion threshold๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ ์„ค๊ณ„ํ•˜๊ณ , ๊ทธ ๊ฒฐ๊ณผ 36M→24M→7M์œผ๋กœ ๋ฐ์ดํ„ฐ๊ฐ€ ์ •์ œ๋˜๋Š” ํ๋ฆ„ ์ž์ฒด๊ฐ€ ํ•™์Šต ์ปค๋ฆฌํ˜๋Ÿผ์œผ๋กœ ์ž‘๋™ํ•œ๋‹ค.
    • ํŠนํžˆ motion score ๊ธฐ๋ฐ˜ ํ•„ํ„ฐ๋ง๊ณผ caption ์ฃผ์ž…์€ “๋ฐ์ดํ„ฐ ํ’ˆ์งˆ ์ œ์–ด”๋ฅผ ๋„˜์–ด “๋ชจ์…˜ controllability”์— ์ง์ ‘ ์—ฐ๊ฒฐ๋˜๋Š” ์„ค๊ณ„๋กœ ์ฝํžŒ๋‹ค.
  3. joint image+video ํ•™์Šต์€ ๋น„๋””์˜ค ์ƒ์„ฑ์—์„œ ์‹ค์šฉ์ ์ธ ์Šน๋ถ€์ฒ˜๊ฐ€ ๋œ๋‹ค.
    • ๋น„๋””์˜ค๋Š” ๊ณ ํ’ˆ์งˆ ๋ฐ์ดํ„ฐ๊ฐ€ ํ•ญ์ƒ ๋ถ€์กฑํ•˜๊ณ  ๋ถ„ํฌ ๋…ธ์ด์ฆˆ๊ฐ€ ํฌ๋‹ค.
    • Goku๋Š” stage-1(T2I)๋กœ ์˜๋ฏธ๋ฅผ ๊ณ ์ •ํ•˜๊ณ , stage-2์—์„œ ์ด๋ฏธ์ง€์˜ ์‹œ๊ฐ์  ๋‹ค์–‘์„ฑ๊ณผ ํ’ˆ์งˆ์„ ๋น„๋””์˜ค๋กœ ์ „์ด์‹œํ‚ค๋ฉฐ, stage-3์—์„œ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ณ„๋กœ ์ •๋ จํ•˜๋Š” ์ „๋žต์„ ํ†ตํ•ด ์ด ๋ฌธ์ œ๋ฅผ ์ •๋ฉด์œผ๋กœ ๋‹ค๋ฃฌ๋‹ค.

์ข…ํ•ฉํ•˜๋ฉด, Goku๋Š” “์ตœ๊ณ  ์„ฑ๋Šฅ์˜ ๋‹จ์ผ ๊ธฐ๋ฒ•”์„ ์ œ์•ˆํ•œ ๋…ผ๋ฌธ์ด๋ผ๊ธฐ๋ณด๋‹ค, ๋Œ€๊ทœ๋ชจ video foundation model์„ ๋งŒ๋“ค ๋•Œ ์–ด๋””์— ์—”์ง€๋‹ˆ์–ด๋ง์„ ํˆฌ์žํ•ด์•ผ ํ•˜๋Š”์ง€๋ฅผ ๊ตฌ์ฒด์ ์ธ ์ˆ˜์น˜์™€ ํŒŒ์ดํ”„๋ผ์ธ์œผ๋กœ ๋ณด์—ฌ์ฃผ๋Š” ๋ ˆํผ๋Ÿฐ์Šค์— ๊ฐ€๊น๋‹ค. ์•ž์œผ๋กœ์˜ ๊ฒฝ์Ÿ์€ ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ๋ณด๋‹ค, (a) long-seq ํ•™์Šต์„ ์ง€์† ๊ฐ€๋Šฅํ•œ ๋น„์šฉ์œผ๋กœ ๋งŒ๋“œ๋Š” ์‹œ์Šคํ…œ, (b) ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ์„ ์ •์˜ํ•˜๊ณ  ํ†ต์ œํ•˜๋Š” ํ๋ ˆ์ด์…˜ ๋ ˆ์‹œํ”ผ, (c) ์ด๋ฏธ์ง€·๋น„๋””์˜ค๋ฅผ ํ•จ๊ป˜ ๊ตด๋ฆฌ๋Š” joint ํ•™์Šต ์„ค๊ณ„์—์„œ ๋” ํฌ๊ฒŒ ๊ฐˆ๋ฆด ๊ฐ€๋Šฅ์„ฑ์ด ๋†’๋‹ค.

๋ฐ˜์‘ํ˜•

'๐Ÿ› Research > Imageโ€ขVideo Generation' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

[Video Gen] HunyuanVideo:A Systematic Framework For Large Video Generative Models ๋ฆฌ๋ทฐ  (0) 2025.11.30
[Omni] OmniGen2: Exploration to Advanced Multimodal Generation | ํ†ตํ•ฉ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ƒ์„ฑ ๋ชจ๋ธ  (1) 2025.11.30
[T2I] Back to Basics: Let Denoising Generative Models Denoise | Just image Transformers (JiT) ๋ฆฌ๋ทฐ  (0) 2025.11.29
[Gen AI] T2I & TI2I ๋ฐ์ดํ„ฐ์…‹ ๋ฐ ๋ฒค์น˜๋งˆํฌ ์ •๋ฆฌ | ์ด๋ฏธ์ง€ ์ƒ์„ฑ & ํŽธ์ง‘ ๋ฐ์ดํ„ฐ์…‹  (0) 2025.11.01
[Gen AI] BAGEL: Unified Multimodal Design - ์ดํ•ด์™€ ์ƒ์„ฑ์˜ ํ†ตํ•ฉ ๊ตฌ์กฐ  (0) 2025.10.31
'๐Ÿ› Research/Image•Video Generation' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€
  • [Video Gen] HunyuanVideo:A Systematic Framework For Large Video Generative Models ๋ฆฌ๋ทฐ
  • [Omni] OmniGen2: Exploration to Advanced Multimodal Generation | ํ†ตํ•ฉ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ƒ์„ฑ ๋ชจ๋ธ
  • [T2I] Back to Basics: Let Denoising Generative Models Denoise | Just image Transformers (JiT) ๋ฆฌ๋ทฐ
  • [Gen AI] T2I & TI2I ๋ฐ์ดํ„ฐ์…‹ ๋ฐ ๋ฒค์น˜๋งˆํฌ ์ •๋ฆฌ | ์ด๋ฏธ์ง€ ์ƒ์„ฑ & ํŽธ์ง‘ ๋ฐ์ดํ„ฐ์…‹
๋ญ…์ฆค
๋ญ…์ฆค
AI ๊ธฐ์ˆ  ๋ธ”๋กœ๊ทธ
    ๋ฐ˜์‘ํ˜•
  • ๋ญ…์ฆค
    moovzi’s Doodle
    ๋ญ…์ฆค
  • ์ „์ฒด
    ์˜ค๋Š˜
    ์–ด์ œ
  • ๊ณต์ง€์‚ฌํ•ญ

    • โœจ About Me
    • ๋ถ„๋ฅ˜ ์ „์ฒด๋ณด๊ธฐ (213)
      • ๐Ÿ“– Fundamentals (34)
        • Computer Vision (9)
        • 3D vision & Graphics (6)
        • AI & ML (16)
        • NLP (2)
        • etc. (1)
      • ๐Ÿ› Research (75)
        • Deep Learning (7)
        • Perception (19)
        • OCR (7)
        • Multi-modal (5)
        • Image•Video Generation (18)
        • 3D Vision (4)
        • Material • Texture Recognit.. (8)
        • Large-scale Model (7)
        • etc. (0)
      • ๐Ÿ› ๏ธ Engineering (8)
        • Distributed Training & Infe.. (5)
        • AI & ML ์ธ์‚ฌ์ดํŠธ (3)
      • ๐Ÿ’ป Programming (92)
        • Python (18)
        • Computer Vision (12)
        • LLM (4)
        • AI & ML (18)
        • Database (3)
        • Distributed Computing (6)
        • Apache Airflow (6)
        • Docker & Kubernetes (14)
        • ์ฝ”๋”ฉ ํ…Œ์ŠคํŠธ (4)
        • C++ (1)
        • etc. (6)
      • ๐Ÿ’ฌ ETC (4)
        • ์ฑ… ๋ฆฌ๋ทฐ (4)
  • ๋งํฌ

    • ๋ฆฌํ‹€๋ฆฌ ํ”„๋กœํ•„ (๋ฉ˜ํ† ๋ง, ๋ฉด์ ‘์ฑ…,...)
    • ใ€Ž๋‚˜๋Š” AI ์—”์ง€๋‹ˆ์–ด์ž…๋‹ˆ๋‹คใ€
    • Instagram
    • Brunch
    • Github
  • ์ธ๊ธฐ ๊ธ€

  • ์ตœ๊ทผ ๋Œ“๊ธ€

  • ์ตœ๊ทผ ๊ธ€

  • hELLOยท Designed By์ •์ƒ์šฐ.v4.10.3
๋ญ…์ฆค
[T2V] Goku: Flow Based Video Generative Foundation Models ๋ฆฌ๋ทฐ
์ƒ๋‹จ์œผ๋กœ

ํ‹ฐ์Šคํ† ๋ฆฌํˆด๋ฐ”