[MLLM] Gemma 3 ํ…Œํฌ๋‹ˆ์ปฌ ๋ฆฌํฌํŠธ ๋ฆฌ๋ทฐ

2026. 2. 18. 14:39ยท๐Ÿ› Research/Multi-modal
๋ฐ˜์‘ํ˜•

https://arxiv.org/abs/2503.19786

1. Introduction

Gemma 3๋Š” Google DeepMind๊ฐ€ 2025๋…„ 3์›” ๊ณต๊ฐœํ•œ ๊ฒฝ๋Ÿ‰ ์˜คํ”ˆ ๋ชจ๋ธ ์‹œ๋ฆฌ์ฆˆ์— ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋น„์ „ ๋Šฅ๋ ฅ์„ ์ถ”๊ฐ€ํ•œ ๋ชจ๋ธ์ด๋‹ค. Pan and Scan (P&S) ๋ฐฉ๋ฒ•์œผ๋กœ ์œ ์—ฐํ•œ ์ด๋ฏธ์ง€ ํ•ด์ƒ๋„๋ฅผ ์ง€์›ํ•˜๋ฉฐ, Local/Global Attention ํ˜ผํ•ฉ ๊ตฌ์กฐ๋กœ 128K ํ† ํฐ ์ปจํ…์ŠคํŠธ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•œ๋‹ค.

Google Gemma ์‹œ๋ฆฌ์ฆˆ๋Š” ์˜คํ”ˆ์†Œ์Šค ๊ฒฝ๋Ÿ‰ LLM์œผ๋กœ ์ถœ๋ฐœํ–ˆ๋‹ค. Gemma 2๊นŒ์ง€๋Š” ํ…์ŠคํŠธ ์ „์šฉ ๋ชจ๋ธ์ด์—ˆ์ง€๋งŒ, ์‹ค์ œ ์‘์šฉ์—์„œ๋Š” ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ๋ฅผ ํ•จ๊ป˜ ์ฒ˜๋ฆฌํ•˜๋Š” ๋Šฅ๋ ฅ์ด ํ•„์š”ํ•˜๊ธฐ์— MLLM์œผ๋กœ ๋ฐœ์ „ํ–ˆ๋‹ค.

 

๊ฒฝ๋Ÿ‰ model์— ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋Šฅ๋ ฅ์„ ์ถ”๊ฐ€ํ•  ๋•Œ์˜ ์ฃผ์š” ๊ณผ์ œ๋Š” ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ฑ์ด๋‹ค. Vision encoder๋Š” ๋งŽ์€ token์„ ์ƒ์„ฑํ•˜๋ฉฐ, ๊ธด context๋ฅผ ์ฒ˜๋ฆฌํ•˜๋ ค๋ฉด KV-cache ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๊ธ‰๊ฒฉํžˆ ์ฆ๊ฐ€ํ•œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, 128K token context๋ฅผ ์ฒ˜๋ฆฌํ•˜๋ ค๋ฉด KV-cache๋งŒ์œผ๋กœ๋„ ์ˆ˜ GB์˜ ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ํ•„์š”ํ•˜๋‹ค. Gemma 3๋Š” ์ด ๋ฌธ์ œ๋ฅผ Local/Global Attention ํ˜ผํ•ฉ ๊ตฌ์กฐ๋กœ ํ•ด๊ฒฐํ–ˆ๋‹ค.

 

2. Technical Approach

Gemma 3๋Š” ๊ฒฝ๋Ÿ‰ model์— ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋Šฅ๋ ฅ์„ ํšจ์œจ์ ์œผ๋กœ ์ถ”๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด ๋ช‡ ๊ฐ€์ง€ ํ•ต์‹ฌ ์„ค๊ณ„๋ฅผ ์ฑ„ํƒํ–ˆ๋‹ค.

2.1. SigLIP Vision Encoder

SigLIP Vision Encoder๋Š” ์ด๋ฏธ์ง€๋ฅผ 256๊ฐœ์˜ ๊ณ ์ •๋œ soft token์œผ๋กœ encodingํ•œ๋‹ค. ์ด๋Š” ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ฑ์„ ๋ณด์žฅํ•˜๋ฉด์„œ๋„ ์ถฉ๋ถ„ํ•œ ์‹œ๊ฐ ์ •๋ณด๋ฅผ ์ „๋‹ฌํ•˜๋Š” ํ•ต์‹ฌ ์„ค๊ณ„๋‹ค.

 

ํ•ต์‹ฌ ์•„์ด๋””์–ด

  • ๊ณ ์ •๋œ token ์ˆ˜(256๊ฐœ)๋กœ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ ์˜ˆ์ธก ๊ฐ€๋Šฅ
  • Contrastive learning์œผ๋กœ ํ•™์Šต๋œ vision encoder
  • ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์ •๋ ฌ์— ํšจ๊ณผ์ 

์„ธ๋ถ€์‚ฌํ•ญ

  • ์ด๋ฏธ์ง€ ์ •๊ทœํ™”: 896x896 resolution์œผ๋กœ ์ •๊ทœํ™”
  • Token ์ˆ˜ ์ œํ•œ: 256๊ฐœ์˜ ๊ณ ์ •๋œ soft token
  • ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ฑ: ์ œํ•œ๋œ token ์ˆ˜๋กœ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ ์ตœ์†Œํ™”
  • ์ •๋ณด ๋ณด์กด: ์ถฉ๋ถ„ํ•œ ์‹œ๊ฐ ์ •๋ณด ์ „๋‹ฌ

SigLIP Vision Encoder๋Š” ๊ณ ์ •๋œ token ์ˆ˜๋กœ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ ์˜ˆ์ธก ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜์—ฌ ๊ฒฝ๋Ÿ‰ model์— ์ ํ•ฉํ•œ ์„ค๊ณ„๋‹ค. 256๊ฐœ์˜ soft token์œผ๋กœ ์ œํ•œํ•˜๋ฉด์„œ๋„ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์ •๋ ฌ ์„ฑ๋Šฅ์ด ์šฐ์ˆ˜ํ•˜์—ฌ, ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ฑ๊ณผ ์„ฑ๋Šฅ์˜ ๊ท ํ˜•์„ ์ž˜ ๋‹ฌ์„ฑํ•œ๋‹ค.

2.2. Pan and Scan (P&S)

Pan and Scan (P&S) ๋ฐฉ๋ฒ•์€ LLaVA์—์„œ ์˜๊ฐ์„ ๋ฐ›์•„ ์œ ์—ฐํ•œ ์ด๋ฏธ์ง€ resolution์„ ์ง€์›ํ•œ๋‹ค. ๊ณ ์ •๋œ resolution ๋Œ€์‹  ๋‹ค์–‘ํ•œ ์ข…ํšก๋น„์˜ ์ด๋ฏธ์ง€๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ค€๋‹ค.

 

Pan and Scan์€ ๊ณ ํ•ด์ƒ๋„๋‚˜ ๋น„์ •์‚ฌ๊ฐํ˜• ์ด๋ฏธ์ง€๋ฅผ ๊ณ ์ •๋œ resolution์œผ๋กœ ๋ฆฌ์‚ฌ์ด์ฆˆํ•˜๋Š” ๋Œ€์‹ , adaptive windowing ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ†ตํ•ด ์ด๋ฏธ์ง€๋ฅผ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๊ฒน์น˜์ง€ ์•Š๋Š” crop์œผ๋กœ ๋ถ„ํ• ํ•œ๋‹ค. ๊ฐ crop์€ ์›๋ณธ ์ด๋ฏธ์ง€์˜ aspect ratio๋ฅผ ์œ ์ง€ํ•œ ์ฑ„๋กœ ๋…๋ฆฝ์ ์œผ๋กœ vision encoder์— ์˜ํ•ด ์ฒ˜๋ฆฌ๋˜๋ฉฐ, ์ดํ›„ language model์—์„œ ํ†ตํ•ฉ๋œ๋‹ค. ์ด ๋ฐฉ์‹์€ ์ด๋ฏธ์ง€๋ฅผ ๊ฐ•์ œ๋กœ ์ •์‚ฌ๊ฐํ˜•์œผ๋กœ ๋ณ€ํ˜•ํ•˜๊ฑฐ๋‚˜ ํ•ด์ƒ๋„๋ฅผ ๋‚ฎ์ถ”์ง€ ์•Š์•„๋„ ๋˜๋ฏ€๋กœ, ์™œ๊ณก ์—†์ด ๋‹ค์–‘ํ•œ ์ข…ํšก๋น„์˜ ์ด๋ฏธ์ง€๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ๊ฐ€๋กœ๋กœ ๊ธด ํŒŒ๋…ธ๋ผ๋งˆ ์ด๋ฏธ์ง€๋‚˜ ์„ธ๋กœ๋กœ ๊ธด ํฌ์Šคํ„ฐ ์ด๋ฏธ์ง€๋„ ์›๋ณธ ๋น„์œจ์„ ์œ ์ง€ํ•˜๋ฉด์„œ ์—ฌ๋Ÿฌ ๊ฐœ์˜ crop์œผ๋กœ ๋‚˜๋ˆ„์–ด ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค.

 

ํ•ต์‹ฌ ์•„์ด๋””์–ด

  • ์‹ค์ œ ์‘์šฉ์—์„œ๋Š” ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์™€ ๋น„์œจ์˜ ์ด๋ฏธ์ง€๊ฐ€ ์ž…๋ ฅ๋จ
  • ๊ณ ์ •๋œ resolution์€ ์ด๋ฏธ์ง€ ์™œ๊ณก์ด๋‚˜ ์ •๋ณด ์†์‹ค ๋ฐœ์ƒ ๊ฐ€๋Šฅ
  • ์œ ์—ฐํ•œ resolution ์ฒ˜๋ฆฌ๋กœ ๋‹ค์–‘ํ•œ ์ด๋ฏธ์ง€ ํšจ์œจ์  ์ฒ˜๋ฆฌ

์„ธ๋ถ€์‚ฌํ•ญ

  • ์ข…ํšก๋น„ ์œ ์ง€: ์›๋ณธ ์ด๋ฏธ์ง€์˜ ์ข…ํšก๋น„ ๋ณด์กด
  • ํšจ์œจ์  ์ฒ˜๋ฆฌ: ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ ์ด๋ฏธ์ง€๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌ
  • ์ •๋ณด ์†์‹ค ์ตœ์†Œํ™”: ์ด๋ฏธ์ง€ ์™œ๊ณก ์ตœ์†Œํ™”

2.3. Local/Global Attention

Local/Global Attention ํ˜ผํ•ฉ ๊ตฌ์กฐ๋Š” ๊ธด ์ปจํ…์ŠคํŠธ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋ฉด์„œ๋„ KV-cache ๋ฉ”๋ชจ๋ฆฌ ์ฆ๊ฐ€๋ฅผ ๊ด€๋ฆฌํ•˜๊ธฐ ์œ„ํ•œ ํ•ต์‹ฌ ์„ค๊ณ„๋‹ค. ์ด๋Š” ๊ฒฝ๋Ÿ‰ ๋ชจ๋ธ์—์„œ ๊ธด ์ปจํ…์ŠคํŠธ๋ฅผ ์ง€์›ํ•˜๋Š” ํ˜์‹ ์ ์ธ ๋ฐฉ๋ฒ•์ด๋‹ค.

ํ•ต์‹ฌ ์•„์ด๋””์–ด

  • ๊ธด ์ปจํ…์ŠคํŠธ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋ ค๋ฉด KV-cache ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๊ธ‰๊ฒฉํžˆ ์ฆ๊ฐ€
  • ๋ชจ๋“  ํ† ํฐ์— ๋Œ€ํ•ด global attention์„ ์ˆ˜ํ–‰ํ•˜๋ฉด ๋ฉ”๋ชจ๋ฆฌ ๋ถ€์กฑ
  • Local attention๊ณผ Global attention์„ ํ˜ผํ•ฉํ•˜์—ฌ ํšจ์œจ์„ฑ๊ณผ ์„ฑ๋Šฅ ๊ท ํ˜•

์„ธ๋ถ€์‚ฌํ•ญ

  • Local Attention Layer:
    • 1024 token ๋ฒ”์œ„์—์„œ ์ž‘๋™
    • ๊ทผ์ฒ˜ token๋งŒ ์ฐธ์กฐํ•˜์—ฌ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ ์ œํ•œ์ 
    • ๋น ๋ฅธ ๊ณ„์‚ฐ ๊ฐ€๋Šฅ
  • Global Attention Layer:
    • ์ „์ฒด ์‹œํ€€์Šค์— attention ์ˆ˜ํ–‰
    • ์ „์ฒด ๋งฅ๋ฝ ์œ ์ง€
    • 5๊ฐœ์˜ local layer๋งˆ๋‹ค 1๊ฐœ์˜ global layer ๋ฐฐ์น˜ (5:1 ๋น„์œจ)
  • ํ˜ผํ•ฉ ๊ตฌ์กฐ:
    • Local layer: ๋น ๋ฅธ ์ฒ˜๋ฆฌ, ์ œํ•œ๋œ ๋ฉ”๋ชจ๋ฆฌ
    • Global layer: ์ „์ฒด ๋งฅ๋ฝ ์œ ์ง€
    • ๋‘ ๊ฐ€์ง€๋ฅผ ์ ์ ˆํžˆ ํ˜ผํ•ฉํ•˜์—ฌ ํšจ์œจ์„ฑ๊ณผ ์„ฑ๋Šฅ ํ™•๋ณด

Local/Global Attention ํ˜ผํ•ฉ ๊ตฌ์กฐ๋Š” 128K token context๋ฅผ ์ง€์›ํ•˜๋ฉด์„œ๋„ KV-cache ๋ฉ”๋ชจ๋ฆฌ ์ฆ๊ฐ€๋ฅผ ํฌ๊ฒŒ ์ œํ•œํ•œ๋‹ค. ๋ชจ๋“  layer์—์„œ full attention์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒฝ์šฐ์™€ ๋น„๊ตํ•˜๋ฉด, local layer๋Š” 1024 token window๋งŒ ์‚ฌ์šฉํ•˜๋ฏ€๋กœ KV cache ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์ด ํฌ๊ฒŒ ๊ฐ์†Œํ•œ๋‹ค. Global layer๋Š” ์ „์ฒด 128K token์— attentionํ•˜์ง€๋งŒ ์ „์ฒด layer์˜ ์•ฝ 1/6(5:1 ๋น„์œจ)๋งŒ global layer์ด๋ฏ€๋กœ, ์ „์ฒด์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์€ full attention ๋Œ€๋น„ ํ˜„์ €ํžˆ ๋‚ฎ๋‹ค.

 

2.4. RoPE Rescaling for Long Context

RoPE Rescaling์€ 128K token์˜ long context๋ฅผ ์ง€์›ํ•˜๊ธฐ ์œ„ํ•œ ๊ธฐ์ˆ ์ด๋‹ค. ๋ชจ๋ธ์€ ์ฒ˜์Œ๋ถ€ํ„ฐ 128K sequence๋กœ ํ•™์Šตํ•˜๋Š” ๋Œ€์‹ , 32K sequence๋กœ pre-training ํ›„ RoPE rescaling์„ ํ†ตํ•ด 4B, 12B, 27B model์„ 128K token์œผ๋กœ ํ™•์žฅํ•œ๋‹ค.

 

RoPE๋Š” ๊ฐ ์œ„์น˜์— rotation angle์„ ํ• ๋‹นํ•˜์—ฌ ์œ„์น˜ ์ •๋ณด๋ฅผ ์ธ์ฝ”๋”ฉํ•œ๋‹ค. ํ‘œ์ค€ RoPE๋Š” ํ•™์Šต ์‹œ ๋ณธ ์  ์—†๋Š” ๊ธด ์œ„์น˜์— ๋Œ€ํ•ด extrapolation์„ ์‹œ๋„ํ•˜์ง€๋งŒ, ๋จผ ๊ฑฐ๋ฆฌ์˜ ๋‹จ์–ด๋“ค์ด ๋งค์šฐ ์œ ์‚ฌํ•œ embedding ๊ฐ’์„ ๊ฐ€์ง€๊ฒŒ ๋˜์–ด ์ƒ๋Œ€์  ์œ„์น˜๋ฅผ ๊ตฌ๋ถ„ํ•˜๊ธฐ ์–ด๋ ค์›Œ์ง„๋‹ค. 

 

"Extending Context Window of Large Language Models via Positional Interpolation" (Chen et al., 2023)์—์„œ ์ œ์•ˆํ•œ Position Interpolation์€ ์ด๋ฅผ interpolation ๋ฌธ์ œ๋กœ ์žฌ๊ตฌ์„ฑํ•œ๋‹ค. ์œ„์น˜ ์ธ๋ฑ์Šค๋ฅผ ์Šค์ผ€์ผ๋งํ•˜์—ฌ ๋ชจ๋ธ์ด ํ•™์Šตํ•œ ์›๋ž˜ ๋ฒ”์œ„ ๋‚ด์— ์œ ์ง€ํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, 2,048 token์œผ๋กœ ํ•™์Šต๋œ ๋ชจ๋ธ์ด 4,096 token์„ ์ฒ˜๋ฆฌํ•ด์•ผ ํ•  ๋•Œ, ์œ„์น˜ 4,096์„ ์œ„์น˜ 2,048๋กœ ๋งคํ•‘ํ•˜์—ฌ rotation angle์ด ๋ชจ๋ธ์ด ์ต์ˆ™ํ•œ ๋ฒ”์œ„ ๋‚ด์— ์žˆ๋„๋ก ํ•œ๋‹ค. Scaling factor๋Š” s = new_context_length / old_context_length๋กœ ๊ณ„์‚ฐ๋˜๋ฉฐ, ์ด๋Š” RoPE์˜ rotation angle ๊ณ„์‚ฐ์— ์ ์šฉ๋œ๋‹ค.

 

ํ•ต์‹ฌ ์•„์ด๋””์–ด

  • 32K sequence๋กœ pre-training ํ›„ RoPE rescaling์œผ๋กœ 128K๋กœ ํ™•์žฅ
  • positional interpolation๊ณผ ์œ ์‚ฌํ•œ ๊ณผ์ • ์‚ฌ์šฉ
  • Scaling factor 8์ด ์‹ค์šฉ์ ์œผ๋กœ ์ž˜ ์ž‘๋™ํ•จ (128K / 32K = 4์ด์ง€๋งŒ, ์‹ค์ œ๋กœ๋Š” 8 ์‚ฌ์šฉ)

 

์„ธ๋ถ€์‚ฌํ•ญ

  • RoPE Base Frequency ์กฐ์ •: Global self-attention layer์˜ RoPE base frequency๋ฅผ Gemma 2์˜ 10k์—์„œ 1M์œผ๋กœ ์ฆ๊ฐ€์‹œํ‚ด. ์ด๋Š” ๊ธด ์ปจํ…์ŠคํŠธ์—์„œ ๋” ์ •๋ฐ€ํ•œ ์œ„์น˜ ์ธ์ฝ”๋”ฉ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•œ๋‹ค.
  • Local Layer Frequency ์œ ์ง€: Local self-attention layer๋Š” 10k frequency ์œ ์ง€
  • Positional Interpolation: Global self-attention layer์˜ span์„ ํ™•์žฅํ•˜๊ธฐ ์œ„ํ•ด positional interpolation ๊ณผ์ • ์ ์šฉ. ์œ„์น˜ ์ธ๋ฑ์Šค๋ฅผ ์Šค์ผ€์ผ๋งํ•˜์—ฌ ํ•™์Šต ๋ฒ”์œ„ ๋‚ด์— ์œ ์ง€
  • Scaling Factor: ์‹คํ—˜์ ์œผ๋กœ scaling factor 8์ด ํšจ๊ณผ์ ์ž„์„ ํ™•์ธ

2.5. Knowledge Distillation

Gemma 3๋Š” ์‚ฌ์ „ ํ•™์Šต ๋‹จ๊ณ„์—์„œ Knowledge Distillation์„ ์ ์šฉํ•˜์—ฌ ๊ฒฝ๋Ÿ‰ model์˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œ์ผฐ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด Gemma3-4B-IT๊ฐ€ Gemma2-27B-IT์™€ ๊ฒฝ์Ÿ๋ ฅ ์žˆ๋Š” ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ์—ˆ์œผ๋ฉฐ, ์ž‘์€ model์ด ํฐ model์˜ ์„ฑ๋Šฅ์— ๊ทผ์ ‘ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ฃผ์–ด ๊ฒฝ๋Ÿ‰ model์˜ ์‹ค์šฉ์„ฑ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œ์ผฐ๋‹ค. ๋˜ํ•œ novel post-training recipe์™€ ๊ฒฐํ•ฉํ•˜์—ฌ ์ˆ˜ํ•™, ์ถ”๋ก , ์ฑ„ํŒ…, ์ง€์‹œ ์ˆ˜ํ–‰, ๋‹ค๊ตญ์–ด ๋Šฅ๋ ฅ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œ์ผœ ๊ฒฝ๋Ÿ‰ model๋„ ์ถฉ๋ถ„ํžˆ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ–ˆ๋‹ค.

 

3. Experimental Results

Gemma 3๋Š” ๊ฒฝ๋Ÿ‰ ๋ชจ๋ธ์ž„์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค. Gemma3-4B-IT๋Š” Gemma2-27B-IT์™€ ๊ฒฝ์Ÿ๋ ฅ ์žˆ๋Š” ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ–ˆ์œผ๋ฉฐ, ์ด๋Š” Knowledge Distillation๊ณผ ํšจ์œจ์ ์ธ ์•„ํ‚คํ…์ฒ˜ ์„ค๊ณ„์˜ ํšจ๊ณผ๋ฅผ ๋ณด์—ฌ์ค€๋‹ค. ์ž‘์€ model์ด ํฐ model์˜ ์„ฑ๋Šฅ์— ๊ทผ์ ‘ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์€ ๊ฒฝ๋Ÿ‰ model์˜ ์‹ค์šฉ์„ฑ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚จ๋‹ค.

 

Gemma3-27B-IT๋Š” Gemini-1.5-Pro์™€ ๋น„๊ต ๊ฐ€๋Šฅํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€์œผ๋ฉฐ, ํŠนํžˆ ์ˆ˜ํ•™, ์ถ”๋ก , ์ฑ„ํŒ…, ์ง€์‹œ ์ˆ˜ํ–‰ ๋Šฅ๋ ฅ์—์„œ ํฌ๊ฒŒ ํ–ฅ์ƒ๋˜์—ˆ๋‹ค. ์ด๋Š” novel post-training recipe์˜ ํšจ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ๋ฉฐ, ๊ฒฝ๋Ÿ‰ ๋ชจ๋ธ๋„ ์ถฉ๋ถ„ํžˆ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ์Œ์„ ์ž…์ฆํ•œ๋‹ค.

 

Extended Context ๋Šฅ๋ ฅ์€ ๊ธด ๋ฌธ์„œ ์ฒ˜๋ฆฌ์™€ ๋ฉ€ํ‹ฐํ„ด ๋Œ€ํ™”์—์„œ ์‹ค์งˆ์ ์ธ ์ด์ ์„ ์ œ๊ณตํ–ˆ๋‹ค. 128K token context๋กœ ๊ธด ๋ฌธ์„œ๋ฅผ ํ•œ ๋ฒˆ์— ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์—ˆ์œผ๋ฉฐ, ๋ฌธ์„œ ์ „์ฒด์˜ ๋งฅ๋ฝ์„ ์œ ์ง€ํ•˜๋ฉด์„œ ํŠน์ • ๋ถ€๋ถ„์— ๋Œ€ํ•œ ์งˆ๋ฌธ์— ์ •ํ™•ํ•˜๊ฒŒ ๋‹ตํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๋˜ํ•œ ๋ฉ€ํ‹ฐํ„ด ๋Œ€ํ™”์—์„œ๋„ ๋Œ€ํ™” ๋งฅ๋ฝ์„ ์˜ค๋ž˜ ์œ ์ง€ํ•  ์ˆ˜ ์žˆ์–ด, ์‚ฌ์šฉ์ž์™€์˜ ๊ธด ๋Œ€ํ™”์—์„œ๋„ ์ผ๊ด€์„ฑ ์žˆ๋Š” ์‘๋‹ต์„ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.

 

Local/Global Attention ํ˜ผํ•ฉ ๊ตฌ์กฐ๋Š” ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ฑ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œ์ผฐ๋‹ค. ๊ธด ์ปจํ…์ŠคํŠธ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋ฉด์„œ๋„ KV-cache ๋ฉ”๋ชจ๋ฆฌ ์ฆ๊ฐ€๋ฅผ ์ œํ•œํ•˜์—ฌ, ์†Œ๋น„์ž์šฉ ํ•˜๋“œ์›จ์–ด์—์„œ๋„ ์‹คํ–‰ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ–ˆ๋‹ค.

 

4. Conclusion

Gemma 3์˜ ๊ธฐ์ˆ ์  ํŠน์ด์ ์€ ๊ฒฝ๋Ÿ‰ model์— ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋Šฅ๋ ฅ์„ ํšจ์œจ์ ์œผ๋กœ ์ถ”๊ฐ€ํ•œ ์„ค๊ณ„์— ์žˆ๋‹ค. Local/Global Attention ํ˜ผํ•ฉ ๊ตฌ์กฐ๋Š” 5:1 ๋น„์œจ๋กœ local layer(1024 token span)์™€ global layer๋ฅผ ๋ฐฐ์น˜ํ•˜์—ฌ, 128K token context๋ฅผ ์ง€์›ํ•˜๋ฉด์„œ๋„ KV-cache ๋ฉ”๋ชจ๋ฆฌ ์ฆ๊ฐ€๋ฅผ ์ œํ•œํ•œ๋‹ค. ์ด๋Š” ๋ชจ๋“  token์— ๋Œ€ํ•ด global attention์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ธฐ์กด ๋ฐฉ์‹์˜ ๋ฉ”๋ชจ๋ฆฌ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ํ•ต์‹ฌ ๋ฉ”์ปค๋‹ˆ์ฆ˜์ด๋‹ค. SigLIP Vision Encoder๋Š” ์ด๋ฏธ์ง€๋ฅผ 256๊ฐœ์˜ ๊ณ ์ •๋œ soft token์œผ๋กœ encodingํ•˜์—ฌ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ ์˜ˆ์ธก ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋ฉฐ, Pan and Scan ๋ฐฉ๋ฒ•์œผ๋กœ ๋‹ค์–‘ํ•œ ์ข…ํšก๋น„์˜ ์ด๋ฏธ์ง€๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•œ๋‹ค. Knowledge Distillation์€ ์‚ฌ์ „ ํ•™์Šต ๋‹จ๊ณ„์—์„œ ์‚ฌ์šฉ๋˜์–ด Gemma3-4B-IT๊ฐ€ Gemma2-27B-IT์™€ ๊ฒฝ์Ÿ๋ ฅ ์žˆ๋Š” ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ–ˆ๋‹ค.

 

์ด๋Ÿฌํ•œ ๊ธฐ์ˆ ์  ์„ค๊ณ„๋ฅผ ํ†ตํ•ด ๊ฒฝ๋Ÿ‰ model(1B ~ 27B parameter)์ด ์†Œ๋น„์ž์šฉ ํ•˜๋“œ์›จ์–ด์—์„œ๋„ ์‹คํ–‰ ๊ฐ€๋Šฅํ•˜๋ฉด์„œ๋„ ๊ฐ•๋ ฅํ•œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์„ฑ๋Šฅ์„ ์ œ๊ณตํ•œ๋‹ค.

๋ฐ˜์‘ํ˜•

'๐Ÿ› Research > Multi-modal' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

[MLLM] GLM-4.5V ํ…Œํฌ๋‹ˆ์ปฌ ๋ฆฌํฌํŠธ ๋ฆฌ๋ทฐ  (0) 2026.02.18
[MLLM] InternVL3.5 ํ…Œํฌ๋‹ˆ์ปฌ ๋ฆฌํฌํŠธ ๋ฆฌ๋ทฐ  (0) 2026.02.18
Qwen3-VL ํ…Œํฌ๋‹ˆ์ปฌ ๋ฆฌํฌํŠธ ๋ฆฌ๋ทฐ | VLM | MLLM  (2) 2026.01.10
[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] Visual Instruction Tuning | LLaVA Model  (1) 2024.12.04
[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models  (0) 2024.12.04
'๐Ÿ› Research/Multi-modal' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€
  • [MLLM] GLM-4.5V ํ…Œํฌ๋‹ˆ์ปฌ ๋ฆฌํฌํŠธ ๋ฆฌ๋ทฐ
  • [MLLM] InternVL3.5 ํ…Œํฌ๋‹ˆ์ปฌ ๋ฆฌํฌํŠธ ๋ฆฌ๋ทฐ
  • Qwen3-VL ํ…Œํฌ๋‹ˆ์ปฌ ๋ฆฌํฌํŠธ ๋ฆฌ๋ทฐ | VLM | MLLM
  • [๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] Visual Instruction Tuning | LLaVA Model
๋ญ…์ฆค
๋ญ…์ฆค
AI ๊ธฐ์ˆ  ๋ธ”๋กœ๊ทธ
    ๋ฐ˜์‘ํ˜•
  • ๋ญ…์ฆค
    moovzi’s Doodle
    ๋ญ…์ฆค
  • ์ „์ฒด
    ์˜ค๋Š˜
    ์–ด์ œ
  • ๊ณต์ง€์‚ฌํ•ญ

    • โœจ About Me
    • ๋ถ„๋ฅ˜ ์ „์ฒด๋ณด๊ธฐ (216)
      • ๐Ÿ“– Fundamentals (34)
        • Computer Vision (9)
        • 3D vision & Graphics (6)
        • AI & ML (16)
        • etc. (3)
      • ๐Ÿ› Research (78)
        • Deep Learning (7)
        • Perception (19)
        • OCR (7)
        • Multi-modal (8)
        • Image•Video Generation (18)
        • 3D Vision (4)
        • Material • Texture Recognit.. (8)
        • Large-scale Model (7)
        • etc. (0)
      • ๐Ÿ› ๏ธ Engineering (8)
        • Distributed Training & Infe.. (5)
        • AI & ML ์ธ์‚ฌ์ดํŠธ (3)
      • ๐Ÿ’ป Programming (92)
        • Python (18)
        • Computer Vision (12)
        • LLM (4)
        • AI & ML (18)
        • Database (3)
        • Distributed Computing (6)
        • Apache Airflow (6)
        • Docker & Kubernetes (14)
        • ์ฝ”๋”ฉ ํ…Œ์ŠคํŠธ (4)
        • etc. (7)
      • ๐Ÿ’ฌ ETC (4)
        • ์ฑ… ๋ฆฌ๋ทฐ (4)
  • ๋งํฌ

    • ๋ฆฌํ‹€๋ฆฌ ํ”„๋กœํ•„ (๋ฉ˜ํ† ๋ง, ๋ฉด์ ‘์ฑ…,...)
    • ใ€Ž๋‚˜๋Š” AI ์—”์ง€๋‹ˆ์–ด์ž…๋‹ˆ๋‹คใ€
    • Instagram
    • Brunch
    • Github
  • ์ธ๊ธฐ ๊ธ€

  • ์ตœ๊ทผ ๋Œ“๊ธ€

  • ์ตœ๊ทผ ๊ธ€

  • hELLOยท Designed By์ •์ƒ์šฐ.v4.10.3
๋ญ…์ฆค
[MLLM] Gemma 3 ํ…Œํฌ๋‹ˆ์ปฌ ๋ฆฌํฌํŠธ ๋ฆฌ๋ทฐ
์ƒ๋‹จ์œผ๋กœ

ํ‹ฐ์Šคํ† ๋ฆฌํˆด๋ฐ”