[MLLM] InternVL3.5 ํ…Œํฌ๋‹ˆ์ปฌ ๋ฆฌํฌํŠธ ๋ฆฌ๋ทฐ

2026. 2. 18. 14:33ยท๐Ÿ› Research/Multi-modal
๋ฐ˜์‘ํ˜•

https://arxiv.org/abs/2508.18265

1. Introduction

InternVL3.5๋Š” OpenGVLab์ด 2025๋…„ 8์›” ๊ณต๊ฐœํ•œ ํšจ์œจ์„ฑ๊ณผ ์„ฑ๋Šฅ์„ ๋™์‹œ์— ๊ฐœ์„ ํ•œ ์˜คํ”ˆ์†Œ์Šค ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ชจ๋ธ๋กœ, Qwen ์‹œ๋ฆฌ์ฆˆ ๋‹ค์Œ์œผ๋กœ ์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š” ๋ชจ๋ธ์ด ์•„๋‹๊นŒ ์‹ถ๋‹ค.

 

๊ธฐ์กด ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ชจ๋ธ๋“ค์€ ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ์ง‘์ค‘ํ–ˆ์ง€๋งŒ, ์ถ”๋ก  ์†๋„์™€ ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ฑ์€ ์ƒ๋Œ€์ ์œผ๋กœ ์†Œํ™€ํ–ˆ๋‹ค. InternVL3.5๋Š” ์„ฑ๋Šฅ๊ณผ ํšจ์œจ์„ฑ์„ ๋™์‹œ์— ๊ฐœ์„ ํ•˜๋Š” ๊ฒƒ์ด ์‹ค์šฉ์  ๋ฐฐํฌ์— ํ•„์ˆ˜์ ์ž„์„ ๋ณด์—ฌ์ค€๋‹ค. ํŠนํžˆ ์ถ”๋ก  ์†๋„ ์•ฝ 4๋ฐฐ ํ–ฅ์ƒ์€ ์‹ค์‹œ๊ฐ„ ์‘์šฉ์—์„œ ํฐ ์ฐจ์ด๋ฅผ ๋งŒ๋“ ๋‹ค.

 

๊ธฐ์กด InternVL3๋Š” ๋‹ค์–‘ํ•œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ž‘์—…์—์„œ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€์ง€๋งŒ, ์ถ”๋ก  ์†๋„์™€ ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ฑ ์ธก๋ฉด์—์„œ ๊ฐœ์„ ์ด ํ•„์š”ํ–ˆ๋‹ค. ํŠนํžˆ ๋Œ€๊ทœ๋ชจ ๋ชจ๋ธ์„ ๋ฐฐํฌํ•  ๋•Œ๋Š” ๋‹จ์ผ GPU์˜ ๋ฉ”๋ชจ๋ฆฌ ์ œ์•ฝ์ด ํฐ ๋ฌธ์ œ์˜€๋‹ค. ๋˜ํ•œ ๋ชจ๋“  ์ด๋ฏธ์ง€๋ฅผ ๋™์ผํ•œ ํ•ด์ƒ๋„๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ฒƒ์€ ๋น„ํšจ์œจ์ ์ด๋‹ค. ๋ฌธ์„œ์˜ ๋ฐฐ๊ฒฝ ๋ถ€๋ถ„์€ ๋‚ฎ์€ ํ•ด์ƒ๋„๋กœ ์ถฉ๋ถ„ํ•˜์ง€๋งŒ, ํ…์ŠคํŠธ๊ฐ€ ์žˆ๋Š” ๋ถ€๋ถ„์€ ๋†’์€ ํ•ด์ƒ๋„๊ฐ€ ํ•„์š”ํ•˜๋‹ค. ๊ธฐ์กด ๋ฐฉ์‹์€ ๋ชจ๋“  ์ด๋ฏธ์ง€๋ฅผ ๋™์ผํ•œ ํ•ด์ƒ๋„๋กœ ์ฒ˜๋ฆฌํ•˜์—ฌ ๋ถˆํ•„์š”ํ•œ ๊ณ„์‚ฐ์„ ์ˆ˜ํ–‰ํ–ˆ๋‹ค.

 

ํšจ์œจ์„ฑ๊ณผ ์„ฑ๋Šฅ์˜ ํŠธ๋ ˆ์ด๋“œ์˜คํ”„๋Š” ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ชจ๋ธ์˜ ํ•ต์‹ฌ ๊ณผ์ œ๋‹ค. GUI interaction์ด๋‚˜ embodied agency ๊ฐ™์€ ์‹ค์‹œ๊ฐ„ ์‘์šฉ์—์„œ๋Š” ์ถ”๋ก  ์†๋„๊ฐ€ ์‚ฌ์šฉ์ž ๊ฒฝํ—˜์„ ๊ฒฐ์ •ํ•œ๋‹ค. ์‚ฌ์šฉ์ž๊ฐ€ GUI ์š”์†Œ๋ฅผ ํด๋ฆญํ•˜๋ผ๊ณ  ์š”์ฒญํ–ˆ์„ ๋•Œ, ๋ชจ๋ธ์ด ๋ช‡ ์ดˆ์”ฉ ๊ฑธ๋ ค์„œ ์‘๋‹ตํ•œ๋‹ค๋ฉด ์‹ค์šฉ์ ์ด์ง€ ์•Š๋‹ค. InternVL3.5๋Š” ์ด ๋ฌธ์ œ๋ฅผ ์„ธ ๊ฐ€์ง€ ๊ธฐ์ˆ ๋กœ ํ•ด๊ฒฐํ–ˆ๋‹ค.

2. Technical Approach

InternVL3.5๋Š” ํšจ์œจ์„ฑ๊ณผ ์„ฑ๋Šฅ์„ ๋™์‹œ์— ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•ด ์„ธ ๊ฐ€์ง€ ํ•ต์‹ฌ ๊ธฐ์ˆ ์„ ์ œ์•ˆํ–ˆ๋‹ค.

2.1. Cascade Reinforcement Learning

Cascade RL์€ ๋‘ ๋‹จ๊ณ„ ํ•™์Šต ํ”„๋ ˆ์ž„์›Œํฌ๋กœ, ๋จผ์ € Offline RL๋กœ ์•ˆ์ •์ ์ธ ์ˆ˜๋ ด์„ ๋‹ฌ์„ฑํ•œ ํ›„ Online RL๋กœ ์ •๋ฐ€ํ•œ ์ •๋ ฌ์„ ์ˆ˜ํ–‰ํ•œ๋‹ค. ์ด coarse-to-fine ์ „๋žต์€ ์ถ”๋ก  ์„ฑ๋Šฅ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ค๋ฉด์„œ๋„ ํ•™์Šต ์•ˆ์ •์„ฑ์„ ์œ ์ง€ํ•œ๋‹ค.

Stage 1: Offline RL (Mixed Preference Optimization)

  • ๋ชฉ์ : ์•ˆ์ •์ ์ธ ์ˆ˜๋ ด ๋‹ฌ์„ฑ
  • ๋ฐฉ๋ฒ•: Mixed Preference Optimization (MPO) ์‚ฌ์šฉ
  • Loss Fuction:
    • Preference loss: ์ธ๊ฐ„ ์„ ํ˜ธ๋„ ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ํ•™์Šต
    • Quality loss: ์‘๋‹ต ํ’ˆ์งˆ ํ‰๊ฐ€
    • Generation loss: ์ƒ์„ฑ ๋Šฅ๋ ฅ ํ–ฅ์ƒ
  • ํšจ๊ณผ: ์ดˆ๊ธฐ ๋‹จ๊ณ„์—์„œ ์•ˆ์ •์ ์ธ ์ •์ฑ… ํ•™์Šต

Stage 2: Online RL (Group Sequence Policy Optimization)

  • ๋ชฉ์ : ์ •๋ฐ€ํ•œ ์ •๋ ฌ ์ˆ˜ํ–‰
  • ๋ฐฉ๋ฒ•: GSPO (Group Sequence Policy Optimization) ์•Œ๊ณ ๋ฆฌ์ฆ˜
  • ํ•ต์‹ฌ ๋ฉ”์ปค๋‹ˆ์ฆ˜:
    • ์—ฌ๋Ÿฌ ํ›„๋ณด ์‘๋‹ต ์ƒ์„ฑ
    • ๊ฐ ํ›„๋ณด์˜ ์ •๊ทœํ™”๋œ advantage ๊ณ„์‚ฐ
    • Advantage ๊ธฐ๋ฐ˜ ์ •์ฑ… ์ •์ œ
  • ํšจ๊ณผ: ๋ชจ๋ธ์ด ์ƒ์„ฑํ•œ ์‘๋‹ต์„ ์ง์ ‘ ํ™œ์šฉํ•˜์—ฌ ์ •์ฑ… ๊ฐœ์„ 

์„ธ๋ถ€์‚ฌํ•ญ

  • Coarse-to-fine ์ „๋žต: coarse ์ •๋ ฌ → ์ •๋ฐ€ํ•œ ์ •๋ ฌ
  • ํ•™์Šต ์•ˆ์ •์„ฑ: Offline RL๋กœ ์ดˆ๊ธฐ ์•ˆ์ •์„ฑ ํ™•๋ณด ํ›„ Online RL ์ ์šฉ
  • Advantage ์ •๊ทœํ™”: ์—ฌ๋Ÿฌ ํ›„๋ณด ๊ฐ„ ๋น„๊ต๋ฅผ ํ†ตํ•œ ์•ˆ์ •์  ํ•™์Šต
  • ์„ฑ๋Šฅ ํ–ฅ์ƒ: MMMU, MathVista ๊ฐ™์€ ์ถ”๋ก  ์ž‘์—…์—์„œ ์ตœ๋Œ€ +16.0% ์„ฑ๋Šฅ ํ–ฅ์ƒ

2.2. Visual Resolution Router (ViR)

Visual Resolution Router (ViR)๋Š” ์„ฑ๋Šฅ ์ €ํ•˜ ์—†์ด ๋™์ ์œผ๋กœ ์‹œ๊ฐ ํ† ํฐ ํ•ด์ƒ๋„๋ฅผ ์กฐ์ •ํ•œ๋‹ค. ๊ฐ ์ด๋ฏธ์ง€ ํŒจ์น˜๋ฅผ ํ‰๊ฐ€ํ•˜์—ฌ ์••์ถ•๋ฅ ์„ ๊ฒฐ์ •ํ•˜๋ฉฐ, ์ž…๋ ฅ ์ด๋ฏธ์ง€์˜ ๋ณต์žก๋„์— ๋”ฐ๋ผ ์ ์ ˆํ•œ ํ•ด์ƒ๋„๋ฅผ ์„ ํƒํ•œ๋‹ค.

ํ•ต์‹ฌ ์•„์ด๋””์–ด

  • ์ด๋ฏธ์ง€์˜ ๋ชจ๋“  ์˜์—ญ์ด ๋™์ผํ•œ ํ•ด์ƒ๋„๋ฅผ ํ•„์š”๋กœ ํ•˜์ง€ ์•Š์Œ
  • semanticํ•˜๊ฒŒ ์ค‘์š”ํ•œ ์˜์—ญ(ํ…์ŠคํŠธ, ๊ฐ์ฒด)์€ ๋†’์€ ํ•ด์ƒ๋„ ์œ ์ง€
  • ๋œ ์ค‘์š”ํ•œ ์˜์—ญ(๋ฐฐ๊ฒฝ)์€ ๋‚ฎ์€ ํ•ด์ƒ๋„๋กœ ์••์ถ•

์„ธ๋ถ€์‚ฌํ•ญ

  • Patch ํ‰๊ฐ€: ๊ฐ ์ด๋ฏธ์ง€ patch์˜ ์ค‘์š”๋„ ํ‰๊ฐ€
  • ๋™์  ์••์ถ•: ์ค‘์š”๋„์— ๋”ฐ๋ผ ์••์ถ•๋ฅ  ๊ฒฐ์ •
    • ๋œ ์ค‘์š”ํ•œ patch๋Š” ์ตœ๋Œ€ 64 token๊นŒ์ง€ ์••์ถ•
    • ์ค‘์š”ํ•œ patch๋Š” ์ตœ๋Œ€ 256 token๊นŒ์ง€ ๋ณด์กด
  • ์„ฑ๋Šฅ ๋ณด์กด: ์••์ถ• ๊ณผ์ •์—์„œ ์ค‘์š”ํ•œ ์ •๋ณด ์†์‹ค ์ตœ์†Œํ™”

ViR๋Š” ๋ฌธ์„œ ๋ฐ OCR ์ž‘์—…์—์„œ token ์ˆ˜๋ฅผ ์•ฝ 50% ๊ฐ์†Œ์‹œํ‚ค๋ฉด์„œ๋„ ์ธก์ • ๊ฐ€๋Šฅํ•œ ์„ฑ๋Šฅ ์ €ํ•˜ ์—†์ด ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ค€๋‹ค. ์ด๋Š” ๊ณ„์‚ฐ ๋น„์šฉ์„ ๋Œ€ํญ ๊ฐ์†Œ์‹œํ‚จ๋‹ค. ํŠนํžˆ ๋ฌธ์„œ ์ฒ˜๋ฆฌ์—์„œ ํšจ๊ณผ์ ์ด๋ฉฐ, ํ…์ŠคํŠธ ์˜์—ญ์€ ๊ณ ํ•ด์ƒ๋„๋กœ ๋ณด์กดํ•˜๊ณ  ๋ฐฐ๊ฒฝ์€ ์ €ํ•ด์ƒ๋„๋กœ ์••์ถ•ํ•˜์—ฌ ํšจ์œจ์„ฑ๊ณผ ์„ฑ๋Šฅ์„ ๋™์‹œ์— ํ™•๋ณดํ•œ๋‹ค.

2.3. Decoupled Vision-Language Deployment (DvD)

Decoupled Vision-Language Deployment (DvD)๋Š” Vision encoder์™€ LLM์„ ์„œ๋กœ ๋‹ค๋ฅธ GPU์— ๋ถ„๋ฆฌ ๋ฐฐ์น˜ํ•˜์—ฌ ๊ณ„์‚ฐ ๋ถ€ํ•˜๋ฅผ ๊ท ํ˜•์žˆ๊ฒŒ ๋ถ„์‚ฐ์‹œํ‚จ๋‹ค. ์ด๋Š” ๋ฉ”๋ชจ๋ฆฌ ์ œ์•ฝ์ด ์žˆ๋Š” ํ™˜๊ฒฝ์—์„œ๋„ ๋Œ€๊ทœ๋ชจ ๋ชจ๋ธ์„ ํšจ์œจ์ ์œผ๋กœ ๋ฐฐํฌํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ค€๋‹ค.

ํ•ต์‹ฌ ์•„์ด๋””์–ด

  • Vision encoder์™€ LLM์˜ ๊ณ„์‚ฐ ๋ถ€ํ•˜๊ฐ€ ๋‹ค๋ฆ„
  • ๋‹จ์ผ GPU์— ๋ชจ๋‘ ๋ฐฐ์น˜ํ•˜๋ฉด ๋ฉ”๋ชจ๋ฆฌ ๋ถ€์กฑ ๋˜๋Š” ๋น„ํšจ์œจ์  ๋ฆฌ์†Œ์Šค ์‚ฌ์šฉ
  • ์„œ๋กœ ๋‹ค๋ฅธ GPU์— ๋ถ„๋ฆฌ ๋ฐฐ์น˜ํ•˜์—ฌ ๋ถ€ํ•˜ ๊ท ํ˜•

์„ธ๋ถ€์‚ฌํ•ญ

  • Vision Encoder ๋ฐฐ์น˜: ๋ณ„๋„ GPU์— ๋ฐฐ์น˜
    • ์ด๋ฏธ์ง€ ์ธ์ฝ”๋”ฉ ์ž‘์—… ์ˆ˜ํ–‰
    • ์‹œ๊ฐ ํ† ํฐ ์ƒ์„ฑ
  • LLM ๋ฐฐ์น˜: ๋‹ค๋ฅธ GPU์— ๋ฐฐ์น˜
    • ์‹œ๊ฐ token๊ณผ ํ…์ŠคํŠธ๋ฅผ ํ•จ๊ป˜ ์ฒ˜๋ฆฌ
    • ์ตœ์ข… ์‘๋‹ต ์ƒ์„ฑ
  • ํ†ต์‹  ์ตœ์ ํ™”: GPU ๊ฐ„ ๋ฐ์ดํ„ฐ ์ „์†ก ์ตœ์†Œํ™”

DvD๋Š” ์ถ”๋ก  ์†๋„๋ฅผ ์ตœ๋Œ€ 4๋ฐฐ ํ–ฅ์ƒ์‹œํ‚ค๊ณ  throughput์„ 2๋ฐฐ ์ฆ๊ฐ€์‹œํ‚จ๋‹ค. ๋ฉ”๋ชจ๋ฆฌ ์ œ์•ฝ ํ™˜๊ฒฝ์—์„œ๋„ ๋Œ€๊ทœ๋ชจ ๋ชจ๋ธ์„ ๋ฐฐํฌํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ฃผ๋ฉฐ, Vision encoder์™€ LLM์˜ ๊ณ„์‚ฐ ๋ถ€ํ•˜๋ฅผ ๊ท ํ˜•์žˆ๊ฒŒ ๋ถ„์‚ฐ์‹œ์ผœ ๋ฆฌ์†Œ์Šค ํ™œ์šฉ ํšจ์œจ์„ฑ์„ ํ–ฅ์ƒ์‹œํ‚จ๋‹ค.

3. Experimental Results

InternVL3.5๋Š” ์ถ”๋ก  ์„ฑ๋Šฅ๊ณผ ํšจ์œจ์„ฑ ๋ชจ๋‘์—์„œ ํฐ ํ–ฅ์ƒ์„ ๋ณด์˜€๋‹ค. Cascade RL์„ ํ†ตํ•ด MMMU, MathVista ๊ฐ™์€ ์ถ”๋ก  ๋ฒค์น˜๋งˆํฌ์—์„œ InternVL3 ๋Œ€๋น„ ์ตœ๋Œ€ +16.0% ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋‹ฌ์„ฑํ–ˆ๋‹ค. ์ด๋Š” ๋‘ ๋‹จ๊ณ„ ํ•™์Šต ์ „๋žต์ด ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ด์„ ๋ณด์—ฌ์ค€๋‹ค.

 

ViR๋ฅผ ํ†ตํ•ด token ์ˆ˜๋ฅผ ์•ฝ 50% ๊ฐ์†Œ์‹œํ‚ค๋ฉด์„œ๋„ ์„ฑ๋Šฅ ์ €ํ•˜ ์—†์ด ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์—ˆ์œผ๋ฉฐ, ์ด๋Š” ๋ฌธ์„œ๋‚˜ OCR ์ž‘์—…์—์„œ ํŠนํžˆ ํšจ๊ณผ์ ์ด์—ˆ๋‹ค. ๋œ ์ค‘์š”ํ•œ ์ด๋ฏธ์ง€ ์˜์—ญ์€ ๋‚ฎ์€ resolution์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๊ณ , ํ…์ŠคํŠธ๋‚˜ ์ค‘์š”ํ•œ ๊ฐ์ฒด๊ฐ€ ์žˆ๋Š” ์˜์—ญ์€ ๋†’์€ resolution์œผ๋กœ ๋ณด์กดํ•˜์—ฌ ํšจ์œจ์„ฑ๊ณผ ์„ฑ๋Šฅ์„ ๋™์‹œ์— ํ™•๋ณดํ–ˆ๋‹ค.

DvD๋ฅผ ํ†ตํ•ด ์ถ”๋ก  ์†๋„๊ฐ€ 4๋ฐฐ ํ–ฅ์ƒ๋˜์—ˆ์œผ๋ฉฐ, throughput๋„ 2๋ฐฐ ์ฆ๊ฐ€ํ–ˆ๋‹ค. ์ด๋Š” Vision encoder์™€ LLM์„ ๋‹ค๋ฅธ GPU์— ๋ถ„๋ฆฌ ๋ฐฐ์น˜ํ•จ์œผ๋กœ์จ ๊ณ„์‚ฐ ๋ถ€ํ•˜๋ฅผ ๊ท ํ˜•์žˆ๊ฒŒ ๋ถ„์‚ฐ์‹œํ‚จ ๊ฒฐ๊ณผ๋‹ค. ํŠนํžˆ ๋ฉ”๋ชจ๋ฆฌ ์ œ์•ฝ์ด ์žˆ๋Š” ํ™˜๊ฒฝ์—์„œ๋„ ๋Œ€๊ทœ๋ชจ ๋ชจ๋ธ์„ ํšจ์œจ์ ์œผ๋กœ ๋ฐฐํฌํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ฃผ์—ˆ๋‹ค.

 

GUI interaction ์ž‘์—…์—์„œ๋„ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€์œผ๋ฉฐ, open-source MLLM ์ค‘์—์„œ SOTA ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ–ˆ๋‹ค. ํŠนํžˆ InternVL3.5-241B-A28B๋Š” ์ผ๋ฐ˜ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ, ์ถ”๋ก , ํ…์ŠคํŠธ, ์—์ด์ „ํŠธ ์ž‘์—…์—์„œ ์˜คํ”ˆ์†Œ์Šค MLLM ์ค‘ ์ตœ๊ณ  ์„ฑ๋Šฅ์„ ๋ณด์ด๋ฉฐ, GPT-5 ๊ฐ™์€ ์ƒ์šฉ ๋ชจ๋ธ๊ณผ์˜ ๊ฒฉ์ฐจ๋ฅผ ์ขํ˜”๋‹ค.

4. Conclusion

InternVL3.5์˜ ๊ธฐ์ˆ ์  ํŠน์ด์ ์€ ํšจ์œจ์„ฑ๊ณผ ์„ฑ๋Šฅ์„ ๋™์‹œ์— ๊ฐœ์„ ํ•œ ์„ธ ๊ฐ€์ง€ ํ•ต์‹ฌ ๊ธฐ์ˆ ์— ์žˆ๋‹ค. Cascade RL์€ Offline RL๊ณผ Online RL์˜ ๋‘ ๋‹จ๊ณ„ ํ•™์Šต์œผ๋กœ ์•ˆ์ •์  ์ˆ˜๋ ด๊ณผ ์ •๋ฐ€ํ•œ ์ •๋ ฌ์„ ๋™์‹œ์— ๋‹ฌ์„ฑํ•˜๋ฉฐ, MMMU, MathVista ๊ฐ™์€ ์ถ”๋ก  ์ž‘์—…์—์„œ ์ตœ๋Œ€ +16.0% ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋ณด์˜€๋‹ค. ViR๋Š” ์ด๋ฏธ์ง€ patch๋ณ„๋กœ ์ค‘์š”๋„๋ฅผ ํ‰๊ฐ€ํ•˜์—ฌ ๋™์ ์œผ๋กœ token ์ˆ˜๋ฅผ ์กฐ์ •ํ•˜๋ฉฐ, ๋ฌธ์„œ ๋ฐ OCR ์ž‘์—…์—์„œ token ์ˆ˜๋ฅผ ์•ฝ 50% ๊ฐ์†Œ์‹œํ‚ค๋ฉด์„œ๋„ ์„ฑ๋Šฅ ์ €ํ•˜ ์—†์ด ์ฒ˜๋ฆฌํ•œ๋‹ค. DvD๋Š” Vision encoder์™€ LLM์„ ์„œ๋กœ ๋‹ค๋ฅธ GPU์— ๋ถ„๋ฆฌ ๋ฐฐ์น˜ํ•˜์—ฌ ์ถ”๋ก  ์†๋„๋ฅผ ์ตœ๋Œ€ 4.05๋ฐฐ ํ–ฅ์ƒ์‹œํ‚ค๊ณ  throughput์„ 2๋ฐฐ ์ฆ๊ฐ€์‹œ์ผฐ๋‹ค.

 

ํŠนํžˆ DvD๋Š” ๋‹จ์ผ GPU์˜ ๋ฉ”๋ชจ๋ฆฌ ์ œ์•ฝ์„ ๊ทน๋ณตํ•˜๋Š” ์ƒˆ๋กœ์šด ๋ฐฐํฌ ์ „๋žต์ด๋ฉฐ, ViR๋Š” ๋ชจ๋“  ์ด๋ฏธ์ง€๋ฅผ ๋™์ผํ•œ resolution์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ธฐ์กด ๋ฐฉ์‹์˜ ๋น„ํšจ์œจ์„ฑ์„ ํ•ด๊ฒฐํ•˜๋Š” ๋™์  ํ•ด์ƒ๋„ ์กฐ์ • ๋ฉ”์ปค๋‹ˆ์ฆ˜์ด๋‹ค.

๋ฐ˜์‘ํ˜•

'๐Ÿ› Research > Multi-modal' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

[MLLM] Gemma 3 ํ…Œํฌ๋‹ˆ์ปฌ ๋ฆฌํฌํŠธ ๋ฆฌ๋ทฐ  (0) 2026.02.18
[MLLM] GLM-4.5V ํ…Œํฌ๋‹ˆ์ปฌ ๋ฆฌํฌํŠธ ๋ฆฌ๋ทฐ  (0) 2026.02.18
Qwen3-VL ํ…Œํฌ๋‹ˆ์ปฌ ๋ฆฌํฌํŠธ ๋ฆฌ๋ทฐ | VLM | MLLM  (2) 2026.01.10
[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] Visual Instruction Tuning | LLaVA Model  (1) 2024.12.04
[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models  (0) 2024.12.04
'๐Ÿ› Research/Multi-modal' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€
  • [MLLM] Gemma 3 ํ…Œํฌ๋‹ˆ์ปฌ ๋ฆฌํฌํŠธ ๋ฆฌ๋ทฐ
  • [MLLM] GLM-4.5V ํ…Œํฌ๋‹ˆ์ปฌ ๋ฆฌํฌํŠธ ๋ฆฌ๋ทฐ
  • Qwen3-VL ํ…Œํฌ๋‹ˆ์ปฌ ๋ฆฌํฌํŠธ ๋ฆฌ๋ทฐ | VLM | MLLM
  • [๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] Visual Instruction Tuning | LLaVA Model
๋ญ…์ฆค
๋ญ…์ฆค
AI ๊ธฐ์ˆ  ๋ธ”๋กœ๊ทธ
    ๋ฐ˜์‘ํ˜•
  • ๋ญ…์ฆค
    moovzi’s Doodle
    ๋ญ…์ฆค
  • ์ „์ฒด
    ์˜ค๋Š˜
    ์–ด์ œ
  • ๊ณต์ง€์‚ฌํ•ญ

    • โœจ About Me
    • ๋ถ„๋ฅ˜ ์ „์ฒด๋ณด๊ธฐ (216)
      • ๐Ÿ“– Fundamentals (34)
        • Computer Vision (9)
        • 3D vision & Graphics (6)
        • AI & ML (16)
        • etc. (3)
      • ๐Ÿ› Research (78)
        • Deep Learning (7)
        • Perception (19)
        • OCR (7)
        • Multi-modal (8)
        • Image•Video Generation (18)
        • 3D Vision (4)
        • Material • Texture Recognit.. (8)
        • Large-scale Model (7)
        • etc. (0)
      • ๐Ÿ› ๏ธ Engineering (8)
        • Distributed Training & Infe.. (5)
        • AI & ML ์ธ์‚ฌ์ดํŠธ (3)
      • ๐Ÿ’ป Programming (92)
        • Python (18)
        • Computer Vision (12)
        • LLM (4)
        • AI & ML (18)
        • Database (3)
        • Distributed Computing (6)
        • Apache Airflow (6)
        • Docker & Kubernetes (14)
        • ์ฝ”๋”ฉ ํ…Œ์ŠคํŠธ (4)
        • etc. (7)
      • ๐Ÿ’ฌ ETC (4)
        • ์ฑ… ๋ฆฌ๋ทฐ (4)
  • ๋งํฌ

    • ๋ฆฌํ‹€๋ฆฌ ํ”„๋กœํ•„ (๋ฉ˜ํ† ๋ง, ๋ฉด์ ‘์ฑ…,...)
    • ใ€Ž๋‚˜๋Š” AI ์—”์ง€๋‹ˆ์–ด์ž…๋‹ˆ๋‹คใ€
    • Instagram
    • Brunch
    • Github
  • ์ธ๊ธฐ ๊ธ€

  • ์ตœ๊ทผ ๋Œ“๊ธ€

  • ์ตœ๊ทผ ๊ธ€

  • hELLOยท Designed By์ •์ƒ์šฐ.v4.10.3
๋ญ…์ฆค
[MLLM] InternVL3.5 ํ…Œํฌ๋‹ˆ์ปฌ ๋ฆฌํฌํŠธ ๋ฆฌ๋ทฐ
์ƒ๋‹จ์œผ๋กœ

ํ‹ฐ์Šคํ† ๋ฆฌํˆด๋ฐ”