[MLLM] GLM-4.5V ํ…Œํฌ๋‹ˆ์ปฌ ๋ฆฌํฌํŠธ ๋ฆฌ๋ทฐ

2026. 2. 18. 14:36ยท๐Ÿ› Research/Multi-modal
๋ฐ˜์‘ํ˜•

https://arxiv.org/abs/2507.01006

1. Introduction

GLM-4.5V๋Š” Zhipu AI์™€ Tsinghua University๊ฐ€ 2025๋…„ 7์›” 1์ผ ํ…Œํฌ๋ฆฌํฌํŠธ์—์„œ ์†Œ๊ฐœ๋œ RLCS(Reinforcement Learning with Curriculum Sampling)๋ฅผ ํฌํ•จํ•œ ์Šค์ผ€์ผ๋Ÿฌ๋ธ” ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ RL ๋ ˆ์‹œํ”ผ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ, 2025๋…„ 8์›” 11์ผ๊ฒฝ ๊ณต๊ฐœ/๋ฐฐํฌ๋œ VLM์ด๋‹ค. GLM-4.5V๋Š” GLM-4.5-Air ๊ธฐ๋ฐ˜(MoE, 106B total / 12B active)์ด๋ฉฐ, RLCS๋ฅผ ํฌํ•จํ•œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ RL ์Šคํƒ(RLVR + RLHF, unified reward system, dynamic sampling expansion ๋“ฑ)์„ ํ†ตํ•ด ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ๊ฐ•ํ™”ํ•œ ๋ชจ๋ธ์ด๋‹ค.

 

๊ธฐ์กด ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ชจ๋ธ๋“ค์€ ๋‹ค์–‘ํ•œ ๋„๋ฉ”์ธ์—์„œ ํ•™์Šตํ•˜์ง€๋งŒ, ๋ชจ๋ธ์ด ํ˜„์žฌ ์ž˜ํ•  ์ˆ˜ ์žˆ๋Š” ์ž‘์—…๊ณผ ์–ด๋ ค์šด ์ž‘์—…์„ ๊ตฌ๋ถ„ํ•˜์ง€ ์•Š๊ณ  ๋™์ผํ•˜๊ฒŒ ํ•™์Šตํ•œ๋‹ค. RLCS๋Š” ๋ชจ๋ธ์˜ ์—ญ๋Ÿ‰์— ๋”ฐ๋ผ ํ•™์Šตํ•  ํƒœ์Šคํฌ๋ฅผ ๋™์ ์œผ๋กœ ์„ ํƒํ•˜์—ฌ ํšจ์œจ์ ์ด๊ณ  ์•ˆ์ •์ ์ธ ํ•™์Šต์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•œ๋‹ค. ์ด๋Š” curriculum learning์˜ ์•„์ด๋””์–ด๋ฅผ reinforcement learning์— ์ ์šฉํ•œ ๊ฒƒ์ด๋‹ค.

 

๊ธฐ์กด GLM-V ์‹œ๋ฆฌ์ฆˆ๋Š” ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ดํ•ด ๋Šฅ๋ ฅ์„ ๊พธ์ค€ํžˆ ๋ฐœ์ „์‹œ์ผœ์™”๋‹ค. ํ•˜์ง€๋งŒ ๋ณต์žกํ•œ ์ถ”๋ก  ์ž‘์—…, ํŠนํžˆ STEM ๋ฌธ์ œ๋‚˜ ์ฝ”๋”ฉ ์ž‘์—…์—์„œ๋Š” ์—ฌ์ „ํžˆ ๊ฐœ์„ ์˜ ์—ฌ์ง€๊ฐ€ ์žˆ์—ˆ๋‹ค. Multimodal reasoning์€ ๋‹จ์ˆœํžˆ ์ด๋ฏธ์ง€๋ฅผ ๋ณด๊ณ  ํ…์ŠคํŠธ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ์„ ๋„˜์–ด, ์ด๋ฏธ์ง€์˜ ๋‚ด์šฉ์„ ๋ถ„์„ํ•˜๊ณ  ๋…ผ๋ฆฌ์ ์œผ๋กœ ์ถ”๋ก ํ•˜๋Š” ๋Šฅ๋ ฅ์ด ํ•„์š”ํ•˜๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์ˆ˜ํ•™ ๋ฌธ์ œ์˜ ๊ทธ๋ž˜ํ”„๋ฅผ ๋ณด๊ณ  ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ฑฐ๋‚˜, ์ฝ”๋“œ ์Šคํฌ๋ฆฐ์ƒท์„ ๋ณด๊ณ  ๋ฒ„๊ทธ๋ฅผ ์ฐพ๋Š” ๊ฒƒ์€ ๋‹จ์ˆœํ•œ ์ด๋ฏธ์ง€ ์„ค๋ช…๊ณผ๋Š” ๋‹ค๋ฅธ ์ถ”๋ก  ๋Šฅ๋ ฅ์ด ํ•„์š”ํ•˜๋‹ค.

 

Reinforcement Learning์€ LLM์—์„œ ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๋ฐ ํšจ๊ณผ์ ์ด์ง€๋งŒ, ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ชจ๋ธ์— ์ ์šฉํ•  ๋•Œ๋Š” ์—ฌ๋Ÿฌ ๋„๋ฉ”์ธ(STEM, ์ฝ”๋”ฉ, GUI agent ๋“ฑ)์—์„œ์˜ ์„ฑ๋Šฅ์„ ๊ท ํ˜•์žˆ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๊ฒƒ์ด ์–ด๋ ต๋‹ค. ๋ชจ๋ธ์ด ํ•œ ๋„๋ฉ”์ธ์—์„œ๋Š” ์ž˜ ํ•™์Šตํ•˜์ง€๋งŒ ๋‹ค๋ฅธ ๋„๋ฉ”์ธ์—์„œ๋Š” ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง€๋Š” ๋ถˆ๊ท ํ˜• ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค. RLCS๋Š” ์ด ๋ฌธ์ œ๋ฅผ curriculum sampling์œผ๋กœ ํ•ด๊ฒฐํ•œ๋‹ค. ๋ชจ๋ธ์˜ ํ˜„์žฌ ์—ญ๋Ÿ‰์„ ํ‰๊ฐ€ํ•˜๊ณ , ๊ทธ์— ๋งž๋Š” ๋‚œ์ด๋„์˜ ํƒœ์Šคํฌ๋ฅผ ์„ ํƒํ•˜์—ฌ ์ ์ง„์ ์œผ๋กœ ์–ด๋ ค์šด ์ž‘์—…์œผ๋กœ ํ™•์žฅํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค.

2. Technical Approach

GLM-4.5V์˜ ํ•ต์‹ฌ์€ Reinforcement Learning with Curriculum Sampling (RLCS)๋ฅผ ํฌํ•จํ•œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ RL ์Šคํƒ์ด๋‹ค. RLCS๋Š” multi-domain reinforcement learning์„ ํ†ตํ•ด ๋ชจ๋ธ์˜ ์—ญ๋Ÿ‰์— ๋”ฐ๋ผ ๋™์ ์œผ๋กœ ํƒœ์Šคํฌ๋ฅผ ์„ ํƒํ•˜๋Š” ๋ฐฉ์‹์ด๋ฉฐ, RLVR + RLHF, unified reward system, dynamic sampling expansion ๋“ฑ๊ณผ ํ•จ๊ป˜ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ RL ๋ ˆ์‹œํ”ผ์˜ ํ•œ ๊ตฌ์„ฑ์š”์†Œ๋กœ ์ž‘๋™ํ•œ๋‹ค.

 

2.1. RLCS ํ”„๋ ˆ์ž„์›Œํฌ ๊ฐœ์š”

RLCS๋Š” curriculum learning์˜ ์•„์ด๋””์–ด๋ฅผ reinforcement learning์— ์ ์šฉํ•œ ํ”„๋ ˆ์ž„์›Œํฌ๋‹ค. Curriculum learning์€ ์ธ๊ฐ„์ด ํ•™์Šตํ•  ๋•Œ ์‰ฌ์šด ๋‚ด์šฉ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜์—ฌ ์ ์ง„์ ์œผ๋กœ ์–ด๋ ค์šด ๋‚ด์šฉ์œผ๋กœ ํ™•์žฅํ•˜๋Š” ๊ฒƒ์ฒ˜๋Ÿผ, model๋„ ์‰ฌ์šด task๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜์—ฌ ์ ์ง„์ ์œผ๋กœ ์–ด๋ ค์šด task๋กœ ํ™•์žฅํ•˜๋Š” ํ•™์Šต ์ „๋žต์ด๋‹ค. ๊ธฐ์กด RL ๋ฐฉ์‹์€ ๋ชจ๋“  ๋„๋ฉ”์ธ๊ณผ ๋‚œ์ด๋„๋ฅผ ๋™์ผํ•˜๊ฒŒ ํ•™์Šตํ•˜์ง€๋งŒ, RLCS๋Š” model์ด ํ˜„์žฌ ์ž˜ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” task๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜์—ฌ ์ ์ง„์ ์œผ๋กœ ์–ด๋ ค์šด task๋กœ ํ™•์žฅํ•œ๋‹ค.

ํ•ต์‹ฌ ์•„์ด๋””์–ด

  • Curriculum learning์˜ ์•„์ด๋””์–ด๋ฅผ reinforcement learning์— ์ ์šฉ
  • Model์ด ํ˜„์žฌ ์ž˜ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” task๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜์—ฌ ์ ์ง„์ ์œผ๋กœ ์–ด๋ ค์šด task๋กœ ํ™•์žฅ
  • Model์˜ ์—ญ๋Ÿ‰์— ๋”ฐ๋ผ ๋™์ ์œผ๋กœ ํ•™์Šตํ•  task ์„ ํƒ

ํ•™์Šต์€ ์—ฌ๋Ÿฌ ๋‹จ๊ณ„๋กœ ์ง„ํ–‰๋œ๋‹ค. ์ฒซ์งธ, ๋‹ค์–‘ํ•œ ์ง€์‹ ์ง‘์•ฝ์  ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ฐ์ดํ„ฐ๋กœ vision foundation model์„ ๋Œ€๊ทœ๋ชจ pre-training์„ ์ง„ํ–‰ํ•œ๋‹ค. ๋‘˜์งธ, long-context, ๊ณ ํ•ด์ƒ๋„, ๋น„๋””์˜ค ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ continual training์„ ์ˆ˜ํ–‰ํ•œ๋‹ค. ์…‹์งธ, SFT(Supervised Fine-Tuning) ๋‹จ๊ณ„์—์„œ long CoT(Chain-of-Thought) ์Šคํƒ€์ผ ์ •๋ ฌ์„ ์ˆ˜ํ–‰ํ•œ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ RLCS๋ฅผ ํฌํ•จํ•œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ RL ์Šคํƒ(RLVR + RLHF, unified reward system, dynamic sampling expansion ๋“ฑ)์„ ์ ์šฉํ•œ๋‹ค. ์ด ๋‹จ๊ณ„์—์„œ model์˜ ์—ญ๋Ÿ‰์„ ํ‰๊ฐ€ํ•˜๊ณ  ๊ทธ์— ๋งž๋Š” ๋‚œ์ด๋„์˜ task๋ฅผ ์„ ํƒํ•˜์—ฌ ์ ์ง„์ ์œผ๋กœ ์–ด๋ ค์šด ์ž‘์—…์œผ๋กœ ํ™•์žฅํ•œ๋‹ค.

 

ํ•™์Šต ๋‹จ๊ณ„

  1. Pre-training: ๋‹ค์–‘ํ•œ ์ง€์‹ ์ง‘์•ฝ์  ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ฐ์ดํ„ฐ๋กœ vision foundation model ์‚ฌ์ „ ํ•™์Šต
  2. Continual Training: long-context, ๊ณ ํ•ด์ƒ๋„, ๋น„๋””์˜ค ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ ์ง€์†์  ํ•™์Šต
  3. SFT: ๋กฑ CoT ์Šคํƒ€์ผ ์ •๋ ฌ์„ ์œ„ํ•œ Supervised Fine-Tuning
  4. Multi-modal RL: RLCS๋ฅผ ํฌํ•จํ•œ RL ์Šคํƒ(RLVR + RLHF, unified reward system, dynamic sampling expansion ๋“ฑ) ์ ์šฉ

2.2. ๋™์  ํƒœ์Šคํฌ ์„ ํƒ ๋ฉ”์ปค๋‹ˆ์ฆ˜

RLCS์˜ ํ•ต์‹ฌ์€ model์ด ๊ฐ ๋„๋ฉ”์ธ(STEM, ์ฝ”๋”ฉ, GUI agent ๋“ฑ)์—์„œ์˜ ์„ฑ๋Šฅ์„ ์ง€์†์ ์œผ๋กœ ํ‰๊ฐ€ํ•˜๊ณ , ๊ทธ ๊ฒฐ๊ณผ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ํ•™์Šต ์ „๋žต์„ ์กฐ์ •ํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

 

๋„๋ฉ”์ธ๋ณ„ ์„ฑ๋Šฅ ํ‰๊ฐ€

  • Model์ด ๊ฐ ๋„๋ฉ”์ธ(STEM, ์ฝ”๋”ฉ, GUI agent ๋“ฑ)์—์„œ์˜ ์„ฑ๋Šฅ์„ ์ง€์†์ ์œผ๋กœ ํ‰๊ฐ€
  • ์„ฑ๋Šฅ ํ‰๊ฐ€ ๊ฒฐ๊ณผ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ํ•™์Šต ์ „๋žต ์กฐ์ •

Model์ด ์ˆ˜ํ•™ ๋ฌธ์ œ๋Š” ์ž˜ ํ’€์ง€๋งŒ ์ฝ”๋”ฉ ๋ฌธ์ œ๋Š” ์–ด๋ ค์›Œํ•œ๋‹ค๋ฉด, ์ฝ”๋”ฉ task์˜ ๋‚œ์ด๋„๋ฅผ ๋‚ฎ์ถ”๊ณ  ๋” ๋งŽ์€ ์ฝ”๋”ฉ ์ƒ˜ํ”Œ์„ ์ œ๊ณตํ•œ ํ›„, ์ ์ง„์ ์œผ๋กœ ์ฝ”๋”ฉ task์˜ ๋‚œ์ด๋„๋ฅผ ์ฆ๊ฐ€์‹œํ‚จ๋‹ค. ๋ฐ˜๋Œ€๋กœ ์„ฑ๋Šฅ์ด ์ข‹์€ ๋„๋ฉ”์ธ์—์„œ๋Š” ๋” ์–ด๋ ค์šด task๋กœ ํ™•์žฅํ•˜์—ฌ model์˜ ๋Šฅ๋ ฅ์„ ๋”์šฑ ํ–ฅ์ƒ์‹œํ‚จ๋‹ค.

 

ํƒœ์Šคํฌ ์„ ํƒ ์ „๋žต

  • ์„ฑ๋Šฅ์ด ์ข‹์€ ๋„๋ฉ”์ธ: ๋” ์–ด๋ ค์šด task๋กœ ํ™•์žฅ
  • ์„ฑ๋Šฅ์ด ๋‚ฎ์€ ๋„๋ฉ”์ธ: ๋‚œ์ด๋„๋ฅผ ๋‚ฎ์ถ”๊ฑฐ๋‚˜ ๋” ๋งŽ์€ ์ƒ˜ํ”Œ ์ œ๊ณต
  • ์˜ˆ์‹œ: Model์ด ์ˆ˜ํ•™ ๋ฌธ์ œ๋Š” ์ž˜ ํ’€์ง€๋งŒ ์ฝ”๋”ฉ ๋ฌธ์ œ๋Š” ์–ด๋ ค์›Œํ•œ๋‹ค๋ฉด
    • ์ฝ”๋”ฉ task์˜ ๋‚œ์ด๋„๋ฅผ ๋‚ฎ์ถค
    • ๋” ๋งŽ์€ ์ฝ”๋”ฉ ์ƒ˜ํ”Œ ์ œ๊ณต
    • ์ ์ง„์ ์œผ๋กœ ์ฝ”๋”ฉ task์˜ ๋‚œ์ด๋„ ์ฆ๊ฐ€

์ด๋Ÿฌํ•œ ๋™์  ํƒœ์Šคํฌ ์„ ํƒ์€ multi-domain RL์„ ํ†ตํ•ด ์—ฌ๋Ÿฌ ๋„๋ฉ”์ธ์—์„œ ๋™์‹œ์— ํ•™์Šตํ•˜๋ฉด์„œ๋„, ๊ฐ ๋„๋ฉ”์ธ์˜ ์„ฑ๋Šฅ์— ๋”ฐ๋ผ ๋‚œ์ด๋„๋ฅผ ์กฐ์ •ํ•œ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋„๋ฉ”์ธ ๊ฐ„ ์„ฑ๋Šฅ ๋ถˆ๊ท ํ˜•์„ ๋ฐฉ์ง€ํ•˜๊ณ , ์ ์ง„์  ๋‚œ์ด๋„ ์ฆ๊ฐ€๋กœ ํ•™์Šต ์•ˆ์ •์„ฑ์„ ํ™•๋ณดํ•œ๋‹ค.

 

์„ธ๋ถ€์‚ฌํ•ญ

  • Multi-domain RL: ์—ฌ๋Ÿฌ ๋„๋ฉ”์ธ์—์„œ ๋™์‹œ์— ํ•™์Šต
  • ๋™์  ๋‚œ์ด๋„ ์กฐ์ •: Model์˜ ํ˜„์žฌ ์—ญ๋Ÿ‰์— ๋”ฐ๋ผ task ๋‚œ์ด๋„ ์กฐ์ •
  • ๊ท ํ˜• ํ•™์Šต: ๋„๋ฉ”์ธ ๊ฐ„ ์„ฑ๋Šฅ ๋ถˆ๊ท ํ˜• ๋ฐฉ์ง€
  • ์•ˆ์ •์  ํ•™์Šต: ์ ์ง„์  ๋‚œ์ด๋„ ์ฆ๊ฐ€๋กœ ํ•™์Šต ์•ˆ์ •์„ฑ ํ™•๋ณด

2.3. ํ•™์Šต ํšจ์œจ์„ฑ๊ณผ ์„ฑ๋Šฅ ํ–ฅ์ƒ

RLCS๋Š” model์ด ์ด๋ฏธ ์ž˜ํ•˜๋Š” ์ž‘์—…์— ์‹œ๊ฐ„์„ ๋‚ญ๋น„ํ•˜์ง€ ์•Š๊ณ , ์–ด๋ ค์šด ์ž‘์—…์— ์ง‘์ค‘ํ•˜์—ฌ ํšจ์œจ์ ์œผ๋กœ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ค€๋‹ค. ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ RL ๋‹จ๊ณ„(RLCS๋ฅผ ํฌํ•จํ•œ ์—ฌ๋Ÿฌ ์•ˆ์ •ํ™”/์ƒ˜ํ”Œ๋ง ๋ ˆ์‹œํ”ผ ํฌํ•จ)๊ฐ€ ์ตœ๋Œ€ +10.6% ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋ณด์˜€์œผ๋ฉฐ, ๋‹ค์–‘ํ•œ ๋„๋ฉ”์ธ(STEM, ์ฝ”๋”ฉ, GUI agent, ๋น„๋””์˜ค ์ดํ•ด)์—์„œ ๊ท ํ˜•์žˆ๊ฒŒ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œ์ผฐ๋‹ค. ํŠนํžˆ ์ž‘์€ model(GLM-4.1V-9B-Thinking)์—์„œ๋„ ํšจ๊ณผ์ ์œผ๋กœ ์ž‘๋™ํ•˜์—ฌ, curriculum sampling์ด model์˜ ์ž ์žฌ๋ ฅ์„ ํšจ์œจ์ ์œผ๋กœ ๋ฐœํœ˜ํ•˜๊ฒŒ ํ•ด์ค€๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค€๋‹ค.

 

3. Experimental Results

GLM-4.5V๋Š” 42๊ฐœ ๊ณต๊ฐœ benchmark์—์„œ ์˜คํ”ˆ์†Œ์Šค model ์ค‘ SOTA ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ–ˆ๋‹ค. STEM ๋ฌธ์ œ ํ•ด๊ฒฐ์—์„œ๋Š” ์ˆ˜ํ•™, ๋ฌผ๋ฆฌ, ํ™”ํ•™ ๋ฌธ์ œ์—์„œ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€์œผ๋ฉฐ, ํŠนํžˆ ๊ทธ๋ž˜ํ”„๋‚˜ ๋‹ค์ด์–ด๊ทธ๋žจ์„ ํฌํ•จํ•œ ๋ณต์žกํ•œ ๋ฌธ์ œ์—์„œ๋„ ์ •ํ™•ํ•œ ์ถ”๋ก ์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ์ฝ”๋”ฉ ์ž‘์—…์—์„œ๋Š” ์ด๋ฏธ์ง€๋‚˜ ๋น„๋””์˜ค์—์„œ ์ฝ”๋“œ๋ฅผ ์ƒ์„ฑํ•˜๊ฑฐ๋‚˜ ์ฝ”๋“œ๋ฅผ ๋ถ„์„ํ•˜๋Š” ์ž‘์—…์—์„œ ์šฐ์ˆ˜ํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์˜€์œผ๋ฉฐ, ์Šคํฌ๋ฆฐ์ƒท์„ ๋ณด๊ณ  ์‹คํ–‰ ๊ฐ€๋Šฅํ•œ ์ฝ”๋“œ๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.

 

GUI agent ์ž‘์—…์—์„œ๋Š” ์Šคํฌ๋ฆฐ์ƒท์„ ๋ณด๊ณ  GUI ์š”์†Œ๋ฅผ ์ธ์‹ํ•˜๊ณ  ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋Šฅ๋ ฅ์—์„œ closed-source ๋ชจ๋ธ์ธ Gemini-2.5-Flash์™€ ๊ฒฝ์Ÿ๋ ฅ ์žˆ๋Š” ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค. ์ด๋Š” RLCS๋ฅผ ํฌํ•จํ•œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ RL ์Šคํƒ์ด ๋‹ค์–‘ํ•œ ๋„๋ฉ”์ธ์—์„œ ๊ท ํ˜•์žˆ๊ฒŒ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œ์ผฐ์Œ์„ ๋ณด์—ฌ์ค€๋‹ค.

ํŠนํžˆ GLM-4.1V-9B-Thinking์€ ๋‹จ 9B parameter๋กœ๋„ 72B parameter์˜ Qwen2.5-VL-72B๋ณด๋‹ค 29๊ฐœ benchmark์—์„œ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ, ํšจ์œจ์„ฑ๊ณผ ์„ฑ๋Šฅ์˜ ๊ท ํ˜•์„ ์ž˜ ๋‹ฌ์„ฑํ–ˆ๋‹ค. ์ด๋Š” RLCS๋ฅผ ํฌํ•จํ•œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ RL ์Šคํƒ์ด ์ž‘์€ ๋ชจ๋ธ์—์„œ๋„ ํšจ๊ณผ์ ์œผ๋กœ ์ž‘๋™ํ•จ์„ ๋ณด์—ฌ์ฃผ๋ฉฐ, curriculum sampling์ด ๋ชจ๋ธ์˜ ์ž ์žฌ๋ ฅ์„ ํšจ์œจ์ ์œผ๋กœ ๋ฐœํœ˜ํ•˜๊ฒŒ ํ•ด์ค€๋‹ค๋Š” ๊ฒƒ์„ ์ž…์ฆํ•œ๋‹ค.

 

4. Conclusion

GLM-4.5V์˜ ๊ธฐ์ˆ ์  ํŠน์ด์ ์€ RLCS (Reinforcement Learning with Curriculum Sampling)๋ฅผ ํฌํ•จํ•œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ RL ์Šคํƒ์ด๋‹ค. RLCS๋Š” curriculum learning์˜ ์•„์ด๋””์–ด๋ฅผ reinforcement learning์— ์ ์šฉํ•˜์—ฌ, model์˜ ์—ญ๋Ÿ‰์— ๋”ฐ๋ผ ๋™์ ์œผ๋กœ ํ•™์Šตํ•  task๋ฅผ ์„ ํƒํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค. ๊ธฐ์กด RL ๋ฐฉ์‹์ด ๋ชจ๋“  ๋„๋ฉ”์ธ์„ ๋™์ผํ•˜๊ฒŒ ํ•™์Šตํ•˜์—ฌ ๋ถˆ๊ท ํ˜• ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ–ˆ๋˜ ๊ฒƒ๊ณผ ๋‹ฌ๋ฆฌ, RLCS๋Š” ๋„๋ฉ”์ธ๋ณ„ ์„ฑ๋Šฅ์„ ์ง€์†์ ์œผ๋กœ ํ‰๊ฐ€ํ•˜๊ณ  ๊ทธ์— ๋งž๋Š” ๋‚œ์ด๋„์˜ task๋ฅผ ์„ ํƒํ•˜์—ฌ ์ ์ง„์ ์œผ๋กœ ์–ด๋ ค์šด ์ž‘์—…์œผ๋กœ ํ™•์žฅํ•œ๋‹ค. RLCS๋Š” RLVR + RLHF, unified reward system, dynamic sampling expansion ๋“ฑ๊ณผ ํ•จ๊ป˜ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ RL ๋ ˆ์‹œํ”ผ์˜ ํ•œ ๊ตฌ์„ฑ์š”์†Œ๋กœ ์ž‘๋™ํ•œ๋‹ค.

 

์ด๋ฅผ ํ†ตํ•ด ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ RL ๋‹จ๊ณ„(์—ฌ๋Ÿฌ ์•ˆ์ •ํ™”/์ƒ˜ํ”Œ๋ง ๋ ˆ์‹œํ”ผ ํฌํ•จ)๊ฐ€ ์ตœ๋Œ€ +10.6% ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋‹ฌ์„ฑํ–ˆ์œผ๋ฉฐ, ํŠนํžˆ ์ž‘์€ model(GLM-4.1V-9B-Thinking)๋„ ํฐ model(Qwen2.5-VL-72B) ๋Œ€๋น„ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ, curriculum sampling์ด model์˜ ์ž ์žฌ๋ ฅ์„ ํšจ์œจ์ ์œผ๋กœ ๋ฐœํœ˜ํ•˜๊ฒŒ ํ•ด์ค€๋‹ค๋Š” ๊ฒƒ์„ ์ž…์ฆํ–ˆ๋‹ค.

๋ฐ˜์‘ํ˜•

'๐Ÿ› Research > Multi-modal' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

[MLLM] Gemma 3 ํ…Œํฌ๋‹ˆ์ปฌ ๋ฆฌํฌํŠธ ๋ฆฌ๋ทฐ  (0) 2026.02.18
[MLLM] InternVL3.5 ํ…Œํฌ๋‹ˆ์ปฌ ๋ฆฌํฌํŠธ ๋ฆฌ๋ทฐ  (0) 2026.02.18
Qwen3-VL ํ…Œํฌ๋‹ˆ์ปฌ ๋ฆฌํฌํŠธ ๋ฆฌ๋ทฐ | VLM | MLLM  (2) 2026.01.10
[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] Visual Instruction Tuning | LLaVA Model  (1) 2024.12.04
[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models  (0) 2024.12.04
'๐Ÿ› Research/Multi-modal' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€
  • [MLLM] Gemma 3 ํ…Œํฌ๋‹ˆ์ปฌ ๋ฆฌํฌํŠธ ๋ฆฌ๋ทฐ
  • [MLLM] InternVL3.5 ํ…Œํฌ๋‹ˆ์ปฌ ๋ฆฌํฌํŠธ ๋ฆฌ๋ทฐ
  • Qwen3-VL ํ…Œํฌ๋‹ˆ์ปฌ ๋ฆฌํฌํŠธ ๋ฆฌ๋ทฐ | VLM | MLLM
  • [๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] Visual Instruction Tuning | LLaVA Model
๋ญ…์ฆค
๋ญ…์ฆค
AI ๊ธฐ์ˆ  ๋ธ”๋กœ๊ทธ
    ๋ฐ˜์‘ํ˜•
  • ๋ญ…์ฆค
    moovzi’s Doodle
    ๋ญ…์ฆค
  • ์ „์ฒด
    ์˜ค๋Š˜
    ์–ด์ œ
  • ๊ณต์ง€์‚ฌํ•ญ

    • โœจ About Me
    • ๋ถ„๋ฅ˜ ์ „์ฒด๋ณด๊ธฐ (216)
      • ๐Ÿ“– Fundamentals (34)
        • Computer Vision (9)
        • 3D vision & Graphics (6)
        • AI & ML (16)
        • etc. (3)
      • ๐Ÿ› Research (78)
        • Deep Learning (7)
        • Perception (19)
        • OCR (7)
        • Multi-modal (8)
        • Image•Video Generation (18)
        • 3D Vision (4)
        • Material • Texture Recognit.. (8)
        • Large-scale Model (7)
        • etc. (0)
      • ๐Ÿ› ๏ธ Engineering (8)
        • Distributed Training & Infe.. (5)
        • AI & ML ์ธ์‚ฌ์ดํŠธ (3)
      • ๐Ÿ’ป Programming (92)
        • Python (18)
        • Computer Vision (12)
        • LLM (4)
        • AI & ML (18)
        • Database (3)
        • Distributed Computing (6)
        • Apache Airflow (6)
        • Docker & Kubernetes (14)
        • ์ฝ”๋”ฉ ํ…Œ์ŠคํŠธ (4)
        • etc. (7)
      • ๐Ÿ’ฌ ETC (4)
        • ์ฑ… ๋ฆฌ๋ทฐ (4)
  • ๋งํฌ

    • ๋ฆฌํ‹€๋ฆฌ ํ”„๋กœํ•„ (๋ฉ˜ํ† ๋ง, ๋ฉด์ ‘์ฑ…,...)
    • ใ€Ž๋‚˜๋Š” AI ์—”์ง€๋‹ˆ์–ด์ž…๋‹ˆ๋‹คใ€
    • Instagram
    • Brunch
    • Github
  • ์ธ๊ธฐ ๊ธ€

  • ์ตœ๊ทผ ๋Œ“๊ธ€

  • ์ตœ๊ทผ ๊ธ€

  • hELLOยท Designed By์ •์ƒ์šฐ.v4.10.3
๋ญ…์ฆค
[MLLM] GLM-4.5V ํ…Œํฌ๋‹ˆ์ปฌ ๋ฆฌํฌํŠธ ๋ฆฌ๋ทฐ
์ƒ๋‹จ์œผ๋กœ

ํ‹ฐ์Šคํ† ๋ฆฌํˆด๋ฐ”