๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
๐Ÿ› Research/Multi-modal

[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

by ๋ญ…์ฆค 2024. 12. 4.
๋ฐ˜์‘ํ˜•

๐Ÿ’ก BLIP-2

1. ์—ฐ๊ตฌ ์ฃผ์ œ์™€ ์ฃผ์š” ๊ธฐ์—ฌ

 

BLIP-2 ๋…ผ๋ฌธ์€ Multi-modal Vision Language Pre-training(VLP)์— ๋Œ€ํ•œ ๋น„์šฉ ํšจ์œจ์ ์ธ ์ƒˆ๋กœ์šด ์ ‘๊ทผ๋ฒ•์„ ์ œ์•ˆํ–ˆ์–ด์š”. ๊ธฐ์กด์˜ ํฐ ๋ชจ๋ธ์„ end-to-end ๋กœ ํ•™์Šต์‹œํ‚ค๋Š” ๋ฐฉ์‹์˜ ๋†’์€ ๊ณ„์‚ฐ ๋น„์šฉ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, ์ด๋ฏธ ํ•™์Šต๋œ ์ด๋ฏธ์ง€ ์ธ์ฝ”๋”์™€ ๋Œ€ํ˜• ์–ธ์–ด ๋ชจ๋ธ(LLM)์„ ๊ณ ์ •(frozen)ํ•œ ์ฑ„๋กœ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๊ณ ์•ˆํ–ˆ์–ด์š”.

 

  • Querying Transformer(Q-Former): Modality Gap(์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ๊ฐ„์˜ ์ฐจ์ด)๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์ค„์ด๊ธฐ ์œ„ํ•œ ๊ฒฝ๋Ÿ‰ ๋ชจ๋“ˆ์„ ์ œ์•ˆํ–ˆ์–ด์š”.
  • Two-stage Pre-training: ๊ธฐ์กด ๋ชจ๋ธ์˜ ๊ฐ•์ ์„ ๊ฒฐํ•ฉํ•œ Representation Learning๊ณผ Generative Learning ์ „๋žต์œผ๋กœ ์„ฑ๋Šฅ๊ณผ ํšจ์œจ์„ฑ์„ ๋ชจ๋‘ ์žก์•˜์–ด์š”.
  • Flamingo ๋“ฑ๊ณผ ๋น„๊ตํ•ด 54๋ฐฐ ์ ์€ Trainable Parameters๋กœ๋„ ์ตœ๊ณ  ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ–ˆ์–ด์š”.

 

2. CLIP, BLIP, BLIP-2 ๋น„๊ต 

ํŠน์ง• CLIP BLIP BLIP-2
Pre-training ๋ฐฉ์‹ Image-Text
Contrastive Learning
Contrastive Learning +
Generative Learning
Two-stage Learning
(Representation + Generative)
๋ชจ๋ธ ๊ตฌ์กฐ Dual-Encoder Encoder-Decoder Q-Former +
Frozen Image/Language Models
Trainable
Parameters
์•ฝ 428M ์•ฝ 583M ์•ฝ 188M
์ฃผ์š” ํŠน์ง• ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€ ๊ฐ„ Representation Alignment Image Captioning ๋ฐ VQA ๊ฐ€๋Šฅ ํšจ์œจ์„ฑ๊ณผ ์„ฑ๋Šฅ์˜ ๊ท ํ˜•
์žฅ์  ๋น ๋ฅธ ํ•™์Šต ์†๋„ ์ƒ์„ฑ ๊ธฐ๋ฐ˜ ์ž‘์—…์— ์œ ๋ฆฌ ์ตœ์†Œํ•œ์˜ ๊ณ„์‚ฐ ๋น„์šฉ์œผ๋กœ SOTA ์„ฑ๋Šฅ
ํ•œ๊ณ„ Generative ๋Šฅ๋ ฅ ๋ถ€์กฑ ๊ณ„์‚ฐ ๋น„์šฉ์ด ํฌ๋‹ค Frozen Models ์˜์กด์„ฑ

 

3. ์—ฐ๊ตฌ ๋ฐฐ๊ฒฝ ๋ฐ ๋™ํ–ฅ 

Vision-Language ์—ฐ๊ตฌ๋Š” ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ๊ฐ„์˜ Representation ํ•™์Šต์œผ๋กœ, Image Captioning, Visual Question Answering(VQA), Image-Text Retrieval ๊ฐ™์€ ์ž‘์—…์—์„œ ํ™œ๋ฐœํžˆ ๋ฐœ์ „ํ•ด์™”์–ด์š”.

 

๊ธฐ์กด CLIP(Radford et al., 2021)๋Š” ํšจ์œจ์ ์ธ Contrastive Learning ๋ฐฉ์‹์„ ํ†ตํ•ด ๊ฐ•๋ ฅํ•œ Zero-shot ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์คฌ์ง€๋งŒ, Generative Task์—์„œ๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ์—ˆ๊ณ , BLIP(Li et al., 2022)๋Š” Contrastive์™€ Generative Task๋ฅผ ๋ชจ๋‘ ์ง€์›ํ–ˆ์ง€๋งŒ, ๊ณ„์‚ฐ ๋น„์šฉ์ด ๋งŽ์ด ๋“œ๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ์—ˆ์–ด์š”.

 

์ตœ๊ทผ์—๋Š” Flamingo(Alayrac et al., 2022)์ฒ˜๋Ÿผ Frozen Models๋ฅผ ํ™œ์šฉํ•˜์—ฌ ํšจ์œจ์„ฑ์„ ๋†’์ด๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ๋ฐœ์ „ํ•˜๊ณ  ์žˆ์–ด์š”.

 

4. ์ฃผ์š” ์ œ์•ˆ

 

BLIP-2์˜ ํ•ต์‹ฌ์€ Q-Former์™€ ์ด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ Two-stage Pre-training์ด์—์š”. ์ด ๊ตฌ์กฐ๋Š” Frozen Image Encoder์™€ Frozen LLM ๊ฐ„์˜ Modality Gap(๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ๊ฒฉ์ฐจ)์„ ํšจ์œจ์ ์œผ๋กœ ํ•ด์†Œํ•˜๊ณ , ๊ณ„์‚ฐ ์ž์›์„ ์•„๋ผ๋ฉด์„œ๋„ ๋†’์€ ์„ฑ๋Šฅ์„ ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ๋„๋ก ์„ค๊ณ„๋˜์—ˆ์–ด์š”.

4.1. Q-Former

Q-Former๋Š” Frozen Image Encoder์™€ Frozen LLM์„ ์—ฐ๊ฒฐํ•˜๋Š” ๊ฒฝ๋Ÿ‰ Transformer ๋ชจ๋“ˆ๋กœ, ๋‘ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ๊ฐ„์˜ ์ •๋ณด๋ฅผ ํšจ์œจ์ ์œผ๋กœ ๊ตํ™˜ํ•˜๊ธฐ ์œ„ํ•ด ์„ค๊ณ„๋˜์—ˆ์–ด์š”. ์ด ๋ชจ๋“ˆ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ฃผ์š” ๊ธฐ๋Šฅ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

 

4.1.1. Learnable Query Vectors 

Q-Former๋Š” Learnable Query Vectors๋ผ๋Š” ํ•™์Šต ๊ฐ€๋Šฅํ•œ ๋ฒกํ„ฐ๋ฅผ ํ†ตํ•ด ์ด๋ฏธ์ง€ ์ธ์ฝ”๋”์˜ ๊ณ ์ •๋œ ์‹œ๊ฐ์  ํ‘œํ˜„์—์„œ ๊ฐ€์žฅ ์œ ์šฉํ•œ ์ •๋ณด๋ฅผ ์ถ”์ถœํ•ด์š”. ์˜ˆ๋ฅผ ๋“ค์–ด, ์ด๋ฏธ์ง€ ์ธ์ฝ”๋”์—์„œ 257๊ฐœ์˜ ์ด๋ฏธ์ง€ ํŠน์ง• ๋ฒกํ„ฐ๋ฅผ ์ถœ๋ ฅํ–ˆ๋‹ค๋ฉด, Q-Former๋Š” 32๊ฐœ์˜ Query Vectors๋ฅผ ํ•™์Šตํ•ด ์ด ์ค‘ ํ…์ŠคํŠธ ์ƒ์„ฑ์— ํ•„์š”ํ•œ ํ•ต์‹ฌ ์ •๋ณด๋งŒ ์š”์•ฝํ•ด์„œ ๊ฐ€์ ธ์™€์š”.

 

์ด ๊ณผ์ •์—์„œ Cross-Attention ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ํ™œ์šฉํ•ด Query์™€ ์ด๋ฏธ์ง€ ํŠน์ง• ๊ฐ„์˜ ์ƒํ˜ธ์ž‘์šฉ์„ ์ˆ˜ํ–‰ํ•˜๋ฉฐ, ์ •๋ณด์˜ ์ค‘์š”๋„๋ฅผ ํŒ๋‹จํ•ด ํ•„์ˆ˜์ ์ธ ์‹œ๊ฐ์  ๋‹จ์„œ๋ฅผ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ฐฉ์‹์€ ํ…์ŠคํŠธ ์ƒ์„ฑ๊ณผ ๊ฐ™์€ ์–ธ์–ด ์ž‘์—…์— ํ•„์š”ํ•œ ์ •๋ณด๋งŒ ์„ ๋ณ„์ ์œผ๋กœ ์ „๋‹ฌํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ๋ชจ๋ธ์˜ ํšจ์œจ์„ฑ๊ณผ ์„ฑ๋Šฅ์„ ๋™์‹œ์— ํ™•๋ณดํ•  ์ˆ˜ ์žˆ์–ด์š”.

 

4.1.2. Bottleneck ์—ญํ• 

Q-Former๋Š” ์ด๋ฏธ์ง€ ์ธ์ฝ”๋”์™€ LLM ์‚ฌ์ด์—์„œ ์ •๋ณด bottleneck ์—ญํ• ์„ ํ•ด์š”. ์ด๋ฏธ์ง€ ์ธ์ฝ”๋”๊ฐ€ ์ถœ๋ ฅํ•˜๋Š” ๋Œ€๊ทœ๋ชจ์˜ ์‹œ๊ฐ ์ •๋ณด๋ฅผ LLM์ด ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•„์š”ํ•œ ํ•ต์‹ฌ ์ •๋ณด๋กœ ๊ฐ„์†Œํ™”ํ•˜๋Š” ๋ฐ ์ค‘์ ์„ ๋‘ก๋‹ˆ๋‹ค. ์ด ๊ณผ์ •์—์„œ Learnable Query Vectors๋ฅผ ํ™œ์šฉํ•ด ํ•„์ˆ˜์ ์ธ ์‹œ๊ฐ์  ๋‹จ์„œ๋ฅผ ์ถ”์ถœํ•˜๊ณ , ์ด๋ฅผ LLM์— ์ „๋‹ฌํ•˜์—ฌ ํ…์ŠคํŠธ ์ƒ์„ฑ์ด๋‚˜ ์งˆ๋ฌธ ์‘๋‹ต๊ณผ ๊ฐ™์€ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•  ๋•Œ ํ•„์š”ํ•œ ์ตœ์†Œํ•œ์˜ ์ •๋ณด๋ฅผ ํฌํ•จํ•˜๋„๋ก ์„ค๊ณ„๋ผ์š”. ์ด๋ ‡๊ฒŒ ์ •๋ณด์˜ ์–‘์„ ์ค„์ด๋ฉด์„œ๋„ ์ค‘์š”ํ•œ ๋‚ด์šฉ์„ ๋†“์น˜์ง€ ์•Š๋„๋ก ์ตœ์ ํ™”ํ•จ์œผ๋กœ์จ ๋ชจ๋ธ์˜ ๊ณ„์‚ฐ ์ž์›์„ ์ ˆ์•ฝํ•˜๊ณ  ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•  ์ˆ˜ ์žˆ์–ด์š”.

 

4.1.3. Transformer์˜ ๋‘ ๊ฐ€์ง€ ๋ชจ๋“œ

Q-Former๋Š” ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ๊ฐ„์˜ ์ •๋ณด๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ๋‚ด๋ถ€์ ์œผ๋กœ ๋‘ ๊ฐ€์ง€ ๋ชจ๋“œ์˜ Transformer๋ฅผ ํฌํ•จํ•˜๊ณ  ์žˆ์–ด์š”.

  • Image Transformer: Learnable Query Vectors๊ฐ€ Frozen Image Encoder์˜ ์ถœ๋ ฅ๊ณผ ์ƒํ˜ธ์ž‘์šฉํ•˜์—ฌ ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ณผ์ •์—์„œ ์ด๋ฏธ์ง€์˜ ๊ณ ์ •๋œ ํŠน์ง• ํ‘œํ˜„์—์„œ ํ…์ŠคํŠธ ์ƒ์„ฑ์— ํ•„์š”ํ•œ ์‹œ๊ฐ์  ์ •๋ณด๋ฅผ ์„ ํƒ์ ์œผ๋กœ ์ถ”์ถœํ•ด์š”.
  • Text Transformer: Query Vectors๊ฐ€ Frozen LLM๊ณผ ์ƒํ˜ธ์ž‘์šฉํ•˜๋ฉฐ, ํ…์ŠคํŠธ ์ƒ์„ฑ์ด๋‚˜ ์งˆ๋ฌธ ์‘๋‹ต์— ํ•„์š”ํ•œ ์ •๋ณด๋ฅผ ์ „๋‹ฌํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ณผ์ •์—์„œ ์„ ํƒ๋œ ์‹œ๊ฐ ์ •๋ณด๋Š” ์ž์—ฐ์–ด๋กœ ๋ณ€ํ™˜๋˜๋Š” ๋ฐ ์‚ฌ์šฉ๋ผ์š”.

์ด ๋‘ ๊ฐ€์ง€ ๋ชจ๋“œ๋Š” ๊ฐ๊ฐ ์‹œ๊ฐ ์ •๋ณด์˜ ์ดํ•ด์™€ ์–ธ์–ด์  ํ‘œํ˜„ ๊ฐ„์˜ ๋‹ค๋ฆฌ ์—ญํ• ์„ ํ•˜๋ฉฐ, ๋‘ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ์˜ ๊ฒฉ์ฐจ๋ฅผ ์ค„์ด๊ณ  ์ •๋ณด๋ฅผ ํšจ์œจ์ ์œผ๋กœ ๊ตํ™˜ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋•์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์„ค๊ณ„๋Š” BLIP-2๊ฐ€ ๋‹ค์–‘ํ•œ Vision-Language ์ž‘์—…์—์„œ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ฐœํœ˜ํ•˜๋Š” ๋ฐ ํ•ต์‹ฌ์ ์ธ ์—ญํ• ์„ ํ•ด์š”.

 

4.2. Two-stage Pre-training

BLIP-2๋Š” Q-Former๋ฅผ ํ•™์Šต์‹œํ‚ค๊ณ  Vision-to-Language ์ž‘์—…์„ ์ตœ์ ํ™”ํ•˜๊ธฐ ์œ„ํ•ด Representation Learning๊ณผ Generative Learning์˜ ๋‘ ๋‹จ๊ณ„ Pre-training ์ „๋žต์„ ์‚ฌ์šฉํ•ด์š”. ์ด๋Ÿฌํ•œ ์ „๋žต์€ ๊ณ„์‚ฐ ์ž์›์„ ํšจ์œจ์ ์œผ๋กœ ์‚ฌ์šฉํ•˜๋ฉด์„œ๋„ ๋†’์€ ์„ฑ๋Šฅ์„ ๊ตฌํ˜„ํ•˜๋„๋ก ์„ค๊ณ„๋˜์—ˆ๋‹ต๋‹ˆ๋‹ค.

 

4.2.1. Representation Learning

Representation Learning์€ ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ๊ฐ„ ๊ฐ•๋ ฅํ•œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํ‘œํ˜„์„ ํ•™์Šตํ•˜๋Š” ์ฒซ ๋ฒˆ์งธ ๋‹จ๊ณ„์˜ˆ์š”. ์ด ๋‹จ๊ณ„๋Š” ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ์‚ฌ์ด์˜ ๊ด€๊ณ„๋ฅผ ์ •๋ ฌํ•˜๊ณ , ํ•„์š”ํ•œ ์ •๋ณด๋ฅผ ์ •ํ™•ํžˆ ์ถ”์ถœํ•  ์ˆ˜ ์žˆ๋„๋ก Q-Former๋ฅผ ํ•™์Šตํ•˜๋Š” ๋ฐ ์ค‘์ ์„ ๋‘ก๋‹ˆ๋‹ค. ์ฃผ์š” ํ•™์Šต ๋ชฉํ‘œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์•„์š”.

 

1) Image-Text Contrastive Learning (ITC)
ITC๋Š” ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ๋ฅผ ๋ฒกํ„ฐ ๊ณต๊ฐ„์—์„œ ์ •๋ ฌ(align)ํ•˜๋Š” ์ž‘์—…์ด์—์š”. Q-Former์˜ Query Vectors๋Š” ์ด๋ฏธ์ง€์˜ ์‹œ๊ฐ์  ํŠน์ง•์„ ์ถ”์ถœํ•œ ๋’ค, ์ด๋ฅผ ํ…์ŠคํŠธ ํ‘œํ˜„๊ณผ ๋น„๊ตํ•ด positive pairs๋ฅผ ๊ฐ€๊นŒ์ด, negative pairs๋ฅผ ๋ฉ€๋ฆฌ ์œ„์น˜ํ•˜๋„๋ก ํ•™์Šตํ•ด์š”.

 

์˜ˆ๋ฅผ ๋“ค์–ด, "๊ณ ์–‘์ด๊ฐ€ ์„ ๊ธ€๋ผ์Šค๋ฅผ ์“ฐ๊ณ  ์žˆ๋Š” ์ด๋ฏธ์ง€"์™€ "๊ณ ์–‘์ด๊ฐ€ ์„ ๊ธ€๋ผ์Šค๋ฅผ ์“ฐ๊ณ  ์žˆ๋‹ค"๋Š” ํ…์ŠคํŠธ๋Š” positive pair๋กœ ๊ฐ„์ฃผ๋ผ์š”. ๋ฐ˜๋ฉด, "๊ฐ•์•„์ง€๊ฐ€ ๋›ฐ์–ด๋†€๊ณ  ์žˆ๋Š” ์ด๋ฏธ์ง€"๋Š” negative pair๋กœ ์ฒ˜๋ฆฌ๋ฉ๋‹ˆ๋‹ค.

 

์ด ์ž‘์—…์„ ํ†ตํ•ด ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ๊ฐ„ ์ „๋ฐ˜์ ์ธ ์˜๋ฏธ์  ์ผ์น˜๋ฅผ ํ•™์Šตํ•ด์š”.

 

2) Image-Text Matching (ITM)
ITM์€ ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ๊ฐ„์˜ Fine-grained Alignment๋ฅผ ํ•™์Šตํ•ด์š”. ๋ชจ๋ธ์€ ์ฃผ์–ด์ง„ ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ์Œ์ด ์ผ์น˜ํ•˜๋Š”์ง€ ์—ฌ๋ถ€๋ฅผ ์ด์ง„ ๋ถ„๋ฅ˜ ๋ฐฉ์‹์œผ๋กœ ํ•™์Šตํ•˜๊ณ , ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€ ๊ฐ„์˜ ์„ธ๋ฐ€ํ•œ ๊ด€๊ณ„๋ฅผ ์ดํ•ดํ•˜๋„๋ก ๋•์Šต๋‹ˆ๋‹ค.

 

์˜ˆ๋ฅผ ๋“ค์–ด, "์‚ฌ๊ณผ"๋ผ๋Š” ํ…์ŠคํŠธ์™€ "์‚ฌ๊ณผ ์‚ฌ์ง„"์€ positive pair๋กœ ํ•™์Šต๋˜์ง€๋งŒ, "๋ฐ”๋‚˜๋‚˜"๋ผ๋Š” ํ…์ŠคํŠธ์™€ "์‚ฌ๊ณผ ์‚ฌ์ง„"์€ negative pair๋กœ ํ•™์Šต๋ผ์š”.

 

์ด ๊ณผ์ •์€ ๋ชจ๋ธ์ด ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€ ๊ฐ„์˜ ๊ตฌ์ฒด์ ์ด๊ณ  ์„ธ๋ฐ€ํ•œ ๋งฅ๋ฝ์„ ์ดํ•ดํ•˜๋„๋ก ๋งŒ๋“ญ๋‹ˆ๋‹ค.

 

3) Image-grounded Text Generation (ITG)
ITG๋Š” ํ…์ŠคํŠธ ์ƒ์„ฑ์— ํ•„์š”ํ•œ ์‹œ๊ฐ์  ์ •๋ณด๋ฅผ ์ถ”์ถœํ•˜๋„๋ก Q-Former๋ฅผ ํ•™์Šตํ•˜๋Š” ๊ณผ์ •์ด์—์š”. Query Vectors๋Š” ์ด๋ฏธ์ง€์—์„œ ํ…์ŠคํŠธ ์ƒ์„ฑ์— ํ•„์ˆ˜์ ์ธ ์ •๋ณด๋ฅผ ์„ ๋ณ„์ ์œผ๋กœ ์ถ”์ถœํ•˜๊ณ , ์ด๋ฅผ ํ…์ŠคํŠธ Transformer์— ์ „๋‹ฌํ•ด ์ž์—ฐ์Šค๋Ÿฌ์šด ํ…์ŠคํŠธ๋ฅผ ์ƒ์„ฑํ•˜๋„๋ก ํ•™์Šตํ•ด์š”.

 

์˜ˆ๋ฅผ ๋“ค์–ด, ์‚ฌ์ง„ ์„ค๋ช… ์ƒ์„ฑ ์ž‘์—…์—์„œ "์ด ์‚ฌ์ง„์€ ์„ ๊ธ€๋ผ์Šค๋ฅผ ์“ด ๊ณ ์–‘์ด๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค"์™€ ๊ฐ™์€ ํ…์ŠคํŠธ๋ฅผ ์ƒ์„ฑํ•˜๋„๋ก ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.

 

Representation Learning์˜ ์˜๋ฏธ 
Representation Learning์€ LLM์ด ์ฒ˜๋ฆฌํ•ด์•ผ ํ•  ์‹œ๊ฐ ์ •๋ณด์˜ ์–‘์„ ์ค„์ด๊ณ , Q-Former๊ฐ€ ์ตœ์ ํ™”๋œ ์‹œ๊ฐ ํ‘œํ˜„์„ LLM์— ์ „๋‹ฌํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•™์Šตํ•˜๋Š” ๊ธฐ์ดˆ ๋‹จ๊ณ„์˜ˆ์š”. ์ด๋ฅผ ํ†ตํ•ด Frozen Image Encoder์™€ Q-Former ๊ฐ„ ํ˜‘๋ ฅ์„ ๊ฐ•ํ™”ํ•˜๊ณ , ํ…์ŠคํŠธ ์ƒ์„ฑ์— ํ•„์š”ํ•œ ์‹œ๊ฐ์  ์ •๋ณด๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์ •๋ ฌํ•ฉ๋‹ˆ๋‹ค.

 

4.2.2. Generative Learning: ํ…์ŠคํŠธ ์ƒ์„ฑ ์ตœ์ ํ™” 

 

Generative Learning์€ Representation Learning ์ดํ›„ ์ง„ํ–‰๋˜๋Š” ๋‘ ๋ฒˆ์งธ ๋‹จ๊ณ„๋กœ, LLM๊ณผ์˜ ์—ฐ๊ฒฐ์„ ์ตœ์ ํ™”ํ•ด ํ…์ŠคํŠธ ์ƒ์„ฑ ๋Šฅ๋ ฅ์„ ๊ฐ•ํ™”ํ•˜๋Š” ๋ฐ ์ดˆ์ ์„ ๋งž์ถ”๊ณ  ์žˆ์–ด์š”. ์ด ๋‹จ๊ณ„๋Š” LLM์ด ์‹œ๊ฐ ์ •๋ณด๋ฅผ ์ž์—ฐ์–ด๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ณผ์ •์„ ์ •๊ตํ•˜๊ฒŒ ์กฐ์œจํ•ฉ๋‹ˆ๋‹ค.

 

1) Soft Prompting with Q-Former
Q-Former๋Š” Frozen LLM๊ณผ ์ง์ ‘ ์—ฐ๊ฒฐ๋˜์ง€ ์•Š๊ณ , Query Representation์„ Soft Prompt๋กœ ๋ณ€ํ™˜ํ•ด LLM์˜ ์ž…๋ ฅ์œผ๋กœ ์ œ๊ณตํ•ด์š”. ์ด ๋ฐฉ์‹์€ LLM์˜ ๊ตฌ์กฐ๋‚˜ ๊ฐ€์ค‘์น˜๋ฅผ ๋ณ€๊ฒฝํ•˜์ง€ ์•Š์œผ๋ฉด์„œ๋„, ์‹œ๊ฐ์  ์ •๋ณด๋ฅผ ํ…์ŠคํŠธ๋กœ ํ‘œํ˜„ํ•˜๋Š” ๋Šฅ๋ ฅ์„ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๊ฐ•ํ™”ํ•ด์ค๋‹ˆ๋‹ค.

 

์˜ˆ๋ฅผ ๋“ค์–ด, ์ด๋ฏธ์ง€์—์„œ ์ถ”์ถœํ•œ Query Representation์„ "์ด๋ฏธ์ง€ ์„ค๋ช…:"์ด๋ผ๋Š” ํ…์ŠคํŠธ์™€ ๊ฒฐํ•ฉํ•ด ์ž…๋ ฅํ•˜๋ฉด, LLM์ด "๊ณ ์–‘์ด๊ฐ€ ์„ ๊ธ€๋ผ์Šค๋ฅผ ์“ฐ๊ณ  ์žˆ๋‹ค"์™€ ๊ฐ™์€ ๋ฌธ์žฅ์„ ์ƒ์„ฑํ•˜๋„๋ก ์œ ๋„ํ•ด์š”.

 

2) Decoder-based LLM (OPT)์™€ Encoder-Decoder LLM (FlanT5)์˜ ์ฐจ์ด

  • OPT (Decoder-only)
    Query Representation์€ LLM์˜ ์ž…๋ ฅ ํ…์ŠคํŠธ์— ์•ž์— ๋ถ™๋Š” ์ถ”๊ฐ€ ํ† ํฐ์ฒ˜๋Ÿผ ๋™์ž‘ํ•ด์š”.
    (์˜ˆ์‹œ: "์ด๋ฏธ์ง€ ์„ค๋ช…: ๊ณ ์–‘์ด๊ฐ€ ์„ ๊ธ€๋ผ์Šค๋ฅผ ์“ฐ๊ณ  ์žˆ๋‹ค.")
  • FlanT5 (Encoder-Decoder)
    Query Representation์€ LLM์˜ ์ธ์ฝ”๋” ์ž…๋ ฅ๊ณผ ๊ฒฐํ•ฉ๋ผ ํ…์ŠคํŠธ ์ƒ์„ฑ ๊ณผ์ •์„ ๋•์Šต๋‹ˆ๋‹ค.
    (์˜ˆ์‹œ: "์ด๋ฏธ์ง€ ์„ค๋ช… ์ƒ์„ฑ -> ํ…์ŠคํŠธ ์ถœ๋ ฅ.")
Generative Learning์˜ ์˜๋ฏธ 
Generative Learning์€ Q-Former๊ฐ€ ์ถ”์ถœํ•œ ์‹œ๊ฐ ์ •๋ณด๋ฅผ Frozen LLM์— ํšจ๊ณผ์ ์œผ๋กœ ์ „๋‹ฌํ•˜๊ณ , ์ด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ž์—ฐ์Šค๋Ÿฌ์šด ํ…์ŠคํŠธ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๋Šฅ๋ ฅ์„ ์ตœ์ ํ™”ํ•˜๋Š” ๋‹จ๊ณ„์˜ˆ์š”. ์ด ์ ‘๊ทผ๋ฒ•์€ LLM์˜ ๊ตฌ์กฐ๋ฅผ ๋ณ€๊ฒฝํ•˜์ง€ ์•Š์œผ๋ฉด์„œ๋„ ๊ฐ•๋ ฅํ•œ Vision-to-Language ์„ฑ๋Šฅ์„ ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋•์Šต๋‹ˆ๋‹ค.

 

 

4.2.3. Two-Stage Pre-training์˜ ๊ฐ•์ 

Representation Learning๊ณผ Generative Learning์˜ ์กฐํ•ฉ์€ BLIP-2๊ฐ€ ์ ์€ Trainable Parameters๋กœ๋„ ๋†’์€ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ด์ค๋‹ˆ๋‹ค.

  • Frozen Models์„ ํ™œ์šฉํ•ด ์ตœ์‹  ๋ชจ๋ธ์˜ ๊ฐ•์ ์„ ์ตœ๋Œ€ํ•œ ํ™œ์šฉํ•˜๋ฉด์„œ ๊ณ„์‚ฐ ์ž์›์„ ์ ˆ์•ฝํ•ด์š”.
  • Q-Former๋Š” Vision-to-Language ์ž‘์—…์—์„œ ํ•„์ˆ˜์ ์ธ ์ •๋ณด๋งŒ ์„ ๋ณ„์ ์œผ๋กœ ์ „๋‹ฌํ•ด ํšจ์œจ์„ฑ์„ ๊ทน๋Œ€ํ™”ํ•ด์š”.
  • ์ด๋Ÿฌํ•œ ์ ‘๊ทผ์€ ์‹œ๊ฐ์  ๋ฐ์ดํ„ฐ๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์–ธ์–ด์  ํ‘œํ˜„์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๋ฐ ์ตœ์ ํ™”๋˜์–ด, ๋‹ค์–‘ํ•œ Vision-Language ์ž‘์—…์—์„œ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ฐœํœ˜ํ•ฉ๋‹ˆ๋‹ค.

BLIP-2์˜ Two-Stage Pre-training ์ „๋žต์€ Vision-Language ์—ฐ๊ตฌ์˜ ์ƒˆ๋กœ์šด ๊ธฐ์ค€์„ ์ œ์‹œํ•˜๋ฉฐ, ํšจ์œจ์„ฑ๊ณผ ์„ฑ๋Šฅ์˜ ์™„๋ฒฝํ•œ ๊ท ํ˜•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

 

5. ์‹คํ—˜ ๊ฒฐ๊ณผ

 

BLIP-2๋Š” ๋‹ค์–‘ํ•œ Vision-Language ์ž‘์—…์—์„œ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ์ž…์ฆํ•˜๋ฉฐ, ํšจ์œจ์„ฑ๊ณผ ์ •ํ™•์„ฑ์„ ๋™์‹œ์— ๋ณด์—ฌ์คฌ์–ด์š”. Zero-shot VQA ์ž‘์—…์—์„œ Flamingo80B ๋Œ€๋น„ 8.7% ๋” ๋†’์€ ์ •ํ™•๋„(65.0%)๋ฅผ ๊ธฐ๋กํ–ˆ์œผ๋ฉฐ, ์ด๋Š” 54๋ฐฐ ์ ์€ Trainable Parameters๋กœ ๋‹ฌ์„ฑ๋œ ๊ฒฐ๊ณผ๋กœ ๋ชจ๋ธ์˜ ๊ฒฝ๋Ÿ‰ํ™”์™€ ํšจ์œจ์„ฑ์„ ์ž˜ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ์ด๋ฏธ์ง€ ์„ค๋ช… ์ƒ์„ฑ์—์„œ๋Š” COCO์™€ NoCaps ๋ฐ์ดํ„ฐ์…‹์—์„œ ๊ฐ๊ฐ CIDEr ์ ์ˆ˜ 145.8๊ณผ 121.6์œผ๋กœ ์ตœ๊ณ  ์„ฑ๋Šฅ์„ ๊ธฐ๋กํ•˜๋ฉฐ, ์ด๋ฏธ์ง€์˜ ์‹œ๊ฐ์  ์ •๋ณด๋ฅผ ์ž์—ฐ์Šค๋Ÿฌ์šด ํ…์ŠคํŠธ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๋Šฅ๋ ฅ์„ ์ž…์ฆํ–ˆ์–ด์š”. ๋˜ํ•œ, ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ๊ฒ€์ƒ‰์—์„œ๋Š” Flickr30K์™€ COCO ๋ฐ์ดํ„ฐ์…‹์—์„œ Recall@1 ๊ธฐ์ค€ ๊ฐ๊ฐ 97.6%์™€ 85.4%๋ฅผ ๋‹ฌ์„ฑํ•˜๋ฉฐ, ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€ ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ์ •๊ตํ•˜๊ฒŒ ์ดํ•ดํ•˜๋Š” ๋ชจ๋ธ์˜ ๊ฐ•์ ์„ ๋ณด์—ฌ์คฌ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๊ฒฐ๊ณผ๋Š” BLIP-2๊ฐ€ ์ ์€ ์ž์›์œผ๋กœ๋„ ๋†’์€ ์„ฑ๋Šฅ์„ ๊ตฌํ˜„ํ•˜๋ฉฐ, ๋‹ค์–‘ํ•œ Vision-Language ์ž‘์—…์—์„œ ๋‹ค๋ชฉ์ ์„ฑ๊ณผ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์„ ๊ฐ–์ถ˜ ๋ชจ๋ธ์ž„์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

 

6. ๊ฒฐ๋ก 

BLIP-2๋Š” ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ AI์—์„œ ๊ณ„์‚ฐ ํšจ์œจ์„ฑ๊ณผ ์„ฑ๋Šฅ์„ ํ˜์‹ ์ ์œผ๋กœ ๊ฒฐํ•ฉํ•œ ๋ชจ๋ธ๋กœ, Q-Former๋ฅผ ํ™œ์šฉํ•ด ๊ณ ์ •๋œ ์ด๋ฏธ์ง€ ๋ฐ ์–ธ์–ด ๋ชจ๋ธ ๊ฐ„์˜ ๊ฐ„๊ทน์„ ํšจ๊ณผ์ ์œผ๋กœ ํ•ด์†Œํ•˜๋ฉฐ ์ƒˆ๋กœ์šด ๊ฐ€๋Šฅ์„ฑ์„ ์—ด์–ด์คฌ์–ด์š”. ๊ณ ์ •๋œ ์–ธ์–ด ๋ชจ๋ธ(OPT, FlanT5 ๋“ฑ)์— ์˜์กดํ•˜๋Š” ๋งŒํผ ์„ ํƒํ•œ ์–ธ์–ด ๋ชจ๋ธ์˜ ํ’ˆ์งˆ์ด ์„ฑ๋Šฅ์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๊ณ , ๋‹ค์ค‘ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์‹œํ€€์Šค ๋ฐ์ดํ„ฐ ์„ธํŠธ ๋ถ€์กฑ์œผ๋กœ ์ธํ•ด in-context learning ์„ฑ๋Šฅ์ด ์ œํ•œ์ ์ด๋ผ๋Š” ํ•œ๊ณ„๋„ ์žˆ์ง€๋งŒ, ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋Š” ๋” ํ’๋ถ€ํ•œ ๋ฐ์ดํ„ฐ์™€ ๋ชจ๋“ˆ ๊ฐœ์„ ์„ ํ†ตํ•ด ์ถฉ๋ถ„ํžˆ ๋ณด์™„ ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

 

BLIP-2๋Š” CLIP๊ณผ BLIP์˜ ๊ฐ•์ ์„ ๋ชจ๋‘ ์ด์–ด๋ฐ›์œผ๋ฉด์„œ๋„ ํšจ์œจ์„ฑ์„ ๊ทน๋Œ€ํ™”ํ•œ ์ ์ด ๋‹๋ณด์ด๋ฉฐ, ์‹ค์šฉ์„ฑ๊ณผ ํ™•์žฅ์„ฑ ์ธก๋ฉด์—์„œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ AI์˜ ์ƒˆ๋กœ์šด ๊ธฐ์ค€์„ ์ œ์‹œํ–ˆ์–ด์š”.

๋ฐ˜์‘ํ˜•