๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
๐Ÿ› Research/Multi-modal

[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] Learning Transferable Visual Models From Natural Language Supervision / CLIP / Multi-modal network

by ๋ญ…์ฆค 2022. 2. 26.
๋ฐ˜์‘ํ˜•

Open AI์—์„œ ๊ฒŒ์žฌํ•œ(ICML2021) Contrastive Language-Image Pre-training(CLIP)๋ฅผ ์ œ์•ˆํ•œ ๋…ผ๋ฌธ์„ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค.

 

Introduction & Motivation

 

๋”ฅ๋Ÿฌ๋‹์ด computer vision์˜ ๊ฑฐ์˜ ๋ชจ๋“  ๋ถ„์•ผ์—์„œ ๊ต‰์žฅํžˆ ์ž˜ ํ™œ์šฉ๋˜์ง€๋งŒ ํ˜„์žฌ ์ ‘๊ทผ ๋ฐฉ์‹์—๋Š” ๋ช‡๊ฐ€์ง€ ๋ฌธ์ œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ธฐ์กด์˜ vision model๋“ค์€ ํ•™์Šต๋œ task์—๋Š” ์„ฑ๋Šฅ์ด ์šฐ์ˆ˜ํ•˜์ง€๋งŒ ์ƒˆ๋กœ์šด task์— ์ ์šฉ์‹œํ‚ค๊ธฐ ์œ„ํ•ด์„œ๋Š” ์ƒˆ๋กœ ํ•™์Šต์„ ์‹œํ‚ค์•ผ ํ•˜๋Š”(๊ทธ๋Ÿฌ๋ฉด ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ์…‹๊ณผ ์ถ”๊ฐ€ ๋ ˆ์ด๋ธ”๋ง์ด ํ•„์š”..) ๋ฒˆ๊ฑฐ๋กœ์›€(?) ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฒค์น˜๋งˆํฌ์—์„œ ์ž˜ ์ˆ˜ํ–‰๋˜๋Š” ๋ช‡๋ช‡ model๋“ค์€ stress test์—์„œ ์ข‹์ง€ ์•Š์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ๊ธฐ๋„ ํ•ฉ๋‹ˆ๋‹ค.

 

๋Œ€์•ˆ์œผ๋กœ raw text์™€ image๋ฅผ pair๋กœ ํ•™์Šต์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ธํ„ฐ๋„ท์—์„œ ํ’๋ถ€ํ•˜๊ฒŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๋‹ค์–‘ํ•œ natural language supervision์„ ํ†ตํ•ด ๋‹ค์–‘ํ•œ ์ด๋ฏธ์ง€์—์„œ ํ•™์Šต๋˜๋Š” network๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. 

 

๊ฐ„๋‹จํ•˜๊ฒŒ ์–˜๊ธฐํ•˜๋ฉด, ๊ธฐ์กด์˜ image classification model์€ ์‚ฌ์šฉํ•˜๋Š” dataset์ด ๊ณ ์ •๋œ label์„ ๊ฐ–์Šต๋‹ˆ๋‹ค. ๋•Œ๋ฌธ์— image๋งŒ feature space ๋กœ embeddingํ•˜์—ฌ predictionํ•˜๊ณ , ์ •๋‹ต label๊ณผ ๋น„๊ตํ•˜์—ฌ loss๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ, ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ์›น์ƒ์—์„œ image-text ์Œ์œผ๋กœ ๊ฐ€์ ธ์˜ค๊ธฐ ๋•Œ๋ฌธ์— ๊ณ ์ •๋œ label์ด ์•„๋‹ˆ๊ณ  ์‹ฌ์ง€์–ด text๋Š” ๋ฌธ์žฅ์˜ ํ˜•ํƒœ๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค. ๋•Œ๋ฌธ์— text์™€ image๋ฅผ ๋ชจ๋‘ feature space๋กœ embeddingํ•˜๊ณ  ์ด๋“ค์˜ similarity๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ํ•™์Šตํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. 

 

Pre-training ํ›„์— ์ž์—ฐ์–ด๋Š” visual concept๋ฅผ ์ฐธ๊ณ ํ•˜๋Š”๋ฐ๋งŒ ์‚ฌ์šฉ๋˜์–ด downstream tsak๋กœ zero-shot transfer ๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. model์€ ๋งŽ์€ task์— transfer๊ฐ€ ์„ฑ๊ณต์ ์ด๊ณ  ์‹ฌ์ง€์–ด ๊ธฐ์กด์˜ supervision model ๋ณด๋‹ค ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ๊ธฐ๋„ ํ•˜์ง€๋งŒ, ์—ฌ์ „ํžˆ complexํ•œ task์—๋Š” ์ ์šฉํ•˜๊ธฐ ํž˜๋“  ํ•œ๊ณ„์ ์„ ๋ณด์—ฌ์ฃผ๊ธฐ๋„ ํ•ฉ๋‹ˆ๋‹ค.

 

NLP ๋ถ„์•ผ๋Š” ์ž˜ ๋ชจ๋ฅด์ง€๋งŒ...(Attention Is All You Need ์ •๋„?..) ๋ณธ ๋…ผ๋ฌธ์— ๋”ฐ๋ฅด๋ฉด ์›น์—์„œ ๊ฐ€์ ธ์˜จ raw text๋กœ pre-trainingํ•œ ๋ชจ๋ธ์ด ๊ธฐ์กด์˜ supervision model(์‚ฌ๋žŒ์ด ๋ ˆ์ด๋ธ”๋งํ•œ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ํ•™์Šตํ•œ) ์˜ ์„ฑ๋Šฅ์„ ๋Šฅ๊ฐ€ํ•œ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ, ์ด๋•Œ๊นŒ์ง€ vision ๋ถ„์•ผ์—์„œ๋Š” raw text ๋กœ pre-trainingํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ์—†์—ˆ๊ณ  ๋ณธ ๋…ผ๋ฌธ์—์„œ ์‹œ๋„ํ•˜์—ฌ ์ข‹์€ ์„ฑ๊ณผ๋ฅผ ์ด๋Œ์–ด ๋ƒ…๋‹ˆ๋‹ค.

 

image-text pair์˜ multi-modal ์— ๋Œ€ํ•œ ์—ฐ๊ตฌ๋Š” ์˜ค๋žซ๋™์•ˆ ์ง€์†๋˜์–ด ์™”์ง€๋งŒ, ์ž์—ฐ์–ด๋ฅผ ์ด๋ฏธ์ง€ representation learning์— ํ™œ์šฉํ•˜๋Š” ๊ฒƒ์€ ์—ฌ์ „ํžˆ ๋“œ๋ฌผ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ์ด์œ ๋Š” ์„ฑ๋Šฅ์ด ๊ทธ๋งŒํผ ์ž˜ ์•ˆ๋‚˜์™”๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ๋Œ€์‹  2018๋…„ ๋…ผ๋ฌธ์—์„œ๋Š” weakly supervised learning์œผ๋กœ ์ธ์Šคํƒ€๊ทธ๋žจ ์ด๋ฏธ์ง€์—์„œ ImageNet๊ณผ ๊ด€๋ จ๋œ ํ•ด์‰ฌํƒœ๊ทธ๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์ด ํšจ๊ณผ์ ์ธ pre-training ๋ฐฉ๋ฒ•์ž„์„ ๋ณด์—ฌ์ฃผ๊ธด ํ–ˆ์Šต๋‹ˆ๋‹ค. 

 

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์œ„์—์„œ ์–ธ๊ธ‰ํ•œ weakly supervised approach์™€ zero shot learning using raw text approcah ๊ฐ„์˜ ๊ฐ„๊ทน์„ ์ค„์ด๊ธฐ ์œ„ํ•ด Contrastive Language-Image Pre-training(CLIP) ๋ผ๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.  

 

* Natural Language Supervision ์„ ์‚ฌ์šฉํ•˜๋Š” ์ด์œ ? 

์ž์—ฐ์–ด ๊ด€๋ จ ์ง€์‹์ด ๋ถ€์กฑํ•˜์—ฌ ์™œ ์ž์—ฐ์–ด๋กœ supervision learning์„ ํ•˜๋Š”์ง€์— ๋Œ€ํ•œ ๋…ผ๋ฌธ์— ์–ธ๊ธ‰๋œ ๋‚ด์šฉ์„ ์ •๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

 

1) ๊ธฐ์กด์— vision task์—์„œ ์‚ฌ์šฉ๋˜๋˜ label์— ๋น„ํ•ด scaling์ด ์‰ฝ์Šต๋‹ˆ๋‹ค. 

๊ธฐ์กด์— ์‚ฌ๋žŒ์ด ์ง์ ‘ ๋ ˆ์ด๋ธ”๋ง์„ ํ–ˆ์ง€๋งŒ, ๊ทธ๋Ÿฐ ๊ณผ์ •์ด ํ•„์š”์—†๊ณ  ์ž์—ฐ์–ด๋ฅผ ์ด์šฉํ•œ ํ•™์Šต์€ ์ธํ„ฐ๋„ท์˜ ๋ฐฉ๋Œ€ํ•œ ์ž๋ฃŒ์— ํฌํ•จ๋œ ํ…์ŠคํŠธ๋ฅผ supervision์œผ๋กœ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. 

 

2) ์–ธ์–ด์— ๋Œ€ํ•œ representation์„ ํ•™์Šตํ•œ๋‹ค๋Š” ์žฅ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

์ž์—ฐ์–ด๋ฅผ ์ด์šฉํ•˜๋Š” ํ•™์Šต ๋ฐฉ๋ฒ•์€ un/semi/self-supervised learning ๋ฐฉ๋ฒ•๊ณผ๋Š” ๋‹ฌ๋ฆฌ ์ด๋ฏธ์ง€ representation ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ์–ธ์–ด representation์„ ๊ฐ€์ง€๊ธฐ ๋•Œ๋ฌธ์— ์กฐ๊ธˆ ๋” ์œ ์—ฐํ•˜๊ณ  robustํ•œ ์žฅ์ ์„ ๊ฐ€์ง€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

 

 

CLIP

1. Contrasive pre-training

์ธํ„ฐ๋„ท์— ๊ณต๊ฐœ๋œ ๋งŽ์€ ๋ฐ์ดํ„ฐ์…‹์„ ํ†ตํ•ด 4์–ต๊ฐœ์˜ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์Œ์œผ๋กœ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ์…‹์„ ๊ตฌ์ถ•ํ•˜์—ฌ CLIP๋ฅผ pre-trainingํ•ฉ๋‹ˆ๋‹ค. N๊ฐœ์˜ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ pair๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ, CLIP๋Š” image embedding feautre์™€ text embedding feature๋“ค๋กœ NxN๊ฐœ์˜ cosine similarity map์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  positive pair(์ •๋‹ต)์˜ similarity๋Š” ์ตœ๋Œ€ํ™”ํ•˜๊ณ  ๋‚˜๋จธ์ง€ N^2-N ๊ฐœ์˜ negative pair(์˜ค๋‹ต)์˜ similarity๋Š” ์ตœ์†Œํ™”ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ image encoder์™€ texte encoder๋ฅผ ํ•™์Šตํ•˜์—ฌ multi-modal embedding space๋ฅผ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.

 

๋˜ํ•œ ์ œ์•ˆํ•˜๋Š” method์—์„œ๋Š” ์ •ํ™•ํ•œ ๋‹จ์–ด๊ฐ€ ์•„๋‹Œ ํ…์ŠคํŠธ ์ „์ฒด๋กœ ์ง์ง€์–ด์ ธ ํ•™์Šต์„ ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค. 

Image encoder๋กœ๋Š” ResNet50๊ณผ Vision Transformer(ViT)๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  Text encoder๋กœ๋Š” Transformer๋ฅผ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.

 

2.  Create dataset classifier from label text / Use for zero-shot prediction

CLIP๋Š” ์ด๋ฏธ์ง€์˜ ์ •ํ™•ํ•œ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๋„๋ก ํ•™์Šตํ•˜์ง€ ์•Š๊ณ  ํ…์ŠคํŠธ ์ „์ฒด๋กœ ์ง์„ ์ง€์–ด ํ•™์Šต์‹œํ‚ต๋‹ˆ๋‹ค. ๋•Œ๋ฌธ์— ํ…Œ์ŠคํŠธ ์‹œ์—๋„ A photo of {object} ๋˜๋Š” A photo of {object}, a type of food ๋“ฑ์˜ ํ…์ŠคํŠธ์— ๋ชจ๋“  class label์„ ๋„ฃ๊ณ  text encoder์— ๋„ฃ์–ด text imbedding feature๋ฅผ ๋งŒ๋“ค๊ณ , image๋Š” image encoder๋ฅผ ํ†ต๊ณผ์‹œ์ผœ image imbedding feature 1๊ฐœ๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค. ์ด ๋•Œ ๋ชจ๋“  text feature๋“ค๊ณผ 1๊ฐœ์˜ image feature ๊ฐ„์˜ similarity๋ฅผ ๋ณด๊ณ  ๊ฐ€์žฅ ๋†’์€ similarity๋ฅผ ๊ฐ€์ง€๋Š” ํ…์ŠคํŠธ๋ฅผ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค.

 

์•„๋ž˜์™€ ๊ฐ™์€ ๋ฌธ์žฅํ˜•์‹์œผ๋กœ prediction์„ ํ•˜๋Š” ์ด์œ ๋Š”, ์• ์™„ ๋™๋ฌผ์˜ ์ข…๋ฅ˜ ์ค‘ ํ•˜๋‚˜, ์Œ์‹ ์ค‘ ํ•˜๋‚˜, ๋“ฑ๋“ฑ์˜ ์ถ”๊ฐ€ ์ •๋ณด๋ฅผ text encoding์—์„œ ๋ฐ˜์˜ํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๋ถ„๋ฅ˜ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

  • "A photo of a {label}, a type of pet."
  • "a satellite photo of a {label}."
  • "A photo of big {label}"
  • "A photo of small {label}"

 

๋ฐ˜์‘ํ˜•

 

Experiments

๋ณธ ๋…ผ๋ฌธ์—์„œ ๋งํ•˜๋Š” zero-shot ์€ ํ•œ๋ฒˆ๋„ ๋ณด์ง€ ๋ชปํ•œ datasets์— ๋Œ€ํ•ด ๋ถ„๋ฅ˜๋ฅผ ํ•˜๋Š” ์ž‘์—…์„ ๋งํ•ฉ๋‹ˆ๋‹ค.  ์ €์ž๋“ค์€ zero-shot transfer ๋ฅผ task-learning capabilties๋ฅผ ์ธก์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค. 

 

CLIP๋Š” ๋‹ค์–‘ํ•œ dataset๊ณผ task์—์„œ fully supervision learning ๋ฐฉ๋ฒ•๊ณผ ๋น„๊ตํ•˜์—ฌ ์•„๋ž˜์™€ ๊ฐ™์ด ๊ฒฝ์Ÿ๋ ฅ ์žˆ๋Š” ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๊ทธ ์™ธ์—๋„ few-shot learning ๊ณผ์˜ ๋น„๊ต, vision sota์™€์˜ ๋น„๊ต ๋“ฑ๋“ฑ์— ๋Œ€ํ•œ ๋‹ค์–‘ํ•œ ์‹คํ—˜์ด ๋…ผ๋ฌธ์— ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

 

* task : fine-grained object classification, geo-localization, action recognition and OCR,... etc

Zero-shot transfer ์„ฑ๋Šฅ์ด ๋ฌผ์ฒด๋ฅผ ์ธ์‹ํ•˜๋Š” task์—์„œ๋Š” ๋งค์šฐ ์šฐ์ˆ˜ํ•˜์ง€๋งŒ, ๋ฌผ์ฒด์˜ ๊ฐœ์ˆ˜๋ฅผ ์„ธ๊ฑฐ๋‚˜ ๋ฌผ์ฒด๊ฐ€ ์–ผ๋งˆ๋‚˜ ๊ฐ€๊นŒ์ด ์žˆ๋Š”์ง€๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ์กฐ๊ธˆ ๋” ๋ณต์žกํ•œ task์—์„œ๋Š” ์ข‹์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ fine-grained recognition ๋ถ„์•ผ์™€ ๊ฐ™์€ ์ด๋ฏธ์ง€์˜ ์„ธ๋ฐ€ํ•œ ์ฐจ์ด๋กœ ํด๋ž˜์Šค๊ฐ€ ๋‚˜๋‰˜๋Š” task์—์„œ๋„ ์„ฑ๋Šฅ์ด ์ข‹์ง€ ์•Š์Šต๋‹ˆ๋‹ค. 

 

* ๋…ผ๋ฌธ์—๋Š” ๋‚˜์™€์žˆ์ง€ ์•Š์ง€๋งŒ, ์•„๋งˆ ์›น์—์„œ ์–ป์–ด์ง„ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์Œ์—์„œ ํ…์ŠคํŠธ๊ฐ€ ์–ผ๋งˆ๋‚˜ ์ด๋ฏธ์ง€๋ฅผ ๊ตฌ์ฒด์ ์œผ๋กœ ๋‚˜ํƒ€๋‚ด๋Š”์ง€๊ฐ€ ์ค‘์š”ํ•  ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ๋Œ€๋ถ€๋ถ„์˜ ํ…์ŠคํŠธ๊ฐ€ ์ด๋ฏธ์ง€๋ฅผ ๋Ÿฌํ”„ํ•˜๊ฒŒ ๋ฌ˜์‚ฌํ•  ๊ฒƒ์ด๋ฏ€๋กœ ํ˜„์žฌ ๋ฐฉ๋ฒ•์€ ์›น์—์„œ ํ”ํžˆ ๋ณผ ์ˆ˜ ์žˆ๋Š” ๋ฌผ์ฒด๋ฅผ ์ธ์‹ํ•˜๋Š” task์—์„œ๋Š” ํƒ์›”ํ•œ transfer ์„ฑ๋Šฅ์„ ๊ฐ€์ง€์ง€๋งŒ, ํŠน์ • ๋ถ„์•ผ์— ํ•œ์ •๋œ ๊ฐ์ฒด๋ฅผ ๋‹ค๋ฃจ๊ฑฐ๋‚˜ ๋งค์šฐ ์„ธ๋ฐ€ํ•œ ์ฐจ์ด๊ฐ€ ์ค‘์š”ํ•œ task์—์„œ๋Š” ์„ฑ๋Šฅ์ด ์ข‹์„ ์ˆ˜ ์—†์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

์—ฌ๋‹ด์ธ๋ฐ, main ์‹คํ—˜์€ 2์ฃผ๋™์•ˆ 256๊ฐœ์˜ V100 GPU๋กœ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ๊ณ ๋กœ... ์ผ๋ฐ˜์ ์ธ ๋Œ€๋ถ€๋ถ„์˜ ํ™˜๊ฒฝ์—์„œ๋Š” ์‹คํ—˜์ด ๋ถˆ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. GPU ๊ฐ€๊ฒฉ๋งŒ 1๊ฐœ๋‹น 600์žก๊ณ  15์–ต์› ๊ฐ€๋Ÿ‰ํ•ฉ๋‹ˆ๋‹ค... 

 

์œ„ 2๊ฐœ์˜ ์‹คํ—˜์€ ๋ฐ์ดํ„ฐ์˜ distribution shift์— ์–ผ๋งˆ๋‚˜ robustํ•œ์ง€ ์‹คํ—˜ํ•œ ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค. ์ฒซ ๋ฒˆ์งธ ์‹คํ—˜(Fig 12)์€ ๋‹ค๋ฅธ task๋กœ transfer ํ–ˆ์„ ๋•Œ์˜ ๊ฒฐ๊ณผ๊ฐ€ ๋‹ค๋ฅธ method ๋“ค์— ๋น„ํ•ด CLIP๊ฐ€ ์›”๋“ฑํžˆ ์ข‹์€ ๊ฒƒ์œผ๋กœ ๋ณด์•„ task shift์— robustํ•˜๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋‘ ๋ฒˆ์งธ ์‹คํ—˜(Fig 13)์€ ImageNet์œผ๋กœ pre-trainingํ•œ ResNet101๊ณผ CLIP์˜ distriution shift ์— ๋Œ€ํ•œ ์‹คํ—˜์ด๊ณ  ๊ฒฐ๊ณผ๋Š” ์—ญ์‹œ CLIP๊ฐ€ ์›”๋“ฑํžˆ ์ข‹์Šต๋‹ˆ๋‹ค. 

 

์ด์™€ ๊ฐ™์€ ๊ฒฐ๊ณผ๋Š”.... CLIP๊ฐ€ ์›น์ƒ์—์„œ ๊ฐ€์ ธ์˜จ 4์–ต๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต์„ ์‹œํ‚ค๋‹ˆ๊นŒ ํ•™์Šต ๋ฐ์ดํ„ฐ์˜ distribution์ด ํฌ๊ณ  ๋งŽ์€ case์˜ ์ด๋ฏธ์ง€(์‹ค์ œ ์ด๋ฏธ์ง€, ๊ทธ๋ฆผ, ์Šค์ผ€์น˜ ๋“ฑ๋“ฑ)๋“ค์„ ํฌํ•จํ•˜๊ธฐ ๋•Œ๋ฌธ์— distribution shift์— robustํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

* ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ ๊ต‰์žฅํžˆ ์„ฑ๋Šฅ์ด ์ข‹์€ ๊ฒƒ์ฒ˜๋Ÿผ ๋ณด์ผ ์ˆ˜ ์žˆ์ง€๋งŒ ์‹ค์ œ๋กœ๋Š” ํ•™์Šตํ•œ ๋ฐ์ดํ„ฐ์— ๊ต‰์žฅํžˆ ์˜์กด์ ์ธ ๊ฒฐ๊ณผ๋ฅผ ๊ฐ€์ ธ์˜ค๊ธฐ๋„ ํ•ฉ๋‹ˆ๋‹ค. Test set์˜ ๊ฒฐ๊ณผ๊ฐ€ ์ข‹๋‹คํ•˜๋”๋ผ๋„ train data๊ณผ test data์˜ distribution์ด ๊ฑฐ์˜ ์œ ์‚ฌํ•œ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๊ธฐ ๋•Œ๋ฌธ์— distribution shift์— robustํ•˜๋‹ค๊ณ  ๋ณผ ์ˆ˜๋Š” ์—†์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ๊ด€๋ จ task๋กœ domain adaptation/generalization ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

 

 

์‹ค์ œ ๊ฒฐ๊ณผ

 

 

๊ฒฐ๋ก  

CLIP๋Š” ์ด๋ฏธ์ง€-์ž์—ฐ์–ด ์Œ์œผ๋กœ task agnosticํ•œ pre-trainingํ•˜์—ฌ ๋‹ค๋ฅธ task์˜ ๋”ฅ๋Ÿฌ๋‹ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œ์ผœ์ค„ ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ์ž์—ฐ์–ด๋ฅผ vision task์— ์„ฑ๊ณต์ ์œผ๋กœ ํ™œ์šฉํ•œ ๋ฐฉ๋ฒ•์ด๋ฉฐ ์•„์ฃผ ๋งŽ์€ ๋ฐ์ดํ„ฐ์™€ ๊ฐ•๋ ฅํ•œ computing power๋กœ ๋ชจ๋ธ์„ ํ•™์Šต์‹œ์ผœ ์˜ฌ์ธ์› ๋ชจ๋ธ์„ ๋งŒ๋“  ๋Š๋‚Œ์ž…๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ, ์ด๋“ค์ด ๋ชจ๋“  task๋ฅผ ์ปค๋ฒ„ํ•  ์ˆ˜์žˆ๋Š” ๋ชจ๋ธ์€ ์•„๋‹ˆ๊ธฐ ๋•Œ๋ฌธ์— transfer/knowledge distillation ๋“ฑ์„ ์‚ฌ์šฉํ•˜์—ฌ ํŠน์ •ํ•œ task์˜ model์„ fine-tuningํ•˜์—ฌ ๋น„๊ต์  ์‰ฝ๊ฒŒ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์„ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. 

๋ฐ˜์‘ํ˜•