๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
728x90

๐Ÿ› Research/Multi-modal4

[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] Visual Instruction Tuning | LLaVA Model ๐Ÿ’ก LLaVA 1. ์—ฐ๊ตฌ ์ฃผ์ œ์™€ ์ฃผ์š” ๊ธฐ์—ฌ ์ด ์—ฐ๊ตฌ๋Š” ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€๋ฅผ ํ•จ๊ป˜ ์ดํ•ดํ•˜๊ณ  ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋Š” ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ชจ๋ธ LLaVA๋ฅผ ์ œ์•ˆํ•˜๊ณ  ์žˆ์–ด์š”. ํŠนํžˆ Visual Instruction Tuning์„ ํ†ตํ•ด ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ž‘์—…์—์„œ ์‚ฌ์šฉ์ž์˜ ์ง€์‹œ๋ฅผ ๋”ฐ๋ฅด๊ณ , ๋ณต์žกํ•œ ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ๊ธฐ๋ฐ˜ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋ชจ๋ธ์„ ์„ค๊ณ„ํ–ˆ์–ด์š”. ๊ธฐ์กด์˜ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ํŽ˜์–ด ๋ฐ์ดํ„ฐ(์˜ˆ: COCO)๋ฅผ ํ™œ์šฉํ•œ ํ•™์Šต์—์„œ ํ•œ ๋ฐœ ๋” ๋‚˜์•„๊ฐ€, GPT-4๋ฅผ ํ™œ์šฉํ•ด ์ด๋ฏธ์ง€ ์„ค๋ช… ์บก์…˜์„ ๋ฐ”ํƒ•์œผ๋กœ ์งˆ๋ฌธ๊ณผ ๋‹ต๋ณ€ ํ˜•์‹์˜ ์ƒˆ๋กœ์šด ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ–ˆ๋‹ต๋‹ˆ๋‹ค.์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ์…‹ ์ƒ์„ฑ ๋ฐฉ๋ฒ•: GPT-4๋ฅผ ํ™œ์šฉํ•ด ๊ธฐ์กด ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ํŽ˜์–ด๋ฅผ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ง€์‹œ-์‘๋‹ต ๋ฐ์ดํ„ฐ๋กœ ์ž๋™ ๋ณ€ํ™˜ํ•˜๋Š” ๋ฐ์ดํ„ฐ ์ƒ์„ฑ ํŒŒ์ดํ”„๋ผ์ธ์„ ๊ฐœ๋ฐœํ–ˆ์–ด์š”. ์ด๋ฅผ ํ†ตํ•ด ๋‹ค์–‘ํ•œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ž‘์—…์— ํ™œ์šฉ ๊ฐ€๋Šฅ.. 2024. 12. 4.
[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models ๐Ÿ’ก BLIP-21. ์—ฐ๊ตฌ ์ฃผ์ œ์™€ ์ฃผ์š” ๊ธฐ์—ฌ BLIP-2 ๋…ผ๋ฌธ์€ Multi-modal Vision Language Pre-training(VLP)์— ๋Œ€ํ•œ ๋น„์šฉ ํšจ์œจ์ ์ธ ์ƒˆ๋กœ์šด ์ ‘๊ทผ๋ฒ•์„ ์ œ์•ˆํ–ˆ์–ด์š”. ๊ธฐ์กด์˜ ํฐ ๋ชจ๋ธ์„ end-to-end ๋กœ ํ•™์Šต์‹œํ‚ค๋Š” ๋ฐฉ์‹์˜ ๋†’์€ ๊ณ„์‚ฐ ๋น„์šฉ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, ์ด๋ฏธ ํ•™์Šต๋œ ์ด๋ฏธ์ง€ ์ธ์ฝ”๋”์™€ ๋Œ€ํ˜• ์–ธ์–ด ๋ชจ๋ธ(LLM)์„ ๊ณ ์ •(frozen)ํ•œ ์ฑ„๋กœ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๊ณ ์•ˆํ–ˆ์–ด์š”. Querying Transformer(Q-Former): Modality Gap(์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ๊ฐ„์˜ ์ฐจ์ด)๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์ค„์ด๊ธฐ ์œ„ํ•œ ๊ฒฝ๋Ÿ‰ ๋ชจ๋“ˆ์„ ์ œ์•ˆํ–ˆ์–ด์š”.Two-stage Pre-training: ๊ธฐ์กด ๋ชจ๋ธ์˜ ๊ฐ•์ ์„ ๊ฒฐํ•ฉํ•œ Representation Learning๊ณผ Generative Learning ์ „๋žต.. 2024. 12. 4.
[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation ๐Ÿ’ก BLIP1. ์—ฐ๊ตฌ ์ฃผ์ œ์™€ ์ฃผ์š” ๊ธฐ์—ฌ BLIP๋Š” Vision-Language Pre-training(VLP)์„ ์œ„ํ•œ ์ƒˆ๋กœ์šด ํ”„๋ ˆ์ž„์›Œํฌ๋กœ, ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ๊ฐ„์˜ ์ดํ•ด ๊ธฐ๋ฐ˜ ์ž‘์—…๊ณผ ์ƒ์„ฑ ๊ธฐ๋ฐ˜ ์ž‘์—…์„ ๋ชจ๋‘ ํšจ๊ณผ์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋„๋ก ์„ค๊ณ„๋˜์—ˆ์–ด์š”.๊ธฐ์กด VLP ๋ชจ๋ธ์˜ ํ•œ๊ณ„๋ฅผ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ฐœ์„ ํ–ˆ์–ด์š”.์ดํ•ด ๊ธฐ๋ฐ˜(์˜ˆ: ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ๊ฒ€์ƒ‰) ๋˜๋Š” ์ƒ์„ฑ ๊ธฐ๋ฐ˜(์˜ˆ: ์ด๋ฏธ์ง€ ์บก์…˜ ์ƒ์„ฑ) ์ž‘์—…์— ํŠนํ™”๋œ ๊ธฐ์กด ๋ชจ๋ธ์˜ ๋‹จ์ ์„ ๋ณด์™„.์›น์—์„œ ์ˆ˜์ง‘๋œ ๋…ธ์ด์ฆˆ๊ฐ€ ๋งŽ์€ ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ํ•™์Šต ์„ฑ๋Šฅ์„ ๊ทน๋Œ€ํ™”ํ•˜๊ธฐ ์œ„ํ•œ ๋ฐ์ดํ„ฐ ๋ถ€ํŠธ์ŠคํŠธ๋ž˜ํ•‘ ๋ฐฉ๋ฒ• ์ œ์•ˆ.BLIP๋Š” SOTA ์„ฑ๋Šฅ์„ ๊ธฐ๋กํ•˜๋ฉฐ ๋‹ค์–‘ํ•œ Vision-Language ์ž‘์—…์—์„œ ์šฐ์ˆ˜ํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์˜€์–ด์š”. 2. ์—ฐ๊ตฌ ๋ฐฐ๊ฒฝ ๋ฐ ๋™ํ–ฅVision-Language Pre-training (VLP)Visio.. 2024. 12. 4.
[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] Learning Transferable Visual Models From Natural Language Supervision / CLIP / Multi-modal network Open AI์—์„œ ๊ฒŒ์žฌํ•œ(ICML2021) Contrastive Language-Image Pre-training(CLIP)๋ฅผ ์ œ์•ˆํ•œ ๋…ผ๋ฌธ์„ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. Introduction & Motivation ๋”ฅ๋Ÿฌ๋‹์ด computer vision์˜ ๊ฑฐ์˜ ๋ชจ๋“  ๋ถ„์•ผ์—์„œ ๊ต‰์žฅํžˆ ์ž˜ ํ™œ์šฉ๋˜์ง€๋งŒ ํ˜„์žฌ ์ ‘๊ทผ ๋ฐฉ์‹์—๋Š” ๋ช‡๊ฐ€์ง€ ๋ฌธ์ œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ธฐ์กด์˜ vision model๋“ค์€ ํ•™์Šต๋œ task์—๋Š” ์„ฑ๋Šฅ์ด ์šฐ์ˆ˜ํ•˜์ง€๋งŒ ์ƒˆ๋กœ์šด task์— ์ ์šฉ์‹œํ‚ค๊ธฐ ์œ„ํ•ด์„œ๋Š” ์ƒˆ๋กœ ํ•™์Šต์„ ์‹œํ‚ค์•ผ ํ•˜๋Š”(๊ทธ๋Ÿฌ๋ฉด ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ์…‹๊ณผ ์ถ”๊ฐ€ ๋ ˆ์ด๋ธ”๋ง์ด ํ•„์š”..) ๋ฒˆ๊ฑฐ๋กœ์›€(?) ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฒค์น˜๋งˆํฌ์—์„œ ์ž˜ ์ˆ˜ํ–‰๋˜๋Š” ๋ช‡๋ช‡ model๋“ค์€ stress test์—์„œ ์ข‹์ง€ ์•Š์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ๊ธฐ๋„ ํ•ฉ๋‹ˆ๋‹ค. ๋Œ€์•ˆ์œผ๋กœ raw text์™€ image๋ฅผ pair๋กœ ํ•™์Šต์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•.. 2022. 2. 26.
728x90