[Gen AI] Diffusion Model๊ณผ DDPM ๊ฐœ๋… ์„ค๋ช…
ยท
๐Ÿ› Research/Generative AI
์ƒ์„ฑ ๋ชจ๋ธ์—์„œ Diffusion ๋ชจ๋ธ์€ ๊ณ ํ•ด์ƒ๋„ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•˜๋Š” ํ•ต์‹ฌ ๊ธฐ์ˆ ๋กœ ์ฃผ๋ชฉ๋ฐ›๊ณ  ์žˆ๋Š”๋ฐ, ์ด ๋ชจ๋ธ์€ ๋…ธ์ด์ฆˆ๋ฅผ ์ ์  ์ œ๊ฑฐํ•ด๊ฐ€๋ฉฐ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•œ๋‹ค๋Š” ๊ฐœ๋…์œผ๋กœ, Stable Diffusion, DALL·E 2 ๋“ฑ ๋‹ค์–‘ํ•œ ๋ชจ๋ธ์˜ ๊ธฐ๋ฐ˜์ด ๋˜๊ณ  ์žˆ๋‹ค. ์ด ๊ธ€์—์„œ๋Š” Diffusion Model์˜ ๊ฐœ๋…๋ถ€ํ„ฐ ๊ฐ€์žฅ ๊ธฐ๋ณธ์ด ๋˜๋Š” DDPM(Denoising Diffusion Probabilistic Model)์˜ ํ•™์Šต ๋ฐ ์ƒ์„ฑ ๊ณผ์ •์— ์ดˆ์ ์„ ๋งž์ถ”์–ด ์„ค๋ช…ํ•œ๋‹ค. ์ˆ˜์‹๋ณด๋‹ค๋Š” ๊ฐœ๋…์  ์„ค๋ช…์— ์ง‘์ค‘ํ–ˆ๋‹ค.1. Diffusion Model์ด๋ž€?๋””ํ“จ์ „ ๋ชจ๋ธ์€ ๋ฐ์ดํ„ฐ์— ์ ์  ๊ฐ€์šฐ์‹œ์•ˆ ๋…ธ์ด์ฆˆ๋ฅผ ์ถ”๊ฐ€ํ•ด ์™„์ „ํžˆ ๋ฌด์ž‘์œ„ํ•œ ์ƒํƒœ๋กœ ๋งŒ๋“  ๋’ค, ๊ทธ ๋ฐ˜๋Œ€ ๊ณผ์ •์„ ํ†ตํ•ด ๋…ธ์ด์ฆˆ์—์„œ ์›๋ณธ ์ด๋ฏธ์ง€๋ฅผ ๋ณต์›ํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค. ์ด ๊ณผ์ •์„ ๋‘ ๋‹จ๊ณ„๋กœ ๋‚˜๋ˆŒ ์ˆ˜ ์žˆ๋‹ค. ..
[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] DREAMFUSION: TEXT-TO-3D USING 2D DIFFUSION
ยท
๐Ÿ› Research/Generative AI
1. ์—ฐ๊ตฌ ์ฃผ์ œ์™€ ์ฃผ์š” ๊ธฐ์—ฌDreamFusion์€ 2D text-to-image diffusion model์„ ํ™œ์šฉํ•ด 3D ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ•˜๋Š” text-to-3D ํ•ฉ์„ฑ ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. โœ… ์ฃผ์š” ๊ธฐ์—ฌ3D ๋ฐ์ดํ„ฐ๋‚˜ 3D ํ•™์Šต์ด ์ „ํ˜€ ์—†์ด, 2D diffusion model๋งŒ์œผ๋กœ 3D ์žฅ๋ฉด์„ ์ƒ์„ฑํ•˜๋Š” end-to-end pipeline์„ ๊ตฌ์ถ•Score Distillation Sampling (SDS)์ด๋ผ๋Š” ์ƒˆ๋กœ์šด ์ตœ์ ํ™” ๊ธฐ๋ฐ˜ ์ƒ˜ํ”Œ๋ง ๊ธฐ๋ฒ•์„ ๊ณ ์•ˆํ•˜์—ฌ, pretrained ์ด๋ฏธ์ง€ diffusion model์„ 3D ํ•™์Šต์˜ loss๋กœ ํ™œ์šฉNeRF๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ 3D ๋ณผ๋ฅจ์„ ํŒŒ๋ผ๋ฏธํ„ฐํ™”ํ•˜์—ฌ, ๋‹ค์–‘ํ•œ ๊ฐ๋„์—์„œ ์ผ๊ด€๋œ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•จ 2. ์—ฐ๊ตฌ ๋ฐฐ๊ฒฝ ๋ฐ ๊ด€๋ จ ์—ฐ๊ตฌ ๋™ํ–ฅโœ… Text-to-Image Synthesis์ตœ๊ทผ..
[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] Zero-1-to-3: Zero-shot One Image to 3D Object | Single-view object reconstruction
ยท
๐Ÿ› Research/Generative AI
1. ์—ฐ๊ตฌ ์ฃผ์ œ์™€ ์ฃผ์š” ๊ธฐ์—ฌ Zero-1-to-3๋Š” ๋‹จ ํ•˜๋‚˜์˜ RGB ์ด๋ฏธ์ง€๋กœ๋ถ€ํ„ฐ ์ƒˆ๋กœ์šด ์นด๋ฉ”๋ผ ์‹œ์ ์—์„œ์˜ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•˜๊ณ , ๋‚˜์•„๊ฐ€ 3D ๋ณต์›๊นŒ์ง€ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” zero-shot ํ”„๋ ˆ์ž„์›Œํฌ์ด๋‹ค. ๊ธฐ์กด์—๋Š” ๋ฉ€ํ‹ฐ ๋ทฐ ๋˜๋Š” 3D ์ •๋ณด๊ฐ€ ํ•„์š”ํ–ˆ๋˜ ๋ฌธ์ œ๋ฅผ, Stable Diffusion๊ณผ ๊ฐ™์€ ๋Œ€๊ทœ๋ชจ ์‚ฌ์ „ํ•™์Šต ๋ชจ๋ธ์„ ํ™œ์šฉํ•ด ์ œ์•ฝ ์—†์ด ํ•™์Šตํ•˜์ง€ ์•Š์€ ๋ฐ์ดํ„ฐ์—์„œ๋„ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์„ ํ™•๋ณดํ•œ๋‹ค๋Š” ์ ์ด ์ฃผ์š” ์ฐจ๋ณ„์ ์ด๋‹ค. โœ… ์ฃผ์š” ๊ธฐ์—ฌStable Diffusion์„ ํ™œ์šฉํ•˜์—ฌ camera viewpoint control์ด ๊ฐ€๋Šฅํ•œ ์กฐ๊ฑด๋ถ€ image-to-image ๋ณ€ํ™˜ ํ•™์ŠตZero-shot 3D reconstruction์„ ์œ„ํ•œ viewpoint-conditioned diffusion ๋ชจ๋ธ ์ œ์•ˆObjaverse ๊ธฐ๋ฐ˜ ํ•™์Šต ํ›„์—..
[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] Visual Instruction Tuning | LLaVA Model
ยท
๐Ÿ› Research/Multi-modal
๐Ÿ’ก LLaVA 1. ์—ฐ๊ตฌ ์ฃผ์ œ์™€ ์ฃผ์š” ๊ธฐ์—ฌ ์ด ์—ฐ๊ตฌ๋Š” ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€๋ฅผ ํ•จ๊ป˜ ์ดํ•ดํ•˜๊ณ  ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋Š” ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ชจ๋ธ LLaVA๋ฅผ ์ œ์•ˆํ•˜๊ณ  ์žˆ์–ด์š”. ํŠนํžˆ Visual Instruction Tuning์„ ํ†ตํ•ด ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ž‘์—…์—์„œ ์‚ฌ์šฉ์ž์˜ ์ง€์‹œ๋ฅผ ๋”ฐ๋ฅด๊ณ , ๋ณต์žกํ•œ ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ๊ธฐ๋ฐ˜ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋ชจ๋ธ์„ ์„ค๊ณ„ํ–ˆ์–ด์š”. ๊ธฐ์กด์˜ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ํŽ˜์–ด ๋ฐ์ดํ„ฐ(์˜ˆ: COCO)๋ฅผ ํ™œ์šฉํ•œ ํ•™์Šต์—์„œ ํ•œ ๋ฐœ ๋” ๋‚˜์•„๊ฐ€, GPT-4๋ฅผ ํ™œ์šฉํ•ด ์ด๋ฏธ์ง€ ์„ค๋ช… ์บก์…˜์„ ๋ฐ”ํƒ•์œผ๋กœ ์งˆ๋ฌธ๊ณผ ๋‹ต๋ณ€ ํ˜•์‹์˜ ์ƒˆ๋กœ์šด ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ–ˆ๋‹ต๋‹ˆ๋‹ค.์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ์…‹ ์ƒ์„ฑ ๋ฐฉ๋ฒ•: GPT-4๋ฅผ ํ™œ์šฉํ•ด ๊ธฐ์กด ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ํŽ˜์–ด๋ฅผ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ง€์‹œ-์‘๋‹ต ๋ฐ์ดํ„ฐ๋กœ ์ž๋™ ๋ณ€ํ™˜ํ•˜๋Š” ๋ฐ์ดํ„ฐ ์ƒ์„ฑ ํŒŒ์ดํ”„๋ผ์ธ์„ ๊ฐœ๋ฐœํ–ˆ์–ด์š”. ์ด๋ฅผ ํ†ตํ•ด ๋‹ค์–‘ํ•œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ž‘์—…์— ํ™œ์šฉ ๊ฐ€๋Šฅ..
[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
ยท
๐Ÿ› Research/Multi-modal
๐Ÿ’ก BLIP-21. ์—ฐ๊ตฌ ์ฃผ์ œ์™€ ์ฃผ์š” ๊ธฐ์—ฌ BLIP-2 ๋…ผ๋ฌธ์€ Multi-modal Vision Language Pre-training(VLP)์— ๋Œ€ํ•œ ๋น„์šฉ ํšจ์œจ์ ์ธ ์ƒˆ๋กœ์šด ์ ‘๊ทผ๋ฒ•์„ ์ œ์•ˆํ–ˆ์–ด์š”. ๊ธฐ์กด์˜ ํฐ ๋ชจ๋ธ์„ end-to-end ๋กœ ํ•™์Šต์‹œํ‚ค๋Š” ๋ฐฉ์‹์˜ ๋†’์€ ๊ณ„์‚ฐ ๋น„์šฉ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, ์ด๋ฏธ ํ•™์Šต๋œ ์ด๋ฏธ์ง€ ์ธ์ฝ”๋”์™€ ๋Œ€ํ˜• ์–ธ์–ด ๋ชจ๋ธ(LLM)์„ ๊ณ ์ •(frozen)ํ•œ ์ฑ„๋กœ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๊ณ ์•ˆํ–ˆ์–ด์š”. Querying Transformer(Q-Former): Modality Gap(์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ๊ฐ„์˜ ์ฐจ์ด)๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์ค„์ด๊ธฐ ์œ„ํ•œ ๊ฒฝ๋Ÿ‰ ๋ชจ๋“ˆ์„ ์ œ์•ˆํ–ˆ์–ด์š”.Two-stage Pre-training: ๊ธฐ์กด ๋ชจ๋ธ์˜ ๊ฐ•์ ์„ ๊ฒฐํ•ฉํ•œ Representation Learning๊ณผ Generative Learning ์ „๋žต..
[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
ยท
๐Ÿ› Research/Multi-modal
๐Ÿ’ก BLIP1. ์—ฐ๊ตฌ ์ฃผ์ œ์™€ ์ฃผ์š” ๊ธฐ์—ฌ BLIP๋Š” Vision-Language Pre-training(VLP)์„ ์œ„ํ•œ ์ƒˆ๋กœ์šด ํ”„๋ ˆ์ž„์›Œํฌ๋กœ, ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ๊ฐ„์˜ ์ดํ•ด ๊ธฐ๋ฐ˜ ์ž‘์—…๊ณผ ์ƒ์„ฑ ๊ธฐ๋ฐ˜ ์ž‘์—…์„ ๋ชจ๋‘ ํšจ๊ณผ์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋„๋ก ์„ค๊ณ„๋˜์—ˆ์–ด์š”.๊ธฐ์กด VLP ๋ชจ๋ธ์˜ ํ•œ๊ณ„๋ฅผ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ฐœ์„ ํ–ˆ์–ด์š”.์ดํ•ด ๊ธฐ๋ฐ˜(์˜ˆ: ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ๊ฒ€์ƒ‰) ๋˜๋Š” ์ƒ์„ฑ ๊ธฐ๋ฐ˜(์˜ˆ: ์ด๋ฏธ์ง€ ์บก์…˜ ์ƒ์„ฑ) ์ž‘์—…์— ํŠนํ™”๋œ ๊ธฐ์กด ๋ชจ๋ธ์˜ ๋‹จ์ ์„ ๋ณด์™„.์›น์—์„œ ์ˆ˜์ง‘๋œ ๋…ธ์ด์ฆˆ๊ฐ€ ๋งŽ์€ ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ํ•™์Šต ์„ฑ๋Šฅ์„ ๊ทน๋Œ€ํ™”ํ•˜๊ธฐ ์œ„ํ•œ ๋ฐ์ดํ„ฐ ๋ถ€ํŠธ์ŠคํŠธ๋ž˜ํ•‘ ๋ฐฉ๋ฒ• ์ œ์•ˆ.BLIP๋Š” SOTA ์„ฑ๋Šฅ์„ ๊ธฐ๋กํ•˜๋ฉฐ ๋‹ค์–‘ํ•œ Vision-Language ์ž‘์—…์—์„œ ์šฐ์ˆ˜ํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์˜€์–ด์š”. 2. ์—ฐ๊ตฌ ๋ฐฐ๊ฒฝ ๋ฐ ๋™ํ–ฅVision-Language Pre-training (VLP)Visio..