๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
๐ŸŒŸ AI & ML Tech

2023๋…„ ์ปดํ“จํ„ฐ๋น„์ „ ๋ถ„์•ผ ํŠธ๋ Œ๋“œ with CVPR 2023 | Diffusion model, NeRF, Multi-modal

by ๋ญ…์ฆค 2023. 5. 28.
๋ฐ˜์‘ํ˜•

CVPR 2023์— accpet๋œ ๋…ผ๋ฌธ๋“ค์„ ๊ธฐ์ค€์œผ๋กœ ์ปดํ“จํ„ฐ๋น„์ „ ๋ถ„์•ผ์˜ ํŠธ๋ Œ๋“œ์— ๋Œ€ํ•ด ์‚ดํŽด ๋ณด๋ ค ํ•œ๋‹ค. ๋‚ด๊ฐ€ ๋ถ„์„ํ•œ ๊ฒƒ์€ ์•„๋‹ˆ๊ณ  ์•„๋ž˜ ํŽ˜์ด์ง€๋ฅผ ์ฐธ๊ณ ํ•ด์„œ ์“ด ๊ธ€์ด๋‹ˆ ๋” ์ž์„ธํ•œ ๋‚ด์šฉ์„ ์›๋ฌธ์„ ์ฐธ๊ณ ํ•˜๊ธธ ๋ฐ”๋ž€๋‹ค.

 

- https://voxel51.com/blog/cvpr-2023-and-the-state-of-computer-vision/

 

 

CVPR 2023 ๋ถ„์„ ์š”์•ฝ

CVPR 2023 word cloud

- 9155๊ฑด์˜ ์ œ์ถœ๋ฌผ ์ค‘ 2359๊ฑด์˜ ๋…ผ๋ฌธ ์ฑ„ํƒ
- ํ‰๊ท  ํ•ฉ๊ฒฉ ๋…ผ๋ฌธ ์ €์ž์˜ ์ˆ˜๋Š” 5.4๋ช…
- 63%์˜ ์ œ๋ชฉ์— ๋‘๋ฌธ์ž์–ด(acronyms) ์‚ฌ์šฉ (๋‹จ์–ด ์•ž๊ธ€์ž ๋”ฐ์„œ ๋งŒ๋“  ์ค„์ž„๋ง)
- Diffusion Model์ด 573% ์ƒ์Šน
- Multi-modal๊ณผ Cross-modal ์ด ๋ฏธ๋ž˜
- CNN์€ 68% ๊ฐ์†Œ
- Mask๋Š” ์–ด๋””์—๋‚˜ ์‚ฌ์šฉ
- ํฌ์ธํŠธํด๋ผ์šฐ๋“œ๋Š” depth & stereo์—์„œ native 3D๋กœ ์ „ํ™˜
- ๊ฐ€์žฅ ๋งŽ์ด ์‚ฌ์šฉ๋œ ๋ฐ์ดํ„ฐ์…‹ : ImageNet, COCO, KITTI

 

 

์š”์•ฝ๋œ ๋‚ด์šฉ์„ ๋ณด๋‹ˆ ํ™•์‹คํžˆ generative model, NeRF, multi-modal ๋ถ„์•ผ๊ฐ€ ์ธ๊ธฐ๊ฐ€ ๋งŽ๋‹ค. ๋˜ํ•œ CNN ๊ด€๋ จ ์—ฐ๊ตฌ๋Š” ๊ฐ์†Œํ•˜๊ณ  ์žˆ์œผ๋ฉฐ 2022๋…„์— ์ด์–ด Transformer ๋ชจ๋ธ์€ ์—ฌ์ „ํžˆ ๊ฐ•์„ธ์ด๋‹ค. 

 

 

CVPR 2023 ์ƒ์„ธ ๋ถ„์„

 

Models

  • Diffusion Models
    • ์ด๋ฏธ์ง€ ์ƒ์„ฑ ๋ชจ๋ธ๊ณผ ๊ฐ™์€ diffusion ๋ชจ๋ธ์— ๋Œ€ํ•œ ์—ฐ๊ตฌ๊ฐ€ ๋งŽ์•„์ง.
    • ๋…ธ์ด์ฆˆ ์ œ๊ฑฐ, ์ด๋ฏธ์ง€ ํŽธ์ง‘ ๋ฐ ์Šคํƒ€์ผ ํŠธ๋žœ์Šคํผ์—๋„ ์‚ฌ์šฉ๋จ
  • Radiance Fields
    • NeRF์— ๋Œ€ํ•œ ์ธ๊ธฐ๊ฐ€ ๋†’์•„์ ธ radiance๋ผ๋Š” ๋‹จ์–ด์˜ ์‚ฌ์šฉ์ด 80% ์ฆ๊ฐ€ํ•˜๊ณ  NeRF์˜ ๊ฒฝ์šฐ 39% ์ฆ๊ฐ€
    • NeRF๋Š” ๊ฐœ๋… ์ฆ๋ช…์„ ๋„˜์–ด editing, application ๋“ฑ์œผ๋กœ ์—ฐ๊ตฌ ์ง„ํ–‰.
  • Transformers
    • Transformer์™€ ViT์˜ ๊ฐ์†Œ๋Š” ํŠธ๋žœ์Šคํฌ๋จธ ๋ชจ๋ธ์˜ ์œ ํ–‰์ด ์ง€๋‚ฌ๋‹ค๋Š” ์˜๋ฏธ๊ฐ€ ์•„๋‹ˆ๋ผ 2022๋…„์— ์ด๋“ค ๋ชจ๋ธ์— ๋Œ€ํ•œ ์—ฐ๊ตฌ๊ฐ€ ์ง€๋ฐฐ์ ์ด์—ˆ๋‹ค๋Š” ๊ฒƒ์„ ๋œปํ•จ
    • Transformer ๋‹จ์–ด๋Š” 2021๋…„์— 37๊ฐœ์˜ ๋…ผ๋ฌธ์—์„œ, 2022๋…„์—๋Š” 201๊ฐœ์˜ ๋…ผ๋ฌธ์—์„œ ์‚ฌ์šฉ
  • Changing of the guard
    • 68% ํ•˜๋ฝํ•œ CNN์€ ์„ ํ˜ธ๋„๊ฐ€ ๋–จ์–ด์ง€๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์ž„
    • CNN๊ณผ Transformer๋ฅผ ํ•จ๊ป˜ ์–ธ๊ธ‰ํ•˜๋Š” ๋…ผ๋ฌธ ์ œ๋ชฉ๋„ ๋งŽ์•„์ง

 

Tasks

  • Generative
    • ๊ฐ์ง€, ๋ถ„๋ฅ˜ ๋ฐ ์„ธ๊ทธ๋ฉ˜ํ…Œ์ด์…˜ ๊ฐ™์€ ๊ธฐ์กด์˜ task๋Š” ์ธ๊ธฐ๋ฅผ ๋Œ์ง€ ๋ชปํ•จ
    • ํ•˜์ง€๋งŒ 'Editing'์— ๋Œ€ํ•œ ์ฆ๊ฐ€์œจ์—์„œ ์•Œ ์ˆ˜ ์žˆ๋“ฏ์ด ์ด๋ฏธ์ง€ ์ƒ์„ฑ ๋ถ„์•ผ์— ๋Œ€ํ•œ ๊ด€์‹ฌ์ด ๋†’์•„์ง€๊ณ  ์žˆ์Œ
  • Masks
    • Mask๋ผ๋Š” ํ‚ค์›Œ๋“œ๋Š” ์ „๋…„๋Œ€๋น„ 263% ์ฆ๊ฐ€
    • context of segmentation์—์„œ ๋ฐœ์ƒ
    • ํ•˜์ง€๋งŒ ๋Œ€๋‹ค์ˆ˜(63%)๋Š” ์‹ค์ œ๋กœ 'masked'๋œ ์ž‘์—…์„ ์ฐธ์กฐ
  • Zero vs Few
    • Zero-shot ํ•™์Šต์€ transfer learning, generative ์ ‘๊ทผ ๋ฐฉ์‹, prompting ๋“ฑ์— ์˜ํ•ด ์ฃผ๋ชฉ ๋ฐ›๊ณ  ์žˆ์Œ
    • Few-shot์€ ์ž‘๋…„์— ๋น„ํ•ด ๊ฐ์†Œํ–ˆ์ง€๋งŒ ์ ˆ๋Œ€์ ์ธ ์ˆ˜์น˜๋Š” Few-shot์ด Zero-shot๋ณด๋‹ค ๋งŽ์Œ

 

 

 

 

Modalities

  • Multi-modal
    • ํ†ต๊ณ„๋ฅผ ๋ณด๋ฉด CV๋ถ„์•ผ์™€ NLP ๋ถ„์•ผ์˜ ๊ฒฝ๊ณ„๊ฐ€ ์ ์  ํ๋ ค์ง€๊ณ  ์žˆ๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Œ
    • image, video ์™€ ๊ฐ™์€ ํ‚ค์›Œ๋“œ์˜ ๋นˆ๋„๋Š” ์ƒ๋Œ€์ ์œผ๋กœ ๋ณ€ํ•˜์ง€ ์•Š์•˜์ง€๋งŒ, text, language, audio ๋“ฑ์˜ ํ‚ค์›Œ๋“œ๋Š” ๊พธ์ค€ํžˆ ์ƒ์Šน
    • Open, Prompt, Vocabulary ํ‚ค์›Œ๋“œ์˜ ๊ธ‰๊ฒฉํ•œ ์ƒ์Šน์œผ๋กœ ์•Œ ์ˆ˜ ์žˆ๋“ฏ์ด ์‹œ๊ฐ ์–ธ์–ด ์ž‘์—…์—์„œ ๋‘๋“œ๋Ÿฌ์ง
  • Point Cloud
    • 3D ์ปดํ“จํ„ฐ๋น„์ „ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์€ 2D ์ด๋ฏธ์ง€์—์„œ 3D ์ •๋ณด๋ฅผ ์ถ”๋ก ํ•˜๋Š” ๊ฒƒ์—์„œ 3D ํฌ์ธํŠธ ํด๋ผ์šฐ๋“œ ๋ฐ์ดํ„ฐ๋ฅผ ์ง์ ‘ ํ•™์Šตํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ๋ณ€ํ™”๋˜์–ด ๊ฐ€๊ณ  ์žˆ์Œ
๋ฐ˜์‘ํ˜•