[๋…ผ๋ฌธ ์†Œ๊ฐœ] DINOv2 - Self-supervised Vision Transformer | Meta AI | ๋ ˆ์ด๋ธ” ๋ฐ์ดํ„ฐ ์—†์ด ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ์„ ๋‚ด๋Š” Vision AI ๋ชจ๋ธ

2023. 4. 29. 20:19ยท๐Ÿ› Research/Detection & Segmentation
๋ฐ˜์‘ํ˜•

DINOv2

  • ๋…ผ๋ฌธ ์ œ๋ชฉ : DINOv2: Learning Robust Visual Features without Supervision
  • GitHub
  • Demo

 

23๋…„ 4์›” Meta AI์—์„œ self-supervised learning์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ณ ์„ฑ๋Šฅ ์ปดํ“จํ„ฐ๋น„์ „ ๋ชจ๋ธ์„ ํ•™์Šตํ•˜๋Š” ์ƒˆ๋กœ์šด ๋ฐฉ๋ฒ•์ธ DINOv2๋ฅผ ๊ณต๊ฐœํ–ˆ๋‹ค. LLM(Large Language Model) ํ•™์Šต์—๋„ ํ™œ์šฉ๋˜๋Š” self-supervised learning ๋ฐฉ๋ฒ•์€ ๋ชจ๋ธ ํ•™์Šต ์‹œ ๋งŽ์€ ์–‘์˜ ๋ ˆ์ด๋ธ”์ด ์ง€์ •๋œ ๋ฐ์ดํ„ฐ๊ฐ€ ํ•„์š”ํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— AI ๋ชจ๋ธ์„ ํ•™์Šตํ•˜๋Š” ๊ฐ•๋ ฅํ•˜๊ณ  ์œ ์—ฐํ•œ ๋ฐฉ๋ฒ•์ด๋‹ค.

 

๋…ผ๋ฌธ์— ๋”ฐ๋ฅด๋ฉด ์ตœ๊ทผ ๋ช‡๋…„ ๋™์•ˆ ์ปดํ“จํ„ฐ๋น„์ „ ์ž‘์—…์˜ ํ‘œ์ค€ ์ ‘๊ทผ ๋ฐฉ์‹์ด์—ˆ๋˜ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ๋ฅผ ํŽ˜์–ด๋กœ ํ•™์Šตํ•˜๋Š” ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ฐฉ์‹์˜ ํ•™์Šต ๋ฐฉ๋ฒ•์—์„œ๋Š” ์ด๋ฏธ์ง€์˜ ์บก์…˜ ์ •๋ณด์— ์˜์กดํ•œ ํ•™์Šต์ด ์ง„ํ–‰๋˜๊ธฐ ๋•Œ๋ฌธ์— ๋ช…์‹œ์ ์œผ๋กœ ์–ธ๊ธ‰๋˜์ง€ ์•Š๋Š” ์ •๋ณด๊ฐ€ ๋ฌด์‹œ๋œ๋‹ค๊ณ  ํ•œ๋‹ค. ์ œ์•ˆํ•˜๋Š” DINOv2๋Š” self-supervised learning์„ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์บก์…˜์ด ํ•„์š”์—†๊ณ (์˜์กดํ•˜์ง€ ์•Š๊ณ ) ์„ค๋ช…ํ•˜๊ธฐ ์–ด๋ ค์šด ๋ฐ์ดํ„ฐ๋„ ํ•™์Šต ๋ฐ์ดํ„ฐ๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์žฅ์ ์ด ์žˆ๋‹ค.

 

DINOv2๋Š” ๋ ˆ์ด๋ธ”๋œ ๋ฐ์ดํ„ฐ์— ์˜์กดํ•˜์ง€ ์•Š๊ณ  fine-tuning ๊ณผ์ •๋„ ํ•„์š”ํ•˜์ง€ ์•Š์ง€๋งŒ ๋งŽ์€ ์ปดํ“จํ„ฐ๋น„์ „ task์—์„œ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ๊ธฐ ๋•Œ๋ฌธ์— ๋‹ค์–‘ํ•œ ์ปดํ“จํ„ฐ๋น„์ „ task์˜ ๋ฐฑ๋ณธ์œผ๋กœ ์‚ฌ์šฉํ•˜๊ธฐ ์ ํ•ฉํ•˜๋‹ค๊ณ  ํ•œ๋‹ค. ๋˜ํ•œ DINOv2๋Š” self-supervision์„ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ชจ๋“  ์ด๋ฏธ์ง€ ๋ชจ์Œ์œผ๋กœ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๊ณ  depth estimation๊ณผ ๊ฐ™์€ ์ด์ „์˜ ์ ‘๊ทผ ๋ฐฉ๋ฒ•์œผ๋กœ ํ•  ์ˆ˜ ์—†๋˜ ๊ธฐ๋Šฅ์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋‹ค.

 

AI ๋ชจ๋ธ ํ•™์Šต์„ ์œ„ํ•ด์„œ๋Š” ๋ฐฉ๋Œ€ํ•œ ์–‘์˜ ๋ ˆ์ด๋ธ”๋ง๋œ ํ•™์Šต ๋ฐ์ดํ„ฐ๊ฐ€ ํ•„์š”ํ•œ๋ฐ, ๋ ˆ์ด๋ธ”์„ ์ƒ์„ฑํ•˜๋Š” ์ผ์€ ์‹œ๊ฐ„๊ณผ ๋น„์šฉ์ด ๊ต‰์žฅํžˆ ๋งŽ์ด ์†Œ์š”๋˜๋Š” ์ž‘์—…์ด๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ๋ชจ๋ธ ํ•™์Šต์— ๋ ˆ์ด๋ธ”๋ง๋œ ๋ฐ์ดํ„ฐ๊ฐ€ ํ•„์š”์—†์ง€๋งŒ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ์„ ๋‚ธ๋‹ค๋ฉด? 

 

 

DINOv2์™€ ๊ฐ™์€ self-supervision ์ปดํ“จํ„ฐ๋น„์ „ ๋ชจ๋ธ์€ ๋‹ค์–‘ํ•œ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์—์„œ ์œ ์šฉํ•  ๊ฒƒ์ด๋ผ ํ•œ๋‹ค. Meta๋Š” World Resources Institute์™€ ํ˜‘๋ ฅํ•˜์—ฌ AI๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋Œ€๋ฅ™ ํฌ๊ธฐ์˜ ์ง€์—ญ์— ๊ฑธ์ณ ๋‚˜๋ฌด๋ณ„๋กœ ์ˆฒ์„ ๋งคํ•‘ํ•˜๋Š”๋ฐ ์„ฑ๊ณตํ–ˆ๊ณ , ๋ถ๋ฏธ์˜ ์ˆฒ์—์„œ ์–ป์€ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šตํ–ˆ์ง€๋งŒ, ์ „ ์„ธ๊ณ„ ๋‹ค๋ฅธ ์œ„์น˜์—์„œ ๋†’์€ ๋งคํ•‘ ์ •ํ™•๋„๋ฅผ ๋ณด์—ฌ์ค„ ๋งŒํผ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์ด ์ข‹๋‹ค๊ณ  ํ•œ๋‹ค.


DINOv2 ๋ฐ๋ชจ

 

DINOv2๋Š” ๋ ˆ์ด๋ธ”์ด ์—†๋Š” 142M๊ฐœ์˜ ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ํ•™์Šต๋˜์—ˆ๊ณ , ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์—๋Š” depth estimation, image retrieval, semantic segmentation์ด ํฌํ•จ๋œ๋‹ค. ๋ฐ๋ชจ๋ฅผ ์ œ๊ณตํ•˜๊ณ  ์žˆ์œผ๋‹ˆ ํ•œ ๋ฒˆ ์•Œ์•„๋ณด๋„๋ก ํ•˜์ž.

 

Depth Estimation

Depth Estimation ์ƒ˜ํ”Œ
๊ฐ•๋‚จ์—ญ ๊ทผ์ฒ˜ ๋กœ๋“œ๋ทฐ ์ด๋ฏธ์ง€ Depth Estimation ๊ฒฐ๊ณผ

 

DINOv2 ๋ชจ๋ธ์€ in/out of distribution ๋ฐ์ดํ„ฐ ๋ชจ๋‘์—์„œ ๋‹จ์ผ ์ด๋ฏธ์ง€๋กœ depth estimation(๊นŠ์ด ์ถ”์ •) ๊ธฐ๋Šฅ์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. Linear ๋ชจ๋ธ ๋งŒ์œผ๋กœ NYU Depth ๋ฐ SUN RGB-D ๋ชจ๋‘์—์„œ SOTA๋ฅผ ๋‹ฌ์„ฑํ–ˆ๋‹ค๊ณ  ํ•œ๋‹ค. ์ฒซ ๋ฒˆ์งธ ์˜ˆ์‹œ๋Š” ๋ฉ”ํƒ€ AI์—์„œ ์ œ๊ณตํ•˜๋Š” ์ƒ˜ํ”Œ ๊ฒฐ๊ณผ์ด๋ฉฐ ๋‘ ๋ฒˆ์งธ ์ด๋ฏธ์ง€๋Š” ๊ฐ•๋‚จ์—ญ ๋ถ€๊ทผ ๋กœ๋“œ๋ทฐ ์ด๋ฏธ์ง€๋กœ ํ…Œ์ŠคํŠธํ•œ ๊ฒฐ๊ณผ์ธ๋ฐ ๊ฝค ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค€๋‹ค.

 

ํฅ๋ฏธ๋กœ์šด ์ ์€ OOD(Out-of-Distribution Data)์—์„œ๋„ ๊นŠ์ด ์ถ”์ •์ด ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ๊ฒƒ์ธ๋ฐ, ๋…ผ๋ฌธ์˜ ์˜ˆ์‹œ๋ฅผ ๋ณด๋ฉด ๋‹ค์–‘ํ•œ ํ™”ํ’์˜ ๊ทธ๋ฆผ์—์„œ๋„ ๋ฐฐ๊ฒฝ๊ณผ ๊ฐ์ฒด๋ฅผ ์ž˜ ๊ตฌ๋ณ„ํ•˜์—ฌ ๊นŠ์ด๋ฅผ ์ถ”์ •ํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

 

 

Image Retrieval

Image Retrieval ์ƒ˜ํ”Œ
๊ด‘ํ™”๋ฌธ Image Retrieval ๊ฒฐ๊ณผ

Image Retrieval ์€ ๋Œ€๊ทœ๋ชจ ์•„ํŠธ ์ด๋ฏธ์ง€ ์ปฌ๋ ‰์…˜์—์„œ ์ฃผ์–ด์ง„ ์ด๋ฏธ์ง€์™€ ์œ ์‚ฌํ•œ ๋ฏธ์ˆ  ์ž‘ํ’ˆ์„ ์ฐพ๋Š”๋‹ค. ๋ฐ๋ชจ ์‚ฌ์ดํŠธ ์˜ˆ์‹œ์ธ ์—ํŽ ํƒ‘ ์‚ฌ์ง„์€ ์—ํŽ ํƒ‘ ๊ทธ๋ฆผ์„ ์ž˜ ์ฐพ์•„์ฃผ๊ธด ํ•˜์ง€๋งŒ, ์ง์ ‘ ํ…Œ์ŠคํŠธ ํ•ด๋ณธ ๊ด‘ํ™”๋ฌธ ์‚ฌ์ง„์€ ๋™์–‘์ ์ธ ๊ฑด์ถ•๋ฌผ์ด ๋ฌ˜์‚ฌ๋œ ์ž‘ํ’ˆ์„ ์ฐพ์•„ ์ค€๋‹ค. ์•„๋ฌด๋ž˜๋„ ์—ํŽ ํƒ‘์ฒ˜๋Ÿผ ์œ ๋‹ˆํฌํ•œ ํ•œ ํ˜•ํƒœ์˜ ๊ฑด์ถ•๋ฌผ์ด ์•„๋‹ˆ๋ผ์„œ ๊ทธ๋Ÿฐ ๊ฒƒ ๊ฐ™๋‹ค.

 

 

Semantic Segmentation

Semantic Segmentation ์ƒ˜ํ”Œ
๊ด‘ํ™”๋ฌธ, ๊ฐ•๋‚จ์—ญ Semantic Segmentation ๊ฒฐ๊ณผ

Segmentation์€ ์›Œ๋‚™ ์ž˜ ๋˜๋Š” ๋ชจ๋ธ์ด ๋งŽ์œผ๋‹ˆ๊นŒ ๋†€๋ž์ง„ ์•Š์ง€๋งŒ, self-supervised learning ๋งŒ์˜ ๊ฒฐ๊ณผ๋ผ๋ฉด ์‹ ๊ธฐํ•˜๊ธด ํ•˜๋‹ค. ํ‰๊ฐ€ ๊ฒฐ๊ณผ๋Š” ADE20K, Cityspace ๋ฐ์ดํ„ฐ์…‹์—์„œ ๊ฒฝ์Ÿ๋ ฅ ์žˆ๋Š” ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ค€๋‹ค๊ณ  ํ•œ๋‹ค. (SOTA ๊ฐฑ์‹ ์€ ์•„๋‹ˆ๋ผ๋Š” ๋œป์ด๋‹ค)

 


 

Meta AI์—์„œ Segment Anything Model (SAM)์— ์ด์–ด ์ƒˆ๋กœ์šด vision ๋ชจ๋ธ์„ ๊ณต๊ฐœํ–ˆ๋‹ค. ์ตœ๊ทผ AI ์—…๊ณ„์—์„œ ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ์ด ๊ต‰์žฅํ•œ ์ธ๊ธฐ์˜€๋Š”๋ฐ, ๋น„์ „ ๋ถ„์•ผ์—์„œ๋„ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ์˜ ์˜ฌ์ธ์› ๋น„์ „ ๋ชจ๋ธ๋“ค์ด ๋“ฑ์žฅํ•˜๋Š” ์ถ”์„ธ์ด๋‹ค. ์ด์ฒ˜๋Ÿผ ํ•™๊ณ„์—์„œ๋Š” ์ ์  ๋ ˆ์ด๋ธ”์ด ์—†๋Š” ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ๋กœ ์˜ฌ์ธ์› ๋ชจ๋ธ์ด ๊ฐ๊ด‘๋ฐ›๋Š”๋ฐ, ์‚ฐ์—…์—์„œ๋Š” ์–ด๋–ค ์˜ํ–ฅ์ด ์žˆ์„์ง€ ๊ถ๊ธˆํ•ด์ง„๋‹ค. ๊ธฐ์ˆ  ๊ฒฉ์ฐจ๋กœ ์ธํ•ด ์—ฌ์ „ํžˆ ๋ ˆ์ด๋ธ”๋œ ํ•™์Šต๋ฐ์ดํ„ฐ์— ์˜์กดํ•˜๋Š” AI ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•  ๊ฒƒ์ธ์ง€, ๊ธ€๋กœ๋ฒŒ ๊ธฐ์—…์˜ API๋ฅผ ์‚ฌ์šฉํ•  ๊ฒƒ์ธ์ง€, ์ž์ฒด ๋ชจ๋ธ์„ ๊ฐœ๋ฐœํ•  ๊ฒƒ์ธ์ง€?

 

๋ฐ˜์‘ํ˜•

'๐Ÿ› Research > Detection & Segmentation' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] Fast Segment Anything | Fast SAM | SAM์˜ ๊ฒฝ๋Ÿ‰ํ™”  (0) 2023.07.02
[๋…ผ๋ฌธ ์†Œ๊ฐœ] TAM (Track Anything Model) | ์–ด๋–ค ๊ฒƒ์ด๋“  ์ถ”์ ํ•˜๋Š” Vision AI ๋ชจ๋ธ | Sagment Anything ๋น„๋””์˜ค ๋ฒ„์ „  (0) 2023.04.30
[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers  (0) 2022.08.09
[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] Deep Learning for Large-Scale Traffic-Sign Detection and Recognition / ๊ตํ†ต ํ‘œ์ง€ํŒ ๊ฒ€์ถœ  (0) 2022.07.08
[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation / DeepLab v3+ / semantic segmentation์˜ ๊ธฐ์ดˆ  (0) 2022.05.15
'๐Ÿ› Research/Detection & Segmentation' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€
  • [๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] Fast Segment Anything | Fast SAM | SAM์˜ ๊ฒฝ๋Ÿ‰ํ™”
  • [๋…ผ๋ฌธ ์†Œ๊ฐœ] TAM (Track Anything Model) | ์–ด๋–ค ๊ฒƒ์ด๋“  ์ถ”์ ํ•˜๋Š” Vision AI ๋ชจ๋ธ | Sagment Anything ๋น„๋””์˜ค ๋ฒ„์ „
  • [๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers
  • [๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] Deep Learning for Large-Scale Traffic-Sign Detection and Recognition / ๊ตํ†ต ํ‘œ์ง€ํŒ ๊ฒ€์ถœ
๋ญ…์ฆค
๋ญ…์ฆค
AI ๊ธฐ์ˆ  ๋ธ”๋กœ๊ทธ
    ๋ฐ˜์‘ํ˜•
  • ๋ญ…์ฆค
    CV DOODLE
    ๋ญ…์ฆค
  • ์ „์ฒด
    ์˜ค๋Š˜
    ์–ด์ œ
  • ๊ณต์ง€์‚ฌํ•ญ

    • โœจ About Me
    • ๋ถ„๋ฅ˜ ์ „์ฒด๋ณด๊ธฐ (198)
      • ๐Ÿ“– Fundamentals (33)
        • Computer Vision (9)
        • 3D vision & Graphics (6)
        • AI & ML (15)
        • NLP (2)
        • etc. (1)
      • ๐Ÿ› Research (64)
        • Deep Learning (7)
        • Image Classification (2)
        • Detection & Segmentation (17)
        • OCR (7)
        • Multi-modal (4)
        • Generative AI (6)
        • 3D Vision (2)
        • Material & Texture Recognit.. (8)
        • NLP & LLM (11)
        • etc. (0)
      • ๐ŸŒŸ AI & ML Tech (7)
        • AI & ML ์ธ์‚ฌ์ดํŠธ (7)
      • ๐Ÿ’ป Programming (85)
        • Python (18)
        • Computer Vision (12)
        • LLM (4)
        • AI & ML (17)
        • Database (3)
        • Apache Airflow (6)
        • Docker & Kubernetes (14)
        • ์ฝ”๋”ฉ ํ…Œ์ŠคํŠธ (4)
        • C++ (1)
        • etc. (6)
      • ๐Ÿ’ฌ ETC (3)
        • ์ฑ… ๋ฆฌ๋ทฐ (3)
  • ๋งํฌ

  • ์ธ๊ธฐ ๊ธ€

  • ํƒœ๊ทธ

    OpenCV
    Computer Vision
    material recognition
    VLP
    ํ”„๋กฌํ”„ํŠธ์—”์ง€๋‹ˆ์–ด๋ง
    AI
    3D Vision
    ChatGPT
    deep learning
    pytorch
    Image Classification
    ์ปดํ“จํ„ฐ๋น„์ „
    ๊ฐ์ฒด๊ฒ€์ถœ
    nlp
    pandas
    airflow
    OCR
    object detection
    LLM
    ๋”ฅ๋Ÿฌ๋‹
    multi-modal
    ๊ฐ์ฒด ๊ฒ€์ถœ
    CNN
    OpenAI
    segmentation
    Text recognition
    GPT
    Python
    ํŒŒ์ด์ฌ
    ๋„์ปค
  • ์ตœ๊ทผ ๋Œ“๊ธ€

  • ์ตœ๊ทผ ๊ธ€

  • hELLOยท Designed By์ •์ƒ์šฐ.v4.10.3
๋ญ…์ฆค
[๋…ผ๋ฌธ ์†Œ๊ฐœ] DINOv2 - Self-supervised Vision Transformer | Meta AI | ๋ ˆ์ด๋ธ” ๋ฐ์ดํ„ฐ ์—†์ด ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ์„ ๋‚ด๋Š” Vision AI ๋ชจ๋ธ
์ƒ๋‹จ์œผ๋กœ

ํ‹ฐ์Šคํ† ๋ฆฌํˆด๋ฐ”