[Object Detection] DETR ๋ชจ๋ธ ์ดํ•ดํ•˜๊ธฐ! | End-to-end ๊ฐ์ฒด ๊ฒ€์ถœ ๋ชจ๋ธ

2024. 8. 10. 14:27ยท๐Ÿ› Research/Detection & Segmentation
๋ฐ˜์‘ํ˜•

 

๊ฐ์ฒด ๊ฒ€์ถœ(Object Detection)์€ ์ด๋ฏธ์ง€๋‚˜ ์˜์ƒ์—์„œ ์–ด๋–ค ๊ฐ์ฒด๊ฐ€ ์–ด๋””์— ์žˆ๋Š”์ง€๋ฅผ ์‹๋ณ„ํ•˜๋Š” ์ปดํ“จํ„ฐ ๋น„์ „์˜ ํ•ต์‹ฌ ๊ณผ์ œ ์ค‘ ํ•˜๋‚˜๋‹ค. ์ตœ๊ทผ๊นŒ์ง€๋„ ๊ฐ์ฒด ๊ฒ€์ถœ ๋ชจ๋ธ์€ R-CNN ๊ณ„์—ด์ด๋‚˜ YOLO ๊ณ„์—ด์ฒ˜๋Ÿผ ๋ณต์žกํ•œ ๊ตฌ์กฐ์™€ ํ›„์ฒ˜๋ฆฌ ๋‹จ๊ณ„๋ฅผ ํฌํ•จํ•œ ๋ฐฉ์‹์ด ์ฃผ๋ฅผ ์ด๋ค˜์ง€๋งŒ, DETR(Detection Transformer)์€ ์ด ํ๋ฆ„์— ํฐ ์ „ํ™˜์ ์„ ๋งŒ๋“ค์–ด๋ƒˆ๋‹ค.

DETR์€ Transformer ๊ตฌ์กฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์•ต์ปค ๋ฐ•์Šค ์—†์ด, ํ›„์ฒ˜๋ฆฌ ์—†์ด, ๊ฐ์ฒด ๊ฒ€์ถœ์„ End-to-End๋กœ ํ•™์Šตํ•˜๊ณ  ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋„๋ก ๋งŒ๋“  ๋ชจ๋ธ์ด๋‹ค.


1. DETR ํ•ต์‹ฌ ์•„์ด๋””์–ด

 

๊ธฐ์กด์˜ ๊ฐ์ฒด ๊ฒ€์ถœ ๋ชจ๋ธ๋“ค์€ ์ˆ˜๋งŽ์€ ์•ต์ปค ๋ฐ•์Šค๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ›„๋ณด ๋ฐ”์šด๋”ฉ ๋ฐ•์Šค๋ฅผ ๋งŒ๋“ค๊ณ , ๊ทธ ์ค‘์—์„œ ๋†’์€ ํ™•๋ฅ ์„ ๊ฐ€์ง„ ๊ฒƒ๋งŒ ๋‚จ๊ธฐ๋Š” ํ›„์ฒ˜๋ฆฌ ๊ณผ์ •(NMS)์ด ํ•„์š”ํ•˜๋‹ค. ์ด๋Ÿฌํ•œ ๊ณผ์ •์€ ๋ณต์žกํ•  ๋ฟ ์•„๋‹ˆ๋ผ, ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹์ด ๊นŒ๋‹ค๋กญ๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ๋‹ค.

 

DETR์€ ์ด ๋ชจ๋“  ๊ฑธ ์—†์•ด๋‹ค. ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์ด ์ฒ˜์Œ๋ถ€ํ„ฐ ๋๊นŒ์ง€ ์ง์ ‘ ๊ฐ์ฒด๋ฅผ ์˜ˆ์ธกํ•˜๊ณ , ์–ด๋–ค ์˜ˆ์ธก์ด ์–ด๋–ค ์‹ค์ œ ๊ฐ์ฒด์— ๋Œ€์‘ํ•˜๋Š”์ง€๋ฅผ ์Šค์Šค๋กœ ํ•™์Šตํ•˜๋„๋ก ์„ค๊ณ„๋œ ๊ตฌ์กฐ๋‹ค. ์ด๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ๋งŒ๋“  ํ•ต์‹ฌ ์•„์ด๋””์–ด๋Š” ๋‹ค์Œ ์„ธ ๊ฐ€์ง€๋‹ค.

 

  • Object Query: ๋ชจ๋ธ์ด ์˜ˆ์ธกํ•  ๊ฐ์ฒด์˜ ์ˆ˜๋งŒํผ ๊ณ ์ •๋œ ์ฟผ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ์‹
  • Transformer Decoder: ์ด๋ฏธ์ง€์˜ ๊ธ€๋กœ๋ฒŒ ์ปจํ…์ŠคํŠธ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๊ฐ์ฒด๋ฅผ ์˜ˆ์ธก
  • ํ—๊ฐ€๋ฆฌ์•ˆ ๋งค์นญ ์•Œ๊ณ ๋ฆฌ์ฆ˜: ์˜ˆ์ธก๊ณผ ์ •๋‹ต ๊ฐ„์˜ ์ตœ์  ๋งค์นญ์„ ์ฐพ์•„์ฃผ๋Š” ๊ณผ์ •

 

2. DETR ์•„ํ‚คํ…์ฒ˜ 

DETR์€ ํฌ๊ฒŒ ์„ธ ๋ถ€๋ถ„์œผ๋กœ ๊ตฌ์„ฑ๋œ๋‹ค.

๋ฐฑ๋ณธ(Backbone), Transformer Encoder-Decoder, ๊ทธ๋ฆฌ๊ณ  ์˜ˆ์ธก ํ—ค๋“œ(Prediction Head)

2.1. ๋ฐฑ๋ณธ (Backbone)

DETR์—์„œ ์ด๋ฏธ์ง€๋กœ๋ถ€ํ„ฐ ๊ธฐ๋ณธ์ ์ธ ํŠน์ง•์„ ์ถ”์ถœํ•˜๋Š” ์—ญํ• ์€ ResNet๊ณผ ๊ฐ™์€ CNN ๋ชจ๋ธ์ด ์ˆ˜ํ–‰ํ•œ๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ ResNet-50์ด๋‚˜ ResNet-101๊ณผ ๊ฐ™์€ ๋„คํŠธ์›Œํฌ๊ฐ€ ์‚ฌ์šฉ๋œ๋‹ค.

๋ฐฑ๋ณธ์˜ ์—ญํ• ์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค:

  • ์ž…๋ ฅ ์ด๋ฏธ์ง€๋ฅผ downsamplingํ•˜๋ฉด์„œ ์ฃผ์š” ์‹œ๊ฐ์  ํŠน์ง•(feature)์„ ์ถ”์ถœํ•œ๋‹ค.
  • ์ด๋ฏธ์ง€์˜ ํฌ๊ธฐ๋ฅผ ์ค„์ด๋ฉด์„œ๋„, ์˜๋ฏธ ์žˆ๋Š” ํŒจํ„ด(๋ฌผ์ฒด์˜ ์œค๊ณฝ, ๊ฒฝ๊ณ„ ๋“ฑ)์„ ๋ฒกํ„ฐ ํ˜•ํƒœ๋กœ ํ‘œํ˜„ํ•œ๋‹ค.
  • ๋งˆ์ง€๋ง‰ convolution layer์—์„œ ์ถ”์ถœ๋œ feature map์€ (์˜ˆ: [batch, C, H, W]) ํ˜•ํƒœ๋กœ ํ‘œํ˜„๋˜๋ฉฐ, ์ดํ›„ ์ด feature map์€ transformer๊ฐ€ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋„๋ก 2์ฐจ์› ๋ฒกํ„ฐ ์‹œํ€€์Šค๋กœ ํ‰ํƒ„ํ™”๋œ๋‹ค.

์ถ”์ถœ๋œ feature map์€ ์ดํ›„ flatten๋˜์–ด Transformer Encoder์— ์ „๋‹ฌ๋œ๋‹ค. ์—ฌ๊ธฐ์„œ ๊ฐ ์œ„์น˜์˜ feature vector๋Š” ํฌ์ง€์…”๋„ ์ธ์ฝ”๋”ฉ์ด ์ถ”๊ฐ€๋œ ํ›„ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉ๋œ๋‹ค. ํฌ์ง€์…”๋„ ์ธ์ฝ”๋”ฉ์€ CNN์˜ ์œ„์น˜ ์ •๋ณด๋ฅผ ๋ณด์™„ํ•ด์ฃผ๊ธฐ ์œ„ํ•ด ๋ฐ˜๋“œ์‹œ ํ•„์š”ํ•˜๋‹ค.

 

2.2. Transformer Encoder-Decoder

DETR์˜ ํ•ต์‹ฌ์€ Transformer ๊ตฌ์กฐ์ด๋‹ค. Vision Transformer(ViT)์™€ ๋‹ฌ๋ฆฌ, DETR์€ CNN์œผ๋กœ feature๋ฅผ ๋จผ์ € ์ถ”์ถœํ•˜๊ณ , ๊ทธ ์œ„์— Transformer๋ฅผ ์–น๋Š” ๋ฐฉ์‹์ด๋‹ค.

 

๐Ÿ“Œ Encoder

Transformer Encoder๋Š” CNN์—์„œ ์ถ”์ถœํ•œ feature๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„ global context์„ ํ•™์Šตํ•œ๋‹ค.

  • ์ž…๋ ฅ์€ CNN์˜ ์ถœ๋ ฅ feature map์„ flattenํ•œ ํ›„, ๊ฐ ์œ„์น˜๋งˆ๋‹ค ํฌ์ง€์…”๋„ ์ธ์ฝ”๋”ฉ์„ ๋”ํ•œ ๊ฒƒ์ด๋‹ค.
  • ์ดํ›„ ์—ฌ๋Ÿฌ ์ธต์˜ self-attention๊ณผ feed-forward network๋ฅผ ํ†ตํ•ด ๊ฐ ์œ„์น˜์˜ feature๊ฐ€ ์„œ๋กœ ์–ด๋–ค ์—ฐ๊ด€์„ฑ์ด ์žˆ๋Š”์ง€๋ฅผ ํ•™์Šตํ•œ๋‹ค.
  • ์ด ๊ณผ์ •์„ ํ†ตํ•ด ๊ฐ™์€ ๊ฐ์ฒด์˜ ์ผ๋ถ€์ผ ์ˆ˜ ์žˆ๋Š” feature๋“ค ๊ฐ„์˜ ์—ฐ๊ฒฐ์„ฑ์„ ๊ฐ•ํ™”ํ•  ์ˆ˜ ์žˆ๋‹ค.
  • ๊ธฐ์กด NLP Transformer ๊ตฌ์กฐ์™€ ๊ฑฐ์˜ ๋™์ผํ•˜๋ฉฐ, feature ๊ฐ„ ์ƒํ˜ธ์ž‘์šฉ์— ์ค‘์ ์„ ๋‘”๋‹ค.

 

๐Ÿ“Œ Decoder

Decoder๋Š” DETR์—์„œ ๊ฐ€์žฅ ํ•ต์‹ฌ์ ์ธ ๋ถ€๋ถ„ ์ค‘ ํ•˜๋‚˜๋กœ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํŠน์ง•์„ ๊ฐ€์ง„๋‹ค.

  • ์ž…๋ ฅ
    • Object Queries: ๊ณ ์ •๋œ ๊ฐœ์ˆ˜์˜ ํ•™์Šต ๊ฐ€๋Šฅํ•œ ์ฟผ๋ฆฌ ๋ฒกํ„ฐ (์˜ˆ: 100๊ฐœ)
    • Encoder Output: ์ด๋ฏธ์ง€ feature์˜ ์ „์—ญ ํ‘œํ˜„
  • ๋‘ ๋‹จ๊ณ„์˜ Attention ์—ฐ์‚ฐ
    • Self-Attention: ๊ฐ object query๋“ค ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ํ•™์Šตํ•œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ๊ฐ™์€ ์ด๋ฏธ์ง€ ์•ˆ์— ๋‘ ๊ฐ์ฒด๊ฐ€ ๊ฒน์น˜์ง€ ์•Š๋„๋ก ์ƒํ˜ธ์ž‘์šฉํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค.
    • Cross-Attention: ์ฟผ๋ฆฌ ๋ฒกํ„ฐ๊ฐ€ Encoder์˜ ์ถœ๋ ฅ(feature map)์„ ์ฐธ์กฐํ•˜๋ฉฐ, ์ด๋ฏธ์ง€ ๋‚ด ์ž์‹ ์ด ๋‹ด๋‹นํ•  ๊ฐ์ฒด๊ฐ€ ์–ด๋””์— ์žˆ๋Š”์ง€๋ฅผ ํƒ์ƒ‰ํ•œ๋‹ค.

๊ฒฐ๊ณผ์ ์œผ๋กœ, ๊ฐ object query๋Š” ์ด๋ฏธ์ง€์˜ ์ „์—ญ ์ •๋ณด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ํ•˜๋‚˜์˜ ๊ฐ์ฒด๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ์ž„๋ฒ ๋”ฉ์œผ๋กœ ์ž‘๋™ํ•˜๊ฒŒ ๋œ๋‹ค.

 

2.3. ์˜ˆ์ธก ํ—ค๋“œ (Prediction Head)

Transformer Decoder์˜ ์ถœ๋ ฅ์€ (num_queries, hidden_dim) ํ˜•ํƒœ์˜ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ๋“ค์ด๋‹ค. ๊ฐ ๋ฒกํ„ฐ๋Š” ํ•˜๋‚˜์˜ ๊ฐ์ฒด๋ฅผ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•œ ์ •๋ณด๋ฅผ ๋‹ด๊ณ  ์žˆ๋‹ค. ์ด ์ถœ๋ ฅ์€ ๋‘ ๊ฐœ์˜ Feed Forward Network(FFN)์— ์ „๋‹ฌ๋˜์–ด ์ตœ์ข… ์˜ˆ์ธก์ด ์ด๋ฃจ์–ด์ง„๋‹ค.

  • Class Prediction Head
    • ๊ฐ ์ฟผ๋ฆฌ์—์„œ ์˜ˆ์ธก๋œ ๊ฐ์ฒด์˜ ํด๋ž˜์Šค ์ •๋ณด๋ฅผ ์ถœ๋ ฅํ•œ๋‹ค.
    • ์ถœ๋ ฅ ํ˜•ํƒœ๋Š” (num_queries, num_classes + 1)๋กœ, +1์€ "no object" ํด๋ž˜์Šค๋ฅผ ์˜๋ฏธํ•œ๋‹ค.
    • CrossEntropy Loss๋กœ ํ•™์Šต๋œ๋‹ค.
  • BBox Prediction Head
    • ๊ฐ ์ฟผ๋ฆฌ์— ๋Œ€ํ•ด ๊ฐ์ฒด์˜ ๋ฐ”์šด๋”ฉ ๋ฐ•์Šค๋ฅผ ์˜ˆ์ธกํ•œ๋‹ค.
    • ์ถœ๋ ฅ์€ [center_x, center_y, width, height]์˜ ํ˜•์‹์ด๋ฉฐ, ๋ชจ๋‘ 0~1 ์‚ฌ์ด์˜ ๊ฐ’์œผ๋กœ ์ •๊ทœํ™”๋˜์–ด ์žˆ๋‹ค.
    • ์ด ์˜ˆ์ธก๊ฐ’์€ ์‹ค์ œ ์ •๋‹ต ๋ฐ”์šด๋”ฉ ๋ฐ•์Šค์™€ ๋น„๊ตํ•˜์—ฌ L1 Loss, GIoU Loss๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•™์Šต๋œ๋‹ค.

์ด๋Ÿฌํ•œ ์˜ˆ์ธก ํ—ค๋“œ๋Š” ๋งค์šฐ ๋‹จ์ˆœํ•˜๋ฉฐ, fully connected layer ๋‘์„ธ ๊ฐœ ์ •๋„๋กœ ๊ตฌ์„ฑ๋œ FFN์ด๋‹ค. ์ด์ฒ˜๋Ÿผ ๊ตฌ์กฐ๊ฐ€ ๋‹จ์ˆœํ•œ ์ด์œ ๋Š” ๋Œ€๋ถ€๋ถ„์˜ ํ•™์Šต์ด Transformer ๋‚ด๋ถ€์—์„œ ์ด๋ค„์ง€๊ณ , ์ตœ์ข… ์ถœ๋ ฅ์€ ํ•ด๋‹น ์ •๋ณด๋ฅผ ์ถ”์ถœ๋งŒ ํ•˜๋ฉด ๋˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

 

3. ์˜ค๋ธŒ์ ํŠธ ์ฟผ๋ฆฌ(Object Query)

Object Query๋Š” DETR๋งŒ์˜ ๋…ํŠนํ•œ ๊ตฌ์„ฑ ์š”์†Œ๋‹ค.

  • ๊ฐ ์ฟผ๋ฆฌ๋Š” ํ•˜๋‚˜์˜ ๊ฐ์ฒด๋ฅผ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•œ ํ•™์Šต ๊ฐ€๋Šฅํ•œ ๋ฒกํ„ฐ๋‹ค.
  • ์ฟผ๋ฆฌ๋Š” Transformer ๋””์ฝ”๋”์— ์ž…๋ ฅ๋˜๋ฉฐ, ์ด๋ฏธ์ง€์˜ feature๋“ค๊ณผ ์ƒํ˜ธ์ž‘์šฉํ•˜๋ฉฐ ๊ด€๋ จ๋œ ๊ฐ์ฒด ์ •๋ณด๋ฅผ ๋ฝ‘์•„๋‚ธ๋‹ค.

์ฆ‰, Object Query๋Š” “์ด ์ฟผ๋ฆฌ๋Š” ์ด๋ฏธ์ง€ ์†์˜ ์–ด๋–ค ๊ฐ์ฒด๋ฅผ ์˜ˆ์ธกํ•ด์ค˜”๋ผ๋Š” ์š”์ฒญ ์—ญํ• ์„ ์ˆ˜ํ–‰ํ•˜๋Š” ์…ˆ์ด๋‹ค. ์ด ๊ฐœ์ˆ˜๋Š” ๊ณ ์ •๋˜์–ด ์žˆ๊ณ , ํ•™์Šต์„ ํ†ตํ•ด ์ฟผ๋ฆฌ๋งˆ๋‹ค ํŠน์ • ์œ ํ˜•์˜ ๊ฐ์ฒด๋ฅผ ํƒ์ง€ํ•˜๋„๋ก ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๋ถ„ํ™”๋œ๋‹ค.

 

4. ํ—๊ฐ€๋ฆฌ์•ˆ ๋งค์นญ ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ ๋งค์นญ ์ฝ”์ŠคํŠธ

DETR์—์„œ ๊ฐ€์žฅ ํฐ ํŠน์ง• ์ค‘ ํ•˜๋‚˜๋Š” ๊ฐ์ฒด ๊ฒ€์ถœ ๊ณผ์ •์„ end-to-end ๋ฐฉ์‹์œผ๋กœ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋งŒ๋“ ๋‹ค๋Š” ์ ์ด๋‹ค. ์ด๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ด์ฃผ๋Š” ํ•ต์‹ฌ ์š”์†Œ๊ฐ€ ๋ฐ”๋กœ ํ—๊ฐ€๋ฆฌ์•ˆ ๋งค์นญ(Hungarian Matching)์ด๋‹ค.

 

ํ—๊ฐ€๋ฆฌ์•ˆ ๋งค์นญ์˜ ์—ญํ• 

DETR์—์„œ๋Š” ๋””์ฝ”๋”๊ฐ€ ๊ณ ์ •๋œ ์ˆ˜์˜ ์˜ค๋ธŒ์ ํŠธ ์ฟผ๋ฆฌ(Object Queries)๋ฅผ ์ž…๋ ฅ๋ฐ›์•„, ๊ฐ ์ฟผ๋ฆฌ์— ๋Œ€ํ•ด ๊ฐ์ฒด์˜ ํด๋ž˜์Šค์™€ ๋ฐ”์šด๋”ฉ ๋ฐ•์Šค๋ฅผ ์˜ˆ์ธกํ•œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, 100๊ฐœ์˜ ์ฟผ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ๋ผ๋ฉด ๋งค ์˜ˆ์ธก ์‹œ์ ๋งˆ๋‹ค 100๊ฐœ์˜ ๊ฐ์ฒด ํ›„๋ณด๊ฐ€ ์ƒ์„ฑ๋œ๋‹ค. ํ•˜์ง€๋งŒ ์‹ค์ œ ์ด๋ฏธ์ง€์—๋Š” ๊ฐ์ฒด๊ฐ€ 5๊ฐœ๋งŒ ์žˆ์„ ์ˆ˜๋„ ์žˆ๊ณ , 12๊ฐœ์ผ ์ˆ˜๋„ ์žˆ๋‹ค. ์ฆ‰, ์˜ˆ์ธก ๊ฒฐ๊ณผ์™€ ์‹ค์ œ ์ •๋‹ต์˜ ์ˆ˜๊ฐ€ ๋‹ค๋ฅด๋ฉฐ, ์ˆœ์„œ๋„ ์ „ํ˜€ ์ผ์น˜ํ•˜์ง€ ์•Š๋Š”๋‹ค. ์ด์ฒ˜๋Ÿผ ์˜ˆ์ธก๋œ ๊ฐ’๊ณผ ์‹ค์ œ ๊ฐ’์˜ ์ˆ˜๊ฐ€ ๋‹ค๋ฅด๊ณ  ์ผ๋Œ€์ผ ๋Œ€์‘์ด ๋ถˆ๋ถ„๋ช…ํ•œ ์ƒํ™ฉ์—์„œ, ๊ฐ ์˜ˆ์ธก๊ฐ’์ด ์–ด๋–ค ์‹ค์ œ ๊ฐ์ฒด๋ฅผ ์˜ˆ์ธกํ•˜๋ ค ํ•œ ๊ฒƒ์ธ์ง€ ๋งค์นญํ•ด์ฃผ๋Š” ๊ณผ์ •์ด ํ•„์š”ํ•˜๋‹ค.

 

์ด๋ฅผ ์œ„ํ•ด DETR์€ ํ—๊ฐ€๋ฆฌ์•ˆ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•œ๋‹ค. ํ—๊ฐ€๋ฆฌ์•ˆ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์˜ˆ์ธก๊ฐ’๊ณผ ์ •๋‹ต ๊ฐ„์˜ ๋งค์นญ ๋น„์šฉ(Matching Cost)์„ ์ตœ์†Œํ™”ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ, ๊ฐ€์žฅ ํšจ์œจ์ ์ธ 1:1 ๋งค์นญ์„ ์ฐพ์•„์ฃผ๋Š” ์ตœ์ ํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‹ค.

 

์™œ ๋งค์นญ์ด ํ•„์š”ํ•œ๊ฐ€?

๊ณ ์ •๋œ ์ˆ˜์˜ ์˜ค๋ธŒ์ ํŠธ ์ฟผ๋ฆฌ๋Š” ๋ชจ๋ธ์ด ํ•ญ์ƒ ์ผ์ •ํ•œ ๊ฐœ์ˆ˜์˜ ๊ฐ์ฒด๋ฅผ ์˜ˆ์ธกํ•˜๋„๋ก ๊ฐ•์ œํ•œ๋‹ค. ํ•˜์ง€๋งŒ ์‹ค์ œ ์ •๋‹ต์€ ์œ ๋™์ ์ด๊ธฐ ๋•Œ๋ฌธ์—, ํ•™์Šต ์‹œ์— "์–ด๋–ค ์˜ˆ์ธก๊ฐ’์ด ์–ด๋–ค ์ •๋‹ต๊ณผ ๋Œ€์‘๋˜๋Š”์ง€"๊ฐ€ ๋ช…ํ™•ํ•˜์ง€ ์•Š์œผ๋ฉด Loss๋ฅผ ๊ณ„์‚ฐํ•  ์ˆ˜ ์—†๋‹ค.

 

์˜ˆ๋ฅผ ๋“ค์–ด, 100๊ฐœ์˜ ์˜ˆ์ธก ์ค‘ 5๊ฐœ๋งŒ ์‹ค์ œ ๊ฐ์ฒด์™€ ๊ด€๋ จ์ด ์žˆ๊ณ , ๋‚˜๋จธ์ง€๋Š” ๋ชจ๋‘ "no object"์ธ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค. ์ด ๋•Œ ํ—๊ฐ€๋ฆฌ์•ˆ ๋งค์นญ์€ ๊ฐ€์žฅ ์ ์ ˆํ•œ ์˜ˆ์ธก๊ฐ’ 5๊ฐœ๋ฅผ ์„ ํƒํ•ด ์‹ค์ œ ๊ฐ์ฒด์™€ ๋งค์นญ์‹œ์ผœ์ค€๋‹ค. ๋‚˜๋จธ์ง€ 95๊ฐœ๋Š” "๋ฐฐ๊ฒฝ"์œผ๋กœ ๋ถ„๋ฅ˜๋˜์–ด ๋ณ„๋„์˜ Loss๋กœ ์ฒ˜๋ฆฌ๋œ๋‹ค.

 

๋งค์นญ ๋น„์šฉ (Matching Cost)์˜ ๊ตฌ์„ฑ

ํ—๊ฐ€๋ฆฌ์•ˆ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ๋งค์นญ์„ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š”, ์˜ˆ์ธก๊ฐ’๊ณผ ์ •๋‹ต ์‚ฌ์ด์˜ "๋น„์šฉ(Cost)"์„ ์ •์˜ํ•ด์•ผ ํ•œ๋‹ค. ์ด ๋น„์šฉ์€ ๋‹จ์ˆœํ•œ ๊ฑฐ๋ฆฌ๋ฟ ์•„๋‹ˆ๋ผ ์˜ˆ์ธก๋œ ๊ฐ์ฒด ์ •๋ณด์˜ ์ •ํ™•๋„ ์ „๋ฐ˜์„ ๋ฐ˜์˜ํ•œ๋‹ค. DETR์—์„œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์„ธ ๊ฐ€์ง€ ์š”์†Œ๊ฐ€ ๋งค์นญ ๋น„์šฉ์„ ๊ตฌ์„ฑํ•œ๋‹ค:

  1. Class Matching Cost
    ์˜ˆ์ธก๋œ ํด๋ž˜์Šค์™€ ์‹ค์ œ ํด๋ž˜์Šค ๊ฐ„์˜ ์ฐจ์ด๋ฅผ ๋ฐ˜์˜ํ•œ๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ Cross Entropy ๋˜๋Š” negative log-likelihood ๋ฐฉ์‹์œผ๋กœ ๊ณ„์‚ฐ๋œ๋‹ค.
  2. BBox Matching L1 Cost
    ์˜ˆ์ธก๋œ ๋ฐ”์šด๋”ฉ ๋ฐ•์Šค ์ขŒํ‘œ์™€ ์‹ค์ œ ๋ฐ”์šด๋”ฉ ๋ฐ•์Šค ์ขŒํ‘œ ๊ฐ„์˜ L1 ๊ฑฐ๋ฆฌ ์ฐจ์ด๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค.
    ์œ„์น˜ ์ •๋ณด์˜ ์ •ํ™•๋„๋ฅผ ๋ฐ˜์˜ํ•˜๋ฉฐ, ์ค‘์‹ฌ ์ขŒํ‘œ์™€ ํฌ๊ธฐ ๋ชจ๋‘ ๊ณ ๋ ค๋œ๋‹ค.
  3. GIoU Matching Cost (Generalized IoU)
    ๋ฐ”์šด๋”ฉ ๋ฐ•์Šค ๊ฐ„์˜ ๊ฒน์นจ ์ •๋„๋ฅผ ํ‰๊ฐ€ํ•œ๋‹ค. GIoU๋Š” ๋‹จ์ˆœ IoU๋ณด๋‹ค ๋” ์ •๊ตํ•œ ์ฒ™๋„๋กœ, ๋‘ ๋ฐ•์Šค๊ฐ€ ๊ฒน์น˜์ง€ ์•Š๋”๋ผ๋„ ์–ผ๋งˆ๋‚˜ ๋ฉ€๋ฆฌ ๋–จ์–ด์ ธ ์žˆ๋Š”์ง€๋ฅผ ๊ณ ๋ คํ•˜์—ฌ ๋” ๋‚˜์€ gradient๋ฅผ ์ œ๊ณตํ•œ๋‹ค.

์ด ์„ธ ๊ฐ€์ง€ ๋น„์šฉ์„ ๊ฐ€์ค‘ ํ•ฉ์‚ฐํ•˜์—ฌ ์ตœ์ข… ๋งค์นญ ์ฝ”์ŠคํŠธ๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ , ํ—๊ฐ€๋ฆฌ์•ˆ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์ด ์ฝ”์ŠคํŠธ๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ์ตœ์ ์˜ ์˜ˆ์ธก-์ •๋‹ต ์Œ ์กฐํ•ฉ์„ ์ฐพ์•„๋‚ธ๋‹ค.

 

ํ—๊ฐ€๋ฆฌ์•ˆ ๋งค์นญ์˜ ์žฅ์ 

  • ํ›„์ฒ˜๋ฆฌ ํ•„์š” ์—†์Œ: ์ „ํ†ต์ ์ธ ๊ฐ์ฒด ๊ฒ€์ถœ ๋ชจ๋ธ๊ณผ ๋‹ฌ๋ฆฌ, NMS(Non-Maximum Suppression) ๊ฐ™์€ ํ›„์ฒ˜๋ฆฌ ๊ณผ์ •์ด ํ•„์š” ์—†๋‹ค.
  • ๋‹จ์ˆœํ•œ ๊ตฌ์กฐ: ์•ต์ปค ๋ฐ•์Šค๊ฐ€ ํ•„์š” ์—†๊ธฐ ๋•Œ๋ฌธ์— ๋ชจ๋ธ ๊ตฌ์กฐ๊ฐ€ ํ›จ์”ฌ ๊ฐ„๊ฒฐํ•ด์ง„๋‹ค.
  • ์ผ๊ด€๋œ ํ•™์Šต ๊ฐ€๋Šฅ: ๊ฐ ์˜ˆ์ธก์— ๋Œ€ํ•ด ์ •๋‹ต๊ณผ์˜ ๋งค์นญ์ด ๋ช…ํ™•ํžˆ ์ •์˜๋˜๋ฏ€๋กœ, ํ•™์Šต ๊ณผ์ •์ด ์•ˆ์ •์ ์ด๊ณ  ์ผ๊ด€๋œ๋‹ค.
  • ๋‹ค๋Œ€๋‹ค ์˜ˆ์ธก์—์„œ 1:1 ๋งค์นญ์œผ๋กœ: ๊ณ ์ •๋œ ์ˆ˜์˜ ์ฟผ๋ฆฌ์—์„œ ๋‹ค์ˆ˜์˜ ์˜ˆ์ธก์„ ํ•˜๋˜, ์‹ค์ œ ๊ฐ์ฒด์™€๋Š” ์ •ํ™•ํ•œ 1:1 ๋Œ€์‘์„ ์ฐพ๊ธฐ ๋•Œ๋ฌธ์— ๋ถˆํ•„์š”ํ•œ ์ค‘๋ณต ์˜ˆ์ธก์ด ์ค„์–ด๋“ ๋‹ค.

 

5. DETR์˜ ์žฅ์ ๊ณผ ํ•œ๊ณ„

์žฅ์ 

  • NMS ์ œ๊ฑฐ: ํ›„์ฒ˜๋ฆฌ ์—†์ด ๋ฐ”๋กœ ์˜ˆ์ธก๋œ ๊ฒฐ๊ณผ๋ฅผ ์‚ฌ์šฉ ๊ฐ€๋Šฅ
  • ๊ฐ„๊ฒฐํ•œ ๊ตฌ์กฐ: ์•ต์ปค ๋ฐ•์Šค, ์ œ์•ˆ ์˜์—ญ ๋“ฑ์ด ์‚ฌ๋ผ์ ธ ๊ตฌ์กฐ๊ฐ€ ๋‹จ์ˆœํ•ด์ง
  • ํ•™์Šต ์ผ๊ด€์„ฑ: ์ฟผ๋ฆฌ ๊ธฐ๋ฐ˜ ์˜ˆ์ธก์œผ๋กœ ๋งค์นญ๊ณผ ํ•™์Šต์ด ์ง๊ด€์ 

๋‹จ์ 

  • ์ž‘์€ ๊ฐ์ฒด ๊ฒ€์ถœ์— ์•ฝํ•จ: Transformer ๊ตฌ์กฐ ํŠน์„ฑ์ƒ ๋กœ์ปฌ ๋””ํ…Œ์ผ ์ •๋ณด ์†์‹ค
  • ํ•™์Šต ์ˆ˜๋ ด ์†๋„ ๋А๋ฆผ: CNN ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์— ๋น„ํ•ด ํ›ˆ๋ จ์— ๋” ๋งŽ์€ ์‹œ๊ฐ„์ด ํ•„์š”

์ด๋Ÿฌํ•œ ๋‹จ์ ์„ ๊ฐœ์„ ํ•œ Deformable DETR, DINO DETR, H-DETR ๋“ฑ ๋‹ค์–‘ํ•œ ๋ณ€ํ˜• ๋ชจ๋ธ์ด ์ดํ›„์— ์ œ์•ˆ๋˜์—ˆ๋‹ค. ํŠนํžˆ Deformable DETR์€ ์ž‘์€ ๊ฐ์ฒด ๊ฒ€์ถœ ์„ฑ๋Šฅ๊ณผ ํ•™์Šต ์†๋„ ๋ฌธ์ œ๋ฅผ ํฌ๊ฒŒ ๊ฐœ์„ ํ–ˆ๋‹ค.

 

6. ์š”์•ฝ

  • DETR์€ CNN ๋ฐฑ๋ณธ + Transformer ๊ตฌ์กฐ๋ฅผ ํ†ตํ•ด End-to-End ๊ฐ์ฒด ๊ฒ€์ถœ์„ ์‹คํ˜„ํ•œ ์ตœ์ดˆ์˜ ๋ชจ๋ธ์ด๋‹ค.
  • Object Query๋ฅผ ์ด์šฉํ•ด ๊ฐ์ฒด๋ฅผ ์ง์ ‘ ์˜ˆ์ธกํ•˜๊ณ , Hungarian Matching์œผ๋กœ ์ •๋‹ต๊ณผ์˜ ๋งค์นญ์„ ์ž๋™์œผ๋กœ ์ˆ˜ํ–‰ํ•œ๋‹ค.
  • ํ›„์ฒ˜๋ฆฌ ์—†์ด๋„ ๊ฐ์ฒด ๊ฒ€์ถœ์ด ๊ฐ€๋Šฅํ•˜๋ฉฐ, ๊ตฌ์กฐ๊ฐ€ ๋‹จ์ˆœํ•ด ๋ชจ๋ธ ํ•ด์„์ด ์‰ฌ์›Œ์กŒ๋‹ค.
  • ์ดํ›„ ๋‹ค์–‘ํ•œ DETR ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋“ค์ด ๋“ฑ์žฅํ•˜๋ฉด์„œ ๊ฐ์ฒด ๊ฒ€์ถœ ๋ถ„์•ผ์— ์ƒˆ๋กœ์šด ํ๋ฆ„์„ ๋งŒ๋“ค์—ˆ๋‹ค.
๋ฐ˜์‘ํ˜•

'๐Ÿ› Research > Detection & Segmentation' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

[Object Detection] ๊ฐ์ฒด ๊ฒ€์ถœ ๋ชจ๋ธ (2) : Fast RCNN, Faster RCNN  (0) 2024.08.11
[Object Detection] ๊ฐ์ฒด ๊ฒ€์ถœ ๋ชจ๋ธ (1) : RCNN, SPPNet  (0) 2024.08.11
[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] Fast Segment Anything | Fast SAM | SAM์˜ ๊ฒฝ๋Ÿ‰ํ™”  (0) 2023.07.02
[๋…ผ๋ฌธ ์†Œ๊ฐœ] TAM (Track Anything Model) | ์–ด๋–ค ๊ฒƒ์ด๋“  ์ถ”์ ํ•˜๋Š” Vision AI ๋ชจ๋ธ | Sagment Anything ๋น„๋””์˜ค ๋ฒ„์ „  (0) 2023.04.30
[๋…ผ๋ฌธ ์†Œ๊ฐœ] DINOv2 - Self-supervised Vision Transformer | Meta AI | ๋ ˆ์ด๋ธ” ๋ฐ์ดํ„ฐ ์—†์ด ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ์„ ๋‚ด๋Š” Vision AI ๋ชจ๋ธ  (0) 2023.04.29
'๐Ÿ› Research/Detection & Segmentation' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€
  • [Object Detection] ๊ฐ์ฒด ๊ฒ€์ถœ ๋ชจ๋ธ (2) : Fast RCNN, Faster RCNN
  • [Object Detection] ๊ฐ์ฒด ๊ฒ€์ถœ ๋ชจ๋ธ (1) : RCNN, SPPNet
  • [๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] Fast Segment Anything | Fast SAM | SAM์˜ ๊ฒฝ๋Ÿ‰ํ™”
  • [๋…ผ๋ฌธ ์†Œ๊ฐœ] TAM (Track Anything Model) | ์–ด๋–ค ๊ฒƒ์ด๋“  ์ถ”์ ํ•˜๋Š” Vision AI ๋ชจ๋ธ | Sagment Anything ๋น„๋””์˜ค ๋ฒ„์ „
๋ญ…์ฆค
๋ญ…์ฆค
AI ๊ธฐ์ˆ  ๋ธ”๋กœ๊ทธ
    ๋ฐ˜์‘ํ˜•
  • ๋ญ…์ฆค
    CV DOODLE
    ๋ญ…์ฆค
  • ์ „์ฒด
    ์˜ค๋Š˜
    ์–ด์ œ
  • ๊ณต์ง€์‚ฌํ•ญ

    • โœจ About Me
    • ๋ถ„๋ฅ˜ ์ „์ฒด๋ณด๊ธฐ (202) N
      • ๐Ÿ“– Fundamentals (33)
        • Computer Vision (9)
        • 3D vision & Graphics (6)
        • AI & ML (15)
        • NLP (2)
        • etc. (1)
      • ๐Ÿ› Research (67) N
        • Deep Learning (7)
        • Image Classification (2)
        • Detection & Segmentation (17)
        • OCR (7)
        • Multi-modal (4)
        • Generative AI (8) N
        • 3D Vision (3)
        • Material & Texture Recognit.. (8)
        • NLP & LLM (11)
        • etc. (0)
      • ๐Ÿ› ๏ธ Engineering (7)
        • Distributed Training (4)
        • AI & ML ์ธ์‚ฌ์ดํŠธ (3)
      • ๐Ÿ’ป Programming (86)
        • Python (18)
        • Computer Vision (12)
        • LLM (4)
        • AI & ML (18)
        • Database (3)
        • Apache Airflow (6)
        • Docker & Kubernetes (14)
        • ์ฝ”๋”ฉ ํ…Œ์ŠคํŠธ (4)
        • C++ (1)
        • etc. (6)
      • ๐Ÿ’ฌ ETC (3)
        • ์ฑ… ๋ฆฌ๋ทฐ (3)
  • ๋งํฌ

  • ์ธ๊ธฐ ๊ธ€

  • ํƒœ๊ทธ

    OCR
    ๊ฐ์ฒด ๊ฒ€์ถœ
    ChatGPT
    ์ปดํ“จํ„ฐ๋น„์ „
    nlp
    ๋„์ปค
    pytorch
    CNN
    Python
    Computer Vision
    object detection
    ํŒŒ์ด์ฌ
    ml
    material recognition
    OpenCV
    ๋”ฅ๋Ÿฌ๋‹
    airflow
    VLP
    3D Vision
    pandas
    ํ”„๋กฌํ”„ํŠธ์—”์ง€๋‹ˆ์–ด๋ง
    deep learning
    generative ai
    LLM
    segmentation
    ๊ฐ์ฒด๊ฒ€์ถœ
    OpenAI
    AI
    multi-modal
    Text recognition
  • ์ตœ๊ทผ ๋Œ“๊ธ€

  • ์ตœ๊ทผ ๊ธ€

  • hELLOยท Designed By์ •์ƒ์šฐ.v4.10.3
๋ญ…์ฆค
[Object Detection] DETR ๋ชจ๋ธ ์ดํ•ดํ•˜๊ธฐ! | End-to-end ๊ฐ์ฒด ๊ฒ€์ถœ ๋ชจ๋ธ
์ƒ๋‹จ์œผ๋กœ

ํ‹ฐ์Šคํ† ๋ฆฌํˆด๋ฐ”