๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
๐Ÿ› Research/Detection & Segmentation

[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] End-to-End Object Detection with Transformers | DETR ์„ค๋ช…

by ๋ญ…์ฆค 2023. 11. 25.
๋ฐ˜์‘ํ˜•

์˜ค๋Š˜์€ 2020๋…„์— Meta์—์„œ ๊ณต๊ฐœํ•œ DETR ๋ชจ๋ธ(ECCV 2020)์„ ๋ฆฌ๋ทฐํ•ด ๋ณด๊ณ ์ž ํ•œ๋‹ค. ํ”ผ ์ธ์šฉ์ˆ˜๊ฐ€ 9000ํšŒ์— ์œก๋ฐ•ํ•˜๋ฉฐ, ์ตœ๊ทผ ๊ณต๊ฐœ๋˜๋Š” ๊ฐ์ฒด ๊ฒ€์ถœ ๋…ผ๋ฌธ๋“ค์„ ๋ณด๋ฉด DETR ๊ธฐ๋ฐ˜์˜ ์—ฐ๊ตฌ๋„ ์‹ฌ์‹ฌ์น˜ ์•Š๊ฒŒ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. Deformable DETR, Conditional DETR, Group DETR, Co-DETR, ...


DETR (DEtection TRansformer)

 

DETR์€ ํŠธ๋žœ์Šคํฌ๋จธ์™€ ์ด๋ถ„ ๋งค์นญ(Bipartite-matching) ๊ธฐ๋ฐ˜์˜ ์ƒˆ๋กœ์šด ๊ฒ€์ถœ ๋ฐฉ์‹์„ ๋„์ž…ํ•˜์—ฌ RPN, NMS์™€ ๊ฐ™์€ hand-crafted ํ•œ ์—”์ง€๋‹ˆ์–ด๋ง์ด ํ•„์š”์—†๋Š” ๋ชจ๋ธ ๊ตฌ์กฐ๋ผ๊ณ  ํ•œ๋‹ค. ๊ตฌ์กฐ์ ์œผ๋กœ ๊ต‰์žฅํžˆ ๊ฐ„๋‹จํ•˜๋ฉด์„œ ๋‹ค๋ฅธ task์— ๋Œ€ํ•œ ํ™•์žฅ์„ฑ๋„ ์ข‹๊ณ , ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ์ด์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํฐ ๊ฐ์ฒด๋ฅผ ๊ฒ€์ถœ ๋Šฅ๋ ฅ์ด Faster RCNN๋ณด๋‹ค ๋” ์ข‹๋‹ค๊ณ  ํ•œ๋‹ค.

 

 

๊ตฌ์กฐ๋ฅผ ๋ณด๋ฉด ๊ต‰์žฅํžˆ ๊ฐ„๋‹จํ•˜๋‹ค๋Š” ๊ฒŒ ๋Š๊ปด์ง„๋‹ค. ์ž…๋ ฅ ์ด๋ฏธ์ง€๋ฅผ CNN์— ํƒœ์šด ํ›„ Transformer ์ธ์ฝ”๋”-๋””์ฝ”๋”์— ๋„ฃ๊ณ  ๋‚˜์˜จ ๊ฒฐ๊ณผ๋ฅผ FFN์„ ํ†ตํ•ด ๊ฐ์ฒด์˜ ํด๋ž˜์Šค์™€ bbox ์œ„์น˜๋ฅผ ์ถ”์ •ํ•˜๊ฒŒ ๋˜๋Š” ๊ตฌ์กฐ์ด๋‹ค. ์ด ๋•Œ hand craftedํ•œ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ  ๊ฐ์ฒด ๊ฒ€์ถœ prediction๊ณผ GT ์™€์˜ loss๋ฅผ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•ด์„  1๋Œ€1 ๋งค์นญ์„ ํ•ด์ค˜์•ผ ํ•œ๋‹ค. ๋ฐ”๋กœ ์ด ๋•Œ ์‚ฌ์šฉ๋˜๋Š” ๋ฐฉ๋ฒ•์ด ์ด๋ถ„ ๋งค์นญ์ด๋‹ค. ๊ฐ„๋‹จํžˆ ๋งํ•ด ์ค‘๋ณต๋˜์ง€ ์•Š๋Š” ์ตœ์ ์˜ ๋งค์นญ์„ ์ฐพ์•„์ฃผ๋Š” ํ—๊ฐ€๋ฆฌ์•ˆ ๋งค์นญ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด๋ผ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ์ฃผ๋กœ ๋“œ๋Š” ์˜ˆ์‹œ๊ฐ€ 4๋ช…์˜ ์ž‘์—…์ž๊ฐ€ 4๊ฐœ์˜ ์ž‘์—…์„ ํ•ด์•ผ ํ•˜๋Š”๋ฐ ๊ฐ€์žฅ ํšจ์œจ์ ์œผ๋กœ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์œ„ํ•ด ์ž‘์—…์ž์™€ ์ž‘์—…์„ ๋งค์นญํ•ด์ฃผ๋Š” ๋ฌธ์ œ์ด๋‹ค. 

 

 

- ์ด๋ฏธ์ง€๋ฅผ Transformer ์ž…๋ ฅ ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜

  • CNN(๋…ผ๋ฌธ์—์„œ๋Š” ResNet)์„ ์‚ฌ์šฉํ•˜์—ฌ ํŠน์ • ์ฐจ์›์˜ feature map์„ ์–ป์Œ
  • 1x1 Conv ์‚ฌ์šฉํ•˜์—ฌ ๋ฏธ๋ฆฌ ์„ค์ •ํ•œ ํ† ํฐ ์ž„๋ฒ ๋”ฉ ์ฐจ์›(d)์œผ๋กœ ์ฑ„๋„๊ฐ’์„ ์ถ•์†Œ
  • ์ตœ์ข…์ ์œผ๋กœ d*HW๋กœ flattenํ•˜์—ฌ transformer์˜ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉํ•˜๋Š” ์‹œํ€€์Šค๋ฅผ ๊ตฌ์„ฑ

 

- Positional Encoding 

  • 2D sine positional encoding ์‚ฌ์šฉ

 

- Encoder-Decoder

  • Self Attention์„ ํ†ตํ•ด query slot๊ฐ„ ์ƒ๊ด€ ๊ด€๊ณ„๋ฅผ ํ•™์Šต
  • Object query๋Š” ์ •๋ณด๋ฅผ ๋‹ด๊ธฐ ์œ„ํ•œ slot์ด๋ฉฐ, Encoder-decoder attention์„ ํ†ตํ•ด ์ด๋ฏธ์ง€์˜ ์–ด๋Š ๋ถ€๋ถ„์„ ์ค‘์ ์ ์œผ๋กœ ๋ด์•ผํ• ์ง€ ํ•™์Šต

 

- FFN (Feed Forward Network)

  • output์— ๋Œ€ํ•œ ์ •๊ทœํ™” ๋ฐ ํ•™์Šต ๋ณด์กฐ
  • ๋””์ฝ”๋” ์ž„๋ฒ ๋”ฉ ๊ฐ’์„ FFN์— ๋„ฃ์–ด ํŠน์ • ์Šฌ๋กฏ์ด ์˜ˆ์ธกํ•œ ๊ฐ์ฒด์˜ ์œ ๋ฌด์™€ ๊ฐ์ฒด์˜ ์œ„์น˜๋ฅผ ์ถœ๋ ฅ

 

 

์‹คํ—˜ ๊ฒฐ๊ณผ

 

  • ์‹คํ—˜ ๊ฒฐ๊ณผ๋Š” Faster RCNN์™€ ๋น„๊ตํ•˜๋Š”๋ฐ, ์ด๋Š” Faster RCNN์ด hand crafted ํ•œ ๋ฐฉ๋ฒ•์„ ๋งŽ์ด ์‚ฌ์šฉํ•˜์—ฌ end-to-end ๊ฐ์ฒด ๊ฒ€์ถœ๊ธฐ๊ฐ€ ์•„๋‹ˆ๊ธฐ ๋•Œ๋ฌธ
  • Faster RCNN ๋Œ€๋น„ ๊ตฌ์กฐ๊ฐ€ ๊ต‰์žฅํžˆ ๊ฐ„๋‹จํ•œ end-to-end ๊ฒ€์ถœ๊ธฐ์ด๋ฉฐ ์„ฑ๋Šฅ ๋˜ํ•œ ์ƒ์Šน
  • Attention ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ์‚ฌ์šฉํ•˜์—ฌ ํฐ ๋ฌผ์ฒด์— ๋Œ€ํ•œ ์„ฑ๋Šฅ์€ ์ข‹์ง€๋งŒ, FPN ๊ณผ ๊ฐ™์ด ๊ฐ์ฒด ์Šค์ผ€์ผ์— ๋Œ€ํ•œ ๊ณ ๋ ค๊ฐ€ ์—†๊ธฐ์— ์ž‘์€ ๋ฌผ์ฒด์— ๋Œ€ํ•œ ์„ฑ๋Šฅ์€ ๋–จ์–ด์ง

 

 

๊ฒฐ๋ก 

๋ณต์žกํ•˜์ง€ ์•Š์€ End-to-end ๊ฐ์ฒด ๊ฒ€์ถœ๊ธฐ๋ฅผ ์ œ์•ˆํ–ˆ๋‹ค๋Š” ์ ์ด ์ฃผ์š” contribution์ด์ง€๋งŒ, ํ•™์Šต ์‹œ๊ฐ„์ด ๊ต‰์žฅํžˆ ๊ธธ๊ณ  ์ž‘์€ ๋ฌผ์ฒด๋ฅผ ์ž˜ ๊ฒ€์ถœํ•˜์ง€ ๋ชปํ•œ๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ๋Š” ์—ฐ๊ตฌ์ด๋‹ค. ์ดํ›„ Deformable DETR ๋“ฑ ์—ฌ๋Ÿฌ ์—ฐ๊ตฌ๋ฅผ ํ†ตํ•ด ๋‹จ์ ์ด ๋ณด์™„๋˜์—ˆ์œผ๋ฉฐ, ํ˜„์‹œ์  ๊ธฐ์ค€ COCO ๋ฐ์ดํ„ฐ์…‹ ๊ฐ์ฒด ๊ฒ€์ถœ SOTA๋„ DETR ๊ธฐ๋ฐ˜์˜ ๋ชจ๋ธ์ธ ๋งŒํผ ์˜๋ฏธ๊ฐ€ ์žˆ๋Š” ์—ฐ๊ตฌ๋ผ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

๋ฐ˜์‘ํ˜•