[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] Unified Perceptual Parsing for Scene Understanding / UperNet / Multi-task learning

2021. 12. 4. 20:37ยท๐Ÿ› Research/Detection & Segmentation
๋ฐ˜์‘ํ˜•

๋ณธ ๋…ผ๋ฌธ์€ ECCV 2018์— ๊ฒŒ์žฌ๋œ ๋…ผ๋ฌธ์œผ๋กœ ๋‹ค์–‘ํ•œ visual concepts ์ธ์‹ํ•˜๋Š”(multi-task learning) Unified Perceptual Parsing ์ด๋ผ๋Š” ์ƒˆ๋กœ์šด task ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

 

Introduction

 

์œ„ ๊ทธ๋ฆผ์€ ๊ฑฐ์‹ค(scene)์— ํ…Œ์ด๋ธ”, ๊ทธ๋ฆผ, ๋ฒฝ๊ณผ ๊ฐ™์€ ๋‹ค์–‘ํ•œ ๊ฐ์ฒด(object)๋กœ ์ด๋ฃจ์–ด์ ธ์žˆ๊ณ  ๋™์‹œ์— ํ…Œ์ด๋ธ”์€ ํ…Œ์ด๋ธ” ๋‹ค๋ฆฌ, ์ƒํŒ, apron(part) ๋“ฑ์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ํ…Œ์ด๋ธ”์€ ๋‚˜๋ฌด(material)๋กœ ๋งŒ๋“ค์–ด์กŒ๊ณ  ์†ŒํŒŒ ํ‘œ๋ฉด์€ kinitted(texture) ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์นดํ…Œ๊ณ ๋ฆฌ๋“ค์€ scene understanding, object/material/part/texture recognition task์—์„œ ๊ฐ๊ฐ ๋…๋ฆฝ์ ์œผ๋กœ ์ˆ˜ํ–‰๋˜์–ด ์™”์Šต๋‹ˆ๋‹ค.

 

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ์ž‘์—…๋“ค์„ ๋™์‹œ์— ์ˆ˜ํ–‰ํ•˜๊ธฐ ์œ„ํ•ด UPP(Unified Perceptual Parsing) ์ด๋ผ๋Š” ์ƒˆ๋กœ์šด task์™€ ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ธ ์ƒˆ๋กœ์šด ํ•™์Šต๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ํ•ด๊ฒฐํ•ด์•ผํ•  ๋ช‡๊ฐ€์ง€ ๋ฌธ์ œ์ ๋“ค์ด ์žˆ๋Š”๋ฐ...

 

1) ๋ชจ๋“  level์˜ ์‹œ๊ฐ ์ •๋ณด๊ฐ€ label๋œ dataset ์ด ์—†์Œ

scene parsing์„ ์œ„ํ•œ  ADE20K, texture recognition์„ ์œ„ํ•œ DTD, material recognition์„ ์œ„ํ•œ OpenSurfaces ๋“ฑ dataset์ด ๊ฐ๊ฐ ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.

 

2) ์„œ๋กœ ๋‹ค๋ฅธ perceptual level์˜ annoation์ด heterogeneous 

์˜ˆ๋ฅผ ๋“ค์–ด ADE20K๋Š” pixel-wise label์ด ๋˜์–ด์žˆ๊ณ , DTD ๋Š” image-wise label์ด ๋˜์–ด์žˆ์Šต๋‹ˆ๋‹ค.

 

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์œ„์˜ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์„œ๋กœ ๋‹ค๋ฅธ dataset์˜ ์ด์งˆ์„ฑ์„ ๊ทน๋ณตํ•˜๊ณ  ๋‹ค์–‘ํ•œ level์˜ visual concept ๋“ค์„ ๋™์‹œ์— ํ•™์Šตํ•˜๋Š” framework๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

 

Datasets

๋‹ค์–‘ํ•œ visual concept๋“ค์„ ํฌํ•จํ•˜๋Š” Broadly and Densely Labeled Dataset(Broden) ์„ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค. Broden์€ ADE20K, Pascal-Context, Pascal-Part, OpenSurface ๋ฐ DTD๊ฐ€ ํ†ตํ•ฉ๋œ dataset์ž…๋‹ˆ๋‹ค. ์ด dataset์—๋Š” object, object parts, material ์— ๋Œ€ํ•œ ์ •๋ณด๊ฐ€ pixel-wise labeling ๋˜์–ด ์žˆ๊ณ  scene ๊ณผ texture๋Š” image-wise labeling ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ, ์ด dataset์€ class ๋ณ„ ์ƒ˜ํ”Œ์ด imbalance ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์œ ์‚ฌํ•œ class ๋ฅผ ๋ณ‘ํ•ฉํ•˜๋Š” ๋“ฑ์˜ ๋ช‡๊ฐ€์ง€ ์ˆ˜์ •์„ ๊ฑฐ์ณ์„œ Broden+ dataset์„ ๊ตฌ์ถ•ํ•ฉ๋‹ˆ๋‹ค.

 

์•„๋ž˜๋Š” Broden+ dataset์˜ ์˜ˆ์‹œ์ž…๋‹ˆ๋‹ค.

 

 

Designing Networks for Unified Perceptual Parsing

 

์œ„ ๊ทธ๋ฆผ์€ ๋ณธ ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆํ•˜๋Š” UperNet(Unified Perceptual Parsing Network) ์ด๋ฉฐ Feature Pyramid Network(FPN) ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋งŒ๋“ค์–ด์กŒ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ํšจ๊ณผ์ ์ธ global prior representation์„ ์ถ”์ถœํ•˜๋Š” PSPNet ์˜ PPM(Pyramid Pooling Module) ์„ backbone net ์˜ ๋งˆ์ง€๋ง‰ layer์— ์—ฐ๊ฒฐํ•ฉ๋‹ˆ๋‹ค.

 

๊ทธ๋ฆฌ๊ณ  ์—ฌ๋Ÿฌ visual task๋“ค์„ ๋™์‹œ์— ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด task ๋ณ„๋กœ conv. layer๊ฐ€ ํฌํ•จ๋œ Head๋ฅผ ์—ฐ๊ฒฐํ•˜์—ฌ segmentation์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ๋งŽ์€ parameter ์ˆ˜๋ฅผ ๊ฐ€์ง€๋Š” backbone net์„ ๊ณต์œ ํ•˜๊ณ  ๊ฐ€๋ฒผ์šด Head๋งŒ task๋ณ„๋กœ ์ถ”๊ฐ€ํ•œ ๊ตฌ์กฐ๋กœ, ๊ฐ Head๋ฅผ ํ†ต๊ณผํ•œ output feature์˜ channel ๊ฐœ์ˆ˜๋Š” ํ•ด๋‹น task์˜ class ๊ฐœ์ˆ˜์™€ ๋™์ผํ•ฉ๋‹ˆ๋‹ค.

 

- Scene : image-wise prediction์ด ํ•„์š”ํ•˜๊ธฐ ๋•Œ๋ฌธ์— backbone net → PPM Head → Scene Head ํ†ต๊ณผ ํ›„ scene ์„ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค. Scene Head๋Š” 3x3 conv + GAP + classifier ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์Šต๋‹ˆ๋‹ค.  

 

- Object / Part : Object์™€ part๋Š” ๋ชจ๋“  level์˜ feature map์„ ๊ฒฐํ•ฉํ•˜์—ฌ ์˜ˆ์ธกํ•œ ๊ฒฝ์šฐ์— ๊ฐ€์žฅ ์„ฑ๋Šฅ์ด ๋†’๋‹ค๋Š” ๊ฒƒ์„ ์‹คํ—˜์ ์œผ๋กœ ๋ฐœ๊ฒฌํ•˜์—ฌ low~high level feature ๋“ค์„ fuse ํ•˜๊ณ  object์™€ part์— ๊ฐ๊ฐ head๋ฅผ ์—ฐ๊ฒฐํ•˜์—ฌ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค. object, part, material head๋Š” 3x3 conv + classifier ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์Šต๋‹ˆ๋‹ค. (segmentation์„ ์œ„ํ•ด ๊ณต๊ฐ„ ์ •๋ณด๋ฅผ ์†Œ์‹คํ•˜๋ฉด ์•ˆ๋˜๊ธฐ ๋•Œ๋ฌธ์— GAP ์‚ฌ์šฉ X)

 

- Material : low-level feature๋งŒ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋„ material ์ธ์‹์„ ์œ„ํ•ด context ์ •๋ณด๊ฐ€ ์ค‘์š”ํ•˜๋‹ค๋Š” ๊ฒƒ์„ ๊ฐ•์กฐํ•˜๊ณ  ์žˆ์ง€๋งŒ, low-level feature๋งŒ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์กฐ๊ธˆ ์ด์ƒ(?) ํ•ฉ๋‹ˆ๋‹ค.

(์•„๋งˆ, object์™€ part ๋“ฑ์˜ shape ์ •๋ณด๊ฐ€ ์ค‘์š”ํ•œ task์™€ material์€ high-level ์—์„œ ํ•™์Šต๋˜์–ด์•ผ ํ•  ์ •๋ณด์˜ ์„ฑ๊ฒฉ์ด ๋งŽ์ด ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์— material ์„ low~high level ์ •๋ณด๋ฅผ ๋ชจ๋‘ ์ด์šฉํ•ด์„œ ํ•™์Šต์‹œํ‚ค๋ฉด ๋„คํŠธ์›Œํฌ ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง€๊ธฐ ๋•Œ๋ฌธ์ด ์•„๋‹๊นŒ ์ƒ๊ฐ...)

 

- Texture : texture์˜ ๊ฒฝ์šฐ scene, object ๋“ฑ์˜ task์—์„œ ํ•™์Šตํ•ด์•ผํ•˜๋Š” feature์™€ ์„ฑ๊ฒฉ์ด ๋งŽ์ด ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์— backbone net์—์„œ low-level feature๋งŒ ์ถ”์ถœํ•˜์—ฌ ํ•™์Šต์‹œ์—๋Š” texture ์ด๋ฏธ์ง€ ํ•œ์žฅ์„ pixel-wise annotation ๋œ ๊ฒƒ์œผ๋กœ ๊ฐ„์ฃผํ•˜์—ฌ ํ•™์Šต์‹œํ‚ต๋‹ˆ๋‹ค. (DTD dataset์˜ ์ƒ˜ํ”Œ๋“ค์€ Field of  View ์ „์ฒด๊ฐ€ ํ•ด๋‹น ํด๋ž˜์Šค๋กœ ์ด๋ฃจ์–ด์ ธ์žˆ๊ธฐ ๋•Œ๋ฌธ). ๋˜ํ•œ backbone net์—๋Š” gradient๋ฅผ ์ „๋‹ฌํ•˜์ง€ ์•Š๊ณ (ํ•™์Šต์— ์˜ํ–ฅ์„ ๋ผ์น˜์ง€ ์•Š๊ณ ) 3x3 conv 4๊ฐœ๊ฐ€ ์—ฐ๊ฒฐ๋œ texture head๋งŒ ํ•™์Šตํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค. 

 

Pyramid Pooling Module (PPM)

https://mvje.tistory.com/33

 

Experiments

์‹คํ—˜์€ training data๋ฅผ ์—ฌ๋Ÿฌ task๋ฅผ ํ•˜๋‚˜์”ฉ ์ถ”๊ฐ€ํ•ด๊ฐ€๋ฉฐ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. ๋‹น์—ฐํ•œ ๊ฒฐ๊ณผ์ด์ง€๋งŒ task ์ˆ˜๊ฐ€ ๋งŽ์•„์งˆ ์ˆ˜๋ก ์„ฑ๋Šฅ์€ ์กฐ๊ธˆ์”ฉ ๋–จ์–ด์ง‘๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ, ์•„์ฃผ ๋ฏธ๋น„ํ•œ ์„ฑ๋Šฅ์ €ํ•˜์ด๊ธฐ ๋•Œ๋ฌธ์— scene, object, part, material ๋“ฑ์˜ visual task๋“ค์ด ํ•˜๋‚˜์˜ ๋„คํŠธ์›Œํฌ๋ฅผ ์ด์šฉํ•ด์„œ ํšจ๊ณผ์ ์œผ๋กœ ์ˆ˜ํ–‰๋  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. 

 

ํ•˜์ง€๋งŒ UperNet ์€ ๊ฐ task๋“ค์˜ ๋ถ„๋ฅ˜ ๊ฒฐ๊ณผ๋ฅผ ์ƒํ˜ธ๋ณด์™„์ ์œผ๋กœ ํ™œ์šฉํ•˜์ง€๋Š” ์•Š์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด scene๊ณผ object, object์™€ material, material ๊ณผ texture๋Š” real world ์—์„œ ๊นŠ์€ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๊ฐ€์ง€๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๊ด€๊ณ„๋ฅผ ์ด์šฉํ•˜๋ฉด ์—ฌ๋Ÿฌ task๋“ค์„ ๋™์‹œ์— ํ•™์Šตํ•˜๋ฉด์„œ ๋ฐœ์ƒํ•˜๋Š” ์‹œ๋„ˆ์ง€๋กœ ์˜คํžˆ๋ ค task ๋ณ„ ์„ฑ๋Šฅ์ด ๋”์šฑ ์ข‹์•„์ง€๊ฒŒ ๋งŒ๋“ค ์ˆ˜๋„ ์žˆ์ง€ ์•Š์„๊นŒ ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.

 

์•„๋ž˜๋Š” ์‹œ๊ฐํ™”๋œ ์‹คํ—˜ ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค.

๋ฐ˜์‘ํ˜•

'๐Ÿ› Research > Detection & Segmentation' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] FaPN: Feature-aligned Pyramid Network for Dense Image Prediction  (0) 2022.01.19
[๊ฐ„๋‹จ ์„ค๋ช…] Semi-Supervised Semantic Segmentation / Segmentation์—์„œ unlabeled ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šตํ•˜๋Š” ๋ฐฉ๋ฒ•  (0) 2022.01.13
[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] Feature Pyramid Networks for Object Detection / FPN / ๊ฐ์ฒด์˜ ์Šค์ผ€์ผ์— invariantํ•œ ๋„คํŠธ์›Œํฌ  (0) 2022.01.13
[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] Efficient RGB-D Semantic Segmentation for Indoor Scene Analysis / RGB-D ์˜์ƒ์—์„œ์˜ segementation  (0) 2022.01.12
[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] Pyramid Scene Parsing Network / PSPNet / Pyramid Pooling  (0) 2021.12.05
'๐Ÿ› Research/Detection & Segmentation' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€
  • [๊ฐ„๋‹จ ์„ค๋ช…] Semi-Supervised Semantic Segmentation / Segmentation์—์„œ unlabeled ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šตํ•˜๋Š” ๋ฐฉ๋ฒ•
  • [๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] Feature Pyramid Networks for Object Detection / FPN / ๊ฐ์ฒด์˜ ์Šค์ผ€์ผ์— invariantํ•œ ๋„คํŠธ์›Œํฌ
  • [๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] Efficient RGB-D Semantic Segmentation for Indoor Scene Analysis / RGB-D ์˜์ƒ์—์„œ์˜ segementation
  • [๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] Pyramid Scene Parsing Network / PSPNet / Pyramid Pooling
๋ญ…์ฆค
๋ญ…์ฆค
AI ๊ธฐ์ˆ  ๋ธ”๋กœ๊ทธ
    ๋ฐ˜์‘ํ˜•
  • ๋ญ…์ฆค
    CV DOODLE
    ๋ญ…์ฆค
  • ์ „์ฒด
    ์˜ค๋Š˜
    ์–ด์ œ
  • ๊ณต์ง€์‚ฌํ•ญ

    • โœจ About Me
    • ๋ถ„๋ฅ˜ ์ „์ฒด๋ณด๊ธฐ (203) N
      • ๐Ÿ“– Fundamentals (33)
        • Computer Vision (9)
        • 3D vision & Graphics (6)
        • AI & ML (15)
        • NLP (2)
        • etc. (1)
      • ๐Ÿ› Research (68) N
        • Deep Learning (7)
        • Image Classification (2)
        • Detection & Segmentation (17)
        • OCR (7)
        • Multi-modal (4)
        • Generative AI (9) N
        • 3D Vision (3)
        • Material & Texture Recognit.. (8)
        • NLP & LLM (11)
        • etc. (0)
      • ๐Ÿ› ๏ธ Engineering (7)
        • Distributed Training (4)
        • AI & ML ์ธ์‚ฌ์ดํŠธ (3)
      • ๐Ÿ’ป Programming (86)
        • Python (18)
        • Computer Vision (12)
        • LLM (4)
        • AI & ML (18)
        • Database (3)
        • Apache Airflow (6)
        • Docker & Kubernetes (14)
        • ์ฝ”๋”ฉ ํ…Œ์ŠคํŠธ (4)
        • C++ (1)
        • etc. (6)
      • ๐Ÿ’ฌ ETC (3)
        • ์ฑ… ๋ฆฌ๋ทฐ (3)
  • ๋งํฌ

  • ์ธ๊ธฐ ๊ธ€

  • ํƒœ๊ทธ

    segmentation
    airflow
    Text recognition
    object detection
    diffusion
    ๊ฐ์ฒด๊ฒ€์ถœ
    ๋„์ปค
    AI
    nlp
    material recognition
    OCR
    ์ปดํ“จํ„ฐ๋น„์ „
    deep learning
    OpenCV
    OpenAI
    Python
    pytorch
    ml
    ํ”„๋กฌํ”„ํŠธ์—”์ง€๋‹ˆ์–ด๋ง
    multi-modal
    3D Vision
    CNN
    ChatGPT
    ํŒŒ์ด์ฌ
    LLM
    ๊ฐ์ฒด ๊ฒ€์ถœ
    generative ai
    pandas
    Computer Vision
    ๋”ฅ๋Ÿฌ๋‹
  • ์ตœ๊ทผ ๋Œ“๊ธ€

  • ์ตœ๊ทผ ๊ธ€

  • hELLOยท Designed By์ •์ƒ์šฐ.v4.10.3
๋ญ…์ฆค
[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] Unified Perceptual Parsing for Scene Understanding / UperNet / Multi-task learning
์ƒ๋‹จ์œผ๋กœ

ํ‹ฐ์Šคํ† ๋ฆฌํˆด๋ฐ”