๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
๐Ÿ› Research/Detection & Segmentation

[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] Unified Perceptual Parsing for Scene Understanding / UperNet / Multi-task learning

by ๋ญ…์ฆค 2021. 12. 4.
๋ฐ˜์‘ํ˜•

๋ณธ ๋…ผ๋ฌธ์€ ECCV 2018์— ๊ฒŒ์žฌ๋œ ๋…ผ๋ฌธ์œผ๋กœ ๋‹ค์–‘ํ•œ visual concepts ์ธ์‹ํ•˜๋Š”(multi-task learning) Unified Perceptual Parsing ์ด๋ผ๋Š” ์ƒˆ๋กœ์šด task ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

 

Introduction

 

์œ„ ๊ทธ๋ฆผ์€ ๊ฑฐ์‹ค(scene)์— ํ…Œ์ด๋ธ”, ๊ทธ๋ฆผ, ๋ฒฝ๊ณผ ๊ฐ™์€ ๋‹ค์–‘ํ•œ ๊ฐ์ฒด(object)๋กœ ์ด๋ฃจ์–ด์ ธ์žˆ๊ณ  ๋™์‹œ์— ํ…Œ์ด๋ธ”์€ ํ…Œ์ด๋ธ” ๋‹ค๋ฆฌ, ์ƒํŒ, apron(part) ๋“ฑ์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ํ…Œ์ด๋ธ”์€ ๋‚˜๋ฌด(material)๋กœ ๋งŒ๋“ค์–ด์กŒ๊ณ  ์†ŒํŒŒ ํ‘œ๋ฉด์€ kinitted(texture) ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์นดํ…Œ๊ณ ๋ฆฌ๋“ค์€ scene understanding, object/material/part/texture recognition task์—์„œ ๊ฐ๊ฐ ๋…๋ฆฝ์ ์œผ๋กœ ์ˆ˜ํ–‰๋˜์–ด ์™”์Šต๋‹ˆ๋‹ค.

 

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ์ž‘์—…๋“ค์„ ๋™์‹œ์— ์ˆ˜ํ–‰ํ•˜๊ธฐ ์œ„ํ•ด UPP(Unified Perceptual Parsing) ์ด๋ผ๋Š” ์ƒˆ๋กœ์šด task์™€ ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ธ ์ƒˆ๋กœ์šด ํ•™์Šต๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ํ•ด๊ฒฐํ•ด์•ผํ•  ๋ช‡๊ฐ€์ง€ ๋ฌธ์ œ์ ๋“ค์ด ์žˆ๋Š”๋ฐ...

 

1) ๋ชจ๋“  level์˜ ์‹œ๊ฐ ์ •๋ณด๊ฐ€ label๋œ dataset ์ด ์—†์Œ

scene parsing์„ ์œ„ํ•œ  ADE20K, texture recognition์„ ์œ„ํ•œ DTD, material recognition์„ ์œ„ํ•œ OpenSurfaces ๋“ฑ dataset์ด ๊ฐ๊ฐ ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.

 

2) ์„œ๋กœ ๋‹ค๋ฅธ perceptual level์˜ annoation์ด heterogeneous 

์˜ˆ๋ฅผ ๋“ค์–ด ADE20K๋Š” pixel-wise label์ด ๋˜์–ด์žˆ๊ณ , DTD ๋Š” image-wise label์ด ๋˜์–ด์žˆ์Šต๋‹ˆ๋‹ค.

 

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์œ„์˜ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์„œ๋กœ ๋‹ค๋ฅธ dataset์˜ ์ด์งˆ์„ฑ์„ ๊ทน๋ณตํ•˜๊ณ  ๋‹ค์–‘ํ•œ level์˜ visual concept ๋“ค์„ ๋™์‹œ์— ํ•™์Šตํ•˜๋Š” framework๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

 

Datasets

๋‹ค์–‘ํ•œ visual concept๋“ค์„ ํฌํ•จํ•˜๋Š” Broadly and Densely Labeled Dataset(Broden) ์„ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค. Broden์€ ADE20K, Pascal-Context, Pascal-Part, OpenSurface ๋ฐ DTD๊ฐ€ ํ†ตํ•ฉ๋œ dataset์ž…๋‹ˆ๋‹ค. ์ด dataset์—๋Š” object, object parts, material ์— ๋Œ€ํ•œ ์ •๋ณด๊ฐ€ pixel-wise labeling ๋˜์–ด ์žˆ๊ณ  scene ๊ณผ texture๋Š” image-wise labeling ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ, ์ด dataset์€ class ๋ณ„ ์ƒ˜ํ”Œ์ด imbalance ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์œ ์‚ฌํ•œ class ๋ฅผ ๋ณ‘ํ•ฉํ•˜๋Š” ๋“ฑ์˜ ๋ช‡๊ฐ€์ง€ ์ˆ˜์ •์„ ๊ฑฐ์ณ์„œ Broden+ dataset์„ ๊ตฌ์ถ•ํ•ฉ๋‹ˆ๋‹ค.

 

์•„๋ž˜๋Š” Broden+ dataset์˜ ์˜ˆ์‹œ์ž…๋‹ˆ๋‹ค.

 

 

Designing Networks for Unified Perceptual Parsing

 

์œ„ ๊ทธ๋ฆผ์€ ๋ณธ ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆํ•˜๋Š” UperNet(Unified Perceptual Parsing Network) ์ด๋ฉฐ Feature Pyramid Network(FPN) ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋งŒ๋“ค์–ด์กŒ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ํšจ๊ณผ์ ์ธ global prior representation์„ ์ถ”์ถœํ•˜๋Š” PSPNet ์˜ PPM(Pyramid Pooling Module) ์„ backbone net ์˜ ๋งˆ์ง€๋ง‰ layer์— ์—ฐ๊ฒฐํ•ฉ๋‹ˆ๋‹ค.

 

๊ทธ๋ฆฌ๊ณ  ์—ฌ๋Ÿฌ visual task๋“ค์„ ๋™์‹œ์— ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด task ๋ณ„๋กœ conv. layer๊ฐ€ ํฌํ•จ๋œ Head๋ฅผ ์—ฐ๊ฒฐํ•˜์—ฌ segmentation์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ๋งŽ์€ parameter ์ˆ˜๋ฅผ ๊ฐ€์ง€๋Š” backbone net์„ ๊ณต์œ ํ•˜๊ณ  ๊ฐ€๋ฒผ์šด Head๋งŒ task๋ณ„๋กœ ์ถ”๊ฐ€ํ•œ ๊ตฌ์กฐ๋กœ, ๊ฐ Head๋ฅผ ํ†ต๊ณผํ•œ output feature์˜ channel ๊ฐœ์ˆ˜๋Š” ํ•ด๋‹น task์˜ class ๊ฐœ์ˆ˜์™€ ๋™์ผํ•ฉ๋‹ˆ๋‹ค.

 

- Scene : image-wise prediction์ด ํ•„์š”ํ•˜๊ธฐ ๋•Œ๋ฌธ์— backbone net → PPM Head → Scene Head ํ†ต๊ณผ ํ›„ scene ์„ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค. Scene Head๋Š” 3x3 conv + GAP + classifier ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์Šต๋‹ˆ๋‹ค.  

 

- Object / Part : Object์™€ part๋Š” ๋ชจ๋“  level์˜ feature map์„ ๊ฒฐํ•ฉํ•˜์—ฌ ์˜ˆ์ธกํ•œ ๊ฒฝ์šฐ์— ๊ฐ€์žฅ ์„ฑ๋Šฅ์ด ๋†’๋‹ค๋Š” ๊ฒƒ์„ ์‹คํ—˜์ ์œผ๋กœ ๋ฐœ๊ฒฌํ•˜์—ฌ low~high level feature ๋“ค์„ fuse ํ•˜๊ณ  object์™€ part์— ๊ฐ๊ฐ head๋ฅผ ์—ฐ๊ฒฐํ•˜์—ฌ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค. object, part, material head๋Š” 3x3 conv + classifier ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์Šต๋‹ˆ๋‹ค. (segmentation์„ ์œ„ํ•ด ๊ณต๊ฐ„ ์ •๋ณด๋ฅผ ์†Œ์‹คํ•˜๋ฉด ์•ˆ๋˜๊ธฐ ๋•Œ๋ฌธ์— GAP ์‚ฌ์šฉ X)

 

- Material : low-level feature๋งŒ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋„ material ์ธ์‹์„ ์œ„ํ•ด context ์ •๋ณด๊ฐ€ ์ค‘์š”ํ•˜๋‹ค๋Š” ๊ฒƒ์„ ๊ฐ•์กฐํ•˜๊ณ  ์žˆ์ง€๋งŒ, low-level feature๋งŒ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์กฐ๊ธˆ ์ด์ƒ(?) ํ•ฉ๋‹ˆ๋‹ค.

(์•„๋งˆ, object์™€ part ๋“ฑ์˜ shape ์ •๋ณด๊ฐ€ ์ค‘์š”ํ•œ task์™€ material์€ high-level ์—์„œ ํ•™์Šต๋˜์–ด์•ผ ํ•  ์ •๋ณด์˜ ์„ฑ๊ฒฉ์ด ๋งŽ์ด ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์— material ์„ low~high level ์ •๋ณด๋ฅผ ๋ชจ๋‘ ์ด์šฉํ•ด์„œ ํ•™์Šต์‹œํ‚ค๋ฉด ๋„คํŠธ์›Œํฌ ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง€๊ธฐ ๋•Œ๋ฌธ์ด ์•„๋‹๊นŒ ์ƒ๊ฐ...)

 

- Texture : texture์˜ ๊ฒฝ์šฐ scene, object ๋“ฑ์˜ task์—์„œ ํ•™์Šตํ•ด์•ผํ•˜๋Š” feature์™€ ์„ฑ๊ฒฉ์ด ๋งŽ์ด ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์— backbone net์—์„œ low-level feature๋งŒ ์ถ”์ถœํ•˜์—ฌ ํ•™์Šต์‹œ์—๋Š” texture ์ด๋ฏธ์ง€ ํ•œ์žฅ์„ pixel-wise annotation ๋œ ๊ฒƒ์œผ๋กœ ๊ฐ„์ฃผํ•˜์—ฌ ํ•™์Šต์‹œํ‚ต๋‹ˆ๋‹ค. (DTD dataset์˜ ์ƒ˜ํ”Œ๋“ค์€ Field of  View ์ „์ฒด๊ฐ€ ํ•ด๋‹น ํด๋ž˜์Šค๋กœ ์ด๋ฃจ์–ด์ ธ์žˆ๊ธฐ ๋•Œ๋ฌธ). ๋˜ํ•œ backbone net์—๋Š” gradient๋ฅผ ์ „๋‹ฌํ•˜์ง€ ์•Š๊ณ (ํ•™์Šต์— ์˜ํ–ฅ์„ ๋ผ์น˜์ง€ ์•Š๊ณ ) 3x3 conv 4๊ฐœ๊ฐ€ ์—ฐ๊ฒฐ๋œ texture head๋งŒ ํ•™์Šตํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค. 

 

Pyramid Pooling Module (PPM)

https://mvje.tistory.com/33

 

Experiments

์‹คํ—˜์€ training data๋ฅผ ์—ฌ๋Ÿฌ task๋ฅผ ํ•˜๋‚˜์”ฉ ์ถ”๊ฐ€ํ•ด๊ฐ€๋ฉฐ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. ๋‹น์—ฐํ•œ ๊ฒฐ๊ณผ์ด์ง€๋งŒ task ์ˆ˜๊ฐ€ ๋งŽ์•„์งˆ ์ˆ˜๋ก ์„ฑ๋Šฅ์€ ์กฐ๊ธˆ์”ฉ ๋–จ์–ด์ง‘๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ, ์•„์ฃผ ๋ฏธ๋น„ํ•œ ์„ฑ๋Šฅ์ €ํ•˜์ด๊ธฐ ๋•Œ๋ฌธ์— scene, object, part, material ๋“ฑ์˜ visual task๋“ค์ด ํ•˜๋‚˜์˜ ๋„คํŠธ์›Œํฌ๋ฅผ ์ด์šฉํ•ด์„œ ํšจ๊ณผ์ ์œผ๋กœ ์ˆ˜ํ–‰๋  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. 

 

ํ•˜์ง€๋งŒ UperNet ์€ ๊ฐ task๋“ค์˜ ๋ถ„๋ฅ˜ ๊ฒฐ๊ณผ๋ฅผ ์ƒํ˜ธ๋ณด์™„์ ์œผ๋กœ ํ™œ์šฉํ•˜์ง€๋Š” ์•Š์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด scene๊ณผ object, object์™€ material, material ๊ณผ texture๋Š” real world ์—์„œ ๊นŠ์€ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๊ฐ€์ง€๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๊ด€๊ณ„๋ฅผ ์ด์šฉํ•˜๋ฉด ์—ฌ๋Ÿฌ task๋“ค์„ ๋™์‹œ์— ํ•™์Šตํ•˜๋ฉด์„œ ๋ฐœ์ƒํ•˜๋Š” ์‹œ๋„ˆ์ง€๋กœ ์˜คํžˆ๋ ค task ๋ณ„ ์„ฑ๋Šฅ์ด ๋”์šฑ ์ข‹์•„์ง€๊ฒŒ ๋งŒ๋“ค ์ˆ˜๋„ ์žˆ์ง€ ์•Š์„๊นŒ ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.

 

์•„๋ž˜๋Š” ์‹œ๊ฐํ™”๋œ ์‹คํ—˜ ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค.

๋ฐ˜์‘ํ˜•