๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
๐Ÿ› Research/Detection & Segmentation

[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] Efficient RGB-D Semantic Segmentation for Indoor Scene Analysis / RGB-D ์˜์ƒ์—์„œ์˜ segementation

by ๋ญ…์ฆค 2022. 1. 12.
๋ฐ˜์‘ํ˜•

๋ณธ ๋…ผ๋ฌธ์€ 2021๋…„ International Conference on Robotics and Automation (ICRA) ๋ผ๋Š” ํ•™ํšŒ์— ๊ฒŒ์žฌ๋˜์—ˆ๊ณ , RGB+depth image ๋กœ semantic segmentation task๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ์—ฐ๊ตฌ๋ฅผ ์†Œ๊ฐœํ•˜๊ธฐ ์œ„ํ•ด ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค.

 

Depth ์ด๋ฏธ์ง€๋Š” ๊ด€์ธก์ž(์นด๋ฉ”๋ผ) ์™€์˜ ๊ฑฐ๋ฆฌ๋ฅผ ํ‘œํ˜„ํ•˜๋ฏ€๋กœ RGB ์ด๋ฏธ์ง€์—์„œ๋Š” ๊ฐ์ฒด๊ฐ€ ๋ถ„๋ฆฌ๋˜๋Š” ์ง€์ ์ฒ˜๋Ÿผ ๋ณด์ผ์ง€๋ผ๋„(์กฐ๋ช…, ๊ทธ๋ฆผ์ž์— ๋”ฐ๋ผ) depth ์ด๋ฏธ์ง€์—์„œ๋Š” ๋™์ผํ•œ(continuousํ•œ) ๊ฐ์ฒด๋กœ ๋ณด์ผ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— RGB ์ด๋ฏธ์ง€์™€ depth ์ด๋ฏธ์ง€๋ฅผ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜๋ฉด segmentation ์„ฑ๋Šฅ์ด ์˜ฌ๋ผ๊ฐˆ ๊ฒƒ์ด๋ผ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

(๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” depth ์ด๋ฏธ์ง€๊ฐ€ rgb ์ด๋ฏธ์ง€์— complementary geometric information์„ ์ œ๊ณตํ•œ๋‹ค๊ณ  ํ‘œํ˜„)

 

๊ฐ€์žฅ ๊ฐ„๋‹จํ•˜๊ฒŒ ์ƒ๊ฐํ•ด๋ณผ ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์ด rgb-encoder, depth-encoder๋กœ rgb, depth์˜ feature๋ฅผ ์ถ”์ถœํ•˜๊ณ  decoder๋กœ feature๋ฅผ ๋„˜๊ฒจ์ฃผ๊ธฐ ์ „์— feature๋ฅผ mergingํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. 

์•„๋ž˜ figure๋ฅผ ๋ณด๋ฉด  rgb, depth ์ด๋ฏธ์ง€๋ฅผ ๊ฐ๊ฐ ๋‹ค๋ฅธ encoder์— ์ฃผ์ž…ํ•˜๊ณ  depth-encoder์—์„œ ์ถ”์ถœ๋˜๋Š” feature๋“ค์„ layer ์ค‘๊ฐ„์ค‘๊ฐ„์—์„œ rgb-encoder ์ชฝ์œผ๋กœ ๋„˜๊ฒจ์ค˜์„œ RGB-D Fusion ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

 

- RGB-D Fusion

RGB์™€ depth ์ด๋ฏธ์ง€๋ฅผ ๊ฐ๊ฐ SE-block ์„ ์‚ฌ์šฉํ•˜์—ฌ channel-wise attention์„ ์ˆ˜ํ–‰ํ•˜๊ณ  element-wise ๋”ํ•ด์„œ ์ค๋‹ˆ๋‹ค. ์ด๋Š” RGB์™€ depth ์ด๋ฏธ์ง€๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ ๋„คํŠธ์›Œํฌ์—์„œ ์ธ์ฝ”๋”ฉ๋˜์—ˆ์œผ๋‹ˆ feature๋ฅผ ํ•ฉ์น˜๊ธฐ์ „์— channel calibration์„ ํ•ด์ค˜์„œ, RGB์™€ depth ์ด๋ฏธ์ง€ ์ •๋ณด๊ฐ€ ๋ฐธ๋Ÿฐ์Šค ์žˆ๊ฒŒ ํ•ฉ์ณ์งˆ ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

 

- Context Module

PSPNet ์˜ Pyramid Pooling Module๊ณผ ์œ ์‚ฌํ•˜๊ฒŒ ์—ฌ๋Ÿฌ branch ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์„œ๋กœ ๋‹ค๋ฅธ scale์˜ feature๋“ค์„ aggregateํ•ฉ๋‹ˆ๋‹ค.

 

๊ทธ๋ฆฌ๊ณ  ๊ณ„์‚ฐ๋Ÿ‰ ๊ฐ์†Œ๋ฅผ ์œ„ํ•ด resnet์˜ basic block์„ spatially factorized version(NBt1D)์œผ๋กœ ๋Œ€์ฒดํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” mobilenet ์ฒ˜๋Ÿผ ๋ชจ๋ธ์„ ๊ฒฝ๋Ÿ‰ํ™” ์‹œํ‚ค๊ธฐ ์œ„ํ•ด 3x3 conv ๋ฅผ 3x1 conv์™€ 1x3 conv๋กœ ๋ถ„ํ•ด์‹œ์ผœ์„œ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ ERFNet์—์„œ ์ฒ˜์Œ ์ œ์•ˆ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. 

ESANet

 

- Experimental Results

 

๋‚ด ์ƒ๊ฐ

์—ฌ๋Ÿฌ method๋ฅผ ์ ์ ˆํžˆ ํ†ตํ•ฉํ•˜์—ฌ RGB, Depth ์ด๋ฏธ์ง€๋ฅผ ๋ชจ๋‘ ์ธ์ฝ”๋”ฉํ•˜์—ฌ semantic segmentation์„ ์ˆ˜ํ–‰ํ•˜๋„๋ก ์„ค๊ณ„๋œ ๋„คํŠธ์›Œํฌ์ด์ง€๋งŒ, ์•ฝ๊ฐ„์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์œ„ํ•ด encoder๊ฐ€ 2๊ฐœ๊ฐ€ ์ƒ๊ธฐ๋Š” ๋‹จ์ ์ด ์—ฌ์ „ํžˆ ์กด์žฌํ•˜๋Š” ๋„คํŠธ์›Œํฌ์ž…๋‹ˆ๋‹ค.

 

๋˜ํ•œ feature๋ฅผ fusion ํ•˜๋Š” ๋ชจ๋“ˆ์ด ๋‹จ์ˆœํžˆ SE block์„ ์‚ฌ์šฉํ–ˆ๋Š”๋ฐ ์ด์™€ ๊ฐ™์€ ๋ฐฉ๋ฒ•์ด RGB ์™€ depth ์ด๋ฏธ์ง€๋ฅผ ๋ฐธ๋Ÿฐ์Šค ์žˆ๊ฒŒ ์ ์ ˆํžˆ ํ•ฉ์ณ์ฃผ๋Š”์ง€ ์˜๋ฌธ์ž…๋‹ˆ๋‹ค.

(๋„คํŠธ์›Œํฌ์— ๋งก๊ฒจ๋ฒ„๋ฆฌ๋Š” ๋Š๋‚Œ์ด๋ผ, ablation study์—์„œ SE block์„ ์‚ฌ์šฉํ•ด์„œ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋œ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์ง€๋งŒ, SE block์€ attention module ์ด๋ผ ์–ด๋””์— ๋ถ™์—ฌ๋„ ์•ฝ๊ฐ„์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์€ ์žˆ์œผ๋ฏ€๋กœ..)

๋ฐ˜์‘ํ˜•