๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
๐Ÿ› Research/Detection & Segmentation

[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation / DeepLab v3+ / semantic segmentation์˜ ๊ธฐ์ดˆ

by ๋ญ…์ฆค 2022. 5. 15.
๋ฐ˜์‘ํ˜•

Object Detection ์— YOLO ๊ฐ€ ์žˆ๋‹ค๋ฉด Segmentation ๋ถ„์•ผ์—์„  DeepLab ์ด ์ •๋ง ์œ ๋ช…ํ•œ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์€ ECCV 2018 ์— ๋ฐœํ‘œ๋˜์–ด DeepLabV3+ ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. Segmentation์—์„œ์˜ ์ค‘์š”ํ•œ ์š”์†Œ๋“ค์„ ๋ฐฐ์šธ ์ˆ˜ ์žˆ๊ณ , base ์‹คํ—˜ ์‹œ ์•„์ง๋„ ๋งŽ์ด ์‚ฌ์šฉํ•˜๊ธฐ๋„ ํ•˜๊ณ  ์ €๋„ ์—ฐ๊ตฌํ•˜๋ฉฐ ์ผ๋˜ ๋„คํŠธ์›Œํฌ๋ผ ์ •๋ฆฌํ•ด๋‘๋ ค ํ•ฉ๋‹ˆ๋‹ค.

 

Abstract

Spatial Pyramid pooling module ๋˜๋Š” encoder-decoder ๊ตฌ์กฐ๋Š” semantic segmentation ์ž‘์—…์„ ์œ„ํ•ด deep neural network์— ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ์ „์ž๋Š” multiple effective FoV ์—์„œ filter ๋˜๋Š” pooling ์œผ๋กœ ๋“ค์–ด์˜ค๋Š” feature์˜ multi-scale contextual ์ •๋ณด๋ฅผ ์ธ์ฝ”๋”ฉํ•  ์ˆ˜ ์žˆ๊ณ , ํ›„์ž๋Š” ๊ณต๊ฐ„์ •๋ณด๋ฅผ ์ ์ง„์ ์œผ๋กœ ๋ณต๊ตฌํ•˜์—ฌ ๋” ์„ ๋ช…ํ•œ ๊ฐ์ฒด ๊ฒฝ๊ณ„๋ฅผ ์บก์ฒ˜ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋‘ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์˜ ์žฅ์ ์„ ๊ฒฐํ•ฉํ•˜๋Š” ๊ฒƒ์„ ์ œ์•ˆํ•˜๊ณ , ์ œ์•ˆ๋œ ๋ชจ๋ธ์ธ DeepLabv3+ ๋Š” ํŠนํžˆ ๊ฐ์ฒด ๊ฒจ์˜ˆ๋ฅผ ๋”ฐ๋ผ segmentation ๊ฒฐ๊ณผ๋ฅผ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ„๋‹จํ•˜๊ณ  ํšจ๊ณผ์ ์ธ decoder ๋ชจ๋“ˆ์„ ์ถ”๊ฐ€ํ•˜์—ฌ DeepLabv3 ๋ฅผ ํ™•์žฅํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ Xception ๋ชจ๋ธ์„ ์ถ”๊ฐ€๋กœ ๋ถ„์„ํ•˜๊ณ  Atrous Spatial Pyramid Pooling๊ณผ decoder ๋ชจ๋“ˆ์— Depth-wise separable convolution์„ ์ ์šฉํ•˜์—ฌ ๋” ๋น ๋ฅด๊ณ  ๊ฐ•๋ ฅํ•œ encoder-decoder network๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค. ์ œ์•ˆํ•˜๋Š” ๊ตฌ์กฐ๋Š” PASCAL VOC 2012 ๋ฐ Cityspaces ๋ฐ์ดํ„ฐ์…‹์—์„œ SoTA ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค. 

 

DeepLab V1 ๋ถ€ํ„ฐ V3+ ๊นŒ์ง€ Atrous convolution, Atrous Spatial Pyramid Pooling, Depthwise Separable Convolution ๋“ฑ์„ ์ ์šฉํ–ˆ๋Š”๋ฐ ํ•˜๋‚˜ ํ•˜๋‚˜์”ฉ ์ •๋ฆฌํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

 

Atrous Convolution

Atrous Convolution์€ ํ•„ํ„ฐ ๋‚ด๋ถ€์— convolution ์—ฐ์‚ฐ์— ์‚ฌ์šฉ๋˜๋Š” ์›์†Œ ์‚ฌ์ด ๊ฑฐ๋ฆฌ๋ฅผ ๋„์šฐ๋Š” convolution ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. Kernel size๋Š” 3x3 ์œผ๋กœ ๋™์ผํ•˜๋”๋ผ๋„ ์œ„ ๊ทธ๋ฆผ์ฒ˜๋Ÿผ rate ๊ฐ€ 1์ธ ๊ฒฝ์šฐ์—๋Š” ์ผ๋ฐ˜์ ์ธ convolution์ด๊ณ  rate๊ฐ€ 6์ด ๋˜๋ฉด ๋นจ๊ฐ„์ƒ‰ ํ”ฝ์…€๊ฐ„์˜ ๊ฑฐ๋ฆฌ๊ฐ€ 6์ด ๋˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ฆ‰ rate๊ฐ€ ์ปค์งˆ์ˆ˜๋ก convolution filter ๋‚ด๋ถ€์˜ ๋นˆ๊ณต๊ฐ„์ด ์ปค์ง€๊ณ  ๋” ๋„“์€ ์˜์—ญ์˜ correltation์„ ์ถ”์ถœํ•˜์ง€๋งŒ ์„ธ๋ฐ€ํ•œ ๋””ํ…Œ์ผ๋ณด๋‹จ ํฐ ์˜์—ญ์˜ ๋Ÿฌํ”„ํ•œ correlation์„ ์ถ”์ถœํ•œ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ฆ‰, Atrous convolution์„ ์‚ฌ์šฉํ•˜๋ฉด trainable parameter ์˜ ์ˆ˜๋Š” ์œ ์ง€ํ•œ์ฑ„ receptive field๋ฅผ ํฌ๊ฒŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์žฅ์ ์ด ์ƒ๊ฒจ segmentation task์—์„œ ์Šค์ผ€์ผ์ด ํฐ ๊ฐ์ฒด๋ฅผ ๋”์šฑ ์ž˜ ๊ฒ€์ถœํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ๋ฐ˜๋ฉด์— ๋””ํ…Œ์ผํ•œ ์ •๋ณด๋Š” ์ค„์–ด๋“ ๋‹ค๋Š” ๋‹จ์ ์ด ์ƒ๊น๋‹ˆ๋‹ค. 

 

 

Atrous Spatial Pyramid Pooling

DeepLabv3+ ๊ตฌ์กฐ๋Š” Encoder ๋๋‹จ์—์„œ Decoder๋กœ ์ •๋ณด๋ฅผ ๋„˜๊ฒจ์ฃผ๊ธฐ ์ „์— Atrous Spatial Pyramid Pooling (ASPP)๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š”๋ฐ, Spatial Pyramid Pooling ์—์„œ Atrous ๋ฐฉ์‹์ด ์ถ”๊ฐ€๋œ ๊ตฌ์กฐ์ž…๋‹ˆ๋‹ค. Feature map์— ์„œ๋กœ ๋‹ค๋ฅธ rate๋ฅผ ๊ฐ€์ง€๋Š” Atrous convolution์„ ๋ณ‘๋ ฌ์ ์œผ๋กœ ์‚ฌ์šฉํ•˜๊ณ  ์ถ”์ถœ๋œ feature map ๋“ค์— 1x1 conv ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•œ ๋’ค ํ•ฉ์น˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. 

์ด๋Ÿฌํ•œ ๋ฐฉ์‹์€ ์ด๋ฏธ์ง€์—์„œ ์„œ๋กœ ๋‹ค๋ฅธ ์Šค์ผ€์ผ์„ ๊ฐ€์ง„ ๊ฐ์ฒด๋ฅผ ๋”์šฑ ์ž˜ ๊ฒ€์ถœํ•ด๋‚ผ ์ˆ˜ ์žˆ๋„๋ก ํ•ฉ๋‹ˆ๋‹ค. Object detection ์ด๋‚˜ Segmentation ๋ถ„์•ผ์—์„œ ์›ฌ๋งŒํ•˜๋ฉด ๊ธฐ๋ณธ์ ์œผ๋กœ ์‚ฌ์šฉํ•˜๋Š” ์Šคํƒ€์ผ์ž…๋‹ˆ๋‹ค.

 

 

Depthwise Separable Convolution

Depthwise Separable Convolution์€ standard convolution์„ 2 ๊ฐ€์ง€ ์Šคํ…(Depthwise → Pointwise)์œผ๋กœ ๋‚˜๋ˆ„์–ด ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฐฉ์‹์ธ๋ฐ, ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๋ฅผ ๋งŽ์ด ์ค„์ผ ์ˆ˜ ์žˆ์–ด์„œ ๋„คํŠธ์›Œํฌ ๊ฒฝ๋Ÿ‰ํ™”๋ฅผ ์œ„ํ•ด ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. 

 

์˜ˆ๋ฅผ ๋“ค์–ด, standard convolution ์—์„œ๋Š” 64*32*32(C*H*W)์— channel size๋ฅผ 64→128 ๋กœ ์ž„๋ฒ ๋”ฉํ•ด์ฃผ๋Š” 3x3 conv. ๋ฅผ ์ ์šฉํ•˜๋ฉด convolution filter size๋Š” 3*3*64*128์ด ๋ฉ๋‹ˆ๋‹ค. 

ํ•˜์ง€๋งŒ, Depthwise Separable Convolution ์€ conv ์—ฐ์‚ฐ์„ ๊ณต๊ฐ„์ถ•๊ณผ ์ฑ„๋„์ถ•์œผ๋กœ ๋‚˜๋ˆ„์–ด์„œ ์ง„ํ–‰ํ•˜๊ธฐ ๋•Œ๋ฌธ์— 3*3*64 + 64*128 ์ด ๋˜๊ธฐ ๋•Œ๋ฌธ์— standard convolution ๋Œ€๋น„ ํŒŒ๋ผ๋ฏธํ„ฐ์ˆ˜๊ฐ€ ๋งŽ์ด ์ค„์–ด๋“ญ๋‹ˆ๋‹ค.

 

์ด๋Ÿฌํ•œ ์—ฐ์‚ฐ ๋ฐฉ๋ฒ•์€ ๋„คํŠธ์›Œํฌ๋ฅผ ๋งŽ์ด ๊ฒฝ๋Ÿ‰ํ™” ์‹œ์ผœ์ฃผ์ง€๋งŒ, feature map์˜ spatial ํ•œ ์ •๋ณด์™€ channel ์ •๋ณด์˜ correltation์„ ํ•œ๋ฒˆ์— ์ถ”์ถœํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— ์ž„๋ฒ ๋”ฉ ์„ฑ๋Šฅ์€ standard convolution ๋ณด๋‹ค ๋–จ์–ด์ง€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

 

์ œ ์ƒ๊ฐ์œผ๋กœ๋Š”, ์–ด๋– ํ•œ ์ •๋ณด(feature)์ด๋˜์ง€ ์ถฉ๋ถ„ํžˆ ์ž„๋ฒ ๋”ฉ์ด ๋˜๋Š” ์–ด๋–ค saturation์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทผ๋ฐ, ์‚ฌ์‹ค ์ด ์ž„๊ณ„์ ์„ ์ฐพ๋Š” ๊ฒƒ์ด ์‰ฝ์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ถฉ๋ถ„ํžˆ ์ž„๋ฒ ๋”ฉ๋œ feature๋Š” ์ถ”๊ฐ€์ ์ธ conv ์—ฐ์‚ฐ์„ ํ•˜๋”๋ผ๋„ ๋ถˆํ•„์š”ํ•œ ๊ณผ์ •์ผ ์ˆ˜ ์žˆ๊ณ  ์˜คํžˆ๋ ค ๋…ธ์ด์ฆˆ๋ฅผ ๊ฐ€ํ•˜๋Š” ํ˜•ํƒœ๊ฐ€ ๋  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์–ด๋Š์ •๋„ ์ž„๋ฒ ๋”ฉ์ด ๋œ feature์ธ ๊ฒฝ์šฐ์—๋Š” Depthwise Separable Convolution ๊ฐ™์€ ๊ฒฝ๋Ÿ‰ํ™” convolution ๊ตฌ์กฐ๋ฅผ ์“ฐ๋Š” ๊ฒƒ์ด ํšจ์œจ์ ์ด๋ผ๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค. ์‚ฌ์‹ค ์ด๋Ÿฐ ๊ฐœ๋…์€ ๋„คํŠธ์›Œํฌ๋งŒ ๋ด์„œ ํ•ด๊ฒฐ๋˜๋Š”๊ฒŒ ์•„๋‹ˆ๊ณ  ๋„คํŠธ์›Œํฌ ๋Œ€๋น„ ํ’€๋ ค๊ณ  ํ•˜๋Š” task์˜ ๋ณต์žก๋„๋ฅผ ๋ด์•ผํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋”ฑ ์–ด๋Š ๋ถ€๋ถ„์— ๊ฒฝ๋Ÿ‰ํ™” ๋„คํŠธ์›Œํฌ๊ฐ€ ํ•„์š”ํ•˜๋‹ค๊ณ  ๋ณด๊ธฐ๋Š” ์–ด๋ ต๊ณ  ์„ฑ๋Šฅ๊ณผ ๊ณ„์‚ฐ๋Ÿ‰๊ณผ์˜ trade-off ๋ฅผ ์ƒ๊ฐํ•ด์„œ ๋ฐฐ์น˜ํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•œ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

 

DeepLabv3+ 

DeepLabv3+ ๊ตฌ์กฐ๋Š” ์•ž์—์„œ ์„ค๋ช…ํ•œ Atrous Convolution, Atrous Spatial Pyramid Pooling, Depth-wise Separable Convolution์„ Encoder-Decoder ๊ตฌ์กฐ์— ์ ์šฉํ•œ ๊ตฌ์กฐ์ž…๋‹ˆ๋‹ค. ๋˜ํ•œ ๋‹ค๋ฅธ segmentation network ์™€ ์œ ์‚ฌํ•˜๊ฒŒ encoder layer ์ค‘๊ฐ„ ์ค‘๊ฐ„์˜ feature๋ฅผ decoder ๋‹จ์— ๋”ํ•ด์ฃผ๋Š” ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” High encoding๋œ low resolution์˜ feature์— high resolution ์ •๋ณด๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ boundary ์˜ˆ์ธก ์„ฑ๋Šฅ์„ ๋†’์—ฌ์ฃผ๋Š” ์—ญํ• ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.  V3์—์„œ decoder ๊ณผ์ •์—์„œ bilinear upsampling ํ›„ ๋”ํ•ด์ฃผ๋Š” ๋ฐฉ์‹์„ ์‚ฌ์šฉํ–ˆ์œผ๋‚˜, V3+ ์—์„œ๋Š” encoder output feature์— 1x1 conv ๋กœ channel ์ˆ˜๋ฅผ ์ค„์ด๊ณ  concat ํ•˜๋Š” ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค(U-Net ๋ฐฉ์‹๊ณผ ์œ ์‚ฌ).

 

๋˜ํ•œ V3+ ์—์„œ๋Š” Encoder์— ์•ฝ๊ฐ„ ์ˆ˜์ •ํ•œ Xception module์„ ์‚ฌ์šฉํ•˜๋Š”๋ฐ ์ด๋Š” feature map์— ์—ฌ๋Ÿฌ size์˜ conv filter ๋ฅผ ๋ณ‘๋ ฌ์ ์œผ๋กœ ์ ์šฉํ•˜๊ณ  ๋‹ค์‹œ ํ•ฉ์น˜๋Š” ๋ฐฉ์‹์ธ inception module ์— ๊ธฐ์ดˆํ•œ ๋ฐฉ์‹์— convolution์„ point-wise conv. → depth-wise conv.(Depth-wise saparable conv ๋ฐฉ์‹์˜ ๋ฐ˜๋Œ€) ๋ฐฉ์‹์„ ๋„์ž…ํ•œ module์ž…๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์— V3+ ๋Š” ๋ชจ๋“  pooling layer๋ฅผ Depth-wise saparable convolution์œผ๋กœ ๋Œ€์ฒดํ•˜๊ณ  BN๊ณผ ReLU๋ฅผ ์ผ๋ถ€ ์ถ”๊ฐ€ํ•˜์—ฌ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.

Modified Xception์„ encoder๋กœ ์‚ฌ์šฉํ•œ ๊ฒฝ์šฐ 2% ๊ฐ€๋Ÿ‰์ด๋‚˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

 

๋‚ด ์ƒ๊ฐ

์ „์ฒด์ ์œผ๋กœ ๊ฐ์ฒด์˜ ๋‹ค์–‘ํ•œ scale์— robust ํ•˜๊ฒŒ ์„ฑ๋Šฅ์„ ๋‚ด๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•๊ณผ ๋ชจ๋ธ์„ ๊ฒฝ๋Ÿ‰ํ™”ํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•์„ ์ž˜ ์กฐํ•ฉํ•˜์—ฌ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚ด๋Š” ๋„คํŠธ์›Œํฌ ์ž…๋‹ˆ๋‹ค. Segmentation ์— ์ค‘์š”ํ•œ ํฌ์ธํŠธ๋“ค์„ ์ž˜ ์ง‘์–ด์ฃผ๊ณ  ์žˆ์–ด์„œ ๋งŽ์€ ๋„์›€์ด ๋˜๋Š” ๋…ผ๋ฌธ์ด์—ˆ์Šต๋‹ˆ๋‹ค. 

 

๋ฐ˜์‘ํ˜•