๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
๐Ÿ› Research/Detection & Segmentation

[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers

by ๋ญ…์ฆค 2022. 8. 9.
๋ฐ˜์‘ํ˜•

๋ณธ ๋…ผ๋ฌธ์€ NeurIPS 2021 ์— ๊ณต๊ฐœ๋˜์—ˆ๊ณ , ์‹ฌํ”Œํ•˜๊ณ  ๊ฐ•๋ ฅํ•œ semantic segmentation task ์šฉ Transformer ์ธ SegFormer ๋ฅผ ์ œ์•ˆํ•˜๋Š” ๋…ผ๋ฌธ์ž…๋‹ˆ๋‹ค.

 

Abstract

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ํšจ์œจ์ ์ธ Segmentation task ์ˆ˜ํ–‰์„ ์œ„ํ•œ ๊ฐ„๋‹จํ•˜๊ณ  ํšจ์œจ์ ์ด๋ฉด์„œ ๊ฐ•๋ ฅํ•œ semantic segmentation ํ”„๋ ˆ์ž„์›Œํฌ์ธ SegFormer ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. SegFormer ๋Š” 1) multi-scale feature ๋ฅผ ์ถ”์ถœํ•˜๋Š” ์ƒˆ๋กœ์šด hierarchically structured Transformer encoder ๋กœ ๊ตฌ์„ฑ๋˜๊ณ , positional encoding์ด ํ•„์š”ํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— ํ…Œ์ŠคํŠธ ์ด๋ฏธ์ง€์˜ ํ•ด์ƒ๋„๊ฐ€ ํ•™์Šต ์ด๋ฏธ์ง€์˜ ํ•ด์ƒ๋„์™€ ๋‹ค๋ฅผ ๋•Œ ์„ฑ๋Šฅ์ด ์ €ํ•˜๋˜๋Š” positional code์˜ interpolation์„ ํ”ผํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. 2) ๋˜ํ•œ SegFormer ๋Š” ๋ณต์žกํ•˜์ง€ ์•Š์€ ๊ฐ„๋‹จํ•œ MLP ๊ตฌ์กฐ์˜ decoder๋ฅผ ์‚ฌ์šฉํ•˜๋Š”๋ฐ, ์ œ์•ˆ๋œ MLP decoder ๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ๊ณ„์ธต์˜ ์ •๋ณด๋ฅผ ์ง‘๊ณ„ํ•˜์—ฌ ๊ฐ•๋ ฅํ•œ representation ์„ ๋ Œ๋”๋งํ•˜๊ธฐ ์œ„ํ•ด local attention ๊ณผ global attention ์„ ๊ฒฐํ•ฉํ•ฉ๋‹ˆ๋‹ค. 

 

์•„๋ž˜ figure๋ฅผ ๋ณด๋ฉด Ade20k ๋ฐ์ดํ„ฐ์…‹์—์„œ SegFormer ๋ชจ๋ธ์ด ๋‹ค๋ฅธ ๋ชจ๋ธ์— ๋น„ํ•ด ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๋Š” ์ ๊ณ  mIOU๋Š” ์›”๋“ฑํžˆ ๋†’์€ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

๋ณธ ์—ฐ๊ตฌ์˜ novelty ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  • Positional Encoding ์ด ํ•„์š”์—†๋Š” Hierarchical Transformer Encoder
  • ๋ณต์žกํ•˜๊ณ  ๊ณ„์‚ฐ๋Ÿ‰์ด ๋งŽ์€ ๋ชจ๋“ˆ ์—†์ด๋„ ํŒŒ์›Œํ’€ํ•œ representation์„ ์ œ๊ณตํ•˜๋Š” Lightweight All-MLP Decoder
  • SegFormer ๋Š” 3๊ฐ€์ง€ semantic segmentation benchmark dataset์—์„œ efficiency, accuracy, robustness ๋ถ€๋ถ„์—์„œ SoTA๋ฅผ ๋‹ฌ์„ฑ

 

Method

 

Figure 2 ์— ํ‘œ์‹œ๋œ ๊ฒƒ์ฒ˜๋Ÿผ SegFormer๋Š” Hierarchical Transformer Encdoer ์™€ Lightweight All-MLP Decoder ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค. ํฌ๊ธฐ๊ฐ€ H x W x 3 ์ธ ์ž…๋ ฅ ์ด๋ฏธ์ง€๊ฐ€ ์ฃผ์–ด์ง€๋ฉด ๋จผ์ € 4x4 ํฌ๊ธฐ์˜ ํŒจ์น˜๋กœ ๋‚˜๋ˆ•๋‹ˆ๋‹ค. 16x16 ํŒจ์น˜๋กœ ๋‚˜๋ˆ„๋Š” ViT ์— ๋น„ํ•ด ์„ธ๋ถ„ํ™”๋œ ํŒจ์น˜๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด segmentation ์— ์œ ๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ์ด ํŒจ์น˜๋“ค์„ ์ธ์ฝ”๋”์˜ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉํ•˜์—ฌ ์›๋ณธ ์ด๋ฏธ์ง€์˜ ํ•ด์ƒ๋„๊ฐ€ {1/4, 1/8, 1/16, 1/32} ์ธ multi-level  feature ๋ฅผ ์–ป์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ ๋‹ค์Œ multi-level feature ๋ฅผ ๋””์ฝ”๋”์— ์ „๋‹ฌํ•˜์—ฌ (H/4) x (W/4) x Ncls ํ•ด์ƒ๋„๋กœ segmentation mask ๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค. (Ncls : ์นดํ…Œ๊ณ ๋ฆฌ ์ˆ˜) 

 

Hierarchical Transformer Encdoer

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์•„ํ‚คํ…์ฒ˜๋Š” ๋™์ผํ•˜์ง€๋งŒ ํฌ๊ธฐ๊ฐ€ ๋‹ค๋ฅธ MiT(Mix Transformer Encoder) ์‹œ๋ฆฌ์ฆˆ MiT-B0 ~ MiT-B5 ๋ฅผ ์„ค๊ณ„ํ•ฉ๋‹ˆ๋‹ค. 

 

 

# Hierarchical Feature Representation

ViT ์™€ ๋‹ฌ๋ฆฌ SegFormer ์˜ ์ธ์ฝ”๋”๋Š” ์ž…๋ ฅ ์ด๋ฏธ์ง€๊ฐ€ ์ฃผ์–ด์ง€๋ฉด multi-level multi-scale feature๋ฅผ ์ƒ์„ฑํ•˜๊ณ , ์ด๋Ÿฌํ•œ feature๋Š”  segmenation ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๊ณ ํ•ด์ƒ๋„์˜ coarse ํ•œ feature์™€ ์ €ํ•ด์ƒ๋„์˜ fine-grained ํ•œ feature ๋ฅผ ๋ชจ๋‘ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. 

 

 

# Efficient Self-Attention

๊ธฐ์กด์˜ multi-head self-attention ์—์„œ ๊ฐ head Q, K, V ๋Š” ๋™์ผํ•œ dimension NxC ๋ฅผ ๊ฐ€์ง€๋ฉฐ, ์—ฌ๊ธฐ์„œ N = H x W ์€ ์‹œํ€€์Šค์˜ ๊ธธ์ด์ด๋ฉฐ self-attention์€ ์•„๋ž˜์˜ ์‹์„ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค. 

๊ธฐ์กด Multi-Head Self-Attention

 

SegFormer์˜ ์ธ์ฝ”๋”๋Š” 4x4 ํฌ๊ธฐ์˜ ํŒจ์น˜๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— ViT ์— ๋น„ํ•ด ํŒจ์น˜ ์ˆ˜๊ฐ€ ๋งŽ์•„์ง€๊ณ  self-attention ์—ฐ์‚ฐ์ด ๋งค์šฐ ๋ณต์žกํ•ด์ง‘๋‹ˆ๋‹ค. ์ €์ž๋Š” ์ด๋ฅผ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•ด  sequence reduction process ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” reduction ratiro R ๋ฅผ ๋ฏธ๋ฆฌ ์ง€์ •ํ•˜์—ฌ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์‹œํ€€์Šค์˜ ๊ธธ์ด๋ฅผ ์ค„์ด๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. 

 

K ๋Š” ๊ธธ์ด๋ฅผ ์ค„์ผ ์‹œํ€€์Šค์ด๊ณ , Reshape() ์€ (N/R) x (C·R) ์˜ ํ˜•ํƒœ๋กœ K๋ฅผ reshape ํ•˜๊ณ , Linear() ๋Š”  Cin ์ฑ„๋„ ์‚ฌ์ด์ฆˆ๋ฅผ ์ž…๋ ฅ์œผ๋กœ Cout ์ฑ„๋„ ์‚ฌ์ด์ฆˆ๋กœ ์ถœ๋ ฅํ•˜๋Š” linear layer ์ž…๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ƒˆ๋กœ์šด K์˜ ์ฐจ์›์€ (N/R) x C ์ด๊ณ , ๊ฒฐ๊ณผ์ ์œผ๋กœ self-attention ๋ฉ”์ปค๋‹ˆ์ฆ˜์˜ ๋ณต์žก์„ฑ์„ O(N^2) ์—์„œ O(N^2/R) ๋กœ ์ค„์–ด๋“ญ๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” R์„ stage-1์—์„œ 4๊นŒ์ง€ [64, 16, 4,1] ๋กœ ์„ค์ •ํ–ˆ์Šต๋‹ˆ๋‹ค. 

 

* ๊ฐ„๋‹จํ•˜๊ฒŒ ์ƒ๊ฐํ•˜๋ฉด linear layer ๋กœ ์ฑ„๋„ ์ˆ˜๋ฅผ ์ค„์ธ ๋‹ค์Œ์— multi-head self-attention์„ ์ˆ˜ํ–‰ํ•œ๋‹จ ๋ง์ธ๋ฐ, ๊ต‰์žฅํžˆ ์žฅํ™ฉํ•˜๊ฒŒ ์„ค๋ช…๋˜์–ด ์žˆ๋Š” ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

 

Sequence Reduction Process

 

# Overlapped Patch Merging

์ž…๋ ฅ ์ด๋ฏธ์ง€๊ฐ€ ์ฃผ์–ด์ง€๋ฉด ViT์˜ patch merging ํ”„๋กœ์„ธ์Šค๋Š” N x N x 3 ํŒจ์น˜๋ฅผ 1 x 1 x C ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ด๊ฒƒ์€ 2 x 2 x Ci ํฌ๊ธฐ์˜ feature ๋ฅผ 1 x 1 x Ci+1 ๋ฒกํ„ฐ๋กœ ํ†ตํ•ฉํ•˜์—ฌ ๊ณ„์ธต์ ์ธ feature map์„ ์–ป๊ธฐ ์œ„ํ•ด ํ™•์žฅํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ฆ‰, ์•„๋ž˜ ์‹์ฒ˜๋Ÿผ F1 ์—์„œ F2 ๋กœ ์ถ•์†Œํ•˜๊ณ  ๋‹ค์Œ ๊ณ„์ธต์˜ ์–ด๋–ค ๋‹ค๋ฅธ feature map์— ๋Œ€ํ•ด์„œ๋„ ๋ฐ˜๋ณตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ทธ๋Ÿฐ๋ฐ ์ด๋Ÿฌํ•œ ํ”„๋กœ์„ธ์Šค๋Š” ์ฒ˜์Œ์— ๊ฒน์น˜์ง€ ์•Š๋Š” ์ด๋ฏธ์ง€ ๋˜๋Š” feature patch๋ฅผ ๊ฒฐํ•ฉํ•˜๋„๋ก ์„ค๊ณ„๋œ ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์— ํ•ด๋‹น ํŒจ์น˜ ์ฃผ๋ณ€์˜ ๋กœ์ปฌ ์—ฐ์†์„ฑ์„ ์œ ์ง€ํ•˜์ง€ ๋ชปํ•ฉ๋‹ˆ๋‹ค. 

 

๋•Œ๋ฌธ์— SegFormer ์—์„œ๋Š” Overlapped Patch Merging ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด K = ํŒจ์น˜ ํฌ๊ธฐ, S = ์ธ์ ‘ํ•œ ๋‘ ํŒจ์น˜ ์‚ฌ์ด์˜ stride, P = padding ํฌ๊ธฐ ๋ฅผ ์ •์˜ํ•˜์—ฌ patch merging์„ ์ค‘์ฒฉ๋˜๋Š” ์˜์—ญ์ด ์กด์žฌํ•˜๋„๋ก ๋งŒ๋“ญ๋‹ˆ๋‹ค. ์‹คํ—˜์—์„œ๋Š” overlapping patch merging ์ˆ˜ํ–‰์‹œ non-overlapping ํ”„๋กœ์„ธ์Šค์™€ ๋™์ผํ•œ ์‚ฌ์ด์ฆˆ์˜ feature๋ฅผ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•ด  K = 7, S = 4, P =3 ๋ฐ K = 3, S = 2, P = 1 ๋กœ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค. 

 

์ •๋ฆฌํ•˜๋ฉด ๊ธฐ์กด์˜ patch merging ๋ฐฉ๋ฒ•์—์„œ๋Š” ํŒจ์น˜ ๊ฐ„์˜ overlap ๋˜๋Š” ๋ถ€๋ถ„์ด ์—†์—ˆ๋Š”๋ฐ ์ด๋Š” ๋กœ์ปฌ ์—ฐ์†์„ฑ์„ ์œ ์ง€ํ•˜์ง€ ๋ชปํ•˜๊ธฐ ๋•Œ๋ฌธ์— stride์™€ padding ์‚ฌ์ด์ฆˆ๋ฅผ ์ •ํ•ด์„œ ํŒจ์น˜๋ฅผ overlap ์‹œ์ผœ์„œ ์ถ”์ถœํ•œ๋‹ค๋Š” ๋œป์ž…๋‹ˆ๋‹ค. 

 

 

# Positional-Encoding-Free Design

ViT ์—์„œ Positional Encoding (PE) ์˜ ํ•ด์ƒ๋„๋Š” ๊ณ ์ •๋˜์–ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ํ…Œ์ŠคํŠธ ํ•ด์ƒ๋„๊ฐ€ ํ•™์Šต ์‹œ์™€ ๋‹ค๋ฅธ ๊ฒฝ์šฐ PE ๋ฅผ interpolation ํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” segmentation task ์—์„œ ํ•ด์ƒ๋„๊ฐ€ ๋‹ค๋ฅธ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๊ธฐ ๋•Œ๋ฌธ์— ์„ฑ๋Šฅ ์ €ํ•˜๋กœ ์ด์–ด์ง‘๋‹ˆ๋‹ค. 

 

SegFormer ์—์„œ๋Š” Feed Forward Network (FFN) ์—์„œ 3x3 Conv ๋ฅผ ์ง์ ‘ ์‚ฌ์šฉํ•˜์—ฌ leak location information์— ๋Œ€ํ•œ zero padding ์— ๋Œ€ํ•œ ์˜ํ–ฅ์„ ๊ณ ๋ คํ•˜๋Š” Mix-FFN ์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜ํ•ฉ๋‹ˆ๋‹ค.

xin ์€ self-attention ๋ชจ๋“ˆ์˜ feature ์ด๊ณ  Mix-FFN ์€ 3x3 Conv ์™€ MLP ๋ฅผ ๊ฐ FFN์— ํ˜ผํ•ฉํ•ฉ๋‹ˆ๋‹ค. ๋ณธ ์—ฐ๊ตฌ์—์„œ 3x3 Conv ๊ฐ€ Transformer ์— ๋Œ€ํ•œ ์œ„์น˜ ์ •๋ณด๋ฅผ ์ œ๊ณตํ•˜๊ธฐ์— ์ถฉ๋ถ„ํ•˜๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค(์‹คํ—˜ ๊ฒฐ๊ณผ). ๋˜ํ•œ ํšจ์œจ์„ฑ์„ ์œ„ํ•ด depth-wise convolution์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. 

 

์ด์™€ ๊ฐ™์€ ์„ค๊ณ„๋กœ ์ €์ž๋Š” semantic segmentation ์ž‘์—…์—์„œ feature map ์— PE ๋ฅผ ์ถ”๊ฐ€ํ•  ํ•„์š”๊ฐ€ ์—†๋‹ค๊ณ  ์ฃผ์žฅํ•ฉ๋‹ˆ๋‹ค.

 


Lightweight All-MLP Decoder

SegFormer ๋Š” MLP ๋ ˆ์ด์–ด๋กœ๋งŒ ๊ตฌ์„ฑ๋œ ๊ฒฝ๋Ÿ‰ ๋””์ฝ”๋”๋ฅผ ํ†ตํ•ฉํ•˜๋ฉฐ ์ด๋Š” ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์—์„œ ์ผ๋ฐ˜์ ์œผ๋กœ ์‚ฌ์šฉ๋˜๋Š” ์ˆ˜์ž‘์—… ๋ฐ ๊ณ„์‚ฐ ์š”๊ตฌ ์‚ฌํ•ญ ๊ตฌ์„ฑ ์š”์†Œ๋ฅผ ํ•„์š”๋กœ ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๊ฐ„๋‹จํ•œ ๋””์ฝ”๋”๋ฅผ ์„ค๊ณ„ํ•˜๋Š” ํ•ต์‹ฌ์€ ๊ณ„์ธต์  ํŠธ๋žœ์Šคํฌ๋จธ ์ธ์ฝ”๋”๊ฐ€ ๊ธฐ์กด CNN ์ธ์ฝ”๋”๋ณด๋‹ค ๋” ํฐ Effective Receptive Field (ERF) ๋ฅผ ๊ฐ–๋Š”๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค. 

 

์ œ์•ˆํ•˜๋Š” ALL-MLP ๋””์ฝ”๋”๋Š” ํฌ๊ฒŒ ์•„๋ž˜์™€ ๊ฐ™์€ 4 ๋‹จ๊ณ„๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

  1. MiT ์ธ์ฝ”๋”์˜ multi-level feature Fi ๋Š” MLP ๋ฅผ ํ†ตํ•ด ๋™์ผํ•œ ์ฑ„๋„ ์‚ฌ์ด์ฆˆ๋กœ ํ†ตํ•ฉ
  2. Feature ๋ฅผ ์›๋ณธ ์ด๋ฏธ์ง€์˜ 1/4 ์‚ฌ์ด์ฆˆ๋กœ ์—…์ƒ˜ํ”Œ๋งํ•˜๊ณ  concat 
  3. Concat ํ•˜์—ฌ 4๋ฐฐ ์ฆ๊ฐ€๋œ ์ฑ„๋„์„ MLP ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์›๋ž˜ ์ฑ„๋„ ์‚ฌ์ด์ฆˆ๋กœ ๋ณ€๊ฒฝ
  4. Segmentation mask ๋ฅผ ์˜ˆ์ธก 

 

 

# Effective Receptive Field Analysis

Semantic Segmentation ์—์„œ๋Š” context ์ •๋ณด๋ฅผ ํฌํ•จํ•˜๋„๋ก ํฐ receptive field ๋ฅผ ์œ ์ง€ํ•˜๋Š” ๊ฒƒ์€ ํ•ต์‹ฌ ๋ฌธ์ œ์ž…๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ ํˆด์„ ์‚ฌ์šฉํ•˜์—ฌ MLP ๋””์ฝ”๋” ์„ค๊ณ„๊ฐ€ ํŠธ๋žœ์Šคํฌ๋จธ์—์„œ ์™œ ๊ทธ๋ ‡๊ฒŒ ํšจ๊ณผ์ ์ธ์ง€ ์‹œ๊ฐํ™”ํ•˜์—ฌ ํ•ด์„ํ•ฉ๋‹ˆ๋‹ค. Figure 3 ์—์„œ DeepLabv3+ ์™€ SegFormer ์— ๋Œ€ํ•œ ์ธ์ฝ”๋” ์Šคํ…Œ์ด์ง€์™€ ๋””์ฝ”๋” ํ•ด๋“œ์˜ ERF ๋ฅผ ์‹œ๊ฐํ™” ํ•ฉ๋‹ˆ๋‹ค. 

 

์œ„์˜ Figure 3 ์„ ๋ณด๋ฉด DeepLabv3+ ์˜ ERF ๋Š” ๊ฐ€์žฅ ๊นŠ์€ stage-4 ์—์„œ๋„ ์ƒ๋Œ€์ ์œผ๋กœ ์ž‘๊ณ , SegFormer ์˜ ์ธ์ฝ”๋”๋Š” ํ•˜์œ„ stage ์—์„œ local attention์„ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์ƒ์„ฑํ•˜๋Š” ๋™์‹œ์— stage-4 ์—์„œ context๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์บก์ฒ˜ํ•˜๋Š” non-local attention ์„ ์ถœ๋ ฅํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ์คŒ์ธ ํŒจ์น˜์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ์ด MLP ํ•ด๋“œ(ํŒŒ๋ž€์ƒ‰ ์ƒ์ž) ์˜ ERF๋Š” stage-4(๋นจ๊ฐ„์ƒ‰ ์ƒ์ž)์™€ ๋‹ค๋ฅด๋ฉฐnon-local attention ๋ณด๋‹ค local-attention์ด ๋” ๊ฐ•ํ•ฉ๋‹ˆ๋‹ค.

 

CNN ์˜ ์ œํ•œ๋œ receptive field ๋Š” receptive field ๋ฅผ ํ™•์žฅํ•˜๊ธด ํ•˜์ง€๋งŒ ํ•„์—ฐ์ ์œผ๋กœ ๋ฌด๊ฑฐ์›Œ์ง€๋Š” ASPP ์™€ ๊ฐ™์€ context ๋ชจ๋“ˆ์— ์˜์กดํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ €์ž๋Š” SegFormer ์˜ ๋””์ฝ”๋” ์„ค๊ณ„๊ฐ€ non-local attention ์„ ํ™œ์šฉํ•˜๊ณ  ๋ณต์žกํ•˜์ง€ ์•Š์œผ๋ฉด์„œ ๋” ํฐ receptive field๋ฅผ ๊ฐ€์ง„๋‹ค๊ณ  ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ SegFormer ์˜ ๋””์ฝ”๋” ๋””์ž์ธ์€ ๋ณธ์งˆ์ ์œผ๋กœ highly local / non-local attention ์„ ๋ชจ๋‘ ์ƒ์„ฑํ•˜๋Š” ํŠธ๋žœ์Šคํฌ๋จธ์˜ ์ด์ ์„ ์ทจํ•ฉ๋‹ˆ๋‹ค. 

 

 

๋ฐ˜์‘ํ˜•

 

Experiments

 

 

์ •๋ฆฌ

SegFormer ๋Š” Transformer ์˜ Encoder ์™€ Decoder ์—์„œ ์•„๋ž˜์™€ ๊ฐ™์€ ๋ฐฉ๋ฒ•๋“ค์„ ์‚ฌ์šฉํ•˜์—ฌ Semantic Segmentation ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๋„คํŠธ์›Œํฌ์ž…๋‹ˆ๋‹ค. 

๋ณธ ๋…ผ๋ฌธ์—์„œ ์ฆ๋ช…๋œ ์‚ฌ์‹ค ์ค‘ Transformer ๋ฅผ semantic segmentation ์— ์ด์šฉํ•  ๋•Œ ์ฐธ๊ณ ํ•˜๋ฉด ์ข‹์€ ๋‚ด์šฉ๋“ค์ด ๋งŽ์•„ ์ˆ™์ง€ํ•ด๋‘๋ฉด ์ข‹์„ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. 

 

Hierarchical Transformer Encdoer : Semantic Segmentation ์„ฑ๋Šฅ↑ + ์—ฐ์‚ฐ ํšจ์œจ ↑

  • Hierarchical Feature Representation : Multi-scale/level Feature → Semantice Segmentation ์„ฑ๋Šฅ ↑
  • Efficient Self-Attention : Sequence Reduction → ์—ฐ์‚ฐ ํšจ์œจ์„ฑ
  • Overlapped Patch Merging : ๋กœ์ปฌ ์—ฐ์†์„ฑ ์œ ์ง€ → Semantice Segmentation ์„ฑ๋Šฅ ↑
  • Positional-Encoding-Free Design : PE ์‚ฌ์šฉ X → ํ•™์Šต/ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ํ•ด์ƒ๋„ ๋‹ค๋ฅธ ๊ฒฝ์šฐ์—๋„ Semantic Segmentation ์„ฑ๋Šฅ ์ €ํ•˜ x

Lightweight All-MLP Decoder : ๋ณต์žกํ•œ ๊ณ„์‚ฐ X → ์—ฐ์‚ฐ ํšจ์œจ์„ฑ 

๋ฐ˜์‘ํ˜•