๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
๐Ÿ› Research/Deep Learning

[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows / ๋ฐœ์ „๋œ ํ˜•ํƒœ์˜ ViT

by ๋ญ…์ฆค 2022. 1. 8.
๋ฐ˜์‘ํ˜•

NLP ๋ถ„์•ผ์—์„œ ์ด์Šˆ๊ฐ€ ๋˜์—ˆ๋˜ transformer('Attention Is All You Need/NIPS2017')๊ตฌ์กฐ๋ฅผ vision task์— ์ ‘๋ชฉํ•œ Vision Transformer(ViT)์™€ ViT์—์„œ ๊ฐœ์„ ๋œ ๊ตฌ์กฐ์ธ Swin Transformer์— ๋Œ€ํ•ด ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค.

 

* ๋…ผ๋ฌธ

A.    AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE / ICLR2021

B.    Swin Transformer: Hierarchical Vision Transformer using Shifted Windows / ICCV2021

 

1.    Vision Transformer (ViT)

Computer vision ๋ถ„์•ผ์—์„œ ๊ธฐ์กด์˜ self attention์€ CNN ๊ตฌ์กฐ์˜ bottleneck์—์„œ attention์„ ๊ฐ€ํ•˜๋Š” ๋ฐฉ์‹(Non-local network)์ด์—ˆ์ง€๋งŒ ViT ๋Š” image patch์˜ sequence์— transformer encoder๋ฅผ ์ ์šฉํ•˜๋ฉด ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜์—์„œ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ์•„๋ž˜์—์„œ ViT์˜ ๊ตฌ์กฐ๋ฅผ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค.

1.1  Architecture of ViT

1.1.1  Image to Patches

Figure 1. Example of Image to Patches

Input image๋ฅผ 48x48 ์‚ฌ์ด์ฆˆ์˜ RGB data๋กœ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค. Input image๋ฅผ 16x16 ์‚ฌ์ด์ฆˆ์˜ patch๋กœ ๊ฒน์น˜๋Š” ๋ถ€๋ถ„ ์—†์ด ์ž˜๋ผ์„œ ์ด 9๊ฐœ์˜ patch๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (x  : image, xp  : p*p ์‚ฌ์ด์ฆˆ patch)

 

1.1.2  Linear Projection

์ƒ์„ฑ๋œ patch๋“ค์€ linear projection์„ ํ†ตํ•ด 1-d vector๋กœ embedding๋˜๊ณ (16x16x3 = 768 -> 768), ์ด๋“ค์„ patch embedding์ด๋ผ๊ณ  ๋ถ€๋ฆ…๋‹ˆ๋‹ค.

 

1.1.3  Class token and Position embedding

Class token์€ ๋ชจ๋“  patch๊ฐ„์˜ attention์ด ์ˆ˜ํ–‰๋œ ์ •๋ณด๊ฐ€ ํฌํ•จ๋œ output์„ ์ถœ๋ ฅํ•˜๊ธฐ ์œ„ํ•œ ์ˆ˜๋‹จ์ด๋ฉฐ, position embedding์€ patch์˜ ์œ„์น˜์ •๋ณด๋ฅผ ๋‹ด๊ณ  ์žˆ๋Š” embedding์ž…๋‹ˆ๋‹ค. ์ด๋“ค์€ ๋ชจ๋‘ patch embedding๊ณผ ๊ฐ™์€ ์ฐจ์›์ธ 768์ฐจ์›์ด๋ฉฐ, ์ฒซ ๋ฒˆ์งธ input์€ class token + position embedding ์ด๊ณ , ๋‚˜๋จธ์ง€ input๋“ค์€ ๊ฐ๊ฐ์˜ patch embedding + position embedding ์ž…๋‹ˆ๋‹ค. ๋•Œ๋ฌธ์— ์˜ˆ์‹œ ์ด๋ฏธ์ง€์ธ 48x48 ์‚ฌ์ด์ฆˆ์˜ input์€ 9๊ฐœ์˜ patch๋กœ ๋‚˜๋ˆ„์–ด์ง€๊ณ , class token์ด ์ถ”๊ฐ€๋˜์–ด ์ด 10๊ฐœ์˜ 768์ฐจ์›์˜ transformer encoder input(zlayer,sequence :  z0,0,z0,1,…,z0,9 )์ด ์ค€๋น„๋ฉ๋‹ˆ๋‹ค.

 

1.1.4  Transformer Encoder : Multi-head Self Attention (MSA)

ViT์—์„œ๋Š” NLP transformer์™€ ๋‹ค๋ฅด๊ฒŒ layer normalization์˜ ์œ„์น˜๊ฐ€ multi-head attention์˜ ์•ž์ชฝ์— ์œ„์น˜ํ•ฉ๋‹ˆ๋‹ค. Multi-head attention์˜ ํ•˜๋‚˜์˜ head์—์„œ๋Š” input(patch embedding)์— ๊ฐ๊ฐ์˜ weight๋ฅผ ์ทจํ•ด Query(Q), Key(K), Value(V) ๋กœ embedding ์‹œํ‚ค๊ณ (768 -> 64 size) Q์™€ K์˜ dot product์˜ softmax๋กœ similarity๋ฅผ ๊ตฌํ•˜๊ณ  V๋ฅผ ๊ณฑํ•ด self attention ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. Multi-head ์ด๋ฏ€๋กœ ์ด๋Ÿฌํ•œ ์—ฐ์‚ฐ์„ ๋ณ‘๋ ฌ๋กœ ์—ฌ๋Ÿฌ ๊ฐœ(ํ˜„์žฌ ์˜ˆ์‹œ์—์„œ๋Š” 12๊ฐœ) ์ˆ˜ํ–‰ํ•ด์„œ 64d * 12 = 768d์˜ tensor๊ฐ€ ์ถœ๋ ฅ๋˜๋ฏ€๋กœ encoder์˜ input ์‚ฌ์ด์ฆˆ์™€ ๋™์ผํ•ฉ๋‹ˆ๋‹ค.

 

1.1.5  Transformer Encoder : MLP

์ด๋ฅผ encoder input๊ณผ ๋”ํ•ด์ค€ ๋’ค์— layer normalization์„ ๊ฑฐ์น˜๊ณ  MLP(768->3072->768) ๋ฅผ ํ†ต๊ณผ์‹œํ‚ต๋‹ˆ๋‹ค.

์ด์ฒ˜๋Ÿผ multi-head attention๊ณผ MLP๋ฅผ ํ†ต๊ณผํ–ˆ์„ ๋•Œ input ์‚ฌ์ด์ฆˆ๊ฐ€ ๊ทธ๋Œ€๋กœ ์œ ์ง€๋˜๊ฒŒ ํ•ด์„œ skip connection์„ ์šฉ์ดํ•˜๊ฒŒ ํ–ˆ๊ณ  ์ด๋Ÿฌํ•œ transformer encoder๋ฅผ ์—ฌ๋Ÿฌ ๊ฐœ(ํ˜„์žฌ ์˜ˆ์‹œ์—์„œ๋Š” 12๊ฐœ) ์Œ“์•„์„œ layer๋ฅผ ๊นŠ๊ฒŒ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

 

1.1.6  MLP Head and classification

Transformer encoder layer๋ฅผ 12๊ฐœ ํ†ต๊ณผํ•œ ๋’ค, z12,0  (12๋ฒˆ์งธ layer์˜ output ์ค‘ 0๋ฒˆ์งธ sequence / 0๋ฒˆ์งธ sequence๋Š” class token์— ํ•ด๋‹นํ•˜๋ฉฐ, ๋‚˜๋จธ์ง€ sequence๋Š” ํŠน์ • patch์— ๋Œ€ํ•œ embedding๋˜๋ฏ€๋กœ class token์„ ์‚ฌ์šฉํ•˜์—ฌ ์ด๋ฏธ์ง€ ์ „์ฒด์— ํ•ด๋‹นํ•˜๋Š” embedding์„ representํ•ฉ๋‹ˆ๋‹ค)๋ฅผ MLP์— ํ†ต๊ณผ์‹œ์ผœ classification task๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

Figure 2. Architecture of Vision Transformer (ViT)

 

1.2  Discussion

๊ฒฐ๋ก ์ ์œผ๋กœ, ViT๋Š” input image๋ฅผ ๊ฒน์น˜์ง€ ์•Š๋Š” ์—ฌ๋Ÿฌ ๊ฐœ์˜ patch๋“ค๋กœ ๋‚˜๋ˆ„๊ณ , ๊ฐ patch๋“ค์— position embedding์„ ํ†ตํ•ด ๊ณต๊ฐ„์ •๋ณด๋ฅผ ์œ ์ง€ํ•œ ์ƒํƒœ๋กœ Multi-head Self Attention(non-local operation)์„ ์Œ“์•„์„œ classification์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋„คํŠธ์›Œํฌ์ž…๋‹ˆ๋‹ค.

Transformer๋Š” CNN(CNN์€ locality๊ฐ€ inductive bias)์— ๋น„ํ•ด inductive bias๊ฐ€ ๊ฐ•ํ•˜์ง€ ์•Š์•„์„œ ๋ฐ์ดํ„ฐ์˜ ์–‘์ด ์ ์€ ๊ฒฝ์šฐ์—๋Š” ์„ฑ๋Šฅ์ด ์ข‹์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ, ๋ฐฉ๋Œ€ํ•œ dataset์œผ๋กœ pre-trainingํ•˜๊ณ  transfer learning ์‹œ ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

์ฆ‰, CNN์€ localํ•œ ์ •๋ณด๊ฐ€ ์ค‘์š”ํ•˜๋‹ค๋Š” ์ ์„ ์ด์šฉํ•˜๊ฒŒ ๋˜๋ฏ€๋กœ ์—ฌ๋Ÿฌ vision task์—์„œ ์ด์ ์„ ๊ฐ€์ง€์ง€๋งŒ, ํŠน์ • ๋ฐ์ดํ„ฐ์˜ ๊ฒฝ์šฐ ์ด๋ฏธ์ง€์˜ localํ•œ ๋ถ€๋ถ„๋ณด๋‹ค๋Š” global ํ•œ context๊ฐ€ ์ค‘์š”ํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค. Transformer๋Š” ํ•™์Šต์˜ ์ž์œ ๋„๊ฐ€ ๋†’๊ณ  locality๊ฐ€ ๊ฐ•์กฐ๋˜๋Š” ๊ตฌ์กฐ๊ฐ€ ์•„๋‹ˆ๊ธฐ ๋•Œ๋ฌธ์— ์ ์€ dataset์œผ๋กœ๋Š” ํ•™์Šต์ด ํž˜๋“ค ์ˆ˜ ์žˆ์ง€๋งŒ, ๋†’์€ ์ž์œ ๋„๋ฅผ ๊ฐ€์ง€๊ณ  ํ•™์Šต์„ ํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๋ฐฉ๋Œ€ํ•œ dataset์—์„œ ๋” ํฐ ์ด์ ์„ ๊ฐ€์ง„๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

2.  Swin Transformer

2.1  Introduction

์•ž์„œ ์„ค๋ช…ํ•œ ViT ๋Š” transformer ๊ตฌ์กฐ๋ฅผ vision task์— ์ ‘๋ชฉ์‹œ์ผฐ์ง€๋งŒ, ์ด๋ฏธ์ง€์˜ ํŠน์ง•์ธ scale๊ณผ resolution์˜ variation์ด ์žˆ๋‹ค๋Š” ์ ์„ ๊ณ ๋ คํ•˜์ง€ ์•Š์•˜๊ณ , ๋ชจ๋“  patch๋“ค ๊ฐ„์˜ self attention์„ ์ˆ˜ํ–‰ํ•ด์„œ computation cost๊ฐ€ ํฌ๊ธฐ ๋•Œ๋ฌธ์— vision task์— ์ตœ์ ํ™”๋œ transformer ๋ผ๊ณ  ๋ณผ ์ˆ˜๋Š” ์—†์Šต๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” shifted window๋กœ ๊ณ„์‚ฐ๋˜๋Š” representation์„ ๊ฐ€์ง€๋Š” hierarchical transformer์ธ Shifted Window Transformer(Swin Transformer)๋ฅผ ์ œ์•ˆํ•˜์—ฌ ViT๋ฅผ ๊ฐœ์„ ํ–ˆ์Šต๋‹ˆ๋‹ค.

Figure 3. Swin Transformer and ViT

Figure 3์—์„œ ๋™์ผํ•œ patch ์‚ฌ์ด์ฆˆ๋งŒ ์‚ฌ์šฉ๋˜๊ณ , ์ด๋ฏธ์ง€ ์ „์ฒด์˜์—ญ์—์„œ self attention์ด ๊ณ„์‚ฐ๋˜๋Š” ViT์— ๋น„ํ•ด, Swin Transformer ๋Š” hierarchicalํ•œ local window์™€ patch๋ฅผ ์ ์šฉํ•˜๊ณ  ์ „์ฒด ์˜์—ญ์ด ์•„๋‹Œ window ์•ˆ์— ํฌํ•จ๋œ patch๋“ค๊ฐ„์˜ self-attention๋งŒ์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฐฉ์‹์€ Inductive bias๊ฐ€ ๊ฑฐ์˜ ์—†์—ˆ๋˜ ViT ๊ตฌ์กฐ์— locality inductive bias๋ฅผ ๊ฐ€ํ•ด์ค€ ๊ฒƒ์œผ๋กœ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. (์ž‘์€ ์˜์—ญ์—์„œ ์ ์  ํฐ์˜์—ญ์œผ๋กœ self-attention์„ ์ทจํ•จ)

 

Vision task์—์„œ ์œ„์™€ ๊ฐ™์€ hierarchicalํ•œ feature๋ฅผ ์ถ”์ถœํ•˜๋Š” ๋ฐฉ์‹์€ ์ด๋ฏธ ๋งŽ์ด ์‚ฌ์šฉ๋˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. Object detection, segmentation์—์„œ๋Š” ์„œ๋กœ๋‹ค๋ฅธ ์ด๋ฏธ์ง€์˜ object๊ฐ€ resolution๊ณผ scale์ด ๋‹ค๋ฅด์ง€๋งŒ ๋™์ผํ•œ object์ผ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— hierarchicalํ•œ ์ •๋ณด๊ฐ€ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด Feature Pyramid Network(FPN)์—์„œ๋Š” ์—ฌ๋Ÿฌ ์‚ฌ์ด์ฆˆ์˜ pooling ์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ณ„์ธต์  ์ •๋ณด๋ฅผ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.

Figure 4. Feature Pyramid Pooling 

 

 

2.2  Architecture

Figure  5 . Architecture of Swin Transformer

 

Swin Transformer์˜ ์ „์ฒด์ ์ธ ๊ตฌ์กฐ๋ฅผ ๋ณด๋ฉด HxWx3 ์‚ฌ์ด์ฆˆ์˜ input image๋ฅผ patch partition์„ ํ†ตํ•ด ๊ฒน์น˜์ง€ ์•Š๋Š” 4x4x3 ์‚ฌ์ด์ฆˆ์˜ patch๋กœ ๋ถ„ํ• (ViT ๋ณด๋‹ค ํ›จ์”ฌ ์ž‘์€ patch ์‚ฌ์ด์ฆˆ)ํ•ด์„œ H/4xW/4x48 ์‚ฌ์ด์ฆˆ์˜ feature๋กœ ๋งŒ๋“ญ๋‹ˆ๋‹ค. ์ดํ›„์— ViT์ฒ˜๋Ÿผ Linear projection์„ ํ†ตํ•ด transformer encoder์— ์ฃผ์ž…๋ฉ๋‹ˆ๋‹ค. Stage2๋ถ€ํ„ฐ๋Š” stage ์•ž ๋‹จ์— patch merging ๋‹จ๊ณ„๊ฐ€ ์žˆ๋Š”๋ฐ, ์ด๋Š” ์ธ์ ‘ํ•œ 2x2์˜ patch๋“ค์„ ํ•˜๋‚˜์˜ patch๋กœ ํ•ฉ์ณ์„œ window size๊ฐ€ ์ปค์ง€๋”๋ผ๋„ window ๋‚ด๋ถ€์˜ patch ๊ฐœ์ˆ˜๋Š” ์ผ์ •ํ•˜๊ฒŒ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” patch size๊ฐ€ ์ ์  ์ปค์ง€๋ฉด์„œ CNN์ฒ˜๋Ÿผ hierarchical ํ•œ ์ •๋ณด๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ ์ด๋กœ ์ธํ•ด ๊ณ„์‚ฐ๋Ÿ‰์ด ์„ ํ˜•์ ์œผ๋กœ๋งŒ ์ฆ๊ฐ€ํ•˜์—ฌ ViT์— ๋น„ํ•ด ๊ณ„์‚ฐ๋Ÿ‰์ด ํ˜„์ €ํžˆ ์ค„์–ด๋“ค๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. (ViT ๋Š” ๋ชจ๋“  MSA์—์„œ ๋ชจ๋“  patch ๊ฐ„์˜ self attention์„ ์ˆ˜ํ–‰ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๊ณ„์‚ฐ๋Ÿ‰์ด ๋งŽ์Šต๋‹ˆ๋‹ค.)

 

 

2.2.1  W-MSA, SW-MSA

Figure  6 . Example of ‘Windows -> Shifted Windows’

 

๊ฐ Swin Transformer block์€ Windows Multi-head Self Attention(W-MSA)๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” block๊ณผ Shifted Windows Multi-head Self Attention(SW-MSA)๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” block์ด ์—ฐ์†์ ์œผ๋กœ ์—ฐ๊ฒฐ๋ฉ๋‹ˆ๋‹ค. W-MSA ๋Š” local window ๋‚ด๋ถ€์— ์žˆ๋Š” patch๋“ค๋ผ๋ฆฌ๋งŒ self-attention์„ ์ˆ˜ํ–‰ํ•˜๊ณ , SW-MSA๋Š” shifted๋œ window์—์„œ self-attention์„ ์ˆ˜ํ–‰ํ•˜์—ฌ ๊ณ ์ •๋œ ์œ„์น˜๋ฟ๋งŒ์ด ์•„๋‹ˆ๋ผ ์—ฌ๋Ÿฌ ์˜์—ญ์—์„œ์˜ self-attention์ด ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค. ์œ„ ๊ทธ๋ฆผ์˜ layer1์—์„œ๋Š” ์ด๋ฏธ์ง€ ์ „์ฒด๊ฐ€ ํฌ๊ฒŒ 4๊ฐœ์˜ window๋กœ ๋‚˜๋ˆ„์–ด ์ง€๊ณ , ๊ฐ window ๋‚ด๋ถ€์˜ patch๋“ค๋ผ๋ฆฌ self-attention์ด ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค. Layer1+1์—์„œ๋Š” window๊ฐ€ shift ๋˜๋ฏ€๋กœ window ๊ฒฝ๊ณ„ ๋•Œ๋ฌธ์— self attention์ด ๊ณ„์‚ฐ๋˜์ง€ ์•Š์•˜๋˜ ๋ถ€๋ถ„๋“ค์˜ self attention์ด ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค.

 

2.2.2  Cyclic shift and Masked MSA

Figure  7 . Cyclic shift

 

์œ„์˜ ์˜ˆ์‹œ๋ฅผ ๊ธฐ์ค€์œผ๋กœ W-MSA๋Š” 4๊ฐœ์˜ window์—์„œ self attention์„ ๊ฐ๊ฐ ์ˆ˜ํ–‰ํ•˜๊ณ , SW-MSA๋Š” 9๊ฐœ์˜ window์—์„œ self attention์„ ๊ฐ๊ฐ ์ˆ˜ํ–‰ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. 9๊ฐœ๋ฅผ ๊ฐ๊ฐ ์ˆ˜ํ–‰ ์‹œ padding์„ ์ด์šฉํ•  ์ˆ˜ ์žˆ์ง€๋งŒ computation cost๊ฐ€ ์ฆ๊ฐ€ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ‘cyclic shift’(figure 7) ๋ผ๋Š” ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์œˆ๋„์šฐ๋ฅผ window size//2 ๋งŒํผ(์˜ˆ์‹œ์—์„œ๋Š” 2๋งŒํผ) ์šฐ์ธก ํ•˜๋‹จ์œผ๋กœ ์ด๋™์‹œํ‚ค๊ณ  ์ขŒ์ธก ์ƒ๋‹ด์˜ A, B, C ๊ตฌ์—ญ์„ ์šฐ์ธกํ•˜๋‹จ์œผ๋กœ ์ด๋™์‹œํ‚ต๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  4๊ฐœ๋กœ ๋‚˜๋ˆ„์–ด์ง„ window ์—์„œ ๊ฐ๊ฐ self attention์„ ์ˆ˜ํ–‰ํ•˜๋Š”๋ฐ 2์‚ฌ๋ถ„๋ฉด์˜ window ๋ฅผ ์ œ์™ธํ•˜๊ณ ๋Š” ์ด๋ฏธ์ง€ space์—์„œ ์—ฐ๊ฒฐ๋œ ๋ถ€๋ถ„์ด ์•„๋‹ˆ๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ๊ฐ ๋‹ค๋ฅธ mask๋ฅผ ์”Œ์›Œ์„œ ์ด๋ฏธ์ง€ space์—์„œ ์—ฐ๊ฒฐ๋œ patch๋“ค๊ฐ„์˜ self attention์„ ์ˆ˜ํ–‰ํ•˜์—ฌ computation cost๋ฅผ ์ž‘๊ฒŒ ๋งŒ๋“œ๋Š” ํšจ๊ณผ๋ฅผ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค. ์•„๋ž˜ figure 8๋Š” cyclic shift + Masked MSA์˜ ์˜ˆ์‹œ์ž…๋‹ˆ๋‹ค.

Figure  8 . Example of Cyclic shift and Masked MSA

 

2.2.3  Relative Position Bias

Swin transformer๋Š” ViT์™€ ๋‹ค๋ฅด๊ฒŒ encoder ์ž…๋ ฅ ๋ถ€๋ถ„์—์„œ position embedding์„ ํ•˜์ง€ ์•Š๊ณ  self attention์„ ๊ณ„์‚ฐํ•˜๋Š” ์‹์—์„œ relative position bias(B)๋ฅผ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. Position embedding์€ ์ ˆ๋Œ€ ์ขŒํ‘œ์˜€๋˜ ๊ฒƒ์ด ๋น„ํ•ด relative position bias๋Š” patch๋“ค๊ฐ„์˜ ์ƒ๋Œ€์ขŒํ‘œ๋ฅผ ๋”ํ•ด์ฃผ๋Š” ๊ฒƒ์ธ๋ฐ, ์ด๊ฒƒ์ด ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ์•„๋งˆ ์ด๋ฏธ์ง€์—์„œ ์ ˆ๋Œ€์ ์ธ ์œ„์น˜ ๋ณด๋‹ค๋Š” ์–ด๋–ค object์˜ part๋“ค์ด ์žˆ์„ ๋•Œ part๊ฐ„์˜ ์ƒ๋Œ€์ ์ธ ์œ„์น˜๊ฐ€ object๋ฅผ ํŒŒ์•…ํ•˜๋Š”๋ฐ ๋” ๋„์›€์ด ๋˜์–ด์„œ ๊ทธ๋Ÿฐ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

 

2.3  Experimental Results

ImageNet classification ์‹คํ—˜์—์„œ Swin Transformer(Blue)๊ฐ€ ViT(Red) ๋ณด๋‹ค 2๋ฐฐ์ด์ƒ ์ž‘์€ parameter๋กœ ์„ฑ๋Šฅ์€ 3% ์ด์ƒ ์ข‹์€ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์ง€๋งŒ, CNN ๊ธฐ๋ฐ˜์˜ SOTA(EffNet)๋ณด๋‹ค๋Š” ์„ฑ๋Šฅ์ด ๋‚ฎ์Šต๋‹ˆ๋‹ค.  Object detection๊ณผ segmentation task์—์„œ๋„ Swin transformer๋ฅผ backbone์œผ๋กœ ์‚ฌ์šฉํ•œ ๋ชจ๋ธ์ด ๋ชจ๋“  ๋ถ€๋ถ„์—์„œ SOTA๋ฅผ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.   

 

 

3.  ๋‚ด ์ƒ๊ฐ

Swin Transformer๋Š” ViT ์˜ ๋ฌธ์ œ์ ์ด์—ˆ๋˜ ๋งŽ์€ ๊ณ„์‚ฐ๋Ÿ‰๊ณผ ์ด๋ฏธ์ง€ resolution, scale์˜ variation์ด ๊ณ ๋ ค๋˜์ง€ ์•Š์•˜๋˜ ์ (์ ์€ inductive bias)์„ ๊ฐœ์„ ํ–ˆ๊ณ  ์—ฌ๋Ÿฌ object detection, segmentation method์˜ backbone์œผ๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Segmentation dataset์ธ ADE20K์˜ benchmark๋ฅผ ๋ณด๋ฉด 1~10์œ„๋Š” ๊ฑฐ์˜ ๋ชจ๋“  method๊ฐ€ transformer ๊ธฐ๋ฐ˜์ด๊ณ , ํ˜„์žฌ SOTA๋Š” Swin Transformer version 2 ์ž…๋‹ˆ๋‹ค.

ADE20K dataset benchmark

ViT๋Š” ์ด๋ฏธ์ง€๋ฅผ patch ๋‹จ์œ„๋กœ ๋‚˜๋ˆ„๊ณ , patch ๋‹จ์œ„์—์„œ multi-head๋กœ non-local operation์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ตฌ์กฐ์ธ๋ฐ, ๋„คํŠธ์›Œํฌ๊ฐ€ ๊นŠ์–ด์ง€๋”๋ผ๋„ ์ฒ˜์Œ ๋‚˜๋‰˜์–ด์ง„ patch๋ฅผ ๊ธฐ์ค€์œผ๋กœ self attention์ด ์ˆ˜ํ–‰๋˜๊ธฐ ๋•Œ๋ฌธ์— ๋™์ผํ•œ ๊ฐ์ฒด์ผ์ง€๋ผ๋„ ์ด๋ฏธ์ง€ space์—์„œ shift ๋˜๋Š” ๊ฒฝ์šฐ์—๋Š” ์กฐ๊ธˆ ๋‹ค๋ฅธ ๊ฒฐ๊ณผ๋ฅผ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ์„ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. ๊ทธ์— ๋น„ํ•ด, Swin Transformer๋Š” ์ฒ˜์Œ์—๋Š” ์ž‘์€ window size๋กœ ์ด๋ฏธ์ง€๋ฅผ ๋ถ„ํ• ํ•˜๊ณ  window ๋‚ด๋ถ€ patch๋“ค๊ฐ„์˜ self attention์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ๋„คํŠธ์›Œํฌ๊ฐ€ ๊นŠ์–ด์งˆ์ˆ˜๋ก ์ ์  window size์™€ patch size๋ฅผ ํ‚ค์›Œ์„œ self attention์„ ์ˆ˜ํ–‰ํ•˜๋Š”๋ฐ, ์ด๋Š” ์ด๋ฏธ์ง€ space์—์„œ ์ ์  ๋” ํฐ ์˜์—ญ(window) ๋‚ด๋ถ€์—์„œ ๋” ๋ฉด์ ์ด ํฐ ์˜์—ญ(patch)๊ฐ„์˜ self attention์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

3.1  CNN vs. Transformer

์‚ฌ์‹ค Swin Transformer์˜ ๊ตฌ์กฐ์™€ ์—ฐ์‚ฐ์€ CNN์˜ hierarchicalํ•œ ๊ตฌ์กฐ์™€ ์ƒ๋‹นํžˆ ์œ ์‚ฌํ•˜๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค. Swin Transformer๋Š” window, patch size๋ฅผ ๋Š˜๋ ค์„œ ์ด๋ฏธ์ง€ resolution์„ ์ค„์ด๊ณ , CNN์€ convolutional ์—ฐ์‚ฐ์„ ํ†ตํ•ด resolution์„ ์ค„์ด๋ฉด์„œ hierarchicalํ•œ representation์„ ์ƒ์„ฑํ•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ๋•Œ๋ฌธ์— CNN ๊ตฌ์กฐ ์‚ฌ์ด์‚ฌ์ด์— W-MSA์™€ SW-MSA๋ฅผ ์‚ฝ์ž…ํ•˜๋ฉด patch merging์ด๋‚˜ window, patch size์˜ ๋ณ€ํ™” ์—†์ด ๋น„์Šทํ•œ ํšจ๊ณผ๋ฅผ ๊ฐ€์ ธ์˜ค์ง€ ์•Š์„๊นŒ๋ผ๋Š” ์˜๋ฌธ์ด ๋“ค์—ˆ์Šต๋‹ˆ๋‹ค.

 

3.2  Transformer ๊ตฌ์กฐ์˜ ํ™œ์šฉ

๊ทธ๋ฆฌ๊ณ  ViT๋Š” inductive bias ๊ฐ€ ์•ฝํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์–ด๋””์—๋“  ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๋„ํ™”์ง€ ๊ฐ™์€ ๋Š๋‚Œ์ด ๋“ค์—ˆ์ง€๋งŒ, ๊ทธ์— ๋Œ€ํ•œ ๋‹จ์ ์œผ๋กœ ๋ฐฉ๋Œ€ํ•œ dataset์œผ๋กœ pre-trainingํ•˜์ง€ ์•Š์œผ๋ฉด ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚ด๊ธฐ ํž˜๋“ญ๋‹ˆ๋‹ค. Swin Transformer๋Š” ์ด๋ฏธ์ง€์˜ hierarchicalํ•œ ํŠน์ง•์„ ์ด์šฉํ•ด์„œ inductive bias๋ฅผ ์ฃผ์—ˆ๋Š”๋ฐ, ์ด์ฒ˜๋Ÿผ ์—ฌ๋Ÿฌ task์—์„œ ์ค‘์š”ํ•œ ํŠน์ง•๋“ค์„ ViT์— inductive bias๋กœ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๊ตฌ์กฐ๋ฅผ ์„ค๊ณ„ํ•˜๋ฉด ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์ง€ ์•Š์„๊นŒ ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด texture ์ด๋ฏธ์ง€์˜ ๊ฒฝ์šฐ local structuralํ•œ ํŠน์ง•๋ฟ๋งŒ ์•„๋‹ˆ๋ผ global statisticalํ•œ ํŠน์ง•์œผ๋กœ ์ž˜ ํ‘œํ˜„๋ฉ๋‹ˆ๋‹ค. ๋•Œ๋ฌธ์— feature์˜ statisticalํ•œ property๋ฅผ ํ™œ์šฉํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ inductive bias๋ฅผ ์ค„ ์ˆ˜ ์žˆ๋‹ค๋ฉด transformer ๊ตฌ์กฐ๋ฅผ texture recognition์— ์ ์šฉํ•ด์„œ ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์„ ๊ฒƒ์ด๋ผ ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.

 

 

๋ฐ˜์‘ํ˜•