๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
๐Ÿ› Research/Deep Learning

[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] Non-local Neural Networks / Vision Transformer์˜ ์‹œ์ดˆ

by ๋ญ…์ฆค 2021. 12. 12.
๋ฐ˜์‘ํ˜•

Non-local network ์ •๋ฆฌ...

 

CNN ์€ ์–•์€ layer์—์„œ๋Š” spatial domain์—์„œ์˜ localํ•œ ์˜์—ญ์˜ correlation์„, ๊นŠ์€ layer์—์„œ๋Š” ์ƒ๋Œ€์ ์œผ๋กœ globalํ•œ ์˜์—ญ๊นŒ์ง€์˜ correlation์„ ์ถ”์ถœํ•˜๋Š” local operator ๋กœ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ layer๊ฐ€ ๊นŠ์–ด์ง€๋”๋ผ๋„ ํ•œ๋ฒˆ์˜ ์—ฐ์‚ฐ์—์„œ ์ „์ฒด ์˜์—ญ์˜ correlation์„ ์ถ”์ถœํ•˜๋Š” non-local ์—ฐ์‚ฐ๊ณผ๋Š” ์ฐจ์ด๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ๋•Œ๋ฌธ์— CNN์€ spatial domain ๋˜๋Š” temporal domain ์ƒ์—์„œ ๊ฑฐ๋ฆฌ๊ฐ€ ๋จผ feature ๋“ค๊ฐ„์˜ correlation์ด ์ถ”์ถœ๋˜๊ธฐ ํž˜๋“  ๊ตฌ์กฐ์ž…๋‹ˆ๋‹ค.

๋ณธ ๋…ผ๋ฌธ์€ ์ด๋ฅผ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•œ Non-local operation์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

 

์•„๋ž˜ ๊ทธ๋ฆผ์€ non-local block ์ด ํ•™์Šต๋˜์—ˆ์„ ๋•Œ, ๊ฐ€์žฅ ํฐ weighted arrow ๋ฅผ ์‹œ๊ฐํ™”ํ•œ ๊ทธ๋ฆผ์ž…๋‹ˆ๋‹ค. ํŠน์ • object๊ฐ€ ๋‹ค๋ฅธ object์™€ correlation์ด ์žˆ๋‹ค๋ฉด spatial, temporal ๊ฑฐ๋ฆฌ๊ฐ€ ๋ฉ€๋”๋ผ๋„ ํฌ๊ฒŒ activate ๋˜๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

Examples of the behavior of a non-local block

 

Motivation

 

์ด๋ฏธ์ง€ ๋…ธ์ด์ฆˆ ์ œ๊ฑฐ filter์ธ Non-local Means Filter(NLM filter)์—์„œ ์˜๊ฐ์„ ์–ป์–ด ํ•œ ์ง€์ ๊ณผ ๋‹ค๋ฅธ ๋ชจ๋“  ์ง€์ ๊ณผ์˜ ๊ด€๊ณ„๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ ์ž ํ•ฉ๋‹ˆ๋‹ค. ์•„๋ž˜ ์‹์—์„œ yi๋Š” ํŠน์ • i๋ฒˆ์งธ feature(feature map์—์„œ ํŠน์ • spatial point)์—์„œ ๋‹ค๋ฅธ ์œ„์น˜์˜ feature(j) ์™€์˜ similarity๋ฅผ ๊ณ„์‚ฐ ํ›„ embedding๋œ j๋ฒˆ์งธ feature์˜ ๊ฐ’์„ ๊ณฑํ•˜๊ณ , ์ด ์—ฐ์‚ฐ์„ ๋ชจ๋“  ์˜์—ญ์—์„œ ์ˆ˜ํ–‰ํ•œ ๊ฒฐ๊ณผ์˜ ํ•ฉ ์ž…๋‹ˆ๋‹ค.

Non-local ์—ฐ์‚ฐ
similarity ๊ณ„์‚ฐ
Non-local ์—ฐ์‚ฐ๋œ yi์™€ ์›๋ž˜ feature xi๋ฅผ ๋”ํ•˜๋Š” residual ์—ฐ์‚ฐ.(Wz : 1x1 conv.)

- f(xi, xj) : xi์™€ xj ์˜ similarity ๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ํ•จ์ˆ˜์ด๋ฉฐ, ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” 4๊ฐ€์ง€์˜ ๋ฐฉ์‹์„ ์ œ์•ˆํ•˜๋Š”๋ฐ ์–ด๋–ค ๋ฐฉ์‹์„ ์จ๋„ ์„ฑ๋Šฅ ํ–ฅ์ƒ์€ ๋น„์Šทํ•˜๊ธฐ ๋•Œ๋ฌธ์— similarity๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ์ž์ฒด๊ฐ€ non-local ์—ฐ์‚ฐ์—์„œ ์ค‘์š”ํ•œ ์ ์ด๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

- g(xj) : j ๋ฒˆ์งธ ์œ„์น˜์˜ feature์— weight๋ฅผ ๊ณฑํ•ด์„œ embeddingํ•˜๋Š” ํ•จ์ˆ˜์ด๋ฉฐ, ๋‹จ์ˆœํžˆ ํŠน์ • ์œ„์น˜์˜ feature๋ฅผ ๊ฐ€์ ธ์˜ค๋Š” ๋Œ€์‹  weight๋ฅผ ๊ณฑํ•ด์ฃผ์–ด ์˜๋ฏธ์žˆ๋Š” ์ •๋ณด๋ฅผ ์ถ”์ถœํ•˜๋ ค ํ•ฉ๋‹ˆ๋‹ค. ๊ณฑํ•ด์ง€๋Š” ํ•จ์ˆ˜ f๋ฅผ weight๋กœ ์ƒ๊ฐํ•˜๋ฉด i, j ๊ฐ„์˜ similarity ๊ฐ€ ๋†’์„ ๋•Œ g(xj)๊ฐ€ ๋” ํฌ๊ฒŒ activate ๋ฉ๋‹ˆ๋‹ค. 

 

- 1/C(x) : normalize term

 

๊ฒฐ๊ตญ ์ด๋Ÿฌํ•œ non-local ์—ฐ์‚ฐ์˜ ์˜๋ฏธ๋Š” ์–ด๋–ค A์˜์—ญ์˜ feature์™€ ์ด๋ฏธ์ง€ ์ „์ฒด ์˜์—ญ(spatial, temporal)์— ๋Œ€ํ•œ feature๋“ค๊ณผ์˜ ๊ด€๊ณ„(similarity, correlation)์„ ๊ณ„์‚ฐํ•˜๊ณ  ๊ทธ ๊ด€๊ณ„์„ฑ์ด ํด์ˆ˜๋ก(๊ด€๊ณ„์„ฑ์ด ํฐ ์˜์—ญ์ด B๋ผ๋ฉด), B์˜์—ญ์˜ embedding๋œ feature์˜ ๊ฐ’์„ ๋” ํฌ๊ฒŒ activate ์‹œํ‚ค๋Š” ๊ฒƒ์ด๋ผ ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.

 

๋•Œ๋ฌธ์—, single ์ด๋ฏธ์ง€์—์„œ (task์— ๋”ฐ๋ผ ๋‹ค๋ฅด๊ฒ ์ง€๋งŒ classifaction ์ด๋ผ๋ฉด) ์–ด๋–ค 2๊ฐœ ์ด์ƒ์˜ object ๋“ค์ด ํ•จ๊ป˜ ์กด์žฌํ•˜๋Š” ๊ฒƒ์ด ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜์— ๋„์›€์„ ์ค€๋‹ค๋ฉด, ๊ณต๊ฐ„์ ์œผ๋กœ ๊ฐ€๊นŒ์ด ์žˆ์ง€ ์•Š์„๋•Œ CNN์€ ์ด๋“ค์˜ ๊ด€๊ณ„๋ฅผ ์ง์ ‘์ ์œผ๋กœ ์ถ”์ถœํ•˜๊ธฐ ํž˜๋“ค์ง€๋งŒ, non-local ์—ฐ์‚ฐ์„ ์ด๋“ค์˜ ๊ด€๊ณ„๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ video ์ด๋ฏธ์ง€๋“ค์—์„œ๋Š” spatial, temporal ์ถ•์—์„œ ๋ชจ๋‘ non-localํ•œ ๊ด€๊ณ„๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

 

Implementation

 

์ด๋Ÿฌํ•œ non-local ์—ฐ์‚ฐ์„ pixel-wise ๋กœ ์‹คํ–‰ํ•˜๋Š” ๊ฒƒ์€ ๋„ˆ๋ฌด ๋‚ญ๋น„์ ์ด๊ธฐ ๋•Œ๋ฌธ์—, ์•„๋ž˜์™€ ๊ฐ™์€ ๋ฐฉ์‹์œผ๋กœ feature ๋‹จ์œ„์—์„œ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

feature x(HW*1024)๋ฅผ transposeํ•ด์„œ(1024*HW) ์„œ๋กœ matrix multiplication ์—ฐ์‚ฐ(f(xi,xj))์„ ํ†ตํ•ด similarity ์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค(HW*HW). ๋‹ค์‹œ feature x๋ฅผ 1x1 conv(g(xj))๋ฅผ ํ†ต๊ณผ์‹œ์ผœ embedding ์‹œ์ผœ์ฃผ๊ณ (HW*512), ์ด๋ฅผ ์•ž์„  ์—ฐ์‚ฐ์—์„œ ์ถœ๋ ฅํ•œ similarity map(HW*HW)๊ณผ matrix mul. ๋ฅผ ์ˆ˜ํ–‰ํ•ด์„œ non-local ์—ฐ์‚ฐ์ด ๋๋‚ฉ๋‹ˆ๋‹ค.

๋ฐ˜์‘ํ˜•