๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
๐Ÿ› Research/Deep Learning

[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] SHAPE-TEXTURE DEBIASED NEURAL NETWORK TRAINING / ๋‰ด๋Ÿด ๋„คํŠธ์›Œํฌ์—์„œ shape๊ณผ texture์˜ ๊ด€๊ณ„

by ๋ญ…์ฆค 2021. 12. 4.
๋ฐ˜์‘ํ˜•

ICLR 2021์— ๊ฐœ์ œ๋œ ๋…ผ๋ฌธ์ด๋ฉฐ object์™€ shape, texture์™€์˜ ๊ด€๊ณ„, ๊ทธ๋ฆฌ๊ณ  object recognition ๋“ฑ์˜ vision task์—์„œ shape๊ณผ texture ์ •๋ณด๋ฅผ ๋ชจ๋‘ ์ด์šฉํ•˜์—ฌ ํ•™์Šตํ•˜์—ฌ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚จ shape-texture debiased neural network๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

 

Introduction

Shape๊ณผ texture๋Š” ๋ชจ๋‘ object๋ฅผ ์ธ์‹ํ•  ๋•Œ ์ค‘์š”ํ•œ ๋‹จ์„œ๋“ค์ž…๋‹ˆ๋‹ค. ์ด๋ฏธ ์ด์ „์˜ object recognition ์—ฐ๊ตฌ์—์„œ shape๊ณผ texture๋ฅผ ์ ์ ˆํ•˜๊ฒŒ ๊ฒฐํ•ฉํ•˜๋ฉด ์ธ์‹ ์„ฑ๋Šฅ์„ ๋†’์ผ ์ˆ˜ ์žˆ์Œ์ด ๋ฐํ˜€์กŒ์Šต๋‹ˆ๋‹ค. ‘IMAGENET-TRAINED CNNS ARE BIASED TOWARDS TEXTURE; INCREASING SHAPE BIAS IMPROVES ACCURACY AND ROBUSTNESS’ ๋…ผ๋ฌธ์—์„œ๋Š” ImageNet์œผ๋กœ ํ•™์Šต๋œ(training data์— ๋”ฐ๋ผ) CNN์ด shape๋ณด๋‹ค texture์— ํŽธํ–ฅ๋˜์–ด ์žˆ๋‹ค๋Š” ๊ฒฐ๊ณผ๋ฅผ ์‹คํ—˜์ ์œผ๋กœ ์ฆ๋ช…ํ–ˆ์Šต๋‹ˆ๋‹ค. ์œ„์˜ ๊ฐ„๋‹จํ•œ ์˜ˆ์‹œ๋ฅผ ๋ณด๋ฉด ํ•™์Šต๋œ CNN์ด ์ฝ”๋ผ๋ฆฌ์˜ localํ•œ ์ด๋ฏธ์ง€๋Š” ์ฝ”๋ผ๋ฆฌ๋กœ ์ •ํ™•ํ•˜๊ฒŒ ์ธ์‹ํ•˜๊ณ  ๊ณ ์–‘์ด ์ „์ฒด ์ด๋ฏธ์ง€๋Š” ๊ณ ์–‘์ด๋กœ ์ธ์‹ํ•˜์ง€๋งŒ, ์ฝ”๋ผ๋ฆฌ ํ”ผ๋ถ€์— ๊ณ ์–‘์ด shape์ด ํ•ฉ์„ฑ๋œ ์ด๋ฏธ์ง€๋Š” texture์— ํ•ด๋‹นํ•˜๋Š” ์ฝ”๋ผ๋ฆฌ๋กœ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Overview of AdaIN style-transfer
Examples of Stylized-ImageNet(SIN)

IN์€ ImageNet, SIN์€ stylized-ImageNet(AdaIN style-transfer๋ฅผ ์ด์šฉํ•˜์—ฌ shape์€ ์œ ์ง€ํ•œ ์ฑ„ style(texture)๋ฅผ ๋ฐ”๊พผ ์ด๋ฏธ์ง€)์ž…๋‹ˆ๋‹ค. ์œ„์˜ figure์—์„œ SIN ์˜ˆ์‹œ๋ฅผ ๋ณด๋ฉด ํ•˜๋‚˜์˜ ์ด๋ฏธ์ง€๋ฅผ texture ์ •๋ณด๋ฅผ ๋ฐ”๊ฟ”์„œ ์—ฌ๋Ÿฌ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์‹คํ—˜๊ฒฐ๊ณผ์—์„œ IN์œผ๋กœ training์‹œํ‚ค๊ณ  SIN์œผ๋กœ test์‹œ ๊ฒฐ๊ณผ๊ฐ€ ์ œ์ผ ์•ˆ ์ข‹์€ ๊ฒƒ์œผ๋กœ ๋ณด์•„ IN(ImageNet ์›๋ณธ)์œผ๋กœ ํ•™์Šต๋œ CNN์ด texture์— ๋งŽ์ด bias ๋˜์–ด ์žˆ๋‹ค๊ณ  ํŒ๋‹จํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

*AdaIN Style-transfer : feature space์—์„œ statistics(mean,std)๊ฐ€ style์„ ํ‘œํ˜„ํ•˜๋Š”๋ฐ ์ค‘์š”ํ•œ ์š”์†Œ๋ผ๋Š” ๊ฒƒ์„ ์ด์šฉํ•˜์—ฌ contents-์ด๋ฏธ์ง€(shape)์—์„œ style์„ ์ œ๊ฑฐํ•˜๊ณ  style-์ด๋ฏธ์ง€(texture)์˜ style์„ ์ ์šฉ์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•

 

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” shape ๋˜๋Š” texture์— bias๋œ representation์œผ๋กœ ํ•™์Šต๋œ CNN์€ ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง„๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  shape-biased, texture-biased ๋ชจ๋ธ์ด ์ƒํ˜ธ๋ณด์™„์ ์ด๊ณ  ๋‘ ๊ฐ€์ง€ ํŠน์„ฑ ์ค‘ ํ•˜๋‚˜์— ์น˜์šฐ์น˜๋ฉด ๋ชจ๋ธ ์„ฑ๋Šฅ์ด ์ œํ•œ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด texture ์ •๋ณด๊ฐ€ ์—†์œผ๋ฉด shape์ด ๋น„์Šทํ•œ ์˜ค๋ Œ์ง€์™€ ๋ ˆ๋ชฌ์˜ ์ฐจ์ด๋ฅผ ๊ตฌ๋ณ„ํ•˜๊ธฐ ํž˜๋“ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋” ๋‚˜์€ representation์„ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด shape-texture debiased neural network training framework๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ฐฉ๋ฒ•์€ CNN training ์ƒ˜ํ”Œ์—์„œ shape์ด๋‚˜ texture์— bias๋˜๋Š” ๊ฒƒ์„ ๋ง‰๊ธฐ ์œ„ํ•ด ์ž๋™์œผ๋กœ optimalํ•œ representation์„ ์ฐพ๊ฒŒ ํ•˜๋Š” ๋ฐฉ์‹์ด๋ฉฐ, ์›๋ณธ training ๋ฐ์ดํ„ฐ๋ฅผ ๋ณด๊ฐ•ํ•˜๊ธฐ ์œ„ํ•ด shape๊ณผ texture ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋ฌด๋„ˆ๋œจ๋ฆฌ๋Š” cue conflict image๋ฅผ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•ด style transfer๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ ์ด๋ฏธ์ง€์˜ shape๊ณผ texture ๋‘ ๊ฐ€์ง€ ๋ชจ๋‘ supervision(shape, texture label์„ ๊ฐ€์ง„ data ํ•„์š”)์„ ์ œ๊ณตํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

 

์•„๋ž˜์˜ figure์—์„œ fur coat๋Š” ๋Œ€๋ถ€๋ถ„์˜ ์ฝ”ํŠธ ๋˜๋Š” ์˜ท ์ƒ์˜๊ฐ€ ๊ฐ€์ง€๋Š” shape ์ •๋ณด๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Œ๊ณผ ๋™์‹œ์— ํŠน์œ ์˜ texture๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋•Œ shape-biased ๋œ ๋ชจ๋ธ์€ ์˜ท์˜ shape์—๋งŒ ์ง‘์ค‘ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋‹ค๋ฅธ ์ข…๋ฅ˜์˜ ์˜ท์œผ๋กœ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๊ณ , texture-biased๋œ ๋ชจ๋ธ์—์„œ๋Š” fur coat์˜ texture์—๋งŒ ์ง‘์ค‘ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋น„์Šทํ•œ local ํŒจํ„ด์„ ๊ฐ€์ง€๋Š” ๊ณ ์–‘์ด ๋˜๋Š” ๋‹ค๋ฅธ ์‚ฌ๋ฌผ๋กœ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ ์˜ท ๊ฐ™์€ ๊ฒฝ์šฐ shape์ด ์ผ์ •ํ•˜์ง€ ์•Š๊ณ  ์ ‘ํžˆ๊ฑฐ๋‚˜ ๊ตฌ๊ฒจ์ ธ ์žˆ์„ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— texture ์ •๋ณด๊ฐ€ ํ•„์ˆ˜์ ์ž…๋‹ˆ๋‹ค. ๋•Œ๋ฌธ์— object๋ฅผ ๋”์šฑ ์ •ํ™•ํ•˜๊ฒŒ ํ•™์Šตํ•˜๊ณ  ํŒ๋ณ„ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” shape ์ •๋ณด์™€ texture ์ •๋ณด๋ฅผ ์ ์ ˆํžˆ ํ™œ์šฉํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.

Shape-biased Model / Texture-biased Model / Debiased Model(proposed)

 

SHAPE/TEXTURE BIASED NEURAL NETWORKS

์•„๋ž˜ ๊ทธ๋ฆผ์ฒ˜๋Ÿผ ์ด๋ฏธ์ง€๋Š” shape-biased model์ด ๋” ์ž˜ ๋ถ„๋ฅ˜ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒฝ์šฐ์™€ texture-biased model์ด ๋” ์ž˜ ๋ถ„๋ฅ˜ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒฝ์šฐ๋กœ ๋‚˜๋‰ฉ๋‹ˆ๋‹ค. ์ฃผ๋กœ texture-bias model์ด ์ž˜ ๋ถ„๋ฅ˜ํ•  ์ˆ˜ ์žˆ๋Š” ์ด๋ฏธ์ง€๋Š” shape์ด ์ •ํ˜•ํ™” ๋˜์–ด์žˆ์ง€ ์•Š์•„์„œ ๊ฐ์ฒด๊ฐ€ ๋‹ค์–‘ํ•œ ๋ชจ์–‘์„ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋Š” ๊ฒฝ์šฐ์— ํ•ด๋‹นํ•ฉ๋‹ˆ๋‹ค.

Stylized-ImageNet์€ style transfer๋ฅผ ์ด์šฉํ•˜์—ฌ ImageNet dataset์˜ shape์€ ์œ ์ง€ํ•œ ์ฑ„ texture๋ฅผ ๋ณ€ํ™”์‹œํ‚จ dataset์ž…๋‹ˆ๋‹ค. ๋•Œ๋ฌธ์— CNN์ด Stylized-ImageNet dataset์œผ๋กœ training๋˜๋Š” ๊ฒฝ์šฐ model์ด shape์— bias๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด์ „ ๋…ผ๋ฌธ์—์„œ๋Š” original dataset์œผ๋กœ CNN์„ ํ•™์Šต์‹œํ‚ค๊ณ  style-ImageNet์œผ๋กœ fine-tuningํ•ด์„œ texture-biased ๋˜์—ˆ๋˜ ๋ชจ๋ธ์„ shape์— ์ง‘์ค‘ํ•˜๋„๋ก ์กฐ์ ˆํ–ˆ์Šต๋‹ˆ๋‹ค.

 

- Data generation

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” shape, texture ์ •๋ณด๊ฐ€ ํ•ฉ์„ฑ๋œ ์ด๋ฏธ์ง€๋ฅผ training ์ƒ˜ํ”Œ๋กœ ์ ์šฉํ•˜์—ฌ shape-bias ๋˜๋Š” texture-bias ๋ชจ๋ธ์„ ์–ป์Šต๋‹ˆ๋‹ค. Cue conflict image๋Š” ๊ท ์ผํ•˜๊ฒŒ ๋ฌด์ž‘์œ„๋กœ ํ•œ ์Œ์˜ ์ด๋ฏธ์ง€๋ฅผ ์„ ํƒํ•œ ๋‹ค์Œ style-transfer์„ ์ ์šฉํ•˜์—ฌ shape๊ณผ texture ์ •๋ณด๋ฅผ ํ˜ผํ•ฉํ•œ ์ด๋ฏธ์ง€์ž…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ์นจํŒฌ์ง€ shape๊ณผ ๋ ˆ๋ชฌ texture๋ฅผ ํ•ฉ์นœ conflict image๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

- Label assignment

Cue conflict image์— label์„ ํ• ๋‹นํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ํ•™์Šต๋œ ๋ชจ๋ธ์˜ bias๋ฅผ ์ œ์–ดํ•ฉ๋‹ˆ๋‹ค. ๋ชจ๋ธ์ด texture์— ๋” ์ง‘์ค‘ํ•˜๊ฒŒ ํ•˜๋ ค๋ฉด cue conflict image์— ์ ์šฉ๋œ ํ•œ ์Œ(์นจํŒฌ์ง€, ๋ ˆ๋ชฌ)์˜ ์ด๋ฏธ์ง€์—์„œ texture(๋ ˆ๋ชฌ)์— ํ•ด๋‹นํ•˜๋Š” label์„ ํ• ๋‹นํ•˜๊ณ  shape์— ์ง‘์ค‘ํ•˜๊ฒŒ ํ•˜๋ ค๋ฉด shape(์นจํŒฌ์ง€)์— ํ•ด๋‹นํ•˜๋Š” label์„ ํ• ๋‹นํ•ฉ๋‹ˆ๋‹ค.

 

๋ฐ˜์‘ํ˜•

SHAPE-TEXTURE DEBIASED NEURAL NETWORK TRAINING

Prediction์— shape๊ณผ texture ์ •๋ณด๋ฅผ ๋ชจ๋‘ ์ด์šฉํ•˜๊ธฐ ์œ„ํ•ด Mixup์—์„œ ์˜๊ฐ์„ ์–ป์€ ๊ฐ„๋‹จํ•œ ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•˜์—ฌ soft labeling์„ ํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, shape-source ์ด๋ฏธ์ง€ ys์˜ one-hot label๊ณผ texture-source ์ด๋ฏธ yt์˜ one-hot label์ด ์ฃผ์–ด์ง€๋ฉด cue conflict image์— ํ• ๋‹น๋œ new label์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

 

Shape-texture coefficient γ ๋Š” 0~1 ๊ฐ’์œผ๋กœ shape๊ณผ texture ์‚ฌ์ด์˜ ์ƒ๋Œ€์  ์ค‘์š”์„ฑ์„ ์ œ์–ดํ•˜๊ธฐ ์œ„ํ•œ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ์ž…๋‹ˆ๋‹ค. γ ๋ฅผ 0 ๋˜๋Š” 1๋กœ ์ง€์ •ํ•˜๋ฉด ๋ชจ๋ธ์„ texture-bias ๋˜๋Š” shape-bias ๋ชจ๋ธ๋กœ ํ•™์Šต์‹œํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋‘ ๊ฐœ์˜ ๊ทน๋‹จ์€ biased ๋ชจ๋ธ์„ ๋งŒ๋“ค๊ณ  0~1์‚ฌ์ด์—์„œ optimalํ•œ ํฌ์ธํŠธ๊ฐ€ ์กด์žฌํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค. (์‹คํ—˜์ ์œผ๋กœ 0.7์ด sweet point์ž…๋‹ˆ๋‹ค=shape ์ •๋ณด๊ฐ€ recognition์— ์ƒ๋Œ€์ ์œผ๋กœ ๋” ์ค‘์š”ํ•˜๋‹ค๋Š” ๊ฒฐ๊ณผ) ์ด๋Ÿฌํ•œ ๋ฐฉ๋ฒ•์„ shape-texture debiased neural network training์ด๋ผ๊ณ  ์ด๋ฆ„์„ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค.

 

๋ฐฉ๋ฒ•์€ ๋งค์šฐ ๊ฐ„๋‹จํ•ด ๋ณด์ด์ง€๋งŒ, ํ•œ ์Œ์˜ ์ด๋ฏธ์ง€๋ฅผ ๊ตฌ์„ฑํ•  ๋•Œ shape-source ์ด๋ฏธ์ง€์™€ texture-source ์ด๋ฏธ์ง€๋ฅผ ์–ด๋–ป๊ฒŒ ๋‚˜๋ˆ„์–ด ๋’€๋Š”์ง€ ์˜๋ฌธ์ด ์ƒ๊น๋‹ˆ๋‹ค. ํ•œ ์ด๋ฏธ์ง€์—์„œ shape ๊ณผ texture ์ •๋ณด ์ค‘ ์–ด๋–ค ๊ฒƒ์ด dominantํ•œ์ง€ handcrafted ํ•˜๊ฒŒ ๋‚˜๋ˆ„์—ˆ๋Š”์ง€ ์˜๋ฌธ์ž…๋‹ˆ๋‹ค.

 

Debiased Model (proposed)

Semantic segmentation์œผ๋กœ ์‘์šฉ ๋˜ํ•œ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. ์•„๋ž˜ figure์ฒ˜๋Ÿผ Texture-source object๋ฅผ ๋ถ„ํ• ํ•˜๊ณ  shape label์ด ํ• ๋‹น๋œ shape source ์ด๋ฏธ์ง€์— texture๋ฅผ style-transferํ•˜์—ฌ data๋ฅผ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

 

RESULTS

์•„๋ž˜ figure์—์„œ shape model๊ณผ texture model์˜ activation map ๊ฒฐ๊ณผ๋ฅผ ๋ณด๋ฉด shape ๋ชจ๋ธ์€ ๊ณ ์–‘์ด, ์‚ฌ์ž์˜ ์–ผ๊ตด์— activate ๋˜์–ด์žˆ๋Š” ๋ฐ˜๋ฉด texture model์€ ๊ฐ์ฒด์˜ ์ „์ฒด์— ๊ณ ๋ฃจ activate๋˜์–ด ์žˆ๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

 

table.1

 

Table.1์€ vanilla ํ•™์Šต๊ณผ vanilla ํ•™์Šต์—์„œ epoch์„ 2๋ฐฐ๋กœ ํ•˜์—ฌ ์‹คํ—˜ํ•œ ๊ฒƒ, shape-biased, texture-biased, Debiased model์˜ ์‹คํ—˜๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค. Shape ๋˜๋Š” texture์— bias ๋˜๋Š” ๊ฒฝ์šฐ ์„ฑ๋Šฅ์ด ์ €ํ•˜๋˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๊ณ , debiased model์˜ ๊ฒฝ์šฐ network ์‚ฌ์ด์ฆˆ๊ฐ€ ์ปค์งˆ์ˆ˜๋ก ์„ฑ๋Šฅ์ด ๋”์šฑ ํ–ฅ์ƒ๋˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. 2xepoch๊ณผ ๋น„๊ตํ•˜์—ฌ ๋‹จ์ˆœํžˆ ํ•™์Šต์„ ๋” ๋งŽ์ด ํ•˜๋Š” ๊ฒƒ์€ ์˜๋ฏธ ์—†๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

table.2

Table.2์€ debiased model์ด shape๊ณผ texture ์ •๋ณด๋ฅผ ๋ชจ๋‘ ์ ์ ˆํžˆ ์ž˜ ํ‘œํ˜„ํ•˜๋„๋ก ํ•™์Šต๋˜์—ˆ๋Š”์ง€ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•ด shape dataset์ธ ImageNet-Sketch, ImageNet-R๊ณผ texture dataset์ธ Kylberg Texture, Flicker Material dataset์œผ๋กœ ์‹คํ—˜ํ•œ ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค. Shape dataset์˜ ๊ฒฝ์šฐ S-biased model ๋ณด๋‹ค๋„ debiased model์ด ์„ฑ๋Šฅ์ด ๋” ์ข‹๊ณ  texture dataset์˜ ๊ฒฝ์šฐ T-biased ๋ณด๋‹ค๋Š” ์„ฑ๋Šฅ์ด ์ข‹์ง€ ์•Š์ง€๋งŒ vanilla ๋ณด๋‹ค ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

table.3
table.4

 

Table.3,4๋Š” ImageNet-A(natural adversarial example ํฌํ•จ), ImageNet-C(75๊ฐœ์˜ visual corruption ์ ์šฉ), stylized-ImageNet์—์„œ ๋ชจ๋ธ์˜ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ๊ณผ FGSM adversarial attack์— ๋Œ€ํ•œ robustness๋ฅผ ์‹คํ—˜ํ•œ ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค. ๋ชจ๋“  STOA์˜ ๊ฐœ์„ ์‚ฌํ•ญ์€ ์—ฌ๋Ÿฌ ๋ฒค์น˜๋งˆํฌ์—์„œ ์ผ๊ด€๋˜์ง€ ์•Š์ง€๋งŒ debiased ๋ชจ๋ธ์€ vanilla training baseline์„ ๋Šฅ๊ฐ€ํ•˜๋Š” ์œ ์ผํ•œ ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.

 

๋‚ด ์ƒ๊ฐ.. 

๊ฐ€๊ณตํ•˜์ง€ ์•Š์€ ๋Œ€๋ถ€๋ถ„์˜ real world ๋ฐ์ดํ„ฐ๋กœ CNN์„ ํ•™์Šต์‹œํ‚ฌ ๋• ๊ทน๋‹จ์ ์œผ๋กœ shape ๋˜๋Š” texture์— bias๋˜์ง€๋Š” ์•Š์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค. Texture์— ์–ด๋Š ์ •๋„ bias ๋˜๋Š” ๊ฒฝ์šฐ๋Š” ๋ถ„๋ฅ˜ ๋Œ€์ƒ shape์˜ deformation์ด ํฌ๊ธฐ ๋•Œ๋ฌธ์— inner class์˜ shape์ด ์ผ์ •ํ•˜์ง€ ์•Š์ง€๋งŒ(๋™๋ฌผ์˜ ๊ฒฝ์šฐ ๋ฐ”๋ผ๋ณด๋Š” ๊ฐ๋„๋‚˜ ๋™๋ฌผ์˜ ์ž์„ธ์— ๋”ฐ๋ผ, ์‚ฌ๋ฌผ์€ ์„ธ๋ถ€ ์นดํ…Œ๊ณ ๋ฆฌ๊ฐ€ ๋‚˜๋‰˜๊ธฐ ๋•Œ๋ฌธ์—) ์ผ์ •๋ถ€๋ถ„ ์œ ์‚ฌํ•œ texture๋ฅผ ๊ณต์œ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— CNN์ด ์ž์—ฐ์Šค๋ ˆ texture์— bias๋˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, CNN์ด shape์— ์ง‘์ค‘ํ•œ๋‹ค๋ฉด training data์˜ texture์˜ variation์ด ํฐ ๊ฒƒ์ด๊ณ  texture์— ์ง‘์ค‘ํ•œ๋‹ค๋ฉด shape์˜ variation์ด ํฌ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ์„ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. ๋•Œ๋ฌธ์— training data๊ฐ€ ์ „์ฒด data๋ฅผ ๋Œ€๋ณ€ํ•  ์ˆ˜ ์žˆ์„ ์ •๋„๋กœ ๋ฐฉ๋Œ€ํ•˜๋‹ค๋ฉด shape ๋˜๋Š” texture์— bias๋˜๋Š”๊ฒŒ ์˜คํžˆ๋ ค ์ž์—ฐ์Šค๋Ÿฝ๊ณ  ๋ฐ”๋žŒ์งํ•œ ํ•™์Šต๋ฐฉ๋ฒ•์ด๋ผ๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ, ์–ธ์ œ๋‚˜ training data๋Š” ๋ชจ๋“  data๋ฅผ ๋Œ€๋ณ€ํ•  ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ์— ๋ณธ ๋…ผ๋ฌธ์—์„œ shape๊ณผ texture ์ •๋ณด๋ฅผ ๋ชจ๋‘ ์ ์ ˆํžˆ ์ด์šฉํ•˜๋„๋ก ํ•™์Šต์„ ์œ ๋„ํ•˜๋Š” ๊ฒƒ์ด๋ผ ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.

 

Overview of six domains in ImageNet-D

๋ณธ ๋…ผ๋ฌธ์˜ proposed method๋Š” ์œ„ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์ด shape ์ •๋ณด๋Š” ์œ ์‚ฌํ•˜์ง€๋งŒ texture ์ •๋ณด๊ฐ€ ๋‹ค๋ฅธ(domain์ด ๋‹ค๋ฅธ) dataset์—์„œ๋„ ์„ฑ๋Šฅํ–ฅ์ƒ์ด ์žˆ์„ ๊ฒƒ์ด๋ผ ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ, ๋ฌธ์ œ ์ •์˜์— ๋”ฐ๋ผ real ์ด๋ฏธ์ง€๊ฐ€ ์•„๋‹Œ ์ด๋ฏธ์ง€(sketch, painting,…)๋“ค์€ anomaly data๋กœ ์ทจ๊ธ‰ํ•˜๊ณ  real ์ด๋ฏธ์ง€๋“ค๋งŒ ๋ถ„๋ฅ˜ํ•˜๊ธธ ๋ฐ”๋ผ๋Š” task๋„ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์„ ํƒ์ ์œผ๋กœ ์‚ฌ์šฉํ•ด์•ผ ํ•œ๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.  

 

์•„๋ž˜๋Š” ๋…ผ๋ฌธ์„ ์ฝ์œผ๋ฉฐ ์ƒ๊ธด ๋‘ ๊ฐ€์ง€ ์˜๋ฌธ์ ์— ๋Œ€ํ•œ ์ •๋ฆฌ์ž…๋‹ˆ๋‹ค.

 

์ฒซ ๋ฒˆ์งธ๋Š” class ํŠน์„ฑ์— ๋”ฐ๋ผ(๋ฌด์—‡์— ๋”์šฑ dominant ํ•œ์ง€) shape/texture – source๋กœ ์ˆ˜์ž‘์—…์œผ๋กœ ๋‚˜๋ˆ„์ง€ ์•Š๊ณ  random ํ•˜๊ฒŒ ํ•œ ์Œ์˜ ์ด๋ฏธ์ง€๋ฅผ ๊ณจ๋ผ์„œ ํ•˜๋‚˜์˜ ์ด๋ฏธ์ง€์— ๋‹ค๋ฅธ ํ•˜๋‚˜์˜ ์ด๋ฏธ์ง€์˜ style(texture) ์„ ์ ์šฉ์‹œ์ผฐ์„ ๋•Œ ์–ด๋–ป๊ฒŒ ์„ฑ๋Šฅ์ด ๊ฐœ์„ ๋˜์—ˆ๋Š”์ง€ ์˜๋ฌธ์ž…๋‹ˆ๋‹ค. Shape ์ •๋ณด๊ฐ€ ๋” ์ค‘์š”ํ•œ ์ด๋ฏธ์ง€๋ฅผ shape-source๋กœ ์ƒ๋Œ€์ ์œผ๋กœ texture ์ •๋ณด๊ฐ€ ์ค‘์š”ํ•œ ์ด๋ฏธ์ง€๋ฅผ texture-source๋กœ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ํ•ฉ๋ฆฌ์ ์ด๋ผ๊ณ  ์ƒ๊ฐํ–ˆ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

๋‘ ๋ฒˆ์งธ๋Š” style-transfer๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์— ๋Œ€ํ•œ ์˜๋ฌธ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” shape+texture ํ•ฉ์„ฑ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•ด AdaIN style-transfer๋ฅผ ์‚ฌ์šฉํ•˜๋Š”๋ฐ ์›๋ฆฌ๋Š” ์•„๋ž˜ ์ˆ˜์‹์„ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค.

(x : content-์ด๋ฏธ์ง€, y : style-์ด๋ฏธ์ง€)

 

์œ„ ์‹์€ feature space์—์„œ statistics(mean, std)๊ฐ€ ์ด๋ฏธ์ง€์˜ style(texture)๊ณผ ์ƒ๊ด€๊ด€๊ณ„๊ฐ€ ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, x ์ด๋ฏธ์ง€์—์„œ style์„ ์ œ๊ฑฐํ•˜๊ณ  y ์ด๋ฏธ์ง€์˜ style์„ ์ ์šฉ์‹œํ‚ค๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. Style-transfer task์—์„œ๋Š” AdaIN ๊ณ„์‚ฐ ์ดํ›„์— ์ด๋ฏธ์ง€ space๋กœ decodingํ•˜๊ธฐ ์œ„ํ•ด ๋„คํŠธ์›Œํฌ๊ฐ€ ๋” ๋‚จ์•„์žˆ์Šต๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋ฏธ์ง€ space์—์„œ visual ์ ์œผ๋กœ ํ•ฉ์„ฑ๋œ ์ด๋ฏธ์ง€๋ฅผ training์— ์ง์ ‘ ์‚ฌ์šฉํ•˜๋Š”๋ฐ, training ์‹œ conflict ์ด๋ฏธ์ง€ ์Œ ์›๋ณธ์„ ๋„คํŠธ์›Œํฌ์— ์ฃผ์ž…ํ•˜๊ณ  AdaIN ์‹์œผ๋กœ feature ํ•ฉ์„ฑ ํ›„ class๋ฅผ predictionํ•˜๊ณ  soft label๋กœ loss๋ฅผ ์ค˜์„œ ํ•™์Šต์‹œํ‚ค๋Š” ๊ฒƒ๊ณผ ๊ฒฐ๊ณผ๊ฐ€ ๋‹ค๋ฅผ์ง€ ๊ถ๊ธˆํ•ฉ๋‹ˆ๋‹ค. ํ•ฉ์„ฑ๋œ ์ด๋ฏธ์ง€๋Š” feature level์—์„œ decoding ๋œ image ์ด๊ธฐ ๋•Œ๋ฌธ์— decoding ๋˜๊ธฐ ์ „์˜ feature level์—์„œ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด shape๊ณผ texture๋ฅผ ๋ชจ๋‘ ๋” ์ž˜ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ์ง€ ์•Š์„๊นŒ? ๋ผ๋Š” ์ƒ๊ฐ์ด ๋“ค์—ˆ์Šต๋‹ˆ๋‹ค.

 

๋ฐ˜์‘ํ˜•