๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
๐Ÿ“– Theory/AI & ML

[DL] ๋”ฅ๋Ÿฌ๋‹์—์„œ์˜ Regularization : Weight Decay, Batch Normalization, Early Stopping

by ๋ญ…์ฆค 2022. 3. 23.
๋ฐ˜์‘ํ˜•

๋”ฅ๋Ÿฌ๋‹์—์„œ Regularization์€ ๋ชจ๋ธ์˜ overfitting์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ํŠน์ •ํ•œ ๊ฒƒ์— ๊ทœ์ œ๋ฅผ ํ•˜๋Š” ๋ฐฉ๋ฒ•๋“ค์„ ์ด์นญํ•˜๊ณ , ๋Œ€ํ‘œ์ ์œผ๋กœ ์•„๋ž˜์™€ ๊ฐ™์€ ๋ฐฉ๋ฒ•๋“ค์ด ์žˆ๋‹ค.

 

*Overfitting : ๊ธฐ๊ณ„ ํ•™์Šต ๋ชจ๋ธ์—์„œ ์ž์ฃผ ๋ฐœ์ƒํ•˜๋Š” ๋ฌธ์ œ ์ค‘ ํ•˜๋‚˜๋กœ, ๋ชจ๋ธ์ด ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹์— ๊ณผ๋„ํ•˜๊ฒŒ fit๋˜์–ด ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง€๋Š” ํ˜„์ƒ. 

 

  1. Weight Decay - L1, L2
  2. Batch Normalization
  3. Early Stopping

 

Weight Decay

  • Neural network์˜ ํŠน์ • weight๊ฐ€ ๋„ˆ๋ฌด ์ปค์ง€๋Š” ๊ฒƒ์€ ๋ชจ๋ธ์˜ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์„ ๋–จ์–ด๋œจ๋ ค overfitting ๋˜๊ฒŒ ํ•˜๋ฏ€๋กœ, weight์— ๊ทœ์ œ๋ฅผ ๊ฑธ์–ด์ฃผ๋Š” ๊ฒƒ์ด ํ•„์š”.
  • L1 regularization, L2 regularization ๋ชจ๋‘ ๊ธฐ์กด Loss function์— weight์˜ ํฌ๊ธฐ๋ฅผ ํฌํ•จํ•˜์—ฌ weight์˜ ํฌ๊ธฐ๊ฐ€ ์ž‘์•„์ง€๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ํ•™์Šตํ•˜๋„๋ก ๊ทœ์ œ

 

L1 Regularization vs L2 Regularization

  • L1 Regularization : weight ์—…๋ฐ์ดํŠธ ์‹œ weight์˜ ํฌ๊ธฐ์— ๊ด€๊ณ„์—†์ด ์ƒ์ˆ˜๊ฐ’์„ ๋นผ๊ฒŒ ๋˜๋ฏ€๋กœ(loss function ๋ฏธ๋ถ„ํ•˜๋ฉด ํ™•์ธ ๊ฐ€๋Šฅ) ์ž‘์€ weight ๋“ค์€ 0์œผ๋กœ ์ˆ˜๋ ดํ•˜๊ณ , ๋ช‡๋ช‡ ์ค‘์š”ํ•œ weight ๋“ค๋งŒ ๋‚จ์Œ. ๋ช‡ ๊ฐœ์˜ ์˜๋ฏธ์žˆ๋Š” ๊ฐ’์„ ์‚ฐ์ถœํ•˜๊ณ  ์‹ถ์€ sparse model ๊ฐ™์€ ๊ฒฝ์šฐ์— L1 Regularization์ด ํšจ๊ณผ์ . ๋‹ค๋งŒ ์•„๋ž˜ ๊ทธ๋ฆผ์—์„œ ๋ณด๋“ฏ์ด ๋ฏธ๋ถ„ ๋ถˆ๊ฐ€๋Šฅํ•œ ์ง€์ ์ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์— gradient-base learning ์—์„œ๋Š” ์ฃผ์˜๊ฐ€ ํ•„์š”.
  • L2 Regularization : weight ์—…๋ฐ์ดํŠธ ์‹œ weight์˜ ํฌ๊ธฐ๊ฐ€ ์ง์ ‘์ ์ธ ์˜ํ–ฅ์„ ๋ผ์ณ weight decay์— ๋”์šฑ ํšจ๊ณผ์ 

 

 

Batch Normalization

  • Gradient vanishing/exploding ์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ํ•™์Šต ๊ณผ์ • ์ž์ฒด๋ฅผ ์•ˆ์ •ํ™”์‹œํ‚ค๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•
  • ํ•™์Šต์‹œ ๋„คํŠธ์›Œํฌ์˜ ๊ฐ layer ๋˜๋Š” activation ๋งˆ๋‹ค ์ž…๋ ฅ ๊ฐ’์˜ ๋ถ„ํฌ๊ฐ€ ๋‹ฌ๋ผ์ง€๋Š” "Internal Covariance Shift" ๊ฐ€ ๋ฐœ์ƒํ•˜๊ณ  ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ž…๋ ฅ๊ฐ’์˜ ๋ถ„ํฌ๋ฅผ ์กฐ์ •
  • ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ์„ ์กฐ์ •ํ•˜๋Š” ๊ณผ์ •์ด neural network ๋‚ด๋ถ€์— ํฌํ•จ๋˜์–ด ํ•™์Šต์‹œ batch์˜ ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ์„ ์ด์šฉํ•˜์—ฌ ์ •๊ทœํ™”
  • scale๊ณผ shift(bias)๋ฅผ ๊ฐ๋งˆ, ๋ฒ ํƒ€ ๊ฐ’์œผ๋กœ ์กฐ์ •
  • Inference ์‹œ์—๋Š” ๋ฐฐ์น˜ ๋‹จ์œ„์˜ ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ์„ ๊ตฌํ•  ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ์— ํ•™์Šต ๋‹จ๊ณ„์—์„œ moving average ๋˜๋Š” exponential average๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ณ„์‚ฐํ•œ ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ์„ ๊ณ ์ •๊ฐ’์œผ๋กœ ์‚ฌ์šฉ

 

Batch Normalization ํšจ๊ณผ
  • Gradient vanishing/exploding ์„ ์™„ํ™”ํ•˜๋ฏ€๋กœ ๋†’์€ learning rate ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต ์†๋„ ํ–ฅ์ƒ
  • Careful weight initialization์œผ๋กœ ๋ถ€ํ„ฐ ์ž์œ ๋กœ์›Œ์ง
  • Regularization ํšจ๊ณผ : BN ๊ณผ์ •์œผ๋กœ ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ์ด ์ง€์†์ ์œผ๋กœ ๋ณ€ํ•˜๊ณ  weight ์—…๋ฐ์ดํŠธ์—๋„ ์˜ํ–ฅ์„ ์ฃผ์–ด ํ•˜๋‚˜์˜ weight ๊ฐ€ ๋งค์šฐ ์ปค์ง€๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€.

 

Batch Normalization ์ฃผ์˜ ์‚ฌํ•ญ
  • Batch size ๊ฐ€ ๋„ˆ๋ฌด ํฌ๊ฑฐ๋‚˜ ์ž‘์œผ๋ฉด ํšจ๊ณผ๋ฅผ ๊ธฐ๋Œ€ํ•˜๊ธฐ ์–ด๋ ค์›€
  • ์‚ฌ์šฉ ์ˆœ์„œ : Convolution - BN - Activation - Pooling - ... (BN์˜ ๋ชฉ์ ์ด ๋„คํŠธ์›Œํฌ ์—ฐ์‚ฐ ๊ฒฐ๊ณผ๊ฐ€ ์›ํ•˜๋Š” ๋ฐฉํ–ฅ์˜ ๋ถ„ํฌ๋Œ€๋กœ ๋‚˜์˜ค๊ฒŒ ํ•˜๋Š” ๊ฒƒ์ด๋ฏ€๋กœ conv ์—ฐ์‚ฐ ๋ฐ”๋กœ ๋’ค์— ์ฃผ๋กœ ์‚ฌ์šฉ/ ์•„๋‹Œ ๊ฒฝ์šฐ๋„ ์žˆ์Šต๋‹ˆ๋‹ค.)
  • Multi GPU training ์‹œ ์ฃผ๋กœ "Synchronized Batch Normalization" ์‚ฌ์šฉ

 

 

Early Stopping

  • Deep Neural Network๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ํ•™์Šต์„ ๋„ˆ๋ฌด ๋งŽ์ดํ•˜๋ฉด ํŠน์ • epoch ์ดํ›„์—๋Š” overftting์ด ๋ฐœ์ƒํ•˜์—ฌ test ์„ฑ๋Šฅ ํ•˜๋ฝ
  • ์ด๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด validation set์„ ์ด์šฉํ•˜๋Š” ๋“ฑ์˜ ๋ฐฉ๋ฒ•์œผ๋กœ overfitting์ด ๋ฐœ์ƒํ•˜๊ธฐ ์ „์— ํ•™์Šต์„ ์ข…๋ฃŒํ•˜๋Š” ๋ฐฉ๋ฒ•
๋ฐ˜์‘ํ˜•