๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
๐Ÿ› Research/Image Classification

[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] Bag of Tricks for Image Classification with Convolutional Neural Networks / ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜ ๋ถ„์„ ๋…ผ๋ฌธ

by ๋ญ…์ฆค 2022. 2. 21.
๋ฐ˜์‘ํ˜•

CVPR 2019 ์— ๊ณต๊ฐœ๋œ ๋…ผ๋ฌธ์œผ๋กœ, image classification ๋“ฑ์˜ vision ๋ถ„์•ผ์—์„œ ์ฐธ๊ณ ํ•˜๋ฉด ์ข‹์„ ์—ฌ๋Ÿฌ training ๋ฐฉ๋ฒ•๋ก ์„ ์ •๋ฆฌ ๋ฐ ์‹คํ—˜ํ•œ ๋…ผ๋ฌธ์ž…๋‹ˆ๋‹ค.

 

Introduction

 

Image classification task์—์„œ ์„ฑ๋Šฅ์„ ๋†’์ด๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋” ์ข‹์€ ๋” ํฐ network ๋ฅผ ์“ฐ๋ฉด ๋˜์ง€๋งŒ, network๋ฅผ ๋ณ€๊ฒฝํ•˜๋Š” ๊ฒƒ ์ด์™ธ์—๋„ ์„ฑ๋Šฅ์„ ์ขŒ์ง€์šฐ์ง€ํ•˜๋Š” ๋งŽ์€ ์š”์†Œ๋“ค์ด ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ResNet50์„ ๊ธฐ์ค€์œผ๋กœ network architecture๋Š” ํฌ๊ฒŒ ๋ณ€๊ฒฝํ•˜์ง€ ์•Š๊ณ  ์—ฌ๋Ÿฌ Trick ๋“ค์„ ์‹คํ—˜ํ•œ ๊ฒฐ๊ณผ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ, ์—ฌ๋Ÿฌ trick๋“ค์„ ์ ์šฉํ•˜๋ฉด ์ ์šฉ ์ด์ „๋ณด๋‹ค ImageNet Top-1 accuracy๊ฐ€ 4% ๊ฐ€๋Ÿ‰์ด๋‚˜ ์ฆ๊ฐ€ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค(์œ„์˜ Table 1). 

 

๋ชจ๋“  ์‹คํ—˜์€ ๊ธฐ๋ณธ์ ์œผ๋กœ ์•„๋ž˜์˜ ์ด๋ฏธ์ง€ ์ „์ฒ˜๋ฆฌ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

 

Preprocessing pipelines

  1. Randomly sample an image and decode it into 32-bit floating point raw pixel values in [0, 255].
  2. Randomly crop a rectangular region whose aspect ratio is randomly sampled in [3/4, 4/3] and area randomly sampled in [8%, 100%], then resize the cropped region into a 224-by-224 square image.
  3. Flip horizontally with 0.5 probability.
  4. Scale hue, saturation, and brightness with coefficients uniformly drawn from [0.6, 1.4].
  5. Add PCA noise with a coefficient sampled from a normal distribution N(0, 0.1).
  6. Normalize RGB channels by subtracting 123.68, 116.779, 103.939 and dividing by 58.393, 57.12, 57.375, respectively.

๋ฌด์ž‘์œ„ ์ƒ˜ํ”Œ ์ถ”์ถœ → 32bit decode(0~255) → ๊ฐ€๋กœ/์„ธ๋กœ ๋น„์œจ ํŠน์ • ๋ฒ”์œ„๋กœ random crop(์ง์‚ฌ๊ฐํ˜•) → 224x224 ๋กœ resize → ์ˆ˜ํ‰ flip(prob = 50%) → Hue, Saturation, Brightness augmentation → PCA ๋…ธ์ด์ฆˆ ์ถ”๊ฐ€ → normalize 

 

Additional Setting

  • convolution filter, FC layer ์˜ weight ๋Š” Xavier Initialization
  • Batch normalization : ๊ฐ๋งˆ=1, ๋ฒ ํƒ€=0
  • Optimizer : NAG(Nesterov Accelerated Gradient)
  • # of GPU = 8
  • Batch Size = 256
  • Total Epoch = 120
  • Initial learning rate = 0.1
  • Learning rate decay : 30epoch ๋งˆ๋‹ค lr/=10 step decay

 

๋…ผ๋ฌธ์—์„œ ์ ์šฉํ•œ ๋ฐฉ๋ฒ•๋“ค์€ ํฌ๊ฒŒ 3๊ฐ€์ง€๋กœ, Efficient Training, Model Tweaks ๊ทธ๋ฆฌ๊ณ  Training Refinement ์ž…๋‹ˆ๋‹ค. ์•„๋ž˜์—์„œ ๊ฐ๊ฐ์˜ ๋ฐฉ๋ฒ•๋ก ์— ๋Œ€ํ•ด ์„ค๋ช…ํ•˜๋ ค ํ•ฉ๋‹ˆ๋‹ค.

 

1. Efficient Training

์ด ํŒŒํŠธ์—์„œ๋Š” ๋”ฅ๋Ÿฌ๋‹ ๋„คํŠธ์›Œํฌ์— ํšจ์œจ์ ์ธ ํ•™์Šต๋ฐฉ๋ฒ•์„ ์ ์šฉํ•˜์—ฌ ํ•™์Šต ์†๋„์™€ ์„ฑ๋Šฅ์„ ๋†’์ด๋Š” ๋ฐฉ๋ฒ•์„ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. ํฌ๊ฒŒ Large-batch Training ๊ณผ Low-precision Training ์„ ์ œ์‹œํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ผ๋ฐ˜์ ์ธ ๊ฒฝ์šฐ, batch size๊ฐ€ ํฌ๋ฉด ์ˆ˜๋ ด์ด ๋Š๋ ค์งˆ ์ˆ˜ ์žˆ๋Š”๋ฐ ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ๋‹ค์–‘ํ•œ ์—ฐ๊ตฌ๊ฐ€ ์ง„ํ–‰๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

 

1.1 Linear Scaling Learning Rate

'Accurate, large minibatch SGD: training imagenet in 1 hour' ๋…ผ๋ฌธ์— ๋”ฐ๋ฅด๋ฉด batch size๋ฅผ ํ‚ค์šธ์ˆ˜๋ก linearํ•˜๊ฒŒ learning rate๋ฅผ ํ‚ค์šฐ๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค. (๋ณดํ†ต training ์†๋„๋ฅผ ์œ„ํ•ด ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๊ฝ‰ ์ฐฐ ๋•Œ๊นŒ์ง€ batch size๋ฅผ ํ‚ค์šฐ๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ์ž ์‹คํ—˜ํ™˜๊ฒฝ์— ๋”ฐ๋ผ batch size์™€ learning rate ์„ค์ •์ด ๋‹ฌ๋ผ์ง€๊ธฐ ๋•Œ๋ฌธ์— ์œ„์™€ ๊ฐ™์€ ์—ฐ๊ตฌ๋„ ์ง„ํ–‰๋˜์—ˆ์—ˆ์Šต๋‹ˆ๋‹ค.)

 

1.2 Learning Rate Warmup

์ผ๋ฐ˜์ ์œผ๋กœ learning rate ๋Š” initial setting ๊ฐ’์—์„œ ์กฐ๊ธˆ์”ฉ ์ค„์—ฌ์ฃผ๋Š” ๋ฐฉ์‹์œผ๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค(e.g. step decay, cosine decay,...etc). ํ•˜์ง€๋งŒ, learning rate warmup ๋ฐฉ์‹์€ ์ดˆ๊ธฐ learning rate๋ฅผ 0์œผ๋กœ ์„ค์ •ํ•˜๊ณ  ์ผ์ • ๊ธฐ๊ฐ„ ๋™์•ˆ linear ํ•˜๊ฒŒ ํ‚ค์šฐ๊ณ (warmup) ๊ทธ ๋’ค์— learning rate decay๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค. (์ด ๋ฐฉ๋ฒ•๋„ ์œ„์—์„œ ์–ธ๊ธ‰ํ•œ ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.)

 

์œ„ ๊ทธ๋ฆผ์„ ๋ณด๋ฉด ์ดˆ๊ธฐ์— 5epoch๊นŒ์ง€ ์„ ํ˜•์ ์œผ๋กœ learning rate๋ฅผ initial setting ๊ฐ’๊นŒ์ง€ ํ‚ค์šฐ๊ณ  ๊ทธ ๋’ค์— learning rate decay๋ฅผ ์ ์šฉํ•œ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ heuristic์€ training์— ๋„์›€์„ ์ค๋‹ˆ๋‹ค.

 

1.3 Zero Gamma in Batchnorm

Batch normalization layer์—์„œ input x์™€ ๊ณฑํ•˜๋Š” ๊ฐ’์ธ gamma๋Š” beta์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ trainableํ•œ parameter์ด๋ฏ€๋กœ ํ•™์Šต ์ „ initialize๋ฅผ ํ•ด์ค˜์•ผํ•ฉ๋‹ˆ๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ gamma=1, beta=0 ์œผ๋กœ intialize ํ•˜๋Š”๋ฐ, ResNet ๊ฐ™์ด residual connection์ด ์žˆ๋Š” network๋ฅผ ํ•™์Šตํ•  ๋•Œ๋Š” gamma๋ฅผ 0์œผ๋กœ initialize ํ•ด์ฃผ๋Š” ๊ฒƒ์ด ํ•™์Šต ์•ˆ์ •์„ฑ์„ ๋†’์—ฌ์ค€๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

 

1.4 No Bais Decay

'Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes' ๋…ผ๋ฌธ์— ๋”ฐ๋ฅด๋ฉด L2 regularization ์—์„œ weight ์—๋งŒ decay๋ฅผ ์ ์šฉํ•˜๋Š” ๊ฒƒ์ด overfitting์„ ๋ฐฉ์ง€ํ•˜๋Š”๋ฐ ํšจ๊ณผ์ ์ด๋ผ๊ณ  ์–ธ๊ธ‰ํ•ฉ๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋„ weight ์ด์™ธ์—๋Š” decay๋ฅผ ์ ์šฉํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. bias ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, batch norm์˜ gamma, beta์—๋„ ์ ์šฉํ•˜์ง€ ์•Š์Šค๋นˆ๋‹ค.

 

1.5 Low-precision Training

์ผ๋ฐ˜์ ์ธ neural network์—์„œ๋Š” 32-bit floating point(FP32) precision์„ ์ด์šฉํ•˜์—ฌ ํ•™์Šต์„ ์‹œํ‚ค๋Š”๋ฐ, ์ตœ์‹  ํ•˜๋“œ์›จ์–ด์—์„œ๋Š” lower precision(FP16) ๊ณ„์‚ฐ์ด ์ง€์›๋˜๋ฉด์„œ ์†๋„์—์„œ ์ด์ ์„ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ, FP16์œผ๋กœ precision์„ ์ค„์ด๋ฉด ์ˆ˜๋ฅผ ํ‘œํ˜„ํ•˜๋Š” ๋ฒ”์œ„๊ฐ€ ์ค„์–ด๋“ค๋ฉด์„œ ํ•™์Šต ์„ฑ๋Šฅ์ด ์ €ํ•˜๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด Mixed precision training์„ ์ ์šฉํ•˜์—ฌ network๋ฅผ ํ•™์Šต์‹œํ‚ต๋‹ˆ๋‹ค.

 

2. Model Tweaks

Model Tweaks์—์„œ๋Š” ResNet architecture์˜ ๋งˆ์ด๋„ˆํ•œ ์ˆ˜์ •์„ ํ†ตํ•ด ์„ฑ๋Šฅ์„ ๋†’์ด๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์ขŒ์ธก ๊ทธ๋ฆผ์€ vanilla ResNet ๊ตฌ์กฐ์ด๊ณ , ์šฐ์ธก์˜ ResNet-B, ResNet-C, ResNet-D๋Š” ๋ณธ ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆํ•˜๋Š” ๊ตฌ์กฐ์ด๋ฉฐ resnet์˜ input stem ๋˜๋Š” down sampling ๊ตฌ์กฐ๋ฅผ ๋ณ€๊ฒฝํ•ฉ๋‹ˆ๋‹ค. 

- ResNet-B

ResNet์˜ downsampling module์„ ์ผ๋ถ€ ๋ณ€๊ฒฝํ•˜๋Š” ๊ตฌ์กฐ์ž…๋‹ˆ๋‹ค.  Path A์—์„œ convolution์ด stride 2์ธ 1x1์„ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— input feature map์˜ ์ผ๋ถ€๋ฅผ ๋ฌด์‹œํ•ฉ๋‹ˆ๋‹ค.(1x1 ์ธ๋ฐ stride2์ด๋ฉด ํ•œ์นธ์”ฉ ๊ฑด๋„ˆ๋›ฐ๋ฉด์„œ ๋ณด๋‹ˆ๊นŒ) ResNet-B๋Š” path A์˜ ์ฒ˜์Œ ๋‘ convolution ์˜ stride๋ฅผ ๋ณ€๊ฒฝํ•˜์—ฌ ์ด๋ฅผ ๋ณด์™„ํ•ฉ๋‹ˆ๋‹ค. 1x1์€ stride 1๋กœ, 3x3 ์€ stride 2๋กœ ๋ณ€๊ฒฝํ•˜๋Š”๋ฐ 3x3์€ stride 2๋”๋ผ๋„ filter๊ฐ€ ๊ฒน์น˜๋Š” ๋ถ€๋ถ„์ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๋ฌด์‹œ๋˜๋Š” feature map ๊ตฌ๊ฐ„์ด ์—†์–ด์ง‘๋‹ˆ๋‹ค.

 

- ResNet-C

์ด ๋ฐฉ๋ฒ•์€ inception-v2์—์„œ ์ฒ˜์Œ ์ œ์•ˆ๋˜์—ˆ๊ณ , SENet, PSPNet, DeepLabV3, ShuffleNet ๋“ฑ์—์„œ๋„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. convolution์˜ ๊ณ„์‚ฐ๋น„์šฉ์ด kernel width ๋˜๋Š” height์— quadratic์ž…๋‹ˆ๋‹ค.(7x7 kernel์€ 3x3 kernel์— ๋น„ํ•ด 5.4๋ฐฐ์ž…๋‹ˆ๋‹ค.) ๋”ฐ๋ผ์„œ ์ด tweak์—์„œ๋Š” ResNet ์ œ์ผ ์ฒซ convolution์ธ 7x7 kernel์˜ convolution์„ 3๊ฐœ์˜ 3x3 convolution ์œผ๋กœ ๋Œ€์ฒดํ•ฉ๋‹ˆ๋‹ค. 

 

- ResNet-D

ResNet-B์—์„œ ์˜๊ฐ์„ ๋ฐ›์•„ downsampling ๋ธ”๋ก์˜ path B์— ์žˆ๋Š” 1x1 convolution ๋˜ํ•œ input feature map์˜ ์ผ๋ถ€๋ฅผ ๋ฌด์‹œํ•ฉ๋‹ˆ๋‹ค. feature ๊ฐ€ ๋ฌด์‹œ๋˜์ง€ ์•Š๋„๋ก ๋ณ€๊ฒฝํ•˜๊ธฐ ์œ„ํ•ด stride๊ฐ€ 1๋กœ ๋ณ€๊ฒฝ๋œ 1x1 convolution ์ด์ „์— 2x2 average pooling layer๋ฅผ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ์ด์™€ ๊ฐ™์€ ๋ฐฉ๋ฒ•์ด ๊ฒฝํ—˜์ ์œผ๋กœ ์ž˜ ๋™์ž‘ํ•˜๊ณ  computational cost์— ๊ฑฐ์˜ ์˜ํ–ฅ์„ ๋ผ์น˜์ง€ ์•Š๋Š”๋‹ค๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌํ•ฉ๋‹ˆ๋‹ค. 

๊ทธ๋ƒฅ 1x1 conv stride 2๋ฅผ stride 1๋กœ ๋ฐ”๊พธ๋ฉด ๊ทธ ์—ฐ์‚ฐ์—์„œ ๋งŒํผ์€ ์—ฐ์‚ฐ๋Ÿ‰์ด 2๋ฐฐ๊ฐ€ ๋˜๊ธฐ๋•Œ๋ฌธ์—, average pooling ์œผ๋กœ feature spatial size๋ฅผ ์ค„์ด๊ณ  stride 1์˜ 1x1 conv ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

Experimental Results of Model tweaks

์‹คํ—˜ ๊ฒฐ๊ณผ๋Š” ResNet-D๊ฐ€ ์•ฝ๊ฐ„์˜ FLOPS ์ฆ๊ฐ€๊ฐ€ ์žˆ์ง€๋งŒ, Top-1 accuracy๊ฐ€ 1% ๊ฐ€๋Ÿ‰ ์ฆ๊ฐ€ํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

 

3. Training Refinement

์ด ํŒŒํŠธ์—์„œ๋Š” ์„ฑ๋Šฅ์„ ๋†’์ด๊ธฐ ์œ„ํ•ด ์•„๋ž˜ 4๊ฐ€์ง€ ๋ฐฉ๋ฒ•๋“ค์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

 

3.1 Cosine Learning Rate Decay

cosine ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ learning rate๋ฅผ ๋ณ€๊ฒฝํ•˜๋Š” decay ๋ฐฉ๋ฒ•์ด๋ฉฐ, ์ด ๋ฐฉ์‹์œผ๋กœ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

 

3.2 Label Smoothing

๊ธฐ์กด์—๋Š” classification task์—์„œ network๋ฅผ ํ•™์Šต์‹œํ‚ฌ ๋•Œ ์ •๋‹ต์€ 1 ๋‚˜๋จธ์ง€๋Š” 0์ธ one-hot vector๋ฅผ label๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

Label smoothing์€ 1,0 ๋Œ€์‹  smoothing ๋œ label์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.  

 

3.3 Knowledge Distillation

Knowledge distillation์€ ์„ฑ๋Šฅ์ด ์ข‹์€(ํ•™์Šต๋œ) teacher model๋กœ parameter ๊ฐ€ ๋” ์ ๊ณ  ์—ฐ์‚ฐ๋Ÿ‰์ด ์ ์€ student model์„ ํ•™์Šต์‹œ์ผœ์„œ teacher model์˜ ์„ฑ๋Šฅ์„ ๋”ฐ๋ผ๊ฐ€๋„๋ก ํ•™์Šต์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.

Teacher model๋กœ ResNet152 ๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  student model์€ Resnet50์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

 

3.4 Mixup Training

Mixup augmentation์€ image space์—์„œ ๋‘ image์™€ label์„ weighted linear interpolationํ•˜์—ฌ ์ƒˆ๋กœ์šด ์ƒ˜ํ”Œ์„ ์ƒ์„ฑํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋ฉฐ weight ๋น„์œจ์— ๋”ฐ๋ผ ์ •๋‹ต label์˜ ๋น„์œจ๋„ ๋‹ฌ๋ผ์ง‘๋‹ˆ๋‹ค.

 

Experimental Results of Training Refinement

 

 

Transfer Learning - Object detection, Semantic segmentation

์œ„์—์„œ classification์— ์ ์šฉํ–ˆ๋˜ ์—ฌ๋Ÿฌ ๋ฐฉ๋ฒ•๋“ค์„ object detection, semantic segmentation์— ์ ์šฉํ•ฉ๋‹ˆ๋‹ค. ์‹คํ—˜๊ฒฐ๊ณผ๋Š” ๋Œ€๋ถ€๋ถ„ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜์ง€๋งŒ segmentation task์—์„œ๋Š” label smoothing ๊ณผ mixup ๋“ฑ์€ ์˜คํžˆ๋ ค ์•ˆ์ข‹์€ ํšจ๊ณผ๋ฅผ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค.

 

 

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์—ฌ๋Ÿฌ ๋งŽ์€ ๋…ผ๋ฌธ์—์„œ ์†Œ๊ฐœ๋˜์—ˆ๋˜ deep neural network ์˜ training ๊ธฐ๋ฒ•๋“ค์„ ํ•˜๋‚˜ํ•˜๋‚˜ ์‹คํ—˜ํ•œ ๊ฒฐ๊ณผ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ๊ฒฐ๋ก ์ ์œผ๋กœ classification task์—์„œ network๋ฅผ develop ํ•˜๋Š” ๊ฒƒ ์ด์™ธ์˜ ์—ฌ๋Ÿฌ ๋ฐฉ๋ฒ•๋ก ๋“ค์ด network์˜ ์„ฑ๋Šฅ ๋ฐ ํšจ์œจ์„ฑ์„ ๋†’์—ฌ์ค€๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๋˜ํ•œ classification ์ด ์•„๋‹Œ ๋‹ค๋ฅธ task์—๋„ ์ ์šฉ๋˜๋Š” ๋ฐฉ๋ฒ•๋ก ๋“ค์ด ๋งŽ์œผ๋‹ˆ, ๋”ฅ๋Ÿฌ๋‹์œผ๋กœ ๋ชจ๋ธ์„ training ์‹œ์ผœ์•ผํ•˜๋Š” ๊ฒฝ์šฐ ๊ผญ ์ฐธ๊ณ ํ•˜๋ฉด ์ข‹์€ ๋…ผ๋ฌธ์ž…๋‹ˆ๋‹ค.

๋ฐ˜์‘ํ˜•