๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
๐Ÿ› Research/OCR

[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] Character Region Awareness for Text Detection / CRAFT / ํ…์ŠคํŠธ ๊ฒ€์ถœ

by ๋ญ…์ฆค 2023. 3. 13.
๋ฐ˜์‘ํ˜•

๋ณธ ๋…ผ๋ฌธ์€ Naver Clova์—์„œ CVPR 2019 ์— ๋ฐœํ‘œํ•œ Text Detection ๋…ผ๋ฌธ์œผ๋กœ, CRAFT ๋ผ๋Š” ๋ชจ๋ธ์„ ์ œ์•ˆํ•œ๋‹ค. Text Detection ๋ถ„์•ผ์—์„œ ์›Œ๋‚™ ์œ ๋ช…๋‚œ ๋…ผ๋ฌธ์ด๊ณ  ๊ฐœ์ธ์ ์œผ๋กœ ํ…์ŠคํŠธ ๊ฒ€์ถœ์„ ์œ„ํ•ด ํ…์ŠคํŠธ์˜ ํŠน์„ฑ๊ณผ ๋”ฅ๋Ÿฌ๋‹์˜ ํ•™์Šต ํŠน์„ฑ์„ ์•„์ฃผ ํšจ์œจ์ ์œผ๋กœ ์ด์šฉํ•œ ๋งค๋ ฅ์ ์ธ ์—ฐ๊ตฌ๋ผ ์ƒ๊ฐํ•œ๋‹ค. ์ž์„ธํ•œ ์„ค๋ช…์€ ๋‹ค๋ฅธ ๋ธ”๋กœ๊ทธ์—์„œ๋„ ์ž˜ ๋‚˜์™€์žˆ์œผ๋‹ˆ ๋‚˜๋Š” ๋ชจ๋ธ ํ•™์Šต์„ ์œ„ํ•œ ํ•ต์‹ฌ์ ์ธ ๋ถ€๋ถ„๋งŒ ์ •๋ฆฌํ•˜๋ ค ํ•œ๋‹ค.

 

CRAFT ๋ชจ๋ธ์˜ ํ•ต์‹ฌ

  • CRAFT ๋ชจ๋ธ์€ ํ…์ŠคํŠธ ๊ฒ€์ถœ์„ ์œ„ํ•ด ๋‹จ์–ด bbox๋ฅผ ๋ฐ”๋กœ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ๋ฌธ์ž์˜ ์œ„์น˜๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” region score, ๋ฌธ์ž๊ฐ„ ๊ฑฐ๋ฆฌ๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” affinity score๋ฅผ ์˜ˆ์ธก
  • ์ด๋ฅผ ์œ„ํ•ด์„œ๋Š” character-level annotation์ด ํ•„์š”ํ•œ๋ฐ ๋ฌธ์ž ํ•˜๋‚˜ ํ•˜๋‚˜ bbox๋ฅผ ๋งŒ๋“œ๋Š” ์ž‘์—…์€ ์ƒ๊ฐ๋งŒ ํ•ด๋„ ๋”์ฐํ•˜๊ฒŒ ์˜ค๋ž˜๊ฑธ๋ฆฌ๊ธฐ ๋•Œ๋ฌธ์— pseudo-GT๋ฅผ ์ƒ์„ฑํ•ด์„œ ํ•™์Šตํ•˜๋Š” weakly-supervised learning ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉ
  • Character ๋‹จ์œ„ bbox๊ฐ€ ์กด์žฌํ•˜๋ฉด ์œ„์™€ ๊ฐ™์ด region score์™€ affinity score๋ฅผ ์ƒ์„ฑํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์‚ฌ์šฉ
  • ๋ฌผ๋ก  ํŠน์ • character ๋“ค์˜ ์กฐํ•ฉ์ด ํ•˜๋‚˜์˜ ๋‹จ์–ด๋ผ๋Š” ์ •๋ณด๋Š” ๊ฐ€์ง€๊ณ  ์žˆ์–ด์•ผ๋จ - ์œ„์˜ ๊ฒฝ์šฐ p, e, a, c, e๊ฐ€ ๋ชจ์—ฌ peace๋ผ๋Š” ํ•œ ๋‹จ์–ด๋ผ๋Š” ์ •๋ณด ํ•„์š” (์—„๋ฐ€ํžˆ ๋”ฐ์ง€๋ฉด ํ…์ŠคํŠธ ์ •๋ณด๋Š” ํ•„์š”์—†๊ณ , ํŠน์ • character bbox๊ฐ€ ๋ชจ์—ฌ ํ•˜๋‚˜์˜ ๋‹จ์–ด๋ฅผ ์ด๋ฃฌ๋‹ค๋Š” ์ •๋ณด ํ•„์š”)

 

 

CRAFT ๋ชจ๋ธ ํ•™์Šต ๊ณผ์ •


  1. Character-level GT ๊ฐ€ ํฌํ•จ๋œ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ Interim model ์„ ํ•™์Šต (Train with Synthetic Image)
  2. Interim model ์„ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•ด word-level annotation ๋งŒ ๋˜์–ด ์žˆ๋Š” ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ character level pseudo-GT annotation data๋ฅผ ์ƒ์„ฑ (Generate Pseudo-GT)
  3. Character-level GT์™€ ์ƒ์„ฑํ•œ pseudo-GT๋กœ ํ•จ๊ป˜ ๋ชจ๋ธ์„ ํ•™์Šต. Pseudo-GT๋Š” ์ •ํ™•ํ•œ GT๋Š” ์•„๋‹ˆ๊ธฐ ๋•Œ๋ฌธ์— character ๊ฐœ์ˆ˜๋ฅผ ์ •ํ™•ํ•˜๊ฒŒ ์˜ˆ์ธกํ–ˆ๋Š”์ง€์— ๋”ฐ๋ผ confidence score ๋ฅผ ๋ฐ˜์˜ํ•˜์—ฌ ํ•™์Šต (weakly supervised learning) (Train with Real Image + Train with Synthetic Image)

 

* ์‚ฌ์‹ค์ƒ 2, 3๋ฒˆ ๋‹จ๊ณ„๋Š” ๋™์‹œ์— ์ง„ํ–‰. 3๋ฒˆ ๋‹จ๊ณ„์—์„œ๋Š” Real data๋งŒ์„ ์ด์šฉํ•ด์„œ ํ•™์Šตํ•  ์ˆ˜๋„ ์žˆ๊ณ , Synthetic + Real ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šตํ•  ์ˆ˜๋„ ์žˆ์Œ. 

 

* ์ฃผ์˜์‚ฌํ•ญ : ๋…ผ๋ฌธ์—์„œ๋Š” ๋ฐ์ดํ„ฐ์…‹์„ Synthetic Image, Real Image ๋กœ ํ‘œํ˜„ํ–ˆ๋Š”๋ฐ ์‚ฌ์‹ค ์ •ํ™•ํ•˜๊ฒŒ ๋งํ•˜๋ฉด Synthetic Image ๋Š” character-level GT๊ฐ€ ์žˆ๋Š” ๋ฐ์ดํ„ฐ์…‹์„ ๋œปํ•˜๊ณ  Real Image๋Š” word-level GT๋งŒ ์žˆ๋Š” ๋ฐ์ดํ„ฐ์…‹์„ ๋œปํ•œ๋‹ค. ๋‹น์—ฐํžˆ Synthetic Image๋Š” ํ•ฉ์„ฑ๋ฐ์ดํ„ฐ์ด๊ธฐ ๋•Œ๋ฌธ์— character ์ •๋ณด๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์ง€๋งŒ, Real Image์˜ ๊ฒฝ์šฐ word-level GT๋งŒ ๊ฐ€์ง€๊ณ  ์žˆ์„ ์ˆ˜๋„ ์žˆ๊ณ  character-level GT๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์„ ์ˆ˜๋„ ์žˆ๋‹ค. ๋…ผ๋ฌธ์—์„œ ์‚ฌ์šฉํ•˜๋Š” ๋ฐ์ดํ„ฐ์…‹๋งŒ ์ƒ๊ฐํ•˜๋Š” ๊ฒฝ์šฐ ๋…ผ๋ฌธ์˜ ํ‘œํ˜„์ด ๋งž์ง€๋งŒ ํ˜„์‹ค์˜ ๊ฒฝ์šฐ์—” ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์— ํ—ท๊ฐˆ๋ฆฌ์ง€ ์•Š๊ธฐ๋ฅผ...

 

 

Train with Synthetic Image

  • ๊ฐœ์ˆ˜๊ฐ€ ์ ์€ character level annotation ์ด ๋˜์–ด ์žˆ๋Š” ๋ฐ์ดํ„ฐ๋กœ Interim model ์„ ํ•™์Šต
  • Pseudo-GT๋ฅผ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•œ pre-train์„ ์ง„ํ–‰ํ•˜๋Š” ๋‹จ๊ณ„๋กœ ๋ณผ ์ˆ˜ ์žˆ์Œ
  • ์ด ๋‹จ๊ณ„์—์„œ ์–ด๋Š์ •๋„ ํ…์ŠคํŠธ์˜ ๋ฌธ์ž ์œ„์น˜(region score)์™€ ๋ฌธ์ž๊ฐ„ ๊ฑฐ๋ฆฌ(affinity score)๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ชจ๋ธ์„ ๋งŒ๋“ค์–ด ๋†”์•ผ ์ •์ƒ์ ์ธ ํ•™์Šต์ด ๊ฐ€๋Šฅ
    • Interim ๋ชจ๋ธ์ด region, affinity score๋ฅผ ์—‰ํ„ฐ๋ฆฌ๋กœ ์˜ˆ์ธกํ•˜๋ฉด ์ดํ›„์— ์ƒ์„ฑํ•˜๋Š” pseudo-GT๋Š” ๋” ์—‰๋ง์ผํ…Œ๋‹ˆ

 

Generate Pseudo-GT
  • Pseudo-GT๋ฅผ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•ด word-level annotation(๋‹จ์–ด bbox)์™€ ํ…์ŠคํŠธ ์ •๋ณด(์—„๋ฐ€ํžˆ ๋งํ•˜๋ฉด ๋‹จ์–ด๊ฐ€ ๋ช‡ ๊ฐœ์˜ character ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ๋Š”์ง€์— ๋Œ€ํ•œ ์ •๋ณด)๋Š” ์žˆ์–ด์•ผ๋จ
  • Synthetic Image(with GT)๋กœ ํ•™์Šตํ•œ Interim model ์˜ inference ๊ฒฐ๊ณผ๋ฅผ pseudo-GT ๋กœ ํ™œ์šฉ
  • Interim model ๋กœ ์ƒ์„ฑํ•œ ๊ฒฐ๊ณผ๋ฅผ ๊ทธ๋Œ€๋กœ label ๋กœ ์“ฐ๊ธฐ์—๋Š” ์˜ค์ฐจ๊ฐ€ ๋งŽ๊ธฐ ๋•Œ๋ฌธ์— ์˜ˆ์ธกํ•œ character ๊ฐœ์ˆ˜์™€ ์‹ค์ œ character ๊ฐœ์ˆ˜์— ๋”ฐ๋ฅธ confidence score ๋ฅผ ๋ฐ˜์˜
    • e.g. 5๊ฐœ์˜ ๊ธ€์ž๋กœ ๊ตฌ์„ฑ๋œ ๋‹จ์–ด๋ฅผ 5๊ฐœ๋กœ ์˜ˆ์ธกํ•œ ๊ฒฝ์šฐ → confidence score = 5/5, 3๊ฐœ๋กœ ์˜ˆ์ธกํ•œ ๊ฒฝ์šฐ → confidence score= 3/5
    • ๋งŒ์•ฝ confidence score < 1/2 ์ธ ๊ฒฝ์šฐ ๋‹จ์–ด๋ฅผ ๋™์ผํ•œ ์‚ฌ์ด์ฆˆ์˜ ์นธ์œผ๋กœ ์ž˜๋ผ์„œ character bbox gt ๋กœ ์‚ฌ์šฉ

 

Train with Real Image & Train with Synthetic Image

์„ค๋ช…์„ ์œ„ํ•ด Pseudo-GT ์ƒ์„ฑ๊ณผ ํ•™์Šต ๋‹จ๊ณ„๋ฅผ ๋‚˜๋ˆ„์–ด ๋†จ์ง€๋งŒ, ์‹ค์ œ๋กœ๋Š” ๋™์‹œ์— ์ง„ํ–‰

  • ์ƒ์„ฑํ•œ pseudo-GT ์™€ ๊ธฐ์กด GT data ๋ฅผ ๋ชจ๋‘ ํ™œ์šฉํ•˜์—ฌ ๋ชจ๋ธ ํ•™์Šต.
    • Character-level annotation์„ ๊ฐ€์ง„ ๋ฐ์ดํ„ฐ์…‹ ->  GT ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต 
    • Word-level annotation์„ ๊ฐ€์ง„ ๋ฐ์ดํ„ฐ์…‹ -> Pseudo-GT ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต
    • CRAFT ์˜คํ”ผ์…œ ํ•™์Šต ์ฝ”๋“œ๋Š” ๊ณต๊ฐœ๋˜์ง€ ์•Š์•˜์ง€๋งŒ, EasyOCR์—์„œ ๊ณต๊ฐœํ•œ CRAFT ํ•™์Šต ์ฝ”๋“œ๋ฅผ ๋ณด๋ฉด GPU๋ฅผ ๋ฐ˜๋ฐ˜ ๋‚˜๋ˆ„์–ด ํ•œ ์ชฝ์€ GT ๋ฐ์ดํ„ฐ๋กœ ๋ชจ๋ธ์„ ํ•™์Šตํ•˜๊ณ  ๋‹ค๋ฅธ ํ•œ ์ชฝ์€ Pseudo-GT๋ฅผ ์ƒ์„ฑํ•˜๊ณ  weakly-supervised learning์„ ์ง„ํ–‰
  • pseudo-GT ๋Š” confidence score ์ ์šฉ

 

์‹คํ—˜ ๊ฒฐ๊ณผ

  • ํ•™์Šต์„ ์ง„ํ–‰ํ•  ์ˆ˜๋ก character์˜ ์œ„์น˜๋ฅผ ์ฐพ๋Š” region score์˜ ํ‘œํ˜„๋ ฅ์ด ์ข‹์•„์ง€๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Œ
  • pre-train ๋‹จ๊ณ„์ธ Interim ๋ชจ๋ธ์„ ๋งŒ๋“œ๋Š” ๊ณผ์ •์—์„œ ์–ด๋Š์ •๋„ ์ข‹์€ ํ‘œํ˜„๋ ฅ์„ ๊ฐ€์ ธ์•ผ ์„ฑ๊ณต์ ์œผ๋กœ weakly supervised learning์„ ์ง„ํ–‰ํ•  ์ˆ˜ ์žˆ์Œ

 

 

ํ•œ๊ตญ์–ด ์‹คํ—˜ ๊ฒฐ๊ณผ

ํ•œ๊ตญ์–ด๋กœ๋„ ํ•™์Šต์„ ์ง„ํ–‰ํ•ด๋ณด๋ฉด region score์™€ affinity score๋ฅผ ๊ฝค ์ž˜ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

๋ฐ˜์‘ํ˜•