[NLP] TF-IDF ์„ค๋ช… | ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ ํ†ต๊ณ„์  ๊ฐ€์ค‘์น˜ ๊ณ„์‚ฐ | ํŠน์ • ๋‹จ์–ด๊ฐ€ ๋ฌธ์„œ ๋‚ด์—์„œ ์–ผ๋งˆ๋‚˜ ์ค‘์š”ํ•œ์ง€

2023. 9. 22. 14:23ยท๐Ÿ“– Fundamentals/NLP
๋ฐ˜์‘ํ˜•
TF-IDF(Term Frequency-Inverse Document Frequency)

 

TF-IDF๋Š” ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ์˜ ์ •๋ณด ๊ฒ€์ƒ‰ ๋ฐ ํ…์ŠคํŠธ ๋งˆ์ด๋‹ ๋ถ„์•ผ์—์„œ ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๋Š” ํ†ต๊ณ„์ ์ธ ๊ฐ€์ค‘์น˜ ์ฒ™๋„์ด๋‹ค. TF-IDF๋Š” ํŠน์ • ๋‹จ์–ด๊ฐ€ ๋ฌธ์„œ ๋‚ด์—์„œ ์–ผ๋งˆ๋‚˜ ์ค‘์š”ํ•œ์ง€๋ฅผ ์ธก์ •ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋˜๋ฉฐ, ๊ฒ€์ƒ‰ ์—”์ง„, ๋ฌธ์„œ ๋ถ„๋ฅ˜, ์ •๋ณด ๊ฒ€์ƒ‰, ํ…์ŠคํŠธ ์š”์•ฝ ๋“ฑ ๋‹ค์–‘ํ•œ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ์ž‘์—…์—์„œ ํ™œ์šฉ๋œ๋‹ค.

 

TF-IDF๋Š” ๋‹ค์Œ ๋‘ ์š”์†Œ(TF, IDF)์˜ ๊ณฑ์œผ๋กœ ๊ณ„์‚ฐ๋˜๋Š”๋ฐ,

1. TF (Term Frequency, ๋‹จ์–ด ๋นˆ๋„)

ํŠน์ • ๋ฌธ์„œ ๋‚ด์—์„œ ํŠน์ • ๋‹จ์–ด๊ฐ€ ์–ผ๋งˆ๋‚˜ ์ž์ฃผ ๋‚˜ํƒ€๋‚˜๋Š”์ง€๋ฅผ ์ธก์ •. ์ผ๋ฐ˜์ ์œผ๋กœ ๋ฌธ์„œ ๋‚ด์—์„œ ์ž์ฃผ ๋‚˜ํƒ€๋‚˜๋Š” ๋‹จ์–ด์ผ์ˆ˜๋ก ํ•ด๋‹น ๋‹จ์–ด์˜ TF ๊ฐ’์€ ๋†’์œผ๋ฉฐ, TF๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ณ„์‚ฐ๋œ๋‹ค.

TF(w) = (ํŠน์ • ๋‹จ์–ด w์˜ ๋ฌธ์„œ ๋‚ด ๋“ฑ์žฅ ํšŸ์ˆ˜) / (ํ•ด๋‹น ๋ฌธ์„œ ๋‚ด ์ด ๋‹จ์–ด ์ˆ˜)

 

2. IDF (Inverse Document Frequency, ์—ญ๋ฌธ์„œ ๋นˆ๋„)

ํŠน์ • ๋‹จ์–ด๊ฐ€ ์–ผ๋งˆ๋‚˜ ๋‹ค๋ฅธ ๋ฌธ์„œ์—์„œ ์ž์ฃผ ๋‚˜ํƒ€๋‚˜๋Š”์ง€๋ฅผ ์ธก์ •. IDF ๊ฐ’์€ ํŠน์ • ๋‹จ์–ด์˜ ์ค‘์š”์„ฑ์„ ๋ฐ˜์˜ํ•˜๋ฉฐ, ๋‹ค๋ฅธ ๋ฌธ์„œ์—์„œ ์ž์ฃผ ๋‚˜ํƒ€๋‚˜๋Š” ๋‹จ์–ด์ผ์ˆ˜๋ก IDF ๊ฐ’์€ ๋‚ฎ์œผ๋ฉฐ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ณ„์‚ฐ๋œ๋‹ค.

IDF(w) = log(์ด ๋ฌธ์„œ ์ˆ˜ / ํŠน์ • ๋‹จ์–ด w๋ฅผ ํฌํ•จํ•œ ๋ฌธ์„œ ์ˆ˜)

 

์ด๋•Œ, ๋กœ๊ทธ ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ IDF ๊ฐ’์„ ์กฐ์ ˆํ•˜๊ณ , ํŠน์ • ๋‹จ์–ด๊ฐ€ ์ „์ฒด ๋ฌธ์„œ์— ๋‚˜ํƒ€๋‚˜์ง€ ์•Š์„ ๋•Œ ๋ถ„๋ชจ๊ฐ€ 0์ด ๋˜๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€ํ•œ๋‹ค.

TF-IDF๋Š” TF์™€ IDF๋ฅผ ๊ณฑํ•˜์—ฌ ๊ณ„์‚ฐ๋œ๋‹ค.

TF-IDF(w) = TF(w) * IDF(w)

 

TF-IDF๋Š” ์ฃผ์–ด์ง„ ๋ฌธ์„œ์™€ ๋‹จ์–ด์— ๋Œ€ํ•œ ์ค‘์š”์„ฑ์„ ๊ณ„์‚ฐํ•˜๋ฉฐ, ์ด ๊ฐ’์€ ํŠน์ • ๋ฌธ์„œ์—์„œ ํŠน์ • ๋‹จ์–ด๊ฐ€ ์–ผ๋งˆ๋‚˜ ์ค‘์š”ํ•œ์ง€๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ํŠน์ • ๋‹จ์–ด๊ฐ€ ํ•œ ๋ฌธ์„œ์—์„œ ์ž์ฃผ ๋‚˜ํƒ€๋‚˜๊ณ  ๋‹ค๋ฅธ ๋ฌธ์„œ์—์„œ๋Š” ๋“œ๋ฌผ๊ฒŒ ๋‚˜ํƒ€๋‚˜๋ฉด, ํ•ด๋‹น ๋‹จ์–ด์˜ TF-IDF ๊ฐ’์€ ์ƒ๋Œ€์ ์œผ๋กœ ๋†’์•„์ง„๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋ฌธ์„œ ๊ฐ„์˜ ๋‹จ์–ด ์ค‘์š”์„ฑ ๋ฐ ์œ ์‚ฌ์„ฑ์„ ์ธก์ •ํ•˜๊ณ  ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ๋ฅผ ๊ฐœ์„ ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ๋‹ค.

 

TF-IDF๋ฅผ ์ด์šฉํ•œ ๋ฌธ์„œ ์œ ์‚ฌ๋„ ๊ณ„์‚ฐ


TF-IDF๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฌธ์„œ ์œ ์‚ฌ๋„๋ฅผ ๋น„๊ตํ•˜๋Š” ์ผ๋ฐ˜์ ์ธ ๋ฐฉ๋ฒ•์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 

1. ๋ฌธ์„œ ์ˆ˜์ง‘ ๋ฐ ์ „์ฒ˜๋ฆฌ

  • ๋น„๊ตํ•  ๋ฌธ์„œ๋“ค์„ ์ˆ˜์ง‘ํ•˜๊ณ , ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ์ „์ฒ˜๋ฆฌ
  • ์ด ๋‹จ๊ณ„์—์„œ๋Š” ํ† ํฐํ™”, ์†Œ๋ฌธ์ž ๋ณ€ํ™˜, ๊ตฌ๋‘์  ๋ฐ ๋ถˆ์šฉ์–ด ์ œ๊ฑฐ ๋“ฑ์„ ์ˆ˜ํ–‰

 

2. TF-IDF ๋ฒกํ„ฐ ์ƒ์„ฑ

  • ๊ฐ ๋ฌธ์„œ๋ฅผ TF-IDF ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„
  • ๊ฐ ๋ฌธ์„œ์˜ TF-IDF ๋ฒกํ„ฐ๋Š” ๋‹จ์–ด์˜ ์ง‘ํ•ฉ์„ ํŠน์„ฑ์œผ๋กœ ๊ฐ€์ง

 

3. ๋ฌธ์„œ ๊ฐ„ ์œ ์‚ฌ๋„ ๊ณ„์‚ฐ

  • ๊ฐ ๋ฌธ์„œ ๊ฐ„์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•ด ์œ ์‚ฌ์„ฑ ์ธก์ • ๋ฐฉ๋ฒ•์„ ์„ ํƒ.
  • ์ผ๋ฐ˜์ ์œผ๋กœ ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„(Cosine Similarity)๋ฅผ ์‚ฌ์šฉ
  • ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„๋Š” ๋‘ ๋ฒกํ„ฐ ๊ฐ„์˜ ๊ฐ๋„๋ฅผ ์ธก์ •ํ•˜์—ฌ ์œ ์‚ฌ์„ฑ์„ ๊ณ„์‚ฐํ•˜๋Š” ์ง€ํ‘œ
  • ๋‘ ๋ฌธ์„œ A์™€ B ๊ฐ„์˜ ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ณ„์‚ฐ๋จ (๊ฒฐ๊ณผ๋Š” -1์—์„œ 1 ์‚ฌ์ด์˜ ๊ฐ’์„ ๊ฐ€์ง€๋ฉฐ, 1์— ๊ฐ€๊นŒ์šธ์ˆ˜๋ก ๋‘ ๋ฌธ์„œ๊ฐ€ ์œ ์‚ฌํ•˜๋‹ค๊ณ  ํŒ๋‹จ)
Cosine Similarity(A, B) = (A ๋ฒกํ„ฐ์™€ B ๋ฒกํ„ฐ์˜ ๋‚ด์ ) / (A ๋ฒกํ„ฐ์˜ ํฌ๊ธฐ * B ๋ฒกํ„ฐ์˜ ํฌ๊ธฐ)



4. ์œ ์‚ฌ๋„ ์ธก์ • ๋ฐ ๋žญํ‚น

  • ์„ ํƒํ•œ ์œ ์‚ฌ์„ฑ ์ธก์ • ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ๋ฌธ์„œ ์Œ ๊ฐ„์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐ
  • ๋ชจ๋“  ๋ฌธ์„œ ์Œ์— ๋Œ€ํ•œ ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ•œ ํ›„, ์›ํ•˜๋Š” ๋ฌธ์„œ์™€ ๋น„๊ต ๋Œ€์ƒ ๋ฌธ์„œ ๊ฐ„์˜ ์œ ์‚ฌ๋„๋ฅผ ํ™•์ธ
  • ์œ ์‚ฌ๋„๊ฐ€ ๊ฐ€์žฅ ๋†’์€ ๋น„๊ต ๋Œ€์ƒ ๋ฌธ์„œ๋ฅผ ์„ ํƒํ•˜๊ฑฐ๋‚˜, ๋žญํ‚น์„ ๋งค๊ธฐ๋Š” ๋“ฑ์˜ ์ž‘์—…์„ ์ˆ˜ํ–‰

 

์ด์™€ ๊ฐ™์€ ๋ฐฉ์‹์œผ๋กœ TF-IDF๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฌธ์„œ ๊ฐ„ ์œ ์‚ฌ๋„๋ฅผ ๋น„๊ตํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ์ •๋ณด ๊ฒ€์ƒ‰, ๋ฌธ์„œ ํด๋Ÿฌ์Šคํ„ฐ๋ง, ์ถ”์ฒœ ์‹œ์Šคํ…œ ๋“ฑ ๋‹ค์–‘ํ•œ ์‘์šฉ ๋ถ„์•ผ์—์„œ ์œ ์šฉํ•œ ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค.

 

 

๊ด€๋ จ ๋ ˆํฌ์ง€ํ† ๋ฆฌ

- TF-IDF ์˜ˆ์ œ : https://github.com/mayank408/TFIDF

- ๋ฌธ์„œ ์œ ์‚ฌ๋„ ์ธก์ • : https://github.com/malteos/awesome-document-similarity

๋ฐ˜์‘ํ˜•

'๐Ÿ“– Fundamentals > NLP' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

[NLP] Word2Vec ์„ค๋ช… | word2vec ๊ด€๋ จ ๊นƒํ—™ ๋ ˆํฌ์ง€ํ† ๋ฆฌ  (0) 2023.09.22
'๐Ÿ“– Fundamentals/NLP' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€
  • [NLP] Word2Vec ์„ค๋ช… | word2vec ๊ด€๋ จ ๊นƒํ—™ ๋ ˆํฌ์ง€ํ† ๋ฆฌ
๋ญ…์ฆค
๋ญ…์ฆค
AI ๊ธฐ์ˆ  ๋ธ”๋กœ๊ทธ
    ๋ฐ˜์‘ํ˜•
  • ๋ญ…์ฆค
    CV DOODLE
    ๋ญ…์ฆค
  • ์ „์ฒด
    ์˜ค๋Š˜
    ์–ด์ œ
  • ๊ณต์ง€์‚ฌํ•ญ

    • โœจ About Me
    • ๋ถ„๋ฅ˜ ์ „์ฒด๋ณด๊ธฐ (200)
      • ๐Ÿ“– Fundamentals (33)
        • Computer Vision (9)
        • 3D vision & Graphics (6)
        • AI & ML (15)
        • NLP (2)
        • etc. (1)
      • ๐Ÿ› Research (65)
        • Deep Learning (7)
        • Image Classification (2)
        • Detection & Segmentation (17)
        • OCR (7)
        • Multi-modal (4)
        • Generative AI (6)
        • 3D Vision (3)
        • Material & Texture Recognit.. (8)
        • NLP & LLM (11)
        • etc. (0)
      • ๐ŸŒŸ AI & ML Tech (7)
        • AI & ML ์ธ์‚ฌ์ดํŠธ (7)
      • ๐Ÿ’ป Programming (86)
        • Python (18)
        • Computer Vision (12)
        • LLM (4)
        • AI & ML (18)
        • Database (3)
        • Apache Airflow (6)
        • Docker & Kubernetes (14)
        • ์ฝ”๋”ฉ ํ…Œ์ŠคํŠธ (4)
        • C++ (1)
        • etc. (6)
      • ๐Ÿ’ฌ ETC (3)
        • ์ฑ… ๋ฆฌ๋ทฐ (3)
  • ๋งํฌ

  • ์ธ๊ธฐ ๊ธ€

  • ํƒœ๊ทธ

    ์ปดํ“จํ„ฐ๋น„์ „
    material recognition
    AI
    airflow
    ๊ฐ์ฒด ๊ฒ€์ถœ
    ํ”„๋กฌํ”„ํŠธ์—”์ง€๋‹ˆ์–ด๋ง
    ChatGPT
    segmentation
    Text recognition
    ๋„์ปค
    deep learning
    ๋”ฅ๋Ÿฌ๋‹
    3D Vision
    OCR
    Python
    Image Classification
    ๊ฐ์ฒด๊ฒ€์ถœ
    OpenAI
    multi-modal
    CNN
    GPT
    Computer Vision
    ํŒŒ์ด์ฌ
    LLM
    nlp
    OpenCV
    pytorch
    pandas
    VLP
    object detection
  • ์ตœ๊ทผ ๋Œ“๊ธ€

  • ์ตœ๊ทผ ๊ธ€

  • hELLOยท Designed By์ •์ƒ์šฐ.v4.10.3
๋ญ…์ฆค
[NLP] TF-IDF ์„ค๋ช… | ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ ํ†ต๊ณ„์  ๊ฐ€์ค‘์น˜ ๊ณ„์‚ฐ | ํŠน์ • ๋‹จ์–ด๊ฐ€ ๋ฌธ์„œ ๋‚ด์—์„œ ์–ผ๋งˆ๋‚˜ ์ค‘์š”ํ•œ์ง€
์ƒ๋‹จ์œผ๋กœ

ํ‹ฐ์Šคํ† ๋ฆฌํˆด๋ฐ”