[AI/LLM] Transformer์˜ ์ธ์ฝ”๋”์™€ ๋””์ฝ”๋” ์‰ฝ๊ฒŒ ์ดํ•ดํ•˜๊ธฐ

2024. 11. 6. 21:49ยท๐Ÿ› Research/Large-scale Model
๋ฐ˜์‘ํ˜•

Transformer ์•„ํ‚คํ…์ฒ˜๋Š” ํฌ๊ฒŒ ์ •๋ณด๋ฅผ ์••์ถ•ํ•˜๊ณ  ์ดํ•ดํ•˜๋Š” Encoder์™€, ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์ƒˆ๋กœ์šด ์‹œํ€€์Šค๋ฅผ ์ƒ์„ฑํ•˜๋Š” Decoder๋กœ ๋‚˜๋‰œ๋‹ค. ๊ฐ ์ปดํฌ๋„ŒํŠธ์˜ ์ˆ˜ํ•™์  ์„ค๊ณ„ ์˜๋„์™€ ์—ฐ์‚ฐ ํŠน์„ฑ, ๊ทธ๋ฆฌ๊ณ  ์ด๋ฅผ ์‘์šฉํ•œ ๋Œ€ํ‘œ์ ์ธ ๋ชจ๋ธ๊ตฐ(BERT, T5, GPT)์˜ ์ฐจ์ด์ ์„ ์ •๋ฆฌํ•ด ๋ณธ๋‹ค.


1. Encoder

์ธ์ฝ”๋”๋Š” ์ž…๋ ฅ ์‹œํ€€์Šค $X = {x_1, x_2, \dots, x_n}$์˜ ๊ฐ ํ† ํฐ์ด ๋ฌธ์žฅ ๋‚ด ๋‹ค๋ฅธ ๋ชจ๋“  ํ† ํฐ๊ณผ ์–ด๋–ค ๊ด€๊ณ„๋ฅผ ๋งบ๋Š”์ง€ ํŒŒ์•…ํ•˜์—ฌ, ํ’๋ถ€ํ•œ ๋ฌธ๋งฅ์ด ๋‹ด๊ธด ์ž ์žฌ ๋ฒกํ„ฐ(Latent Vector)๋ฅผ ์ƒ์„ฑํ•œ๋‹ค.

โ‘  Multi-Head Self-Attention (MHSA)

์ธ์ฝ”๋”์˜ ํ•ต์‹ฌ์€ ๋ชจ๋“  ์œ„์น˜์˜ ํ† ํฐ์„ ๋™์‹œ์— ์ฐธ์กฐํ•˜๋Š” Bi-directional(์–‘๋ฐฉํ–ฅ) ์—ฐ์‚ฐ์ด๋‹ค. $h$๊ฐœ์˜ ํ—ค๋“œ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” MHSA๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ์„œ๋ธŒ ๊ณต๊ฐ„์—์„œ ๋ฌธ๋งฅ์„ ํŒŒ์•…ํ•œ๋‹ค.

$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O$$
$$\text{where head}_i = \text{Attention}(XW_i^Q, XW_i^K, XW_i^V)$$

์ด ์—ฐ์‚ฐ์€ ์‹œํ€€์Šค ๋‚ด์˜ ์žฅ๊ฑฐ๋ฆฌ ์˜์กด์„ฑ(Long-range dependency)์„ $O(1)$์˜ ๊ฒฝ๋กœ ๊ธธ์ด๋กœ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•œ๋‹ค. RE ๊ด€์ ์—์„œ๋Š” ์ด ๊ณผ์ •์—์„œ ๋ฐœ์ƒํ•˜๋Š” $O(n^2)$์˜ ๋ฉ”๋ชจ๋ฆฌ ๋ณต์žก๋„๋ฅผ ์ œ์–ดํ•˜๊ธฐ ์œ„ํ•ด FlashAttention๊ณผ ๊ฐ™์€ ์ปค๋„ ์ตœ์ ํ™”๊ฐ€ ํ•„์ˆ˜์ ์œผ๋กœ ๊ณ ๋ ค๋œ๋‹ค.

โ‘ก Residual Connection & Layer Normalization (Pre-LN vs Post-LN)

๊ฐ ์„œ๋ธŒ ๋ ˆ์ด์–ด(Attention, FFN)์˜ ์ถœ๋ ฅ์—๋Š” Residual Connection๊ณผ Layer Normalization์ด ์ ์šฉ๋œ๋‹ค.

  • Post-LN (Original Transformer): $\text{LayerNorm}(x + \text{Sublayer}(x))$ ๊ตฌ์กฐ์ด๋‹ค. ์„ฑ๋Šฅ์€ ๋†’์œผ๋‚˜ ์ดˆ๊ธฐ ํ•™์Šต ์‹œ ๊ทธ๋ ˆ๋””์–ธํŠธ ๋ถˆ์•ˆ์ •์„ฑ์ด ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค.
  • Pre-LN (์ตœ์‹  LLM ํ‘œ์ค€): $x + \text{Sublayer}(\text{LayerNorm}(x))$ ๊ตฌ์กฐ์ด๋‹ค. ํ•™์Šต ์•ˆ์ •์„ฑ์ด ๋›ฐ์–ด๋‚˜ ๋งค์šฐ ๊นŠ์€ ์ธต(Deep Networks)์„ ์Œ“๋Š” ํ˜„๋Œ€์˜ Foundation Model(ViT, GPT-3 ๋“ฑ)์—์„œ ์ฃผ๋กœ ์ฑ„ํƒ๋œ๋‹ค.

 

2. Decoder

๋””์ฝ”๋”๋Š” ์ด์ „์— ์ƒ์„ฑ๋œ ํ† ํฐ๋“ค์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋‹ค์Œ ํ† ํฐ์˜ ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ์˜ˆ์ธกํ•˜๋Š” Auto-regressive ๋ชจ๋ธ์ด๋‹ค. ์ธ์ฝ”๋”์™€ ๋‹ฌ๋ฆฌ ์ธ๊ณผ ๊ด€๊ณ„๋ฅผ ์œ ์ง€ํ•˜๊ธฐ ์œ„ํ•œ ํŠน์ˆ˜ํ•œ ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง„๋‹ค.

โ‘  Masked Multi-Head Attention

๋””์ฝ”๋”์˜ ์ฒซ ๋ฒˆ์งธ ๋ ˆ์ด์–ด๋Š” ํ˜„์žฌ ์‹œ์  $t$ ์ดํ›„์˜ ํ† ํฐ(๋ฏธ๋ž˜ ์ •๋ณด)์„ ์ฐธ์กฐํ•˜์ง€ ๋ชปํ•˜๋„๋ก Causal Masking์„ ์ ์šฉํ•œ๋‹ค.

$$\text{MaskedAttention}(Q, K, V) = \text{softmax}\left(\frac{QK^T + M}{\sqrt{d_k}}\right)V$$
  • $M_{ij} = 0$ (if $i \ge j$), $M_{ij} = -\infty$ (if $i < j$)

์ด ๋งˆ์Šคํ‚น ์ฒ˜๋ฆฌ๋ฅผ ํ†ตํ•ด ํ•™์Šต ์‹œ์—๋Š” ์ „์ฒด ๋ฌธ์žฅ์„ ํ•œ ๋ฒˆ์— ์ž…๋ ฅํ•˜๋˜(Teacher Forcing), ๊ฐ ์‹œ์ ์—์„œ๋Š” ๊ณผ๊ฑฐ ์ •๋ณด๋งŒ์„ ๋ฐ”ํƒ•์œผ๋กœ ๋‹ค์Œ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๋„๋ก ๊ฐ•์ œํ•œ๋‹ค.

โ‘ก Encoder-Decoder Cross-Attention

์ธ์ฝ”๋”์™€ ๋””์ฝ”๋”๋ฅผ ์—ฐ๊ฒฐํ•˜๋Š” ํ•ต์‹ฌ ๋ธŒ๋ฆฟ์ง€์ด๋‹ค.

  • Query: ๋””์ฝ”๋”์˜ Masked Self-Attention ์ถœ๋ ฅ์—์„œ ์ƒ์„ฑ๋œ๋‹ค. ("ํ˜„์žฌ ์ƒ์„ฑ ์ค‘์ธ ๋ฌธ๋งฅ์—์„œ ํ•„์š”ํ•œ ์ •๋ณด๊ฐ€ ๋ฌด์—‡์ธ๊ฐ€?")
  • Key, Value: ์ธ์ฝ”๋”์˜ ์ตœ์ข… ์ถœ๋ ฅ ๋ฒกํ„ฐ(Memory)์—์„œ ์ƒ์„ฑ๋œ๋‹ค. ("์ž…๋ ฅ ๋ฌธ์žฅ์˜ ๊ฐ ๋ถ€๋ถ„์€ ์–ด๋–ค ์ •๋ณด๋ฅผ ๋‹ด๊ณ  ์žˆ๋Š”๊ฐ€?")

์ด ์—ฐ์‚ฐ์„ ํ†ตํ•ด ๋””์ฝ”๋”๋Š” ์ถœ๋ ฅ ์‹œํ€€์Šค์˜ ๊ฐ ๋‹จ๊ณ„์—์„œ ์ž…๋ ฅ ๋ฌธ์žฅ์˜ ํŠน์ • ์˜์—ญ์— ์ง‘์ค‘(Align)ํ•˜๊ฒŒ ๋œ๋‹ค. RE๋กœ์„œ ๋ชจ๋ธ์„ ๊ฒฝ๋Ÿ‰ํ™”ํ•  ๋•Œ, Cross-Attention์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์ธ์ฝ”๋” ์ •๋ณด ์ „์ฒด๋ฅผ ์ฐธ์กฐํ•˜๋ฏ€๋กœ ์—ฐ์‚ฐ ๋น„์šฉ์ด ๋†’๋‹ค๋Š” ์ ์„ ์ธ์ง€ํ•˜๊ณ  KV Cache ์ตœ์ ํ™”์— ์ง‘์ค‘ํ•ด์•ผ ํ•œ๋‹ค.

โ‘ข Feed-Forward Network (FFN) ๋ฐ ์ถ”๋ก  ๋ณ‘๋ชฉ

๋””์ฝ”๋”์˜ ๊ฐ ์ธต ๋์—๋Š” ์œ„์น˜๋ณ„ ์™„์ „ ์—ฐ๊ฒฐ๋ง(Position-wise FFN)์ด ์กด์žฌํ•œ๋‹ค. ์ตœ๊ทผ ๋ชจ๋ธ๋“ค์€ ReLU ๋Œ€์‹  SwiGLU๋‚˜ GeLU๋ฅผ ํ™œ์„ฑํ™” ํ•จ์ˆ˜๋กœ ์‚ฌ์šฉํ•˜์—ฌ ๋น„์„ ํ˜• ํ‘œํ˜„๋ ฅ์„ ๋†’์ธ๋‹ค. ์ถ”๋ก  ์‹œ์—๋Š” ๋งค ํ† ํฐ๋งˆ๋‹ค ์ด ๋ชจ๋“  ์ธต์„ ํ†ต๊ณผํ•ด์•ผ ํ•˜๋ฏ€๋กœ, ๋ณ‘๋ ฌ ์—ฐ์‚ฐ์ด ๊ฐ€๋Šฅํ•œ ์ธ์ฝ”๋”๋ณด๋‹ค ์ฒ˜๋ฆฌ ์†๋„(Throughput)๊ฐ€ ๋‚ฎ์•„์ง€๋Š” ํŠน์„ฑ์ด ์žˆ๋‹ค.

 

3. ๋ชจ๋ธ๋ณ„ ๊ตฌ์กฐ์  ๋ณ€ํ˜• ๋ฐ ํŠน์„ฑ ๋น„๊ต

Transformer ์•„ํ‚คํ…์ฒ˜๋Š” ๋ชฉ์ ์— ๋”ฐ๋ผ ํŠน์ • ๋ถ€๋ถ„๋งŒ์„ ํŠนํ™”์‹œ์ผœ ๋ฐœ์ „ํ•ด ์™”๋‹ค.

๊ตฌ๋ถ„ Encoder-Only (BERT ๊ณ„์—ด) Encoder-Decoder (T5/BART ๊ณ„์—ด) Decoder-Only (GPT ๊ณ„์—ด)
ํ•ต์‹ฌ ๋ชฉ์  ์ž์—ฐ์–ด ์ดํ•ด (NLU) ์‹œํ€€์Šค ๋ณ€ํ™˜ (Seq2Seq) ์ž์—ฐ์–ด ์ƒ์„ฑ (NLG)
ํ•™์Šต ๋ฐฉ์‹ Masked LM (์–‘๋ฐฉํ–ฅ) Text-to-Text (๋ณตํ•ฉ) Causal LM (๋‹จ๋ฐฉํ–ฅ)
์žฅ์  ๋ฌธ๋งฅ ์ดํ•ด๋„๊ฐ€ ๋งค์šฐ ๋†’์Œ ์ž…์ถœ๋ ฅ ํ˜•์‹์ด ์ž์œ ๋กญ๊ณ  ๋ฒ”์šฉ์ ์ž„ ๊ธด ๋ฌธ๋งฅ ์œ ์ง€ ๋ฐ ์ƒ์„ฑ ์„ฑ๋Šฅ ํƒ์›”
๋‹จ์  ํ…์ŠคํŠธ ์ƒ์„ฑ ๋ถˆ๊ฐ€ ํŒŒ๋ผ๋ฏธํ„ฐ ๋ฐ ์—ฐ์‚ฐ ํšจ์œจ ์ €ํ•˜ ๊ฐ€๋Šฅ์„ฑ ์–‘๋ฐฉํ–ฅ ์ •๋ณด ํ™œ์šฉ์— ์ƒ๋Œ€์  ํ•œ๊ณ„

โ‘  Only Encoder (์˜ˆ: BERT)

BERT๋Š” ๋ฌธ์žฅ์˜ ์ค‘๊ฐ„ ๋‹จ์–ด๋ฅผ ๊ฐ€๋ฆฌ๊ณ  ์ฃผ๋ณ€ ๋‹จ์–ด๋กœ ์ด๋ฅผ ๋งž์ถ”๋Š” Masked Language Modeling(MLM)์„ ํ†ตํ•ด ํ•™์Šต๋œ๋‹ค. ํ…์ŠคํŠธ ๋ถ„๋ฅ˜, ๊ฐœ์ฒด๋ช… ์ธ์‹(NER) ๋“ฑ ๋ฌธ์žฅ์˜ ์˜๋ฏธ๋ฅผ ๊นŠ๊ฒŒ ํŒŒ์•…ํ•ด์•ผ ํ•˜๋Š” ํƒœ์Šคํฌ์— ์ตœ์ ํ™”๋˜์–ด ์žˆ๋‹ค.

โ‘ก Encoder-Decoder (์˜ˆ: T5)

T5๋Š” ๋ชจ๋“  NLP ๋ฌธ์ œ๋ฅผ "Text-to-Text" ํ”„๋ ˆ์ž„์›Œํฌ๋กœ ํ†ตํ•ฉํ•œ๋‹ค. ์ธ์ฝ”๋”๋Š” ์งˆ๋ฌธ์„ ํ•ด์„ํ•˜๊ณ , ๋””์ฝ”๋”๋Š” ๋‹ต์„ ์ƒ์„ฑํ•˜๋Š” ๋ถ„์—… ๊ตฌ์กฐ๋ฅผ ํ†ตํ•ด ๋ฒˆ์—ญ ๋ฐ ์š”์•ฝ์—์„œ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ์„ ๋ฐœํœ˜ํ•œ๋‹ค.

โ‘ข Only Decoder (์˜ˆ: GPT ์‹œ๋ฆฌ์ฆˆ)

์ตœ๊ทผ ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ(LLM)์˜ ์ฃผ๋ฅ˜ ๊ตฌ์กฐ์ด๋‹ค. ๋ณ„๋„์˜ ์ธ์ฝ”๋” ์—†์ด ์ž…๋ ฅ ํ”„๋กฌํ”„ํŠธ ์ž์ฒด๋ฅผ ๋””์ฝ”๋”์˜ ์ดˆ๊ธฐ ์ž…๋ ฅ์œผ๋กœ ์ฒ˜๋ฆฌํ•œ๋‹ค. ๋ชจ๋ธ ํฌ๊ธฐ๋ฅผ ํ‚ค์› ์„ ๋•Œ ๋‚˜ํƒ€๋‚˜๋Š” Scaling Law์— ๊ฐ€์žฅ ํšจ์œจ์ ์ด๋ฉฐ, Few-shot ํ•™์Šต ๋Šฅ๋ ฅ์ด ๋›ฐ์–ด๋‚˜๋‹ค.

 

4. ์—ฐ์‚ฐ ํšจ์œจ์„ฑ

๋ชจ๋ธ์„ ์„ค๊ณ„ํ•˜๊ฑฐ๋‚˜ ์„œ๋น™ํ•  ๋•Œ ๊ฐ€์žฅ ์ค‘์š”ํ•˜๊ฒŒ ๊ณ ๋ คํ•ด์•ผ ํ•  ์ง€์ ์€ ์—ฐ์‚ฐ ๋ณต์žก๋„์™€ Throughput์ด๋‹ค.

  1. Encoder์˜ ๋ณ‘๋ ฌ์„ฑ: ํ•™์Šต๊ณผ ์ถ”๋ก  ๋ชจ๋‘ ์ „์ฒด ์‹œํ€€์Šค๋ฅผ ํ•œ ๋ฒˆ์˜ Forward pass๋กœ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์–ด ์ฒ˜๋ฆฌ ์†๋„๊ฐ€ ๋งค์šฐ ๋น ๋ฅด๋‹ค.
  2. Decoder์˜ ์ˆœ์ฐจ์„ฑ: ์ถ”๋ก  ์‹œ ํ•œ ํ† ํฐ์”ฉ ์ƒ์„ฑํ•ด์•ผ ํ•˜๋ฏ€๋กœ ์‹œํ€€์Šค ๊ธธ์ด์— ๋น„๋ก€ํ•˜์—ฌ ์ง€์—ฐ ์‹œ๊ฐ„(Latency)์ด ์„ ํ˜•์ ์œผ๋กœ ์ฆ๊ฐ€ํ•œ๋‹ค. ์ด๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด ์•ž์„œ ์–ธ๊ธ‰ํ•œ KV Caching์ด๋‚˜ Speculative Decoding ๊ฐ™์€ ๊ธฐ๋ฒ•์ด ํ•„์ˆ˜์ ์œผ๋กœ ์š”๊ตฌ๋œ๋‹ค.

Transformer์˜ ์„ธ๋ถ€ ๊ตฌ์กฐ ์„ ํƒ์€ ํ•ด๊ฒฐํ•˜๋ ค๋Š” ๋ฌธ์ œ์˜ ๋„๋ฉ”์ธ๊ณผ ์„œ๋น™ ํ™˜๊ฒฝ์— ๋”ฐ๋ผ ๊ฒฐ์ •๋˜์–ด์•ผ ํ•œ๋‹ค. ์ดํ•ด ์ค‘์‹ฌ์˜ ํƒœ์Šคํฌ๋ผ๋ฉด Encoder ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์ด ํšจ์œจ์ ์ด๋ฉฐ, ๋ณต์žกํ•œ ์ถ”๋ก ์ด๋‚˜ ์ฐฝ์˜์  ์ƒ์„ฑ์ด ๋ชฉ์ ์ด๋ผ๋ฉด Decoder ๊ธฐ๋ฐ˜์˜ ๋Œ€ํ˜• ๋ชจ๋ธ์ด ์œ ๋ฆฌํ•˜๋‹ค. ์ตœ๊ทผ์—๋Š” ์ด๋ฏธ์ง€ ๋ฐ ๋น„๋””์˜ค ์ƒ์„ฑ ๋ถ„์•ผ์—์„œ๋„ DiT(Diffusion Transformer)์™€ ๊ฐ™์ด Decoder-only ํ˜น์€ ๊ฐ€๋ฒผ์šด Encoder ๊ตฌ์กฐ๋ฅผ ๊ฒฐํ•ฉํ•œ ํ˜•ํƒœ๊ฐ€ SOTA(State-of-the-Art)๋ฅผ ๊ธฐ๋กํ•˜๊ณ  ์žˆ๋‹ค.



AI/ML ์ง๊ตฐ ์ทจ์—… & ์ปค๋ฆฌ์–ด ์„ฑ์žฅ์„ ์›ํ•˜์‹œ๋‚˜์š”?

 

[์ฑ… ์ถ”์ฒœ] ๋‚˜๋Š” AI์—”์ง€๋‹ˆ์–ด์ž…๋‹ˆ๋‹ค | ์ œ์ดํŽ | AI/ML ์ง๊ตฐ ์ทจ์—… & ์„ฑ์žฅ ๊ฐ€์ด๋“œ

๋‚˜๋Š” AI ์—”์ง€๋‹ˆ์–ด์ž…๋‹ˆ๋‹ค AI๋ฅผ ๊ณต๋ถ€ํ•˜๊ธฐ ์œ„ํ•œ ๊ฐ•์˜, ๋ธ”๋กœ๊ทธ, ํŠœํ† ๋ฆฌ์–ผ์€ ์ •๋ง ๋งŽ์ฃ .๊ทผ๋ฐ ๋ง‰์ƒ “AI/ML ์—”์ง€๋‹ˆ์–ด๊ฐ€ ๋˜๋ ค๋ฉด ์–ด๋–ป๊ฒŒ ํ•ด์•ผ ํ• ๊นŒ?”์— ๋Œ€ํ•œ ๋‹ต์€ ์ƒ๊ฐ๋ณด๋‹ค ์ž˜ ์•ˆ ๋ณด์ด๋”๋ผ๊ณ ์š”. ์ €๋„ ์ปค๋ฆฌ

mvje.tistory.com

ใ€Ž๋‚˜๋Š” AI ์—”์ง€๋‹ˆ์–ด์ž…๋‹ˆ๋‹คใ€ 

๋ฐ˜์‘ํ˜•

'๐Ÿ› Research > Large-scale Model' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

MoE(Mixture of Experts) ๊ฐœ๋… ์„ค๋ช…: ๊ฑฐ๋Œ€ ๋ชจ๋ธ์„ sparse ๊ณ„์‚ฐ์œผ๋กœ ํ™•์žฅ  (0) 2025.12.31
[NLP] ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ ์„ค๋ช… | Huggingface sentence-transformers, OpenAI  (0) 2025.05.13
[AI/LLM] Transformer Attention ์ดํ•ดํ•˜๊ธฐ: Q, K, V์˜ ์—ญํ• ๊ณผ ๋™์ž‘ ์›๋ฆฌ  (0) 2024.11.06
LLM ํ”„๋กฌํ”„ํŠธ ์—”๋‹ˆ์ง€์–ด๋ง, ๊ทธ๊ฒŒ ๋Œ€์ฒด ๋ญ”๋ฐ? ๋‚˜๋„ ์•Œ์•„์•ผํ•ด!?  (2) 2024.07.26
[NLP] BERT ๊ฐ„๋‹จ ์„ค๋ช… | Bi-Directional LM | ์–‘๋ฐฉํ–ฅ ์–ธ์–ด ๋ชจ๋ธ  (0) 2023.09.25
'๐Ÿ› Research/Large-scale Model' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€
  • MoE(Mixture of Experts) ๊ฐœ๋… ์„ค๋ช…: ๊ฑฐ๋Œ€ ๋ชจ๋ธ์„ sparse ๊ณ„์‚ฐ์œผ๋กœ ํ™•์žฅ
  • [NLP] ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ ์„ค๋ช… | Huggingface sentence-transformers, OpenAI
  • [AI/LLM] Transformer Attention ์ดํ•ดํ•˜๊ธฐ: Q, K, V์˜ ์—ญํ• ๊ณผ ๋™์ž‘ ์›๋ฆฌ
  • LLM ํ”„๋กฌํ”„ํŠธ ์—”๋‹ˆ์ง€์–ด๋ง, ๊ทธ๊ฒŒ ๋Œ€์ฒด ๋ญ”๋ฐ? ๋‚˜๋„ ์•Œ์•„์•ผํ•ด!?
๋ญ…์ฆค
๋ญ…์ฆค
AI ๊ธฐ์ˆ  ๋ธ”๋กœ๊ทธ
    ๋ฐ˜์‘ํ˜•
  • ๋ญ…์ฆค
    moovzi’s Doodle
    ๋ญ…์ฆค
  • ์ „์ฒด
    ์˜ค๋Š˜
    ์–ด์ œ
  • ๊ณต์ง€์‚ฌํ•ญ

    • โœจ About Me
    • ๋ถ„๋ฅ˜ ์ „์ฒด๋ณด๊ธฐ (216)
      • ๐Ÿ“– Fundamentals (34)
        • Computer Vision (9)
        • 3D vision & Graphics (6)
        • AI & ML (16)
        • etc. (3)
      • ๐Ÿ› Research (78)
        • Deep Learning (7)
        • Perception (19)
        • OCR (7)
        • Multi-modal (8)
        • Image•Video Generation (18)
        • 3D Vision (4)
        • Material • Texture Recognit.. (8)
        • Large-scale Model (7)
        • etc. (0)
      • ๐Ÿ› ๏ธ Engineering (8)
        • Distributed Training & Infe.. (5)
        • AI & ML ์ธ์‚ฌ์ดํŠธ (3)
      • ๐Ÿ’ป Programming (92)
        • Python (18)
        • Computer Vision (12)
        • LLM (4)
        • AI & ML (18)
        • Database (3)
        • Distributed Computing (6)
        • Apache Airflow (6)
        • Docker & Kubernetes (14)
        • ์ฝ”๋”ฉ ํ…Œ์ŠคํŠธ (4)
        • etc. (7)
      • ๐Ÿ’ฌ ETC (4)
        • ์ฑ… ๋ฆฌ๋ทฐ (4)
  • ๋งํฌ

    • ๋ฆฌํ‹€๋ฆฌ ํ”„๋กœํ•„ (๋ฉ˜ํ† ๋ง, ๋ฉด์ ‘์ฑ…,...)
    • ใ€Ž๋‚˜๋Š” AI ์—”์ง€๋‹ˆ์–ด์ž…๋‹ˆ๋‹คใ€
    • Instagram
    • Brunch
    • Github
  • ์ธ๊ธฐ ๊ธ€

  • ์ตœ๊ทผ ๋Œ“๊ธ€

  • ์ตœ๊ทผ ๊ธ€

  • hELLOยท Designed By์ •์ƒ์šฐ.v4.10.3
๋ญ…์ฆค
[AI/LLM] Transformer์˜ ์ธ์ฝ”๋”์™€ ๋””์ฝ”๋” ์‰ฝ๊ฒŒ ์ดํ•ดํ•˜๊ธฐ
์ƒ๋‹จ์œผ๋กœ

ํ‹ฐ์Šคํ† ๋ฆฌํˆด๋ฐ”