[๋…ผ๋ฌธ ์†Œ๊ฐœ] TAM (Track Anything Model) | ์–ด๋–ค ๊ฒƒ์ด๋“  ์ถ”์ ํ•˜๋Š” Vision AI ๋ชจ๋ธ | Sagment Anything ๋น„๋””์˜ค ๋ฒ„์ „
ยท
๐Ÿ› Research/Detection & Segmentation
Track Anything: Segment Anything Meets Videos ์„ธ์ƒ ์ฐธ ๋น ๋ฅด๋‹ค. Meta AI์˜ SAM (Segment Anything Model)์ด ๋‚˜์˜จ์ง€ ์–ผ๋งˆ๋‚˜ ๋๋‹ค๊ณ  SAM์„ ๋น„๋””์˜ค์— ์ ์šฉํ•ด tracking task๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” TAM (Tracking Anything Model) ๋…ผ๋ฌธ์ด ๋‚˜์™”๋‹ค๊ณ  ํ•œ๋‹ค. Track-Anything์€ ๋น„๋””์˜ค ๊ฐ์ฒด ์ถ”์  ๋ฐ ๋ถ„ํ• ์„ ์œ„ํ•œ ์œ ์—ฐํ•œ ๋Œ€ํ™”ํ˜• ๋„๊ตฌ๋กœ Segment Anything์—์„œ ๊ฐœ๋ฐœ๋˜์—ˆ์œผ๋ฉฐ ์‚ฌ์šฉ์ž ํด๋ฆญ์„ ํ†ตํ•ด์„œ๋งŒ ์ถ”์  ๋ฐ ์„ธ๊ทธ๋จผํŠธํ™”ํ•  ํ•ญ๋ชฉ์„ ์ง€์ •ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ถ”์ ํ•˜๋Š” ๋™์•ˆ ์‚ฌ์šฉ์ž๋Š” ์ถ”์ ํ•˜๋ ค๋Š” ๊ฐœ์ฒด๋ฅผ ์œ ์—ฐํ•˜๊ฒŒ ๋ณ€๊ฒฝํ•˜๊ฑฐ๋‚˜ ๋ชจํ˜ธํ•œ ๋ถ€๋ถ„์ด ์žˆ๋Š” ๊ฒฝ์šฐ ๊ด€์‹ฌ ์˜์—ญ์„ ์ˆ˜์ •ํ•  ์ˆ˜๋„ ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ํŠน์„ฑ์„ ํ†ตํ•ด Track-Anything์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ž‘์—…..
[๋…ผ๋ฌธ ์†Œ๊ฐœ] DINOv2 - Self-supervised Vision Transformer | Meta AI | ๋ ˆ์ด๋ธ” ๋ฐ์ดํ„ฐ ์—†์ด ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ์„ ๋‚ด๋Š” Vision AI ๋ชจ๋ธ
ยท
๐Ÿ› Research/Detection & Segmentation
DINOv2 ๋…ผ๋ฌธ ์ œ๋ชฉ : DINOv2: Learning Robust Visual Features without Supervision GitHub Demo 23๋…„ 4์›” Meta AI์—์„œ self-supervised learning์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ณ ์„ฑ๋Šฅ ์ปดํ“จํ„ฐ๋น„์ „ ๋ชจ๋ธ์„ ํ•™์Šตํ•˜๋Š” ์ƒˆ๋กœ์šด ๋ฐฉ๋ฒ•์ธ DINOv2๋ฅผ ๊ณต๊ฐœํ–ˆ๋‹ค. LLM(Large Language Model) ํ•™์Šต์—๋„ ํ™œ์šฉ๋˜๋Š” self-supervised learning ๋ฐฉ๋ฒ•์€ ๋ชจ๋ธ ํ•™์Šต ์‹œ ๋งŽ์€ ์–‘์˜ ๋ ˆ์ด๋ธ”์ด ์ง€์ •๋œ ๋ฐ์ดํ„ฐ๊ฐ€ ํ•„์š”ํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— AI ๋ชจ๋ธ์„ ํ•™์Šตํ•˜๋Š” ๊ฐ•๋ ฅํ•˜๊ณ  ์œ ์—ฐํ•œ ๋ฐฉ๋ฒ•์ด๋‹ค. ๋…ผ๋ฌธ์— ๋”ฐ๋ฅด๋ฉด ์ตœ๊ทผ ๋ช‡๋…„ ๋™์•ˆ ์ปดํ“จํ„ฐ๋น„์ „ ์ž‘์—…์˜ ํ‘œ์ค€ ์ ‘๊ทผ ๋ฐฉ์‹์ด์—ˆ๋˜ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ๋ฅผ ํŽ˜์–ด๋กœ ํ•™์Šตํ•˜๋Š” ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ฐฉ์‹์˜ ํ•™์Šต ๋ฐฉ๋ฒ•์—์„œ๋Š” ์ด๋ฏธ์ง€์˜ ์บก์…˜ ์ •๋ณด์— ์˜์กดํ•œ..
[๊ธฐ์ˆ  ์†Œ๊ฐœ] 3D Object Scanning | MVS | ๊ฐ์ฒด ์Šค์บ๋‹ | ์‹ค์‹œ๊ฐ„ 3D ๊ฐ์ฒด ๋ณต์›
ยท
๐Ÿ› Research/3D Vision
3D Object Scanning 3D Object Scanning์€ multi-view stereo (MVS) ๊ธฐ์ˆ ์„ ํ™œ์šฉํ•˜์—ฌ ๊ฐ์ฒด์˜ 3D shape์„ ๋ณต์›ํ•˜๋Š” ๊ธฐ์ˆ ์ด๋‹ค. ์•„๋ž˜ ์˜์ƒ์„ ๋ณด๋ฉด Niantic์ด๋ผ๋Š” ๊ธฐ์—…์—์„œ Unity SDK์— ๋น ๋ฅธ non-lidar ์Šค์บ” ํˆด์„ ์ถ”๊ฐ€ํ•˜์—ฌ ์‚ฌ์šฉ์ž๊ฐ€ ๊ฐ์ฒด๋ฅผ ์‹ค์‹œ๊ฐ„์œผ๋กœ ์Šค์บ”ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•œ๋‹ค. ์Šค๋งˆํŠธํฐ์œผ๋กœ ๊ฐ์ฒด๋ฅผ ๋‹ค์–‘ํ•œ ๊ฐ๋„์—์„œ ์ดฌ์˜ํ•˜๊ณ  ๊ฐ์ฒด๋ฅผ ๋ณต์›ํ•˜๋Š”๋ฐ ํ’ˆ์งˆ์ด ๊ฝค ์ข‹์•„๋ณด์ธ๋‹ค. ๋˜ํ•œ RealityScan๊ณผ ๊ฐ™์€ ์•ฑ์„ ์‚ฌ์šฉํ•˜๋ฉด ์Šค๋งˆํŠธํฐ์œผ๋กœ ๊ฐ„๋‹จํ•˜๊ฒŒ 3D ์Šค์บ”์„ ๊ฒฝํ—˜ํ•ด ๋ณผ ์ˆ˜๋„ ์žˆ๋‹ค. Niantic ๊ธฐ์—…์˜ Object Scanning ์˜ˆ์‹œ RealityScan - 3D Scanning App์˜ ๊ฒฐ๊ณผ ์˜ˆ์‹œ ์ถœ์ฒ˜ : https://sketchfab.com/3d-models..
[๊ธฐ์ˆ  ์†Œ๊ฐœ] Text-to-Image Generation | ์ด๋ฏธ์ง€ ์ƒ์„ฑ AI | DALL-E | GPT | dVAE
ยท
๐Ÿ› Research/Generative AI
Text to Image Generation Text to Image generation์€ ํ…์ŠคํŠธ ์ •๋ณด๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„์„œ ํ•ด๋‹น ํ…์ŠคํŠธ์— ํ•ด๋‹นํ•˜๋Š” ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๊ธฐ์ˆ ์ด๋‹ค. ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ์ˆ ์˜ ๋ฐœ์ „์œผ๋กœ ์ธํ•ด 2010๋…„๋Œ€ ์ค‘๋ฐ˜๋ถ€ํ„ฐ ๊ฐœ๋ฐœ๋˜๊ธฐ ์‹œ์ž‘ํ•ด 2022๋…„์—๋Š” OpenAI์˜ DALL-E 2 , Google Brain์˜ Imagen , StabilityAI์˜ Stable Diffusion ๊ณผ ๊ฐ™์€ ์ตœ์ฒจ๋‹จ ํ…์ŠคํŠธ-์ด๋ฏธ์ง€ ๋ชจ๋ธ์˜ ์ถœ๋ ฅ๋ฌผ์ด ์‹ค์ œ ์‚ฌ์ง„๊ณผ ์‚ฌ๋žŒ์ด ๊ทธ๋ฆฐ ์˜ˆ์ˆ ํ’ˆ์˜ ํ’ˆ์งˆ์— ์ ‘๊ทผํ•˜๊ธฐ ์‹œ์ž‘ํ–ˆ๋‹ค. Text to Image generation์—์„œ๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ์ด๋Ÿฌํ•œ GAN(Generative Adversarial Networks) ๋ชจ๋ธ์„ ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€๋ฅผ ์Œ์œผ๋กœ ์ด๋ฃจ๋Š” ๋ฐ์ดํ„ฐ์…‹์„ ํ•™์Šต์‹œ์ผœ์„œ ๊ตฌํ˜„ํ•œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, "..
[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] Character Region Awareness for Text Detection / CRAFT / ํ…์ŠคํŠธ ๊ฒ€์ถœ
ยท
๐Ÿ› Research/OCR
๋ณธ ๋…ผ๋ฌธ์€ Naver Clova์—์„œ CVPR 2019 ์— ๋ฐœํ‘œํ•œ Text Detection ๋…ผ๋ฌธ์œผ๋กœ, CRAFT ๋ผ๋Š” ๋ชจ๋ธ์„ ์ œ์•ˆํ•œ๋‹ค. Text Detection ๋ถ„์•ผ์—์„œ ์›Œ๋‚™ ์œ ๋ช…๋‚œ ๋…ผ๋ฌธ์ด๊ณ  ๊ฐœ์ธ์ ์œผ๋กœ ํ…์ŠคํŠธ ๊ฒ€์ถœ์„ ์œ„ํ•ด ํ…์ŠคํŠธ์˜ ํŠน์„ฑ๊ณผ ๋”ฅ๋Ÿฌ๋‹์˜ ํ•™์Šต ํŠน์„ฑ์„ ์•„์ฃผ ํšจ์œจ์ ์œผ๋กœ ์ด์šฉํ•œ ๋งค๋ ฅ์ ์ธ ์—ฐ๊ตฌ๋ผ ์ƒ๊ฐํ•œ๋‹ค. ์ž์„ธํ•œ ์„ค๋ช…์€ ๋‹ค๋ฅธ ๋ธ”๋กœ๊ทธ์—์„œ๋„ ์ž˜ ๋‚˜์™€์žˆ์œผ๋‹ˆ ๋‚˜๋Š” ๋ชจ๋ธ ํ•™์Šต์„ ์œ„ํ•œ ํ•ต์‹ฌ์ ์ธ ๋ถ€๋ถ„๋งŒ ์ •๋ฆฌํ•˜๋ ค ํ•œ๋‹ค. CRAFT ๋ชจ๋ธ์˜ ํ•ต์‹ฌ CRAFT ๋ชจ๋ธ์€ ํ…์ŠคํŠธ ๊ฒ€์ถœ์„ ์œ„ํ•ด ๋‹จ์–ด bbox๋ฅผ ๋ฐ”๋กœ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ๋ฌธ์ž์˜ ์œ„์น˜๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” region score, ๋ฌธ์ž๊ฐ„ ๊ฑฐ๋ฆฌ๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” affinity score๋ฅผ ์˜ˆ์ธก ์ด๋ฅผ ์œ„ํ•ด์„œ๋Š” character-level annotation์ด ํ•„์š”ํ•œ๋ฐ ๋ฌธ์ž ํ•˜๋‚˜ ํ•˜๋‚˜..
[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] What If We Only Use Real Datasets for Scene Text Recognition? Toward Scene Text Recognition With Fewer Labels
ยท
๐Ÿ› Research/OCR
๋ณธ ๋…ผ๋ฌธ์€ CVPR 2021์—์„œ ๋ฐœํ‘œ๋œ Text Recognition ๋…ผ๋ฌธ์œผ๋กœ, TRBA ๋ชจ๋ธ ('What is wrong with scene text recognition model comparisons? dataset and model analysis')์„ ์ œ์•ˆํ•œ ๋ฐฑ์ •ํ›ˆ ๋‹˜์˜ ๋…ผ๋ฌธ์ด๊ธฐ๋„ ํ•˜๋‹ค. ๋ณธ๋ฌธ ๋‚ด์šฉ Scene Text Recognition (STR) ์—ฐ๊ตฌ์—์„œ๋Š” ๋ฆฌ์–ผ ๋ฐ์ดํ„ฐ๊ฐ€ ๋ถ€์กฑํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ผ๋ฐ˜์ ์œผ๋กœ ๋Œ€๊ทœ๋ชจ ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต์„ ์ง„ํ–‰ํ•œ๋‹ค. ๋•Œ๋ฌธ์— ์•”๋ฌต์ ์œผ๋กœ ๋ฆฌ์–ผ ๋ฐ์ดํ„ฐ๋งŒ์œผ๋กœ๋Š” STR ๋ชจ๋ธ ํ•™์Šต์ด ๊ฑฐ์˜ ๋ถˆ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ์•”๋ฌต์ ์ธ ์ƒ์‹(?)์ด ์žˆ์—ˆ๋‹ค๊ณ  ํ•œ๋‹ค. ํ•˜์ง€๋งŒ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ด ์ƒ์‹์ด STR ์—ฐ๊ตฌ๋ฅผ ๋ฐฉํ•ดํ–ˆ๋‹ค๊ณ  ๋งํ•œ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ตœ๊ทผ์— ์ถ•์ ๋œ ๋ฆฌ์–ผ ๋ฐ์ดํ„ฐ์…‹์„ ํ†ตํ•ฉํ•˜๊ณ  ์ง€์ •๋œ ์‹ค์ œ ๋ฐ์ด..