[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] Zero-1-to-3: Zero-shot One Image to 3D Object | Single-view object reconstruction

2025. 3. 22. 01:38ยท๐Ÿ› Research/Image•Video Generation
๋ฐ˜์‘ํ˜•

1. ์—ฐ๊ตฌ ์ฃผ์ œ์™€ ์ฃผ์š” ๊ธฐ์—ฌ

 

Zero-1-to-3๋Š” ๋‹จ ํ•˜๋‚˜์˜ RGB ์ด๋ฏธ์ง€๋กœ๋ถ€ํ„ฐ ์ƒˆ๋กœ์šด ์นด๋ฉ”๋ผ ์‹œ์ ์—์„œ์˜ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•˜๊ณ , ๋‚˜์•„๊ฐ€ 3D ๋ณต์›๊นŒ์ง€ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” zero-shot ํ”„๋ ˆ์ž„์›Œํฌ์ด๋‹ค. ๊ธฐ์กด์—๋Š” ๋ฉ€ํ‹ฐ ๋ทฐ ๋˜๋Š” 3D ์ •๋ณด๊ฐ€ ํ•„์š”ํ–ˆ๋˜ ๋ฌธ์ œ๋ฅผ, Stable Diffusion๊ณผ ๊ฐ™์€ ๋Œ€๊ทœ๋ชจ ์‚ฌ์ „ํ•™์Šต ๋ชจ๋ธ์„ ํ™œ์šฉํ•ด ์ œ์•ฝ ์—†์ด ํ•™์Šตํ•˜์ง€ ์•Š์€ ๋ฐ์ดํ„ฐ์—์„œ๋„ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์„ ํ™•๋ณดํ•œ๋‹ค๋Š” ์ ์ด ์ฃผ์š” ์ฐจ๋ณ„์ ์ด๋‹ค.

 

โœ… ์ฃผ์š” ๊ธฐ์—ฌ

  • Stable Diffusion์„ ํ™œ์šฉํ•˜์—ฌ camera viewpoint control์ด ๊ฐ€๋Šฅํ•œ ์กฐ๊ฑด๋ถ€ image-to-image ๋ณ€ํ™˜ ํ•™์Šต
  • Zero-shot 3D reconstruction์„ ์œ„ํ•œ viewpoint-conditioned diffusion ๋ชจ๋ธ ์ œ์•ˆ
  • Objaverse ๊ธฐ๋ฐ˜ ํ•™์Šต ํ›„์—๋„ in-the-wild ์ด๋ฏธ์ง€, ํšŒํ™” ๋“ฑ ๋‹ค์–‘ํ•œ ๋„๋ฉ”์ธ์—์„œ ๋†’์€ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ
  • ๊ธฐ์กด SOTA ๋Œ€๋น„ ์ •๋Ÿ‰์ /์ •์„ฑ์ ์œผ๋กœ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ ํ™•๋ณด (PSNR, SSIM, FID ๋“ฑ)

 

2. ์—ฐ๊ตฌ ๋ฐฐ๊ฒฝ ๋ฐ ๋™ํ–ฅ

์‚ฌ๋žŒ์€ ๋‹จ์ผ ์‹œ์ ์—์„œ 3D ๊ฐ์ฒด์˜ ๊ตฌ์กฐ๋ฅผ ์ง๊ด€์ ์œผ๋กœ ์ƒ์ƒํ•  ์ˆ˜ ์žˆ์œผ๋‚˜, ๊ธฐ์กด CV ๋ชจ๋ธ๋“ค์€ ํ’๋ถ€ํ•œ ์ฃผ์„ ์ •๋ณด๋‚˜ ์ œํ•œ๋œ ๋ฒ”์ฃผ์˜ ๋ฐ์ดํ„ฐ์…‹์— ์˜์กดํ•ด์™”๋‹ค. ์ตœ๊ทผ์—๋Š” CO3D ๊ฐ™์€ ๋Œ€๊ทœ๋ชจ 3D ๋ฐ์ดํ„ฐ์…‹์„ ํ™œ์šฉํ•œ ์—ฐ๊ตฌ๊ฐ€ ์ฆ๊ฐ€ํ–ˆ์ง€๋งŒ, ์—ฌ์ „ํžˆ ์นด๋ฉ”๋ผ ํฌ์ฆˆ, ์Šคํ…Œ๋ ˆ์˜ค ๋ทฐ ๋“ฑ์˜ ์ œ์•ฝ์ด ์กด์žฌํ•œ๋‹ค. ๋ณธ ์—ฐ๊ตฌ๋Š” ์ธํ„ฐ๋„ท ๊ทœ๋ชจ์˜ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต๋œ diffusion model์ด 2D ์ด๋ฏธ์ง€๋กœ๋ถ€ํ„ฐ ๊ฐ„์ ‘์ ์œผ๋กœ 3D priors๋ฅผ ํ•™์Šตํ–ˆ์„ ๊ฐ€๋Šฅ์„ฑ์— ์ฐฉ์•ˆํ•˜์—ฌ ์ด๋ฅผ ํ™œ์šฉํ•˜๊ณ ์ž ํ•œ๋‹ค.

 

*3D prior: ์ด ์„ธ์ƒ ๊ฐ์ฒด๋“ค์ด ์–ด๋–ค ~ 3D ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์„ ๊ฑฐ๋ž€ ๊ฒฝํ—˜์  ์ถ”์ •, ํ†ต๊ณ„์  ๊ฒฝํ–ฅ ์ •๋„๋ฅผ ์˜๋ฏธ

โœ… ๊ด€๋ จ ์—ฐ๊ตฌ ๋™ํ–ฅ

  • Text-to-image diffusion: DALL-E, Stable Diffusion ๋“ฑ ๋Œ€๊ทœ๋ชจ ํ•™์Šต์„ ํ†ตํ•ด ํ’๋ถ€ํ•œ ์˜๋ฏธ์  priors๋ฅผ ํ™•๋ณด
  • 2D ๊ธฐ๋ฐ˜ 3D ์ƒ์„ฑ: DreamFields, DreamFusion ๋“ฑ์€ CLIP๊ณผ NeRF๋ฅผ ์กฐํ•ฉํ•˜์—ฌ implicit 3D ํ‘œํ˜„ ์ƒ์„ฑ
  • Single-view 3D reconstruction: mesh, point cloud, voxel ๋“ฑ์„ ๊ธฐ๋ฐ˜์œผ๋กœ 3D ํ˜•ํƒœ ์˜ˆ์ธก. ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ ๋ถ€์กฑ ๋ฐ ํฌ์ฆˆ ์ •ํ•ฉ ์ด์Šˆ ์กด์žฌ
  • View-conditioned generation: ๊ธฐ์กด ์—ฐ๊ตฌ๋Š” zero-shot ์ผ๋ฐ˜ํ™”๊นŒ์ง€ ๋ณด์—ฌ์ฃผ์ง€ ๋ชปํ–ˆ์Œ. ๋ณธ ์—ฐ๊ตฌ๋Š” ์ œ์–ด ๊ฐ€๋Šฅํ•œ viewpoint translation์„ ํ†ตํ•œ ๊ฐ•๋ ฅํ•œ zero-shot ์„ฑ๋Šฅ ๋‹ฌ์„ฑ

 

3. ์ฃผ์š” ์ œ์•ˆ

Zero-1-to-3์˜ ํ•ต์‹ฌ ๋ชฉํ‘œ๋Š”, ๋‹จ ํ•˜๋‚˜์˜ RGB ์ด๋ฏธ์ง€๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ, ์‚ฌ์šฉ์ž๊ฐ€ ์ง€์ •ํ•œ ์นด๋ฉ”๋ผ ์‹œ์ (ํšŒ์ „ R, ์ด๋™ T)์— ํ•ด๋‹นํ•˜๋Š” ์ƒˆ๋กœ์šด ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์ด ๊ณผ์ •์„ ์ˆ˜์‹์œผ๋กœ ํ‘œํ˜„ํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 

  • x: ์ž…๋ ฅ RGB ์ด๋ฏธ์ง€
  • (R, T): ์›ํ•˜๋Š” ์‹œ์ ์˜ ์ƒ๋Œ€์ ์ธ ์นด๋ฉ”๋ผ ํšŒ์ „ ๋ฐ ์ด๋™
  • f: ์ƒˆ๋กœ์šด ์‹œ์  ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•˜๋Š” ํ•จ์ˆ˜ (๋ชจ๋ธ)

์ด ๋ฌธ์ œ๋Š” ๊ทผ๋ณธ์ ์œผ๋กœ under-constrained๋˜์–ด ์žˆ๋‹ค. ์ฆ‰, ์ž…๋ ฅ ์ด๋ฏธ์ง€๊ฐ€ ํ•˜๋‚˜๋ฟ์ด๊ธฐ ๋•Œ๋ฌธ์— ๋ฌผ์ฒด์˜ ๋‹ค๋ฅธ ์‹œ์ (์˜ˆ: ๋’ค์ชฝ, ์˜†๋ฉด)์˜ ์ •๋ณด๋ฅผ ์ง์ ‘์ ์œผ๋กœ ๊ด€์ธกํ•  ์ˆ˜ ์—†๊ณ , ๋”ฐ๋ผ์„œ ์ •๋‹ต์ด ์œ ์ผํ•˜์ง€ ์•Š๋‹ค. ๋‹ค์–‘ํ•œ ๊ฐ€๋Šฅํ•œ ํ•ด๊ฐ€ ์กด์žฌํ•  ์ˆ˜ ์žˆ๋Š” ๋ฌธ์ œ์ด๋‹ค.

 

์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด Zero-1-to-3๋Š” Stable Diffusion๊ณผ ๊ฐ™์€ ์ธํ„ฐ๋„ท ๊ทœ๋ชจ์˜ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต๋œ ๋Œ€๊ทœ๋ชจ ์‚ฌ์ „ํ•™์Šต ์ด๋ฏธ์ง€ ์ƒ์„ฑ ๋ชจ๋ธ์˜ ์ž ์žฌ์  3D prior๋ฅผ ํ™œ์šฉํ•œ๋‹ค. Stable Diffusion์€ ์ˆ˜์‹ญ์–ต ๊ฐœ์˜ ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•ด, ๋‹ค์–‘ํ•œ ๋ฌผ์ฒด๊ฐ€ ๋‹ค์–‘ํ•œ ๊ฐ๋„์™€ ์Šคํƒ€์ผ์—์„œ ์–ด๋–ป๊ฒŒ ๋ณด์ด๋Š”์ง€๋ฅผ ์ด๋ฏธ ํ•™์Šตํ•œ ์ƒํƒœ๋‹ค. ์ด ๋ชจ๋ธ์€ ์ง์ ‘์ ์œผ๋กœ 3D ๋ฐ์ดํ„ฐ๋ฅผ ๋ณธ ์ ์€ ์—†์ง€๋งŒ, ๊ฐ„์ ‘์ ์œผ๋กœ ๊ฐ์ฒด์˜ ํ˜•ํƒœ, ์‹œ์  ๋ณ€ํ™”, ๋Œ€์นญ์„ฑ ๋“ฑ 3D์ ์ธ ํ†ต๊ณ„์  ๊ทœ์น™(priors)์„ ๋‚ด์žฌํ•˜๊ณ  ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ, ์ž…๋ ฅ ์ด๋ฏธ์ง€ x์™€ ์นด๋ฉ”๋ผ ๋ณ€ํ™˜ (R, T)๋ฅผ ์กฐ๊ฑด์œผ๋กœ diffusion ๋ชจ๋ธ์„ fine-tuningํ•˜๊ฑฐ๋‚˜ ์ œ์–ดํ•จ์œผ๋กœ์จ, ๋ชจ๋ธ์ด ํ•™์Šตํ•œ ์ž ์žฌ์  3D ์ง€์‹์„ ํ™œ์šฉํ•ด ํ˜„์‹ค์ ์ธ ์ƒˆ๋กœ์šด ์‹œ์ ์˜ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์ด ์ด ๋…ผ๋ฌธ์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด๋‹ค.

 

 

โœ… Viewpoint ์ œ์–ด ํ•™์Šต

Zero-1-to-3๋Š” Stable Diffusion์˜ latent diffusion architecture๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋ฉฐ, ์ž…๋ ฅ ์ด๋ฏธ์ง€์™€ ํ•จ๊ป˜ ์›ํ•˜๋Š” ์นด๋ฉ”๋ผ ์‹œ์  ์ •๋ณด๋ฅผ ์กฐ๊ฑด์œผ๋กœ ์ฃผ์–ด, ์ƒˆ๋กœ์šด ์‹œ์ ์—์„œ์˜ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋Š” ์ œ์–ด ๋ฉ”์ปค๋‹ˆ์ฆ˜(viewpoint control)์„ ํ•™์Šตํ•œ๋‹ค.

 

์ด ๋…ผ๋ฌธ์—์„œ๋Š” Stable Diffusion์˜ ๊ธฐ๋ณธ ๊ตฌ์กฐ์ธ encoder → U-Net → decoder ์•„ํ‚คํ…์ฒ˜๋ฅผ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ•˜๋˜, U-Net ๋ถ€๋ถ„๋งŒ์„ fine-tuningํ•˜์—ฌ viewpoint ์ œ์–ด ๊ธฐ๋Šฅ์„ ๋ถ€์—ฌํ•œ๋‹ค. ์ฆ‰, ๊ธฐ์กด์— ํ•™์Šต๋œ ํ’๋ถ€ํ•œ ์ด๋ฏธ์ง€ ์ƒ์„ฑ ๋Šฅ๋ ฅ์€ ๊ทธ๋Œ€๋กœ ์œ ์ง€ํ•˜๋ฉด์„œ, ์นด๋ฉ”๋ผ ์‹œ์ ์„ ๋ฐ”๊พธ๋Š” ๋Šฅ๋ ฅ๋งŒ ์ถ”๊ฐ€๋กœ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

 

  • E(x): ์ž…๋ ฅ ์ด๋ฏธ์ง€์˜ latent representation
  • z_t: diffusion ๋‹จ๊ณ„ t์—์„œ์˜ noisy latent
  • ε: Gaussian noise
  • c(x, R, T): ์ž…๋ ฅ ์ด๋ฏธ์ง€์™€ ์นด๋ฉ”๋ผ ๋ณ€ํ™˜ ์ •๋ณด๋ฅผ ํฌํ•จํ•œ ์กฐ๊ฑด embedding
  • ε_θ: noise๋ฅผ ์˜ˆ์ธกํ•˜๋Š” U-Net

 

Loss Function ๋งŒ ๋ณด๋ฉด ๋ณต์žกํ•ด์„œ... ํ•™์Šต ๊ณผ์ •์„ ์š”์•ฝํ•ด ๋ณด๋ฉด...

  1. ๋ชฉํ‘œ ์‹œ์ ์˜ ์ด๋ฏธ์ง€ x_{R,T}๋ฅผ ๋ Œ๋”๋งํ•˜์—ฌ ์ค€๋น„ (์ผ์ข…์˜ GT)
  2. ์ด ์ด๋ฏธ์ง€๋ฅผ latent space๋กœ ์ธ์ฝ”๋”ฉ (E(x_{R,T}))
  3. ์—ฌ๊ธฐ์— ๋…ธ์ด์ฆˆ ε ๋ฅผ ์„ž์–ด์„œ z_t ์ƒ์„ฑ
  4. ๋ชจ๋ธ์— ์ž…๋ ฅ ๋˜๋Š” ๊ฒƒ
    • z_t: ๋…ธ์ด์ฆˆ๊ฐ€ ์„ž์ธ latent ์ด๋ฏธ์ง€
    • t: ๋…ธ์ด์ฆˆ ๊ฐ•๋„(๋ช‡ ๋‹จ๊ณ„์ธ์ง€)
    • c(x, R, T): ์ž…๋ ฅ ์ด๋ฏธ์ง€์™€ ์‹œ์  ๋ณ€ํ™”์˜ ๊ฒฐํ•ฉ ์ž„๋ฒ ๋”ฉ
  5. ๋ชจ๋ธ์€ ์ด ๋…ธ์ด์ฆˆ๋Š” ๋ฌด์—‡์ด์—ˆ๋Š”๊ฐ€ ε_θ ๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ํ•™์Šต
  6. ์ด ๊ณผ์ •์„ ๋ฐ˜๋ณตํ•˜๋ฉด์„œ ๋ชจ๋ธ์€ "์ž…๋ ฅ ์ด๋ฏธ์ง€ x๋ฅผ (R, T) ๋ฐฉํ–ฅ์—์„œ ๋ณด๋ฉด ์–ด๋–ค ๋ชจ์Šต์ผ์ง€"๋ฅผ ๊ฐ„์ ‘์ ์œผ๋กœ ํ•™์Šต

 

ํ•™์Šต์ด ์™„๋ฃŒ๋˜๋ฉด, ์ƒˆ๋กœ์šด ์‹œ์ ์—์„œ์˜ ์ด๋ฏธ์ง€๋ฅผ ๋…ธ์ด์ฆˆ ์ œ๊ฑฐ(iterative denoising) ๊ณผ์ •์„ ํ†ตํ•ด ์ƒ˜ํ”Œ๋งํ•  ์ˆ˜ ์žˆ๋‹ค. Stable Diffusion์ด ์›๋ž˜ ํ•™์Šตํ•œ ์‹œ๋งจํ‹ฑ/ํ…์Šค์ฒ˜ ํ‘œํ˜„์„ ๊ทธ๋Œ€๋กœ ์œ ์ง€ํ•˜๋ฉด์„œ, viewpoint ์กฐ์ ˆ ๋Šฅ๋ ฅ๋งŒ ๋ง๋ถ™์ด๋Š” ๋ฐฉ์‹์œผ๋กœ fine-tuning์„ ์ˆ˜ํ–‰ํ•œ๋‹ค๋Š” ์ ์ด๋‹ค.

 

โœ… View-conditioned Diffusion Architecture

Zero-1-to-3์˜ ์กฐ๊ฑด ์ž…๋ ฅ ์„ค๊ณ„๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋‘ ๊ฐ€์ง€ ์ŠคํŠธ๋ฆผ์„ ๊ฒฐํ•ฉํ•œ ๊ตฌ์กฐ๋ฅผ ๋”ฐ๋ฅธ๋‹ค.

 

1. High-level ์ŠคํŠธ๋ฆผ: Posed CLIP Embedding

  • ์ž…๋ ฅ ์ด๋ฏธ์ง€ x๋ฅผ CLIP encoder๋ฅผ ํ†ตํ•ด ์ž„๋ฒ ๋”ฉ
  • ์—ฌ๊ธฐ์— ์›ํ•˜๋Š” ์นด๋ฉ”๋ผ ๋ณ€ํ™˜ (R, T)๋ฅผ ๊ฒฐํ•ฉํ•ด posed CLIP embedding์„ ์ƒ์„ฑ
  • ์ด ์ž„๋ฒ ๋”ฉ์€ cross-attention์„ ํ†ตํ•ด denoising U-Net์— ์ „๋‹ฌ๋˜์–ด ๊ฐ์ฒด์˜ ์˜๋ฏธ์  ๊ตฌ์กฐ ๋ฐ ์ „์ฒด์ ์ธ ํ˜•ํƒœ๋ฅผ ์ปจํŠธ๋กค

2. Low-level ์ŠคํŠธ๋ฆผ: ์ฑ„๋„ ๊ฒฐํ•ฉ

  • ์ž…๋ ฅ ์ด๋ฏธ์ง€ x๋ฅผ denoised ์ด๋ฏธ์ง€์™€ ํ•จ๊ป˜ ์ฑ„๋„ ์ฐจ์›์—์„œ ์ง์ ‘ ๊ฒฐํ•ฉ
  • ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด, ๊ฐ์ฒด์˜ ๋””ํ…Œ์ผ, ํ…์Šค์ฒ˜, ์ƒ‰์ƒ ์ •๋ณด๊ฐ€ ์ž˜ ๋ณด์กด๋  ์ˆ˜ ์žˆ๋‹ค
  • ์ฐฝ์˜์ ์ธ ๋‹ต๋ณด๋‹จ ์ข€ ๋” ๋ช…์‹œ์ ์œผ๋กœ ํŠน์ • ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•˜๋„๋ก ํ•˜๋Š” ๋А๋‚Œ 

3. Classifier-free guidance

  • ์กฐ๊ฑด์„ ์ผ๋ถ€ ํ™•๋ฅ ๋กœ ์ œ๊ฑฐํ•˜์—ฌ ํ•™์Šตํ•˜๊ณ , ์ถ”๋ก  ์‹œ์—๋Š” ์กฐ๊ฑด ๊ฐ•๋„๋ฅผ ์กฐ์ ˆํ•˜์—ฌ ์ƒ์„ฑ ์ด๋ฏธ์ง€์˜ ํ’ˆ์งˆ๊ณผ ๋‹ค์–‘์„ฑ ์‚ฌ์ด์˜ trade-off๋ฅผ ์กฐ์ ˆ
  • diffusion ๋ชจ๋ธ์—์„œ ์ž์ฃผ ์‚ฌ์šฉ๋˜๋Š” ๋ฐฉ์‹์ด๋ฉฐ, ์ด ๋…ผ๋ฌธ์—์„œ๋„ ํ™œ์šฉ๋จ

 

โœ… 3D Reconstruction ๋ฐฉ๋ฒ•๋ก 

Zero-1-to-3๋Š” ๋‹จ ํ•˜๋‚˜์˜ ์ด๋ฏธ์ง€๋กœ๋ถ€ํ„ฐ ์ƒˆ๋กœ์šด ์‹œ์ ์˜ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๋ชจ๋ธ์ด์ง€๋งŒ, ๋ณธ ๋…ผ๋ฌธ์€ ์—ฌ๊ธฐ์„œ ํ•œ ๋ฐœ ๋” ๋‚˜์•„๊ฐ€, ๋ชจ๋ธ์ด ์‹ค์ œ๋กœ 3D ๊ตฌ์กฐ์— ๋Œ€ํ•œ ์ดํ•ด๋ฅผ ๋‚ด์žฌ์ ์œผ๋กœ ํ•™์Šตํ–ˆ๋Š”์ง€๋ฅผ ๊ฒ€์ฆํ•˜๊ธฐ ์œ„ํ•ด 3D Reconstruction ์‹คํ—˜์„ ํ•จ๊ป˜ ์ˆ˜ํ–‰ํ•œ๋‹ค.

  • ๋‹จ์ผ ์ž…๋ ฅ ์ด๋ฏธ์ง€ x๋ฅผ ๊ธฐ์ค€์œผ๋กœ
  • ๋‹ค์–‘ํ•œ ์นด๋ฉ”๋ผ ์‹œ์  (R_i, T_i)์„ ๋žœ๋คํ•˜๊ฒŒ ์ƒ˜ํ”Œ๋งํ•˜๊ณ 
  • ๊ฐ ์‹œ์ ์—์„œ์˜ ์ด๋ฏธ์ง€๋ฅผ Zero-1-to-3๋ฅผ ํ†ตํ•ด ์ƒ์„ฑ
  • ์ด ์ด๋ฏธ์ง€๋“ค์„ supervision์œผ๋กœ ์‚ผ์•„, NeRF ์Šคํƒ€์ผ์˜ 3D ๋ณผ๋ฅจ ํ‘œํ˜„(Volumetric representation)์„ ์ตœ์ ํ™”

์ด ๊ณผ์ •์—์„œ Score Jacobian Chaining (SJC)์ด๋ผ๋Š” ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ, Stable Diffusion ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์ด ๋‚ดํฌํ•œ ์ด๋ฏธ์ง€ ์ƒ์„ฑ priors๋ฅผ 3D ์žฌ๊ตฌ์„ฑ ๊ณผ์ •์— ํšจ๊ณผ์ ์œผ๋กœ ํ™œ์šฉํ•œ๋‹ค.

๐Ÿ’ก ์ฆ‰, ๋‹จ์ˆœํžˆ ์ƒˆ๋กœ์šด ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๋ฐ ๊ทธ์น˜์ง€ ์•Š๊ณ , ๋ชจ๋ธ์ด ํ•™์Šตํ•œ ์‹œ์  ๊ฐ„ ๊ด€๊ณ„์™€ ์‹œ๊ฐ์  ์ผ๊ด€์„ฑ์ด ์‹ค์ œ 3D ๊ตฌ์กฐ ๋ณต์›์—๋„ ๊ธฐ์—ฌํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ฃผ๋Š” ์ค‘์š”ํ•œ ์‹คํ—˜์ด๋‹ค.

 

โœ… ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹

Zero-1-to-3๋Š” ๋Œ€๊ทœ๋ชจ ๊ณต๊ฐœ 3D ๊ฐ์ฒด ๋ฐ์ดํ„ฐ์…‹์ธ Objaverse๋ฅผ fine-tuning์— ์‚ฌ์šฉํ•œ๋‹ค.

  • ์•ฝ 80๋งŒ ๊ฐœ ์ด์ƒ์˜ 3D ๊ฐ์ฒด๋ฅผ ํฌํ•จ
  • class label ์—†์ด ๋‹ค์–‘ํ•œ ์Šคํƒ€์ผ, ๊ตฌ์กฐ, ์žฌ์งˆ์„ ํฌํ•จํ•˜๋Š” ๊ณ ํ’ˆ์งˆ 3D ๋ชจ๋ธ ์ œ๊ณต
  • ๊ฐ ๊ฐ์ฒด๋‹น 12๊ฐœ์˜ ์นด๋ฉ”๋ผ ์‹œ์ ์—์„œ ray-tracing ๊ธฐ๋ฐ˜ ์ด๋ฏธ์ง€ ๋ Œ๋”๋ง
  • ์ด๋กœ๋ถ€ํ„ฐ (x, x_{R,T}, R, T) ์Œ์„ ์ƒ์„ฑํ•˜์—ฌ viewpoint control ํ•™์Šต ๋ฐ์ดํ„ฐ๋กœ ์‚ฌ์šฉ

 

4. ์‹คํ—˜ ๋ฐ ๊ฒฐ๊ณผ

 

 

Zero-1-to-3๋Š” ๋‹จ์ผ ์ด๋ฏธ์ง€๋กœ๋ถ€ํ„ฐ ์ƒˆ๋กœ์šด ์‹œ์ ์˜ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๋ชจ๋ธ์ด์ง€๋งŒ, ๋…ผ๋ฌธ์—์„œ๋Š” ๋‹จ์ˆœํ•œ ์ด๋ฏธ์ง€ ์ƒ์„ฑ ์„ฑ๋Šฅ์„ ๋„˜์–ด์„œ, ๋ชจ๋ธ์ด ์‹ค์ œ๋กœ 3D ๊ตฌ์กฐ์— ๋Œ€ํ•œ ์ดํ•ด๋ฅผ ๋‚ด์žฌ์ ์œผ๋กœ ํ•™์Šตํ–ˆ๋Š”์ง€๋ฅผ ๊ฒ€์ฆํ•˜๊ธฐ ์œ„ํ•ด ์‹คํ—˜์„ ์„ค๊ณ„ํ•œ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ์ •ํ˜•ํ™”๋œ ๊ฐ์ฒด ๋ฐ์ดํ„ฐ(GSO), ๋ณต์žกํ•œ ์‹ค์„ธ๊ณ„ ์žฅ๋ฉด(RTMV), ๊ทธ๋ฆฌ๊ณ  ํšŒํ™”, ์ƒ์„ฑ ์ด๋ฏธ์ง€ ๋“ฑ ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•ด ๋ชจ๋ธ์˜ zero-shot ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•œ๋‹ค.

 

Novel view synthesis๋Š” ํ•˜๋‚˜์˜ ์ด๋ฏธ์ง€์™€ ์‹œ์  ์ •๋ณด๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„ ๋ณด์ง€ ๋ชปํ•œ ๋ฐฉํ–ฅ์—์„œ์˜ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๊ณผ์ œ์ด๋‹ค. ์—ฌ๊ธฐ์„œ ๋ชจ๋ธ์€ ๋‹ค์–‘ํ•œ ๋„๋ฉ”์ธ์—์„œ๋„ ์ผ๊ด€์„ฑ ์žˆ๋Š” ๊ณ ํ’ˆ์งˆ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•˜๋ฉฐ, ์‹œ์  ๋ณ€ํ™”์— ๋”ฐ๋ฅธ ์‹œ๊ฐ์  ์ •ํ•ฉ์„ฑ์„ ์„ฑ๊ณต์ ์œผ๋กœ ํ•™์Šตํ–ˆ์Œ์„ ๋ณด์—ฌ์ค€๋‹ค.

 

3D reconstruction ์‹คํ—˜์€ ๋‹ค์–‘ํ•œ ์‹œ์  ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•œ ํ›„, ์ด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ NeRF ์Šคํƒ€์ผ์˜ 3D ํ‘œํ˜„์„ ๋ณต์›ํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค. ์ด ๊ณผ์ •์€ ๋ชจ๋ธ์ด ๋‹จ์ˆœํžˆ ์ด๋ฏธ์ง€๋ฅผ ํšŒ์ „์‹œํ‚ค๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ์‹ค์ œ๋กœ ๊ฐ์ฒด์˜ ๊ตฌ์กฐ๋ฅผ ๋‚ด๋ถ€์ ์œผ๋กœ ์ดํ•ดํ•˜๊ณ  ์žˆ์Œ์„ ์ž…์ฆํ•˜๋Š” ๊ทผ๊ฑฐ๋กœ ์ž‘์šฉํ•œ๋‹ค.

 

๊ฒฐ๊ณผ์ ์œผ๋กœ ์ด ๋…ผ๋ฌธ์€ Zero-1-to-3๊ฐ€ ๋‹จ์ผ ์ด๋ฏธ์ง€ ๊ธฐ๋ฐ˜ ์ƒ์„ฑ ๋ชจ๋ธ์ž„์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ , ๊ฐ•๋ ฅํ•œ 3D priors๋ฅผ ๋‚ด์žฌํ•˜๊ณ  ์žˆ์œผ๋ฉฐ, ๋‹ค์–‘ํ•œ ํ™˜๊ฒฝ์—์„œ ์ผ๋ฐ˜ํ™” ๊ฐ€๋Šฅํ•œ 3D ์ธ์‹ ๋Šฅ๋ ฅ์„ ๊ฐ–์ถ”๊ณ  ์žˆ์Œ์„ ์‹คํ—˜์ ์œผ๋กœ ์ฆ๋ช…ํ•œ๋‹ค.

 

 


Zero-1-to-3๋Š” ์‚ฌ์ „ํ•™์Šต๋œ Stable Diffusion์„ ํ™œ์šฉํ•˜์—ฌ ๋‹จ์ผ ์ด๋ฏธ์ง€๋กœ๋ถ€ํ„ฐ ๋‹ค์–‘ํ•œ ์‹œ์ ์˜ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•˜๊ณ , ๋” ๋‚˜์•„๊ฐ€ ๊ณ ํ’ˆ์งˆ์˜ 3D ๋ณต์›๊นŒ์ง€ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์˜€๋‹ค. ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•์€ viewpoint ์ œ์–ด ํ•™์Šต, view-conditioned diffusion, SJC ๊ธฐ๋ฐ˜ 3D ์ตœ์ ํ™” ๋“ฑ์œผ๋กœ ๊ตฌ์„ฑ๋˜๋ฉฐ, ์ •๋Ÿ‰์ /์ •์„ฑ์  ์‹คํ—˜ ๊ฒฐ๊ณผ ๊ธฐ์กด SOTA๋ฅผ ๋ชจ๋‘ ์ดˆ์›”ํ•˜๋Š” ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค.

 

๋ณต์žกํ•œ ์žฅ๋ฉด (multi-object scenes), ๋น„๋””์˜ค ๋“ฑ์— ๋Œ€ํ•œ ์ƒ์„ฑ์€ ํ–ฅํ›„ ๋„์ „ ๊ณผ์ œ๋กœ ๋‚จ์•„ ์žˆ๋‹ค.

 

๋ฐ˜์‘ํ˜•

'๐Ÿ› Research > Imageโ€ขVideo Generation' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

[Gen AI] Diffusion Model๊ณผ DDPM ๊ฐœ๋… ์„ค๋ช…  (0) 2025.03.31
[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] DREAMFUSION: TEXT-TO-3D USING 2D DIFFUSION  (0) 2025.03.23
[Gen AI] Stable Diffusion: ์ด๋ฏธ์ง€ ์ƒ์„ฑ AI ์ดํ•ดํ•˜๊ธฐ  (0) 2024.11.04
VAE (Variational Autoencoder) ์„ค๋ช… | VAE Pytorch ์ฝ”๋“œ ์˜ˆ์‹œ  (0) 2024.01.06
[๊ธฐ์ˆ  ์†Œ๊ฐœ] Text-to-Image Generation | ์ด๋ฏธ์ง€ ์ƒ์„ฑ AI | DALL-E | GPT | dVAE  (0) 2023.04.06
'๐Ÿ› Research/Image•Video Generation' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€
  • [Gen AI] Diffusion Model๊ณผ DDPM ๊ฐœ๋… ์„ค๋ช…
  • [๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] DREAMFUSION: TEXT-TO-3D USING 2D DIFFUSION
  • [Gen AI] Stable Diffusion: ์ด๋ฏธ์ง€ ์ƒ์„ฑ AI ์ดํ•ดํ•˜๊ธฐ
  • VAE (Variational Autoencoder) ์„ค๋ช… | VAE Pytorch ์ฝ”๋“œ ์˜ˆ์‹œ
๋ญ…์ฆค
๋ญ…์ฆค
AI ๊ธฐ์ˆ  ๋ธ”๋กœ๊ทธ
    ๋ฐ˜์‘ํ˜•
  • ๋ญ…์ฆค
    moovzi’s Doodle
    ๋ญ…์ฆค
  • ์ „์ฒด
    ์˜ค๋Š˜
    ์–ด์ œ
  • ๊ณต์ง€์‚ฌํ•ญ

    • โœจ About Me
    • ๋ถ„๋ฅ˜ ์ „์ฒด๋ณด๊ธฐ (213)
      • ๐Ÿ“– Fundamentals (34)
        • Computer Vision (9)
        • 3D vision & Graphics (6)
        • AI & ML (16)
        • NLP (2)
        • etc. (1)
      • ๐Ÿ› Research (75)
        • Deep Learning (7)
        • Perception (19)
        • OCR (7)
        • Multi-modal (5)
        • Image•Video Generation (18)
        • 3D Vision (4)
        • Material • Texture Recognit.. (8)
        • Large-scale Model (7)
        • etc. (0)
      • ๐Ÿ› ๏ธ Engineering (8)
        • Distributed Training & Infe.. (5)
        • AI & ML ์ธ์‚ฌ์ดํŠธ (3)
      • ๐Ÿ’ป Programming (92)
        • Python (18)
        • Computer Vision (12)
        • LLM (4)
        • AI & ML (18)
        • Database (3)
        • Distributed Computing (6)
        • Apache Airflow (6)
        • Docker & Kubernetes (14)
        • ์ฝ”๋”ฉ ํ…Œ์ŠคํŠธ (4)
        • C++ (1)
        • etc. (6)
      • ๐Ÿ’ฌ ETC (4)
        • ์ฑ… ๋ฆฌ๋ทฐ (4)
  • ๋งํฌ

    • ๋ฆฌํ‹€๋ฆฌ ํ”„๋กœํ•„ (๋ฉ˜ํ† ๋ง, ๋ฉด์ ‘์ฑ…,...)
    • ใ€Ž๋‚˜๋Š” AI ์—”์ง€๋‹ˆ์–ด์ž…๋‹ˆ๋‹คใ€
    • Instagram
    • Brunch
    • Github
  • ์ธ๊ธฐ ๊ธ€

  • ์ตœ๊ทผ ๋Œ“๊ธ€

  • ์ตœ๊ทผ ๊ธ€

  • hELLOยท Designed By์ •์ƒ์šฐ.v4.10.3
๋ญ…์ฆค
[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] Zero-1-to-3: Zero-shot One Image to 3D Object | Single-view object reconstruction
์ƒ๋‹จ์œผ๋กœ

ํ‹ฐ์Šคํ† ๋ฆฌํˆด๋ฐ”