๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
๐Ÿ› Research/3D Vision

[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] MVSNet: Depth Inference for Unstructured Multi-view Stereo / ๋”ฅ๋Ÿฌ๋‹์„ ์ด์šฉํ•œ Multi-view Stereo Reconstruction ๋ฐฉ๋ฒ•

by ๋ญ…์ฆค 2022. 3. 29.
๋ฐ˜์‘ํ˜•

์ž„์˜์˜ N๊ฐœ์˜ view๋ฅผ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉํ•˜๋Š” Multi-view Stereo Reconstuction task์—์„œ ์ „ํ†ต์ ์ธ ๋ฐฉ๋ฒ•์ด ์•„๋‹Œ, CNN ์•„ํ‚คํ…์ฒ˜๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ํ•™์Šต ๊ธฐ๋ฐ˜์˜ ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•˜๋Š” ์ฒซ ์—ฐ๊ตฌ์ด๊ธฐ์— ์†Œ๊ฐœํ•˜๋ ค ํ•ฉ๋‹ˆ๋‹ค. ์ง€๊ธˆ์€ ๋ณธ ๋…ผ๋ฌธ์—์„  ์ œ์•ˆํ•˜๋Š” MVSNet ๋ณด๋‹ค ์„ฑ๋Šฅ์ด ์ข‹์€ ๋„คํŠธ์›Œํฌ๊ฐ€ ๋งŽ์ง€๋งŒ, ๋ณธ ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆํ•˜๋Š” ์•„์ด๋””์–ด๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์—ฐ๊ตฌ๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค. 

(22๋…„ ์ดˆ ๊ธฐ์ค€ SoTA๋Š” Transformer ๊ธฐ๋ฐ˜์˜ TransMVSNet์ž…๋‹ˆ๋‹ค.)

 

Abstract

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” multi-view(๋‹ค์‹œ์ ) ์ด๋ฏธ์ง€์—์„œ depth map inference๋ฅผ ์œ„ํ•œ end-to-end ๋”ฅ๋Ÿฌ๋‹ ์•„ํ‚คํ…์ฒ˜๋ฅผ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. ๋„คํŠธ์›Œํฌ์—์„œ ๋‹ค์‹œ์  ์ด๋ฏธ์ง€๋“ค์˜ feature๋ฅผ ์ถ”์ถœํ•œ ํ›„ ๋ฏธ๋ถ„๊ฐ€๋Šฅํ•œ homography warping์„ ํ†ตํ•ด reference ์นด๋ฉ”๋ผ frustum์— 3D cost volume์„ ๊ตฌ์ถ•ํ•ฉ๋‹ˆ๋‹ค. ๋‹ค์Œ์œผ๋กœ 3D conv ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ดˆ๊ธฐ depth map์„ regularizeํ•˜๊ณ  regressํ•œ ๋‹ค์Œ reference ์ด๋ฏธ์ง€๋กœ refineํ•˜์—ฌ ์ตœ์ข… ์ถœ๋ ฅ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ๋ณธ ๋„คํŠธ์›Œํฌ๋Š” ์—ฌ๋Ÿฌ feature๋“ค์„ ํ•˜๋‚˜์˜ cost feature๋กœ ๋งคํ•‘ํ•˜๋Š” variance ๊ธฐ๋ฐ˜ metric์„ ์‚ฌ์šฉํ•˜์—ฌ ์ž„์˜์˜ N-view (N๊ฐœ์˜ view ๊ฐœ์ˆ˜) ์ž…๋ ฅ์„ ์œ ์—ฐํ•˜๊ฒŒ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ์ œ์•ˆ๋œ MVSNet์€ DTU dataset์—์„œ ์‹œ์—ฐ๋˜๊ณ  ๊ฐ„๋‹จํ•œ ํ›„์ฒ˜๋ฆฌ๋ฅผ ํ†ตํ•ด ์ด์ „ SOTA๋ฅผ ํ›จ์”ฌ ๋Šฅ๊ฐ€ํ•˜๊ณ  ๋Ÿฐํƒ€์ž„์€ ๋ช‡๋ฐฐ ๋” ๋น ๋ฆ…๋‹ˆ๋‹ค.

 

Introduction

Multi-view Stereo (MVS)๋Š” ์ˆ˜์‹ญ ๋…„ ๋™์•ˆ ์—ฐ๊ตฌ๋œ ์ปดํ“จํ„ฐ ๋น„์ „์˜ ํ•ต์‹ฌ ๋ฌธ์ œ์ธ ์ค‘์ฒฉ ์ด๋ฏธ์ง€๋“ค์—์„œ dense representation์„ ์ถ”์ •ํ•ฉ๋‹ˆ๋‹ค. ์ „ํ†ต์ ์ธ ๋ฐฉ๋ฒ•์€ dense correspondence์„ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•ด handcrafted similarity matrice์™€ ์—”์ง€๋‹ˆ์–ด๋ง๋œ regularizations์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฐฉ๋ฒ•๋“ค์€ Lambertian ์‹œ๋‚˜์ด๋กœ์—์„œ ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ์ง€๋งŒ low-textured, specular, reflective ํ•œ ์˜์—ญ์€ dense matching์ด ์‰ฝ์ง€ ์•Š์•„์„œ ๋ถˆ์™„์ „ํ•œ reconstruction ๊ฒฐ๊ณผ๋กœ ์ด์–ด์ง‘๋‹ˆ๋‹ค. 

 

์ตœ๊ทผ(2018๋…„ ๊ธฐ์ค€..) CNN ์—ฐ๊ตฌ์˜ ์„ฑ๊ณต์œผ๋กœ stereo reconstruction์„ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•œ ๊ด€์‹ฌ์ด ๋งŽ์Šต๋‹ˆ๋‹ค. ๊ฐœ๋…์ ์œผ๋กœ learning ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•์€ ๋ณด๋‹ค ์ข‹์€ ์„ฑ๋Šฅ์˜ ๋งค์นญ์„ ์œ„ํ•ด specualr ๋ฐ reflective ๊ฐ™์€ global ํ•œ semantic ์ •๋ณด๋ฅผ ๋„์ž…ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Hand-crafted similarity matrices ๋˜๋Š” egineered regularization ์„ ํ•™์Šต๊ฐ€๋Šฅํ•œ ๋„คํŠธ์›Œํฌ๋กœ ๋Œ€์ฒดํ•˜์—ฌ two-view sterero matching์— ๋Œ€ํ•œ ๋ช‡ ๊ฐ€์ง€ ์‹œ๋„๊ฐ€ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋“ค์€ stereo ๋ฒค์น˜๋งˆํฌ์—์„œ ์ „ํ†ต์ ์ธ ๋ฐฉ๋ฒ•์„ ๋Šฅ๊ฐ€ํ•˜๋Š” ์œ ๋งํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค. ์‹ค์ œ๋กœ stereo matching ์ž‘์—…์€ ์ด๋ฏธ์ง€ ์Œ์„ ๋ฏธ๋ฆฌ ๋ณด์ •ํ•˜์—ฌ ์นด๋ฉ”๋ผ ํŒŒ๋ผ๋ฏธํ„ฐ์— ์‹ ๊ฒฝ์“ฐ์ง€ ์•Š๊ณ  ์ˆ˜ํ‰ ํ”ฝ์…€ ๋‹จ์œ„ disparity ์ถ”์ •์ด ๊ฐ€๋Šฅํ•˜๊ธฐ ๋•Œ๋ฌธ์— CNN ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•์„ ์ ์šฉํ•˜๋Š” ๊ฒƒ์ด ์™„๋ฒฝํžˆ ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค.

 

๊ทธ๋Ÿฌ๋‚˜ ํ•™์Šต๋œ two-view stereo๋ฅผ multi-view ์‹œ๋‚˜๋ฆฌ์˜ค๋กœ ํ™•์žฅํ•˜๋Š” ๊ฒƒ์€ ๊ฐ„๋‹จํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. Stereo matching ๊ณผ๋Š” ๋‹ฌ๋ฆฌ MVS์— ๋Œ€ํ•œ ์ž…๋ ฅ ์ด๋ฏธ์ง€๋Š” ์ž„์˜์˜ ์นด๋ฉ”๋ผ geometry ์ผ ์ˆ˜ ์žˆ๊ณ , ์ด๋Š” training-based ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜๊ธฐ๊ฐ€ ๊นŒ๋‹ค๋กญ์Šต๋‹ˆ๋‹ค.  ์ด ๋ฌธ์ œ๋ฅผ ์ง์ ‘ ํ•ด๊ฒฐํ•˜๋ ค๋Š” ์‹œ๋„๋Š” ๊ฑฐ์˜ ์—†์Šต๋‹ˆ๋‹ค. SurfaceNet์€ ๋ชจ๋“  ์ด๋ฏธ์ง€ ํ”ฝ์…€ ์ƒ‰์ƒ๊ณผ ์นด๋ฉ”๋ผ ์ •๋ณด๋ฅผ ๋„คํŠธ์›Œํฌ์˜ ์ž…๋ ฅ์œผ๋กœ single volume์— ๊ฒฐํ•ฉํ•˜๋Š” Colored Voxel Cubes (CVC)๋ฅผ ๋ฏธ๋ฆฌ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค. ๋Œ€์กฐ์ ์œผ๋กœ, Learned Stereo Machine (LSM)์€ end-to-end ํ•™์Šต/์ถ”๋ก ์„ ๊ฐ€๋Šฅํ•˜๋„๋ก ๋ฏธ๋ถ„๊ฐ€๋Šฅํ•œ projection/unprojection์„ ์ง์ ‘ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ, ๋‘ ๊ฐ€์ง€ ๋ฐฉ๋ฒ• ๋ชจ๋‘ regular grid์˜ volumetric representation์„ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค. ๋•Œ๋ฌธ์— 3D volume์˜ ๋ง‰๋Œ€ํ•œ ๋ฉ”๋ชจ๋ฆฌ ์†Œ๋น„๋กœ ์ธํ•ด ๋„คํŠธ์›Œํฌ ํ™•์žฅ์ด ๊ฑฐ์˜ ๋ถˆ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. LSM์€ low volume resolution์˜ synthetic ๊ฐ์ฒด๋งŒ ์ฒ˜๋ฆฌํ•˜๊ณ  SurfaceNet์€ large-scale reconstruction์— ์˜ค๋žœ ์‹œ๊ฐ„์ด ๊ฑธ๋ฆฝ๋‹ˆ๋‹ค. ํ˜„์žฌ(2018) ํ˜„๋Œ€ MVS ๋ฒค์น˜๋งˆํฌ์˜ ์„ ๋‘๋Š” ์—ฌ์ „ํžˆ ์ „ํ†ต์ ์ธ ๋ฐฉ๋ฒ•์œผ๋กœ ์ ์œ ๋˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. 

 

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์œ„์—์„œ ์–ธ๊ธ‰ํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด 3D scene ์ด ์•„๋‹Œ ํ•œ ๋ฒˆ์— ํ•˜๋‚˜์˜ depth map์„ ๊ณ„์‚ฐํ•˜๋Š” depth map inference๋ฅผ ์œ„ํ•œ end-to-end ๋”ฅ๋Ÿฌ๋‹ ์•„ํ‚คํ…์ฒ˜๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ๋‹ค๋ฅธ depth map ๊ธฐ๋ฐ˜ MVS ๋ฐฉ๋ฒ•๋“ค๊ณผ ์œ ์‚ฌํ•˜๊ฒŒ ์ œ์•ˆํ•˜๋Š” ๋„คํŠธ์›Œํฌ์ธ MVSNet ์€ ํ•˜๋‚˜์˜ reference ์ด๋ฏธ์ง€์™€ ์—ฌ๋Ÿฌ source ์ด๋ฏธ์ง€๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„ reference ์ด๋ฏธ์ง€์— ๋Œ€ํ•œ depth map์„ ์ถ”์ •ํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ ํ•ต์‹ฌ์€ ๋„คํŠธ์›Œํฌ์˜ ์นด๋ฉ”๋ผ geometry๋ฅผ implicitํ•˜๊ฒŒ ์ธ์ฝ”๋”ฉํ•˜์—ฌ 2D image feature์—์„œ 3D cost volume ์„ ๊ตฌ์ถ•ํ•˜๊ณ  end-to-end ํ•™์Šต์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ๋ฏธ๋ถ„ ๊ฐ€๋Šฅํ•œ homograpy warping ์ž‘์—…์ž…๋‹ˆ๋‹ค. ์ž…๋ ฅ์—์„œ ์ž„์˜์˜ view ์ˆ˜์˜ source ์ด๋ฏธ์ง€๋“ค์„ ์ ์šฉํ•˜๊ธฐ ์œ„ํ•ด ์—ฌ๋Ÿฌ view ๋ณ„ feature๋ฅผ ํ•˜๋‚˜์˜ cost volume์œผ๋กœ ๋งคํ•‘ํ•˜๋Š” distribution ๊ธฐ๋ฐ˜ metric์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์ด cost volume์€ multi-scale 3D convolution์„ ๊ฑฐ์น˜๊ณ  initial depth map์„ regressํ•ฉ๋‹ˆ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ boundary ์˜์—ญ์˜ ์ •ํ™•๋„๋ฅผ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•ด reference ์ด๋ฏธ์ง€๋กœ depth map์„ ๊ฐœ์„ ํ•ฉ๋‹ˆ๋‹ค. 

 

๋ณธ ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ์ด์ „์˜ ์ ‘๊ทผ ๋ฐฉ์‹๊ณผ ๋‘ ๊ฐ€์ง€ ์ฐจ์ด์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

1) Depth map inference๋ฅผ ์œ„ํ•ด 3D cost volume์€ Euclidean space ๋Œ€์‹  camera frustum์— ๊ตฌ์ถ•๋ฉ๋‹ˆ๋‹ค.

2) MVS reconstruction์„ view ๋‹น depth map ์ถ”์ •์„ ํ†ตํ•ด large-scale reconstruction์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

 

Related work

- MVS Reconstruction

Ouput representation์— ๋”ฐ๋ผ MVS method๋Š” direct point cloud reconstruction, volumetric reconstuction, depth map reconstruction์œผ๋กœ ๋ถ„๋ฅ˜๋ฉ๋‹ˆ๋‹ค. Point cloud ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•์€ ์ผ๋ฐ˜์ ์œผ๋กœ reconstruction์„ ์ ์ง„์ ์œผ๋กœ ์กฐ๋ฐ€ํ™”ํ•˜๊ธฐ ์œ„ํ•ด propagation ์ „๋žต์— ์˜์กดํ•˜๋Š” 3D point์—์„œ ์ง์ ‘ ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฐฉ์‹์€ point cloud์˜ ์ „ํŒŒ๊ฐ€ ์ˆœ์ฐจ์ ์œผ๋กœ ์ง„ํ–‰๋˜๊ธฐ ๋•Œ๋ฌธ์— ์™„์ „ํžˆ ๋ณ‘๋ ฌํ™”๋˜๊ธฐ ์–ด๋ ต๊ณ  ์ผ๋ฐ˜์ ์œผ๋กœ ์‹œ๊ฐ„์ด ๋งŽ์ด ๊ฑธ๋ฆฝ๋‹ˆ๋‹ค. Volumetric ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•์€ 3D ๊ณต๊ฐ„์„ ์ผ๋ฐ˜ grid๋กœ ๋‚˜๋ˆˆ ๋‹ค์Œ ๊ฐ voxel์ด ํ‘œ๋ฉด์— ๋ถ™์–ด์žˆ๋Š”์ง€ ์ถ”์ •ํ•ฉ๋‹ˆ๋‹ค. ์ด ํ‘œํ˜„์˜ ๋‹จ์ ์€ space discretization error์™€ ๋†’์€ ๋ฉ”๋ชจ๋ฆฌ ์†Œ๋น„์ž…๋‹ˆ๋‹ค. ๋Œ€์กฐ์ ์œผ๋กœ detph map์€ ๋ชจ๋“  ๋ฐฉ๋ฒ• ์ค‘์—์„œ ๊ฐ€์žฅ ์œ ์—ฐํ•œ ํ‘œํ˜„์œผ๋กœ ๋ณต์žกํ•œ MVS ๋ฌธ์ œ๋ฅผ ํ•œ ๋ฒˆ์— ํ•˜๋‚˜์˜ reference ์ด๋ฏธ์ง€์™€ ์†Œ์ˆ˜์˜ source ์ด๋ฏธ์ง€์—๋งŒ ์ดˆ์ ์„ ๋งž์ถ”๋Š” depth map ์ถ”์ •์˜ ๋น„๊ต์  ์ž‘์€ ๋ฌธ์ œ๋กœ ๋ถ„๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ depth map์€ point cloud ๋˜๋Š” volumetric reconstruction์— ์‰ฝ๊ฒŒ ์œตํ•ฉ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. 

 

- Learned Stereo

์Šคํ…Œ๋ ˆ์˜ค์— ๋Œ€ํ•œ ์ตœ๊ทผ ์—ฐ๊ตฌ์—์„œ๋Š” ๊ธฐ์กด์˜ handcrafted ์ด๋ฏธ์ง€ feature์™€ matching metric์„ ์‚ฌ์šฉํ•˜๋Š” ๋Œ€์‹  ๋” ๋‚˜์€ pair-wise matching์„ ์œ„ํ•ด ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ์ˆ ์„ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค. ๋‘ ๊ฐœ์˜ ์ด๋ฏธ์ง€๋ฅผ ๋งค์นญ์‹œํ‚ค๋Š” ๋”ฅ๋Ÿฌ๋‹ ๋„คํŠธ์›Œํฌ, 3D cost volume์„ 3D CNN์œผ๋กœ ์ •๊ทœํ™”ํ•˜๋Š” end-to-end ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์ด ์†Œ๊ฐœ๋˜์–ด์™”๊ณ , ์ด๋ฏธ ๊ธฐ์กด ์Šคํ…Œ๋ ˆ์˜ค ์ ‘๊ทผ ๋ฐฉ์‹์„ ํ›จ์”ฌ ๋Šฅ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

 

- Learned MVS

ํ•™์Šต๋œ MVS ์ ‘๊ทผ ๋ฐฉ์‹์— ๋Œ€ํ•œ ์‹œ๋„๋Š” ๋” ์ ์Šต๋‹ˆ๋‹ค. MVS reconstruction์„ ์œ„ํ•œ ์ „ํ†ต์ ์ธ cost metric์„ ๋Œ€์ฒดํ•˜๊ธฐ ์œ„ํ•œ multi-patch similarity์ด ์ œ์•ˆ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. MVS ๋ฌธ์ œ์— ๋Œ€ํ•œ ์ฒซ ๋ฒˆ์งธ ํ•™์Šต ๊ธฐ๋ฐ˜ ํŒŒ์ดํ”„๋ผ์ธ์€ ์ •๊ตํ•œ voxel๋ณ„ view ์„ ํƒ์œผ๋กœ cost volume์„ ์ •๊ทœํ™”ํ•˜๊ณ  ์ถ”๋ก ํ•˜๋Š” SurfaceNet์ž…๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ๊ณผ ๊ฐ€์žฅ ๊ด€๋ จ์žˆ๋Š” ์ ‘๊ทผ ๋ฐฉ์‹์€ LSM์œผ๋กœ ์นด๋ฉ”๋ผ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ๋„คํŠธ์›Œํฌ์—์„œ cost volume์„ ํ˜•์„ฑํ•˜๊ธฐ ์œ„ํ•œ projection ์ž‘์—…์œผ๋กœ ์ธ์ฝ”๋”ฉ๋˜๊ณ  3D CNN์ด voxel์ด ํ‘œ๋ฉด์— ์†ํ•˜๋Š”์ง€ ๋ถ„๋ฅ˜ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ volumetric ํ‘œํ˜„์˜ ์ผ๋ฐ˜์ ์ธ ๋‹จ์ ์œผ๋กœ ์ธํ•ด SurfaceNet ๋ฐ LSM์˜ ๋„คํŠธ์›Œํฌ๋Š” small-scale reconstruction์œผ๋กœ ์ œํ•œ๋ฉ๋‹ˆ๋‹ค. ๋Œ€์กฐ์ ์œผ๋กœ ๋ณธ ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆํ•˜๋Š” MVSNet์€ ๋งค๋ฒˆ ํ•˜๋‚˜์˜ reference ์ด๋ฏธ์ง€์— ๋Œ€ํ•œ depth map์„ ์ƒ์„ฑํ•˜๋Š” ๋ฐ ์ค‘์ ์„ ๋‘๋ฏ€๋กœ large-scale scene์„ ์ง์ ‘์ ์œผ๋กœ reconstruct ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

 

๋ฐ˜์‘ํ˜•

 

MVSNet

1. Image Features

MVSNet์˜ ์ฒซ ๋ฒˆ์งธ ๋‹จ๊ณ„๋Š” Dense matching์„ ์œ„ํ•ด N๊ฐœ์˜ ์ž…๋ ฅ ์ด๋ฏธ์ง€์—์„œ N๊ฐœ์˜ deep feature๋ฅผ ์ถ”์ถœํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. 8-layer 2D CNN์ด ์ ์šฉ๋˜์—ˆ์œผ๋ฉฐ ์ผ๋ฐ˜์ ์ธ matching ์ž‘์—…๊ณผ ๋™์ผํ•˜๊ฒŒ weight ๋ฅผ sharing ํ•ฉ๋‹ˆ๋‹ค. 2D CNN์˜ ์ถœ๋ ฅ์€ ์ž…๋ ฅ ์ด๋ฏธ์ง€ ๋Œ€๋น„ 4๋ฐฐ ๋งŒํผ ์ถ•์†Œ๋œ 32์ฑ„๋„์˜ feature์ž…๋‹ˆ๋‹ค. 

 

2. Cost Volume

๋‹ค์Œ ๋‹จ๊ณ„๋Š” ์ถ”์ถœ๋œ feature map๊ณผ ์ž…๋ ฅ ์นด๋ฉ”๋ผ์—์„œ 3D cost volume์„ ๊ตฌ์ถ•ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด์ „ ์—ฐ๊ตฌ์—์„œ๋Š” regular grid๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ณต๊ฐ„์„ ๋ถ„ํ• ํ–ˆ์ง€๋งŒ ๋ณธ ๋…ผ๋ฌธ์˜ depth map ์ถ”์ • ์ž‘์—…์—์„œ๋Š” reference ์นด๋ฉ”๋ผ frustum์— cost volume์„ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค. I1์€ reference ์ด๋ฏธ์ง€, Ii๋Š” source ์ด๋ฏธ์ง€, {Ki, Ri, ti}๋Š” ๊ฐ๊ฐ feature map์— ํ•ด๋‹นํ•˜๋Š” ์นด๋ฉ”๋กœ intrinsics, rotations, translations์— ํ•ด๋‹นํ•ฉ๋‹ˆ๋‹ค.

 

Differentiable Homography

๋ชจ๋“  feature map์€ N๊ฐœ์˜ feature volume์„ ํ˜•์„ฑํ•˜๊ธฐ ์œ„ํ•ด reference ์นด๋ฉ”๋ผ์˜ ์„œ๋กœ ๋‹ค๋ฅธ ์ •๋ฉด ํ‰ํ–‰ ํ‰๋ฉด์œผ๋กœ warping๋ฉ๋‹ˆ๋‹ค. Warped feature map ์—์„œ depth d์—์„œ Fi ๋กœ์˜ coordinate mapping ์€ planar transformation x'~Hi(d)*x์— ์˜ํ•ด ๊ฒฐ์ •๋ฉ๋‹ˆ๋‹ค. Hi(d)๋Š” i๋ฒˆ์งธ feature map ๊ณผ depth d์—์„œ์˜ reference feature map ์‚ฌ์ด์˜ Homography ์ž…๋‹ˆ๋‹ค.

2D feature ์ถ”์ถœ๊ณผ 3D regularization ๋„คํŠธ์›Œํฌ๋ฅผ ์—ฐ๊ฒฐํ•˜๋Š” ํ•ต์‹ฌ ๋‹จ๊ณ„๋กœ warping ์ž‘์—…์ด ๋ฏธ๋ถ„ ๊ฐ€๋Šฅํ•˜๋„๋ก ๊ตฌํ˜„๋˜์–ด depth map ์ถ”์ •์˜ end-to-end ํ•™์Šต์ด ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

 

 

Cost Metric

N๊ฐœ์˜ feature volume์„ ํ•˜๋‚˜์˜ cost volume C๋กœ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด N-view similarity ์ธก์ •์„ ์œ„ํ•œ variacne-based cost metric M์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. W(์ด๋ฏธ์ง€ ๋„ˆ๋น„), H(์ด๋ฏธ์ง€ ๋†’์ด), D(Depth sample number), F(Channel number of the feature map), V= W/4*H/4*D*F ์ผ ๋•Œ, cost metric ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜ํ•ฉ๋‹ˆ๋‹ค. Vi ๋Š” ๋ชจ๋“  feature volume์˜ average volume ์ด๊ณ  ๋ชจ๋“  operation ์€ element-wise ์ž…๋‹ˆ๋‹ค.

๋Œ€๋ถ€๋ถ„์˜ ์ „ํ†ต์ ์ธ MVS ๋ฐฉ๋ฒ•์€ reference ์ด๋ฏธ์ง€์™€ ๋ชจ๋“  source ์ด๋ฏธ์ง€ ๊ฐ„์˜ pairwise cost๋ฅผ heuristicํ•œ ๋ฐฉ์‹์œผ๋กœ ์ง‘๊ณ„ํ•ฉ๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์˜ metric ์„ค๊ณ„๋Š” ๋ชจ๋“  view ๊ฐ€ matching cost์— ๋™๋“ฑํ•˜๊ฒŒ ๊ธฐ์—ฌํ•ด์•ผ ํ•˜๊ณ  reference ์ด๋ฏธ์ง€์— ์šฐ์„ ์ˆœ์œ„๋ฅผ ๋‘์ง€ ์•Š๋Š”๋‹ค๋Š” ์ „์ œ๋ฅผ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค(์ตœ๊ทผ ์—ฐ๊ตฌ์— ๋”ฐ๋ผ). ์ตœ๊ทผ ์—ฐ๊ตฌ๊ฐ€ multi-patch similarity๋ฅผ ์ถ”๋ก ํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์ค‘ CNN layer๋กœ average ์—ฐ์‚ฐ์„ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ 'average' ์—ฐ์‚ฐ ์ž์ฒด๊ฐ€ feature ์ฐจ์ด์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์ œ๊ณตํ•˜์ง€ ์•Š๊ณ , ๋„คํŠธ์›Œํฌ๊ฐ€ similarity๋ฅผ ์ถ”๋ก ํ•˜๋Š” ๋ฐ ๋„์›€์ด ๋˜๋Š” pre- ๋ฐ post- CNN ๊ณ„์ธต์„ ํ•„์š”๋กœ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— 'variance' ์—ฐ์‚ฐ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์˜ variance-based cost metric์€ multi-view feature ์ฐจ์ด๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ ์ธก์ •ํ•ฉ๋‹ˆ๋‹ค. ์‹คํ—˜์—์„œ ์ด๋Ÿฌํ•œ ๋ช…์‹œ์  ์ฐจ์ด ์ธก์ •์ด ๊ฒ€์ฆ ์ •ํ™•๋„๋ฅผ ํ–ฅ์ƒ์‹œํ‚จ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

 

 

Cost Volume Regularization

์ด๋ฏธ์ง€ feature ์—์„œ ๊ณ„์‚ฐ๋œ raw cost volume์€ ๋…ธ์ด์ฆˆ(Non-lambertian ํ‘œ๋ฉด ๋˜๋Š” object occlusion์— ์˜ํ•œ)๋กœ ์˜ค์—ผ๋  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— depth map์„ ์ถ”๋ก ํ•˜๊ธฐ ์œ„ํ•ด smoothness constraint์™€ ํ†ตํ•ฉ๋˜์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์˜ regularization ๋‹จ๊ณ„๋Š” depth ์ถ”๋ก ์„ ์œ„ํ•œ probability volume P ๋ฅผ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•ด ์œ„์˜ cost volume C ๋ฅผ refine ํ•˜๊ธฐ ์œ„ํ•ด ์„ค๊ณ„๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ตœ๊ทผ ์—ฐ๊ตฌ์—์„œ ์˜๊ฐ์„ ๋ฐ›์•„ cost volume regularization์— multi-scale 3D CNN์„ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ 4-scale ๋„คํŠธ์›Œํฌ๋Š” encoder-decoder ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋น„๊ต์  ๋‚ฎ์€ ๋ฉ”๋ชจ๋ฆฌ ๋ฐ ๊ณ„์‚ฐ๋Ÿ‰์œผ๋กœ ํฐ receptive field ์—์„œ neighboring information์„ ์ง‘๊ณ„ํ•˜๋Š” 3D version UNet๊ณผ ์œ ์‚ฌํ•ฉ๋‹ˆ๋‹ค. ๋งˆ์ง€๋ง‰์— probability normalization์„ ์œ„ํ•ด depth ์ถ•์œผ๋กœ softmax ์—ฐ์‚ฐ์„ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค.

 

๊ฒฐ๊ณผ probability volume์€ ํ”ฝ์…€๋‹น depth ์ถ”์ •๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์ถ”์ • ์‹ ๋ขฐ๋„ ์ธก์ •์—๋„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์—์„œ depth map ์ถ”๋ก ์— ์ ์ ˆํ•ฉ๋‹ˆ๋‹ค. ์ด ํ›„ ๋‚ด์šฉ์—์„œ probability distribution์„ ๋ถ„์„ํ•˜์—ฌ depth reconstruction ํ’ˆ์งˆ์„ ์‰ฝ๊ฒŒ ๊ฒฐ์ •ํ•  ์ˆ˜ ์žˆ๊ณ  ๊ฐ„๊ฒฐํ•˜์ง€๋งŒ ํšจ๊ณผ์ ์ธ outlier filtering ์œผ๋กœ ์ด์–ด์ง‘๋‹ˆ๋‹ค.

 

3. Depth Map

Initial Estimation

Probability volume P์—์„œ depth map D๋ฅผ ๋งŒ๋“œ๋Š” ๊ฐ€์žฅ ๊ฐ„๋‹จํ•œ ๋ฐฉ๋ฒ•์€ ํ”ฝ์…€ ๋‹จ์œ„ argmax ์ž…๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ, argmax ์—ฐ์‚ฐ์€ ํ•˜์œ„ ํ”ฝ์…€ ์ถ”์ •์„ ํ•  ์ˆ˜ ์—†์œผ๋ฉฐ ๋ฏธ๋ถ„ํ•  ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ์— backpropagation์œผ๋กœ ํ•™์Šตํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ๋Œ€์‹ , depth ์ถ•์„ ๋”ฐ๋ผ expectation, ์ฆ‰ ๋ชจ๋“  hypotheses์— ๋Œ€ํ•œ probability weighted sum์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.

์—ฌ๊ธฐ์„œ P(d)๋Š” ๊นŠ์ด d์— ๋Œ€ํ•œ ๋ชจ๋“  ํ”ฝ์…€์— ๋Œ€ํ•œ ํ™•๋ฅ  ์ถ”์ •์œผ๋กœ ์œ„ ์‹์€ soft argmin์— ํ•ด๋‹นํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๋ฏธ๋ถ„๊ฐ€๋Šฅํ•˜๊ณ  argmax ๊ฒฐ๊ณผ๋ฅผ ๊ทผ์‚ฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Depth hypotheses ๋Š” cost volume ๊ตฌ์„ฑ ์ค‘ ๋ฒ”์œ„ [dmin, dmax] ๋‚ด์—์„œ ๊ท ์ผํ•˜๊ฒŒ ์ƒ˜ํ”Œ๋ง๋˜์ง€๋งŒ ์—ฌ๊ธฐ์„œ expectation ๊ฐ’์€ continuous ํ•œ depth ์ถ”์ •์„ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

* Fig.2 (c)์˜ ๊ฐ€๋กœ์ถ•์ด depth hypotheses index, y์ถ•์€ probability, ๋นจ๊ฐ„์„ ์€ soft argmin ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค. 

 

Probability Map

Multi-scale 3D CNN์€ probabiltiy๋ฅผ single-modal๋กœ ์ •๊ทœํ™”ํ•˜๋Š” ๊ฐ•๋ ฅํ•œ ๊ธฐ๋Šฅ์„ ๊ฐ€์กŒ์ง€๋งŒ, ์ž˜๋ชป ์ผ์น˜ํ•˜๋Š” ํ”ฝ์…€์˜ ๊ฒฝ์šฐ ํ™•๋ฅ  ๋ถ„ํฌ๊ฐ€ ํฉ์–ด์ ธ ์žˆ๊ณ  ํ•˜๋‚˜์˜ peak์— ์ง‘์ค‘ํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค(์œ„ ๊ทธ๋ฆผ์˜ (c)). ์ด๋Ÿฌํ•œ ๊ด€์ ์— ๊ธฐ์ดˆํ•˜์—ฌ depth ์ถ”์ •์˜ ํ’ˆ์งˆ d_hat ์„ ground truth depth๊ฐ€ ์ถ”์ •์น˜ ๊ทผ์ฒ˜์˜ ์ž‘์€ ๋ฒ”์œ„ ๋‚ด์— ์žˆ์„ ํ™•๋ฅ ๋กœ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค. Depth hypotheses๋Š” ์นด๋ฉ”๋ผ frustum์„ ๋”ฐ๋ผ ์ด์‚ฐ์ ์œผ๋กœ ์ƒ˜ํ”Œ๋ง๋˜๋ฏ€๋กœ ์ถ”์ • ํ’ˆ์งˆ์„ ์ธก์ •ํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด 4๊ฐœ์˜ depth hypotheses์— ๋Œ€ํ•œ probability sum์„ ์ทจํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ probability sum ๋ฐฉ๋ฒ•์€ outlier filtering์„ ๋”์šฑ ์ž˜ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

 

Depth Map Refinement

Probability volume์—์„œ retrieve ๋œ depth map์€ ์ •๊ทœํ™”๋œ ์ถœ๋ ฅ์ด์ง€๋งŒ, ํฐ receptive field๋กœ ์ธํ•ด reconstruction ๊ฒฝ๊ณ„๊ฐ€ ๊ณผ๋„ํ•˜๊ฒŒ oversmoothing ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Reference ์ด๋ฏธ์ง€์—๋Š” boundary ์ •๋ณด๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์œผ๋ฏ€๋กœ reference ์ด๋ฏธ์ง€๋ฅผ depth map์„ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•œ ๊ฐ€์ด๋“œ๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ตœ๊ทผ์˜ image matting ์•Œ๊ณ ๋ฆฌ์ฆ˜์—์„œ ์˜๊ฐ์„ ๋ฐ›์•„ MVSNet์˜ ๋์— depth residual learning network๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค. Initial depth map๊ณผ ํฌ๊ธฐ๊ฐ€ ์กฐ์ •๋œ reference ์ด๋ฏธ์ง€๋Š” 4-channel ์ž…๋ ฅ์œผ๋กœ concat ๋˜๊ณ  32-channel 2D convolutional layer์™€ 1-channel convolutional layer ํ•˜๋‚˜๋ฅผ ๊ฑฐ์ณ depth residual์„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ ๋‹ค์Œ initial depth map์„ ๋‹ค์‹œ ์ถ”๊ฐ€ํ•˜์—ฌ refine๋œ depth map์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ, ํŠน์ • depth scale ์—์„œ ํŽธํ–ฅ๋˜๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ์ดˆ๊ธฐ depth ํฌ๊ธฐ๋ฅผ [0,1] ๋ฒ”์œ„๋กœ ์‚ฌ์ „ ์Šค์ผ€์ผ๋งํ•˜๊ณ  refine ํ›„์— ๋‹ค์‹œ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค.

 

 

4. Loss

Initial depth map๊ณผ refine ๋œ depth map ๋ชจ๋‘์— ๋Œ€ํ•œ loss๊ฐ€ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. 

 

Depth Map Fusion

๋‹ค๋ฅธ multi-view stereo ๋ฐฉ๋ฒ•๊ณผ ์œ ์‚ฌํ•˜๊ฒŒ depth map fusion ๋‹จ๊ณ„๋ฅผ ์ ์šฉํ•˜์—ฌ ์„œ๋กœ ๋‹ค๋ฅธ view์˜ depth map์„ ํ†ตํ•ฉํ•˜์—ฌ ํ†ตํ•ฉ๋œ point cloud๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. Visibility-based ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋‹ค์–‘ํ•œ viewpoint์—์„œ depth occlusion ๋ฐ violation์„ ์ตœ์†Œํ™”ํ•ฉ๋‹ˆ๋‹ค. Reconstruction ๋…ธ์ด์ฆˆ๋ฅผ ์ตœ์†Œํ™”ํ•˜๊ธฐ ์œ„ํ•ด ํ•„ํ„ฐ๋ง ๋‹จ๊ณ„์—์„œ์™€ ๊ฐ™์ด ๊ฐ ํ”ฝ์…€์— ๋Œ€ํ•œ visible view๋ฅผ ๊ฒฐ์ •ํ•˜๊ณ  ํ”ฝ์…€์˜ ์ตœ์ข… ๊นŠ์ด ์ถ”์ •์œผ๋กœ ๋ชจ๋“  reprojected depths์— ๋Œ€ํ•œ ํ‰๊ท ์„ ์ทจํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ ๋‹ค์Œ fusion๋œ depth map์„ space์— ์ง์ ‘ reprojectionํ•˜์—ฌ 3D point cloud๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

 

 

Experiments

 

๋ฐ˜์‘ํ˜•