In computer vision, what does MVS do that SFM can't?
Asked Answered
T

1

21

I'm a dev with about a decade of enterprise software engineering under his belt, and my hobbyist interests have steered me into the vast and scary realm of computer vision (CV).

One thing that is not immediately clear to me is the division of labor between Structure from Motion (SFM) tools and Multi View Stereo (MVS) tools.

Specifically, CMVS appears to be the best-in-show MVS tool, and Bundler seems to be one of the better open source SFM tools out there.

Taken from CMVS's own homepage:

You should ALWAYS use CMVS after Bundler and before PMVS2

I'm wondering: why?!? My understanding of SFM tools is that they perform the 3D reconstruction for you, so why do we need MVS tools in the first place? What value/processing/features do they add that SFM tools like Bundler can't address? Why the proposed pipeline of:

Bundler -> CMVS -> PMVS2

?

Tradespeople answered 30/8, 2016 at 2:2 Comment(0)
P
38

Quickly put, Structure from Motion (SfM) and MultiView Stereo (MVS) techniques are complementary, as they do not deal with the same assumptions. They also differ slightly in their inputs, MVS requiring camera parameters to run, which is estimated (output) by SfM. SfM only gives a coarse 3D output, whereas PMVS2 gives a more dense output, and finally CMVS is there to circumvent some limitations of PMVS2.

The rest of the answer provides an high-level overview of how each method works, explaining why it is this way.

Structure from Motion

The first step of the 3D reconstruction pipeline you highlighted is a SfM algorithm that could be done using Bundler, VisualSFM, OpenMVG or the like. This algorithm takes in input some images and outputs the camera parameters of each image (more on this later) as well as a coarse 3D shape of the scene, often called the sparse reconstruction.

Why does SfM outputs only a coarse 3D shape? Basically, SfM techniques begins by detecting 2D features in every input image and matching those features between pairs of images. The goal is, for example, to tell "this table corner is located at those pixels locations in those images." Those features are described by what we call descriptors (like SIFT or ORB). Those descriptors are built to represent a small region (ie. a bunch of neighboring pixels) in images. They can represent reliably highly textured or rough geometries (e.g., edges), but these scene features need to be unique (in the sense distinguishing) throughout the scene to be useful. For example (maybe oversimplified), a wall with repetitive patterns would not be very useful for the reconstruction, because even though it is highly textured, every region of the wall could potentially match pretty much everywhere else on the wall. Since SfM is performing a 3D reconstruction using those features, the vertices of the 3D scene reconstruction will be located on those unique textures or edges, giving a coarse mesh as output. SfM won't typically produce a vertex in the middle of surface without precise and distinguishing texture. But, when many matches are found between the images, one can compute a 3D transformation matrix between the images, effectively giving the relative 3D position between the two camera poses.

MultiView Stereo

Afterwards, the MVS algorithm is used to refine the mesh obtained by the SfM technique, resulting in what is called a dense reconstruction. This algorithm requires the camera parameters of each image to work, which is output by the SfM algorithm. As it works on a more constrained problem (since they already have the camera parameters of every image like position, rotation, focal, etc.), MVS will compute 3D vertices on regions which were not (or could not be) correctly detected by descriptors or matched. This is what PMVS2 does.

How can PMVS work on regions where 2D feature descriptor would difficultly match? Since you know the camera parameters, you know a given pixel in an image is the projection of a line in another image. This approach is called epipolar geometry. Whereas SfM had to seek through the entire 2D image for every descriptor to find a potential match, MVS will work on a single 1D line to find matches, simplifying the problem quite a deal. As such, MVS usually takes into account illumination and object materials into its optimization, which SfM does not.

There is one issue, though: PMVS2 performs a quite complex optimization that can be dreadfully slow or take an astronomic amount of memory on large image sequences. This is where CMVS comes into play, clustering the coarse 3D SfM output into regions. PMVS2 will then be called (potentially in parallel) on each cluster, simplifying its execution. CMVS will then merge each PMVS2 output in an unified detailed model.

Conclusion

Most of the information provided in this answer and many more can be found in this tutorial from Yasutaka Furukawa, author of CMVS and PMVS2: http://www.cse.wustl.edu/~furukawa/papers/fnt_mvs.pdf

In essence, both techniques emerge from two different approaches: SfM aims to perform a 3D reconstruction using a structured (but theunknown) sequence of images while MVS is a generalization of the two-view stereo vision, based on human stereopsis.

Perigordian answered 31/8, 2016 at 7:26 Comment(5)
Thanks @Perigordian (+1) - I truly wish I could upvote this answer more! I do have a few followup questions for you, if you don't mind: (1) Are sparse reconstructions (outputs of SFM) useful for anything in their own right? Or are they always used as inputs to MVS (I guess I'm just wondering if they solve any interesting problems by themselves). (2) I keep hearing the term "camera parameters"...can you give me an example or two of what some of these parameters might be?Tradespeople
And finally (3) According to the Bundler folk: "You should ALWAYS use CMVS after Bundler and before PMVS2" (so, Bundler >> CMVS >> PMVS2)...but according to your answer, it sounds like the proper pipeline is Bundler/SFM >> Make Clusters >> Run PMVS2 on each cluster >> Merge all clusters >> CMVS...any thoughts here? Thanks again so much for such a thoughtful, thorough answer!Tradespeople
Interesting questions. 1) The key element of SfM is to get the camera parameters (see 2). Once you have them, you are in business to begin understanding your scene (using MVS, for example). If you stop after the SfM step, the advantage I would see is to use less computing power (quicker output / less energy required on mobile), which may provide a good enough result in some cases, for example coarse volume estimation or rough shape identification. Depends on the task to perform and goal to achieve, I guess.Perigordian
2) The camera parameters are contained in the Intrinsics and Extrinsics camera matrices. Both matrices contain values (3x3 and 3x4, respectively) explaining the camera (intrinsics), or its pose (extrinsics). Intrinsics parameters are, for example, the focal length on both axes, the center point (center pixel) of the camera, sensor size, the amount of shearing, radial distortion, and so forth. The extrinsics matrix describe the position and rotation of the camera (where is the camera in the world and where is it looking). Look up section 1.2 of the reference I gave for more information.Perigordian
Little interesting detail on the camera parameters: once you get those two matrices, take any 3D point [x, y, z] in the world, multiply it by those matrices and the result is where the 3D point will appear in the image. 3) Don't worry about it, everything I said is taken care of by CMVS. PMVS2 is now contained inside the CMVS code and executable. You have to run Bundler's exporter to CMVS/PMVS2 (same executable, I believe), then only invoke CMVS with the right flags, and everything I said will be taken care of. I just wanted to give you the insight why is it this way.Perigordian

© 2022 - 2024 — McMap. All rights reserved.