Future prospects for improvement of depth data on Project Tango tablet

Asked 26/12, 2014 at 4:22 Answered 31/12, 2014 at 22:12

I am interested in using the Project Tango tablet for 3D reconstruction using arbitrary point features. In the current SDK version, we seem to have access to the following data.

A 1280 x 720 RGB image.
A point cloud with 0-~10,000 points, depending on the environment. This seems to average between 3,000 and 6,000 in most environments.

What I really want is to be able to identify a 3D point for key points within an image. Therefore, it makes sense to project depth into the image plane. I have done this, and I get something like this:

enter image description here

The problem with this process is that the depth points are sparse compared to the RGB pixels. So I took it a step further and performed interpolation between the depth points. First, I did Delaunay triangulation, and once I got a good triangulation, I interpolated between the 3 points on each facet and got a decent, fairly uniform depth image. Here are the zones where the interpolated depth is valid, imposed upon the RGB iamge.

enter image description here

Now, given the camera model, it's possible to project depth back into Cartesian coordinates at any point on the depth image (since the depth image was made such that each pixel corresponds to a point on the original RGB image, and we have the camera parameters of the RGB camera). However, if you look at the triangulation image and compare it to the original RGB image, you can see that depth is valid for all of the uninteresting points in the image: blank, featureless planes mostly. This isn't just true for this single set of images; it's a trend I'm seeing for the sensor. If a person stands in front of the sensor, for example, there are very few depth points within their silhouette.

As a result of this characteristic of the sensor, if I perform visual feature extraction on the image, most of the areas with corners or interesting textures fall in areas without associated depth information. Just an example: I detected 1000 SIFT keypoints from an an RGB image from an Xtion sensor, and 960 of those had valid depth values. If I do the same thing to this system, I get around 80 keypoints with valid depth. At the moment, this level of performance is unacceptable for my purposes.

I can guess at the underlying reasons for this: it seems like some sort of plane extraction algorithm is being used to get depth points, whereas Primesense/DepthSense sensors are using something more sophisticated.

So anyway, my main question here is: can we expect any improvement in the depth data at a later point in time, through improved RGB-IR image processing algorithms? Or is this an inherent limit of the current sensor?

Copyread answered 26/12, 2014 at 4:22 Comment(1)

Very interesting thoughts on the problems of getting complete IR-data, thanks. Could you explain a bit how you proceeded projecting the depth onto the image plane? I am trying to do the same in order to then perform depth map fusion, but what I get from as XYZ-data from the device is not accordingly to what they explain at link. They pretend providing with values in "millimeters in the coordinate frame of the depth-sensing camera" but what I get is float's with all values below 1, which makes no sense at all. – Merrymerryandrew 17/2, 2015 at 15:53

I am from the Project Tango team at Google. I am sorry you are experiencing trouble with depth on the device. Just so that we are sure your device is in good working condition, can you please test the depth performance against a flat wall. Instructions are as below: https://developers.google.com/project-tango/hardware/depth-test

Even with a device in good working condition, the depth library is known to return sparse depth points on scenes with low IR reflectance objects, small sized objects, high dynamic range scenes, surfaces at certain angles and objects at distances larger than ~4m. While some of these are inherent limitations in the depth solution, we are working with the depth solution provider to bring improvements wherever possible.

Attached an image of a typical conference room scene and the corresponding point cloud. As you can see, 1) no depth points are returned from the laptop screen (low reflectance), the table top objects such as post-its, pencil holder etc (small object sizes), large portions of the table (surface at an angles), room corner at the far right (distance >4m).

But as you move around the device, you will start getting depth point returns. Accumulating depth points is a must to get denser point clouds.

Please also keep us posted on your findings at [email protected]

Coper answered 31/12, 2014 at 22:12 Comment(1)

This has been massively improved in 1.4/1.5 and many more depth points are being returned, as well as a lot more on black objects. Thank you guys for your efforts. – Copyread 4/2, 2015 at 9:19

In my very basic initial experiments, you are correct with respect to depth information returned from the visual field, however, the return of surface points is anything but constant. I find as I move the device I can get major shifts in where depth information is returned, i.e. there's a lot of transitory opacity in the image with respect to depth data, probably due to the characteristics of the surfaces. So while no return frame is enough, the real question seems to be the construction of a larger model (point cloud to open, possibly voxel spaces as one scales up) to bring successive scans into a common model. It's reminiscent of synthetic aperture algorithms in spirit, but the letters in the equations are from a whole different set of laws. In short, I think a more interesting approach is to synthesize a more complete model by successive accumulation of point cloud data - now, for this to work, the device team has to have their dead reckoning on the money for whatever scale this is done. Also this addresses an issue that no sensor improvements can address - if your visual sensor is perfect, it still does nothing to help you relate the sides of an object at least be in the close neighborhood of the front of the object.

Swain answered 27/12, 2014 at 0:28 Comment(5)

I want to reconstruct 3D model of a whole space. I was hoping that the device + the SDK layers above it will be able to supply a complete model to me with "successive accumulation of point cloud data". If I'll have to do it by myself, that gonna be a hard task. – Minnesinger 27/12, 2014 at 9:18

And the fact that google demos already show they are doing that makes me greedy jealous - I do feel your pain - how about starting off cheap and superficial - a simple queue of stashed point clouds, gobble up as much memory as you can, delete the oldest, add the newest, render em all ? – Swain 27/12, 2014 at 14:2

I'm sure they gonna release what they are cooking sooner or later, it's just not ready yet. The ij component of the xyzij point cloud should actually provide a lot of extra information, so you might be able to reconstruct a triangulated model without PCL or similar packages. Or I'm thinking how to help a PCL method with the ij data. – Minnesinger 28/12, 2014 at 21:58

Unfortunately, mapping is going to be more difficult than just accumulation. I've been tackling the same task on Xtion/SoftKinetic sensors, and it requires full-scale graph-based SLAM due to accumulation of error in pose. Although the Tango's pose tracking is really good, it still shows significant drift over time, so the task is to build a topological graph, make sure that you can detect when a previously-visited place is rediscovered, and then being able to optimize the whole problem. I have this working quite well with Xtion sensors, but Tango, frustratingly,is not giving me analogous data. – Copyread 29/12, 2014 at 1:28

Accumulation ain't so bad - I've now got a version of the point cloud sample running where it works with a queue of 100 point cloud samples - sitting in my chair and spinning around and I get a reasonably decent map of my office - walking through the house and things go downhill but the envt is still recognizable - before going to SLAM complexities you probably really want to write a noise filter - there's distinct erroneous data in the form of speckles, but it should be easily filterable by a simple neighborhood distance test – Swain 29/12, 2014 at 16:33

Recommended topics

Hot tags