Zero-copy Camera Processing and Rendering Pipeline on Android

I need to do a CPU-side read-only process on live camera data (from just the Y plane) followed by rendering it on the GPU. Frames shouldn't be rendered until processing completes (so I don't always want to render the latest frame from the camera, just the latest one that the CPU-side has finished processing). Rendering is decoupled from the camera processing and aims for 60 FPS even if the camera frames arrive at a lower rate than that.

There's a related but higher-level question over at: Lowest overhead camera to CPU to GPU approach on android

To describe the current setup in a bit more detail: we have an app-side buffer pool for camera data where buffers are either "free", "in display", or "pending display". When a new frame from the camera arrives we grab a free buffer, store the frame (or a reference to it if the actual data is in some system-provided buffer pool) in there, do the processing and stash the results in the buffer, then set the buffer "pending display". In the renderer thread if there is any buffer "pending display" at the start of the render loop we latch it to be the one "in display" instead, render the camera, and render the other content using the processed information calculated from the same camera frame.

Thanks to @fadden's response on the question linked above I now understand the "parallel output" feature of the android camera2 API shares the buffers between the various output queues, so shouldn't involve any copies on the data, at least on modern android.

In a comment there was a suggestion that I could latch the SurfaceTexture and ImageReader outputs at the same time and just "sit on the buffer" until the processing is complete. Unfortunately I don't think that's applicable in my case due to the decoupled rendering that we still want to drive at 60 FPS, and that will still need access to the previous frame whilst the new one is being processed to ensure things don't get out of sync.

One solution that has come to mind is having multiple SurfaceTextures - one in each of our app-side buffers (we currently use 3). With that scheme when we get a new camera frame, we would obtain a free buffer from our app-side pool. Then we'd call acquireLatestImage() on an ImageReader to get the data for processing, and call updateTexImage() on the SurfaceTexture in the free buffer. At render time we just need to make sure the SufaceTexture from the "in display" buffer is the one bound to GL, and everything should be in sync most of the time (as @fadden commented there is a race between calling the updateTexImage() and acquireLatestImage() but that time window should be small enough to make it rare, and is perhaps dectable and fixable anyway using the timestamps in the buffers).

I note in the docs that updateTexImage() can only be called when the SurfaceTexture is bound to a GL context, which suggests I'll need a GL context in the camera processing thread too so the camera thread can do updateTexImage() on the SurfaceTexture in the "free" buffer whilst the render thread is still able to render from the SurfaceTexture from the "in display" buffer.

So, to the questions:

Does this seem like a sensible approach?
Are SurfaceTextures basically a light wrapper around the shared buffer pool, or do they consume some limited hardware resource and should be used sparingly?
Are the SurfaceTexture calls all cheap enough that using multiple ones will still be a big win over just copying the data?
Is the plan to have two threads with distinct GL contexts with a different SurfaceTexture bound in each likely to work or am I asking for a world of pain and buggy drivers?

It sounds promising enough that I'm going to give it a go; but thought it worth asking here in case anyone (basically @fadden!) knows of any internal details that I've overlooked which would make this a bad idea.

Interesting question.

Background Stuff

Having multiple threads with independent contexts is very common. Every app that uses hardware-accelerated View rendering has a GLES context on the main thread, so any app that uses GLSurfaceView (or rolls their own EGL with a SurfaceView or TextureView and an independent render thread) is actively using multiple contexts.

Every TextureView has a SurfaceTexture inside it, so any app that uses multiple TextureViews has multiple SurfaceTextures on a single thread. (The framework actually had a bug in its implementation that caused problems with multiple TextureViews, but that was a high-level issue, not a driver problem.)

SurfaceTexture, a/k/a GLConsumer, doesn't do a whole lot of processing. When a frame arrives from the source (in your case, the camera), it uses some EGL functions to "wrap" the buffer as an "external" texture. You can't do these EGL operations without an EGL context to work in, which is why SurfaceTexture has to be attached to one, and why you can't put a new frame into a texture if the wrong context is current. You can see from the implementation of updateTexImage() that it's doing a lot of arcane things with buffer queues and textures and fences, but none of it requires copying pixel data. The only system resource you're really tying up is RAM, which is not inconsiderable if you're capturing high-resolution images.

Connections

An EGL context can be moved between threads, but can only be "current" on one thread at a time. Simultaneous access from multiple threads would require a lot of undesirable synchronization. A given thread has only one "current" context. The OpenGL API evolved from single-threaded with global state to multi-threaded, and rather than rewrite the API they just shoved state into thread-local storage... hence the notion of "current".

It's possible to create EGL contexts that share certain things between them, including textures, but if these contexts are on different threads you have to be very careful when the textures are updated. Grafika provides a nice example of getting it wrong.

SurfaceTextures are built on top of BufferQueues, which have a producer-consumer structure. The fun thing about SurfaceTextures is that they include both sides, so you can feed data in one side and pull it out the other within a single process (unlike, say, SurfaceView, where the consumer is far away). Like all Surface stuff, they're built on top of Binder IPC, so you can feed the Surface from one thread, and safely updateTexImage() in a different thread (or process). The API is arranged such that you create the SurfaceTexture on the consumer side (your process) and then pass a reference to the producer (e.g. camera, which primarily runs in the mediaserver process).

Implementation

You'll induce a bunch of overhead if you're constantly connecting and disconnecting BufferQueues. So if you want to have three SurfaceTextures receiving buffers, you'll need to connect all three to Camera2's output, and let all of them receive the "buffer broadcast". Then you updateTexImage() in a round-robin fashion. Since SurfaceTexture's BufferQueue runs in "async" mode, you should always get the newest frame with each call, with no need to "drain" a queue.

This arrangement wasn't really possible until the Lollipop-era BufferQueue multi-output changes and the introduction of Camera2, so I don't know if anyone has tried this approach before.

All of the SurfaceTextures would be attached to the same EGL context, ideally in a thread other than the View UI thread, so you don't have to fight over what's current. If you want to access the texture from a second context in a different thread, you will need to use the SurfaceTexture attach/detach API calls, which explicitly support this approach:

A new OpenGL ES texture object is created and populated with the SurfaceTexture image frame that was current at the time of the last call to detachFromGLContext().

Remember that switching EGL contexts is a consumer-side operation, and has no bearing on the connection to the camera, which is a producer-side operation. The overhead involved in moving a SurfaceTexture between contexts should be minor -- less than updateTexImage() -- but you need to take the usual steps to ensure synchronization when communicating between threads.

It's too bad ImageReader lacks a getTimestamp() call, as that would greatly simplify matching up buffers from the camera.

Conclusion

Using multiple SurfaceTextures to buffer output is possible but tricky. I can see a potential advantage to a ping-pong buffer approach, where one ST is used to receive a frame in thread/context A while the other ST is used for rendering in thread/context B, but since you're operating in real time I don't think there's value in additional buffering unless you're trying to pad out the timing.

As always, the Android System-Level Graphics Architecture doc is recommended reading.

Recommended topics

Hot tags