Deferred Rendering with Tile-Based culling Concept Problems
Asked Answered
M

1

9

EDIT: I'm still looking for some help about the use of OpenCL or compute shaders. I would prefer to keep using OGL 3.3 and not have to deal with the bad driver support for OGL 4.3 and OpenCL 1.2, but I can't think of anyway to do this type of shading without using one of the two (to match lights and tiles). Is it possible to implement tile-based culling without using GPGPU?

I wrote a deferred render in OpenGL 3.3. Right now I don't do any culling for the light pass (I just render a full screen quad for every light). This (obviously) has a ton of overdraw. (Sometimes it is ~100%). Because of this I've been looking into ways to improve performance during the light pass. It seems like the best way in (almost) everyone's opinion is to cull the scene using screen space tiles. This was the method used in Frostbite 2. I read the the presentation from Andrew Lauritzen during SIGGRAPH 2010 (http://download-software.intel.com/sites/default/files/m/d/4/1/d/8/lauritzen_deferred_shading_siggraph_2010.pdf) , and I'm not sure I fully understand the concept. (and for that matter why it's better than anything else, and if it is better for me)

In the presentation Laurtizen goes over deferred shading with light volumes, quads, and tiles for culling the scene. According to his data, the tile based deferred renderer was the fastest (by far). I don't understand why it is though. I'm guessing it has something to do with the fact that for each tile, all the lights are batched together. In the presentation it says to read the G-Buffer once and then compute the lighting, but this doesn't make sense to me. In my mind, I would implement this like this:

for each tile {
  for each light effecting the tile {
    render quad (the tile) and compute lighting
    blend with previous tiles (GL_ONE, GL_ONE)
  }
}

This would still involve sampling the G-Buffer a lot. I would think that doing that would have the same (if not worse) performance than rendering a screen aligned quad for every light. From how it's worded though, it seems like this is what's happening:

for each tile {
 render quad (the tile) and compute all lights
}

But I don't see how one would do this without exceeding the instruction limit for the fragment shader on some GPUs . Can anyone help me with this? It also seems like almost every tile based deferred renderer uses compute shaders or OpenCL (to batch the lights), why is this, and if I didn't use these what would happen?

Moonset answered 14/4, 2013 at 0:14 Comment(0)
D
4

But I don't see how one would do this without exceeding the instruction limit for the fragment shader on some GPUs .

It rather depends on how many lights you have. The "instruction limits" are pretty high; it's generally not something you need to worry about outside of degenerate cases. Even if 100+ lights affects a tile, odds are fairly good that your lighting computations aren't going to exceed instruction limits.

Modern GL 3.3 hardware can run at least 65536 dynamic instructions in a fragment shader, and likely more. For 100 lights, that's still 655 instructions per light. Even if you take 2000 instructions to compute the camera-space position, that still leaves 635 instructions per light. Even if you were doing Cook-Torrance directly in the GPU, that's probably still sufficient.

Depth answered 14/4, 2013 at 0:44 Comment(3)
Interesting. At one point I did compute more than one light per shader (7) and on some graphics cards (particularly mid and low end laptop GPUs like the GT 540) it seemed like I exceeded the instruction limit. I came to this conclusion through a lot of trial an error, and I asked about it in one of my earlier posts. Was I wrong?Moonset
In fact I experienced the same issue using NVidia 8600GT .I got to the limit with 20 lights or so.What kind of hardware @NicolBolas are you talking about?Smyrna
Hopefully I'll have some time to implement it this weekend. I'll see what happens.Moonset

© 2022 - 2024 — McMap. All rights reserved.