When are VBOs faster than "simple" OpenGL primitives (glBegin())?
Asked Answered
A

5

37

After many years of hearing about Vertex Buffer Objects (VBOs), I finally decided to experiment with them (my stuff isn't normally performance critical, obviously...)

I'll describe my experiment below, but to make a long story short, I'm seeing indistinguishable performance between "simple" direct mode (glBegin()/glEnd()), vertex array (CPU side) and VBO (GPU side) rendering modes. I'm trying to understand why this is, and under what conditions I can expect to see the VBOs significantly outshine their primitive (pun intended) ancestors.

Experiment Details

For the experiment, I generated a (static) 3D Gaussian cloud of a large number of points. Each point has vertex & color information associated with it. Then I rotated the camera around the cloud in successive frames in sort of an "orbiting" behavior. Again, the points are static, only the eye moves (via gluLookAt()). The data are generated once prior to any rendering & stored in two arrays for use in the rendering loop.

For direct rendering, the entire data set is rendered in a single glBegin()/glEnd() block with a loop containing a single call each to glColor3fv() and glVertex3fv().

For vertex array and VBO rendering, the entire data set is rendered with a single glDrawArrays() call.

Then, I simply run it for a minute or so in a tight loop and measure average FPS with the high performance timer.

Performance Results ##

As mentioned above, performance was indistinguishable on both my desktop machine (XP x64, 8GB RAM, 512 MB Quadro 1700), and my laptop (XP32, 4GB ram, 256 MB Quadro NVS 110). It did scale as expected with the number of points, however. Obviously, I also disabled vsync.

Specific results from laptop runs (rendering w/GL_POINTS):

glBegin()/glEnd():

  • 1K pts --> 603 FPS
  • 10K pts --> 401 FPS
  • 100K pts --> 97 FPS
  • 1M pts --> 14 FPS

Vertex Arrays (CPU side):

  • 1K pts --> 603 FPS
  • 10K pts --> 402 FPS
  • 100K pts --> 97 FPS
  • 1M pts --> 14 FPS

Vertex Buffer Objects (GPU side):

  • 1K pts --> 604 FPS
  • 10K pts --> 399 FPS
  • 100K pts --> 95 FPS
  • 1M pts --> 14 FPS

I rendered the same data with GL_TRIANGLE_STRIP and got similarly indistinguishable (though slower as expected due to extra rasterization). I can post those numbers too if anybody wants them. .

Question(s)

  • What gives?
  • What do I have to do to realize the promised performance gain of VBOs?
  • What am I missing?
Antecedence answered 10/1, 2009 at 4:29 Comment(8)
Maybe nVidia's drivers are optimising things behind your back? The problem there is it's impossible to test that in isolation...Cupulate
That thought occurred to me as well. I also thought that perhaps Quadro vs. GeForce (i.e. professional vs. gamer card) might have something to do with it.Antecedence
How can they optimise the glBegin()/glEnd() version? @Shy: Out of curiosity, why did you edit the question? @Drew Hall: Could you post the program or the source somewhere? I'd like to give it a try (and maybe watch the cloud for a minute or two.)Inconsequent
@aib: Sorry, I'd like to but I can't post the code. I'll see if I can post an executable or at least a screenshot somewhere.Antecedence
@Drew Hall: Never mind, ended up building one myself :)Inconsequent
Paste the code that does the draw - the glBegin() and everything up until the glEnd() - and the code that creates the VBOs. It would be nice to get to the bottom of this, no matter how old it is.Coronach
Is the VBO generated once per frame or once at the beginning of the program? I would expect to see these results if you were reloading the VBO (via glBufferData()) every frame.Grandmotherly
@OldPeculier: Once, at the beginning of the program.Antecedence
A
28

There are a lot of factors to optimizing 3D rendering. usually there are 4 bottlenecks:

  • CPU (creating vertices, APU calls, everything else)
  • Bus (CPU<->GPU transfer)
  • Vertex (vertex shader over fixed function pipeline execution)
  • Pixel (fill, fragment shader execution and rops)

Your test is giving skewed results because you have a lot of CPU (and bus) while maxing out vertex or pixel throughput. VBOs are used to lower CPU (fewer api calls, parallel to CPU DMA transfers). Since you are not CPU bound, they don't give you any gain. This is optimization 101. In a game for example CPU becomes precious as it is needed for other things like AI and physics, not just for issuing tons of api calls. It is easy to see that writing vertex data (3 floats for example) directly to a memory pointer is much faster than calling a function that writes 3 floats to memory - at the very least you save the cycles for the call.

Acea answered 10/1, 2009 at 5:46 Comment(7)
My understanding was that Vertex Arrays (GL 1.1) were used to lower CPU (minimize function calls), while VBOs built on that to also knock down bus activity. I thought that my experiment would be bus bound (or CPU bound for simple glBegin() drawing) but I guess I was wrong. Can you comment? Thks!Antecedence
vbos will only lower bus activity if your geometry is static, that is reused across frames. also make sure that you flag them write only in that case. further reading: developer.nvidia.com/object/using_VBOs.htmlAcea
@starmole: Great link. I had the VBOs flagged as STATIC_DRAW, but hadn't marked them as WRITE_ONLY (I wasn't using glMapBuffer()). I'll make the change & see what happens. Thanks for the tip!Antecedence
@starmole: I tried mapping as WRITE_ONLY (actually, tried all three modes) with no effect. Did a quick map/unmap immediately after glBufferData--did I do that right? Still confused... :(Antecedence
@Drew Hall: and the results were? (on seeing this question just now, the immediate thought was "that's classic VBO-not-marked-as-static-draw" bug)Coronach
@Will: No difference, no matter how I set the flags. I still haven't resolved this one, hence still haven't accepted an answer.Antecedence
@Drew Hall: strange. Here's some other empirical measures in a game engine that is evaluating moving from vertex arrays to VBOs: glest.org/glest_board/index.php?topic=6260.msg65644#msg65644Coronach
W
10

There might be a few things missing:

  1. It's a wild guess, but your laptop's card might be missing this kind of operation at all (i.e. emulating it).

  2. Are you copying the data to GPU's memory (via glBufferData(GL_ARRAY_BUFFER with either GL_STATIC_DRAW or GL_DYNAMIC_DRAW param) or are you using pointer to main (non GPU) array in memory? (that requires copying it every frame and therefore performance is slow)

  3. Are you passing indices as another buffer sent via glBufferData and GL_ELEMENT_ARRAY_BUFFER params?

If those three things are done, the performance gain is big. For Python (v/pyOpenGl) it's about 1000 times faster on arrays bigger than a couple 100 elemnts, C++ up to 5 times faster, but on arrays 50k-10m vertices.

Here are my test results for c++ (Core2Duo/8600GTS):

 pts   vbo glb/e  ratio
 100  3900  3900   1.00
  1k  3800  3200   1.18
 10k  3600  2700   1.33
100k  1500   400   3.75
  1m   213    49   4.34
 10m    24     5   4.80

So even with 10m vertices it was normal framerate while with glB/e it was sluggish.

Womanish answered 14/2, 2009 at 15:14 Comment(0)
B
2

From reading the Red Book, I remember a passage that stated that VBOs are possibly faster depending on the hardware. Some hardware optimizes those, while others don't. It's possible that your hardware doesn't.

Blamed answered 13/1, 2009 at 19:21 Comment(2)
Thanks. It's hard to see how keeping the data resident on the card wouldn't always be faster (even without significant extra optimization in the driver), but I guess I'm having trouble getting my code to be "bus bound".Antecedence
@Coronach Mc (again): Also hard to imagine that Nvidia wouldn't be somewhere near the cutting edge in terms of implementing VBO optimizations. Seems more likely that they've (also) found a way to optimize the direct path to me.Antecedence
S
1

14Mpoints/s is not a whole lot. It's suspect. can we see the complete code doing the drawing, as well as the initialisation ? (compare that 14M/s to the 240M/s (!) that Slava Vishnyakov gets). It's even more suspicious that it drops to 640K/s for 1K draws (compared with his 3.8M/s, that looks capped by the ~3800 SwapBuffers, anyways).

I'd be beting the test does not measure what you think it measures.

Smear answered 19/11, 2009 at 21:10 Comment(0)
S
-2

Assuming I remember this right my OpenGL teacher, who is well known in the OpenGL community, said they are faster on static geometry which is going to be render a lot of time's on a typical game this will be tables chair and small static entities.

Socha answered 13/1, 2009 at 19:24 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.