I've been investigating an issue in my DirectX 11 C++ application for over a week now, and so I'm turning to the good people on StackOverflow for any insight that may help me track this one down.
My application will run mostly at 60-90 frames per second, but every few seconds I'll get a frame that takes around a third of a second to finish. After much investigation, debugging, and using various code profilers, I have narrowed it down to calls to the DirectX API. However, from one slow frame to the next, it is not always the same API call that causes the slowdown. In my latest run, the calls that stall (always for about a fifth of a second) are
- ID3D11DeviceContext:UpdateSubresource
- ID3D11DeviceContext:DrawIndexed
- IDXGISwapChain:Present
Not only is it not the same function that stalls, but each of these functions (mainly the first two) the slow call may be from various places in my code from one to the next.
According to multiple profiling tools and my own high resolution timers I placed in my code to help measure things, I found that this "hiccup" would occur at consistent intervals of just under 3 seconds (~2.95).
This application collects data from external hardware and uses DirectX to visualize that data in real time. While the application is running, the hardware may be idle or running at various speeds. The faster the hardware goes the more data is collected and must be visualized. I point this out because it may be useful when considering some of the characteristics of this bug:
- The long frames don't occur while the hardware is idle. This makes sense to me because the software just has to redraw data it already has and doesn't have to transfer new data over to the GPU.
- However, the long frames occur at these consistent 3 second intervals regardless of the speed the hardware is running. So even if my application is collecting twice the amount of data per second, the frequency of the long frames doesn't change.
- The duration of these long frames is very consistent. Always between 0.25 and 0.3 seconds (I believe it is the slow call to the DirectX API that is consistent, so any variation on the overall frame duration is external to that call).
- While field testing last week (when I first discovered the issue), I noticed that on a couple runs of the application, after a long time (probably 20 minutes or more) of continuous testing without interacting much with the program aside from watching it, the hiccup would go away. The hiccup would come back if we interacted with some features of the application or restarted the program. Doesn't make sense to me, but almost like the GPU "figured out" and fixed the issue but then reverted back when we changed up the pattern of work it had been doing previously. Unfortunately the nature of our hardware makes it difficult for me to replicate this in a lab environment.
This bug is occurring consistently on two different machines with very similar hardware (dual GTX580 cards). However, in recent versions of the application, this issue did not occur. Unfortunately the code has undergone many changes since then so it would be difficult to pinpoint what specific change is causing the issue.
I considered the graphics driver, and so updated to the latest version, but that didn't make a difference. I also considered the possibility that some other change was made to both computers, or possibly an update to software running on both of them, could be causing issues with the GPU. But I can't think of anything other than Microsoft Security Essentials that is running on both machines while the application runs, and I've already tried disabling it's Real-Time Protection feature to no avail.
While I would love for the cause to be an external program that I can just turn off, ultimately I worry that I must be doing something incorrectly/improperly with the DirectX API that is causing the GPU to have to make adjustments every few seconds. Maybe I am doing something wrong in the way I update data on the GPU (since the lag only happens when I'm collecting data to display). Then the GPU stalls every few seconds and whatever API function that happens to get called during a stall can't return as fast as it normally would?
Any suggestions would be greatly appreciated!
Thanks, Tim
UPDATE (2013.01.21):
I finally gave in and went searching back through previous revisions of my application until I found a point where this bug wasn't occurring. Then I went revision by revision until I found exactly when the bug started happening and managed to pinpoint the source of my issue. The problem started occurring after I added an 'unsigned integer' field to a vertex type, of which I allocate a large vertex buffer. Because of the size of the vertex buffer, this change increased the size 184.65 MB (1107.87 MB to 1292.52). Because I do in fact need this extra field in my vertex structure, I found other ways to cut back on overall vertex buffer size, and got it down to 704.26 MB.
My best guess is that the addition of that field and the extra memory it required caused me to exceed some threshold/limit on the GPU. I'm not sure if it was an excess of total memory allocation, or an excess of some limit to a single vertex buffer. Either way, it seems that this excess caused the GPU to have to do some extra work every few seconds (maybe communicating with the CPU) every few seconds, and so my calls to the API had to wait on this. If anyone has any information that would clarify the implications of large vertex buffers, I'd love to hear it!
Thanks to everyone who gave me their time and suggestions.