Will moving code into kernel space give more precise timing?

Asked 25/12, 2011 at 0:13 Answered 17/2, 2013 at 4:44

Background information:

I presently have a hardware device that connects to the USB port. The hardware device is responsible sending out precise periodic messages onto various networks that it, in turn, connects too. Inside the hardware device I have a couple Microchip dsPICs. There are two modes of operation.

One scenario is where send simple "jobs" down to the dsPICs that, in turn, can send out the precise messages with .001ms accuracy. This architecture is not ideal for more complex messaging where we need to send a periodic packet that changes based on events going on within the PC application. So we have a second mode of operation where our PC application will send the periodic messages and the dsPICs simply convert and transmit in response. All this, by the way, is transparent to the end user of our software. Our hardware device is a test tool used in the automotive field.

Currently, we use a USB to serial chip from FTDI and the FTDI Windows drivers to interface the hardware to our PC software.

The problem is that in mode two where we send messages from the PC, the best we are able to achieve is around 1ms on average hardware range. We are subjected to Windows kernel pre-emption. I've tried a number of "tricks" to improve things such as:

Making sure our reader & writer threads live on seperate CPU affinities when possible.
Increasing the thread priority of the writer while reducing that of the reader.
Informing the user to turn off screen saver and other applications when using our software.
Replacing createthread calls with CreateTimerQueueTimer calls.

All our software is written in C/C++. I'm very familiar and comfortable with advanced Windows programming; such as IO Completions, Overlapped I/O, lockless thread queues (really a design strategy), sockets, threads, semaphores, etc...

However, I know nothing about Windows driver development. I've read through a few papers on KMDF vs. UDMF vs. WDM.

I'm hoping a seasoned Windows kernel mode driver developer will respond here...

The next rev. of our hardware has the option to replace the FTDI chip and use either the dsPIC's USB interface or, possibly, port the open source Linux FTDI stuff to Windows and continue to use the FTDI chip within our custom driver. I think by going to a kernel mode driver on the PC side, I can establish a kernel driver that can send out periodic messages at more precise intervals without preemption and/or possibly taking advantage of DMA.

We have a competitor in our business who I think does exactly something similar with their tools. As far as I know, user space applications can not schedule a thread any better than 1ms. We currently use timeGetTime in a thread. I've experiemented with timer queues (via CreateTimerQueueTimer) with no real improvement.

Is a WDM the correct approach to achieve more precise timing?

Our competitor some how is achieveing very precise timing from Windows driven signals to their hardware and they do load a kernel driver (.sys) and their device runs over USB2.0 as does ours.

If WDM is the way to go, can I get some advise on what kernel functions I should be studying for setting up the timings? Thanks for reading

Tanaka answered 25/12, 2011 at 0:13 Comment(6)

You're still not going to get real-mode timing on Windows. How precise does this need to be? – Lummox 25/12, 2011 at 2:37

see original post, we're looking to get .001ms accurate. A typical DMA setup will run independent of the host CPU and could clock out data at a configured rate. What I don't know is if I need a DMA setup or if I can use something like the RTC to schedule timed interrupts. I think I read you can use the Linux equivalent of a spin_lock_irqsave in which you disable IRQs while you process data. it's complicated I know but our current user space app is pre-empted in and out of the thread context. And bottom line our competitor is pulling it off! – Tanaka 25/12, 2011 at 2:50

Eric: The equivalent here is to raise the IRQL above dispatch level via KeRaiseIrqlToDpcLevel – Triform 26/12, 2011 at 22:5

Isn't USB polled only with a certain rate? Like 125 Hz? Did you try increasing this system setting? It may be more sensible to do timestamping in the actual hardware, and batch-transmit timestamped data to the PC, since at least out of Windows, you will not get realtime anyway. – Prostrate 26/12, 2011 at 22:18

what about DMA? Could I set up a DMA transfer at a specific rate and just clock out the data using the current FTDI windows driver? I also do embedded work on Freescale iMX and I wrote a Linux driver that uses the iMX processors SDMA to clock out data over the SSI port. I originally tried the bit-bang approach and was preempted and going the DMA approach solved the preemption issue. Windows on x86 hardware has no such SSI port but curious to see if DMA can be used to clock out data similarly to a USB device? – Tanaka 29/12, 2011 at 2:23

Is this USB 2.0? If so, I think you will run into latency issues due to the polling architecture of USB. You may want to consider redesigning your device for USB 3.0 or PCIe. – Roulade 7/11, 2012 at 18:44

In kernel mode, you have the luxury of getting a DPC triggered in multiples of 100-nanosecond intervals without dealing with interrupts. A DPC cannot be preempted (aka interrupted by thread scheduler) because thread scheduler is also a DPC. An interrupt can still preempt a DPC though. So an interval value of 10 should do the trick for you to have a callback with utmost precision.

However you don't have access to many features such as paged memory, or a specific thread's memory space at DPC level because they run in arbitrary context. It could be useful to defer processing to your own user mode process' context using an APC which has access to more features.

Kernel threads don't get any special treatment in terms of priority. They are the same as user threads from scheduler's perspective. There are couple more higher-priority levels kernel threads can get but usually no kernel thread uses any of them. I don't think your bottleneck is thread priority. It doesn't matter how big your priority number is, having just one above everyone else is enough for you to become the "god thread" which receives top priority. Having highest priority doesn't mean that you'll get continuous attention. OS will still pause your thread to run others so quantum starvation does not occur.

Another note on Windows preemption behavior: Balance Set Manager temporarily boosts a thread's priority when a thread is signaled by an asynchronous event (GUI click, timer trigger, I/O completion) to allow completion code to finish it's procesing with less preemption. Using an async timer handler should give enough boost to prevent preemption at least for a quantum. I wonder why your code does not fall into that window. However it seems like you are not the only one having problems with timer precision: http://www.virtualdub.org/blog/pivot/entry.php?id=272

I agree with Paul on complexity of driver development, but as long as you have a good justification it's not rocket science, just more effort.

Helga answered 23/1, 2012 at 20:18 Comment(0)

This is one of the fundamental design aspects of the Windows kernel - that code running at passive level (=> all user-mode code) is subject to DPCs and interrupts taking up time, and if you want 1us accuracy, you're probably not going to get it with either a UMDF or user-mode driver.

However, writing a kernel driver is not a light or cheap undertaking, it is very difficult, both to even write, and to ensure that it works on your customers' machines (a lot of testing is required). Getting it right will cost you significant engineering resources.

As a stopgap, I'd look into MMCSS for >= Vista (http://msdn.microsoft.com/en-us/library/windows/desktop/ms684247(v=vs.85).aspx), it may give you enough priority that you can be satisfied.

If you really want to go down the rabbit hole, KMDF is what you should be using. KMDF is a framework on top of WDM that represents a lot of codified best-practices for drivers. Unless you're absolutely forced to, KMDF is always the best way to go for drivers. And to be honest, you're almost certainly going to want to either contract with OSR (http://www.osr.com) or hire someone (several people?) experienced in writing Windows drivers.

Triform answered 26/12, 2011 at 22:3 Comment(5)

gave -1 because of the comment about writing a kernel driver is not light or cheap. Not sure what value these sorts of comments add. It seems to suggest I don't know this already. I'm actually an embedded engineer and know quite a bit about low level I/O and timing but just do not have the Win32 knowledge. It would be nice if people could just answer technical questions with straight forward technical answers without giving career advise. I know the X86 platform has a plethora of DMA but what I don't know is what I can do with it under Win32/Win64 kernels. – Tanaka 17/1, 2012 at 1:23

I gave you advice on how to solve your business problem in the most efficient way. What you are saying clearly demonstrates you don't appreciate the magnitude of writing a Windows kernel driver, and I'm trying to do you a favor by giving you adequate warning. – Triform 17/1, 2012 at 1:45

I didn't ask about business problem. I asked a technical question and you went on about the complexity and then suggested and continue to suggest I don't "appreciate the magnitude." So you pretend to know my business and my capabilities. It's at most counter productive and at least rude and insulting. You should learn to refrain yourself and, if you really knew what you were talking about, provide an answer worth discussing so others can learn from as well. We all have to learn or start some place and asking a question on a public forum should not imply what you said to me. – Tanaka 21/1, 2012 at 20:50

It's not just you, I hope that everyone learns that writing Windows kernel drivers is extremely expensive. And I assure you, having worked in kernel and drivers at Microsoft for four years, that I do know what I'm talking about. – Triform 21/1, 2012 at 21:55

ok, my last response & suggestion to you is to start a blog. You lead by showing not by telling. If it's that complicated show the complicated decisions and maybe some code. Anyone old enough to read and understand these posts already knows and appreciates the more complicated nature and they're here to learn not be preached too. Hope you understand but if not, then so be it. Maybe start your own blog on the topic and try to make complicated material easy or easier to new comers. – Tanaka 22/1, 2012 at 18:7

Your focus on drivers and kernel performance misses the forest for the trees. The elephant in the room is the fact that full-speed USB 2 bus frames happen with 1ms period. High speed USB 2 micro-frames happen every 1/8ms.

When you send data over full-speed USB (like for most FTDI chips), the best your application can hope for is that the data will get to the device sometime during the very next frame. With an unloaded USB bus, the transfer will happen very close to the start-of-frame. You'll observe it as 1ms granularity with small random deviation. This is precisely what you're seeing, and is not bad. For example, since all USB devices attached to the same host will see the frames at the same time, it's a simple way to synchronize multiple device clocks with better than microsecond precision. What your application can do is simply send a message that has not only the data, but some time in the near future when it should be sent out. Another issue with USB is that there are no guarantees as to when your requests for data transmission will be serviced. You're sharing a bus with other devices, after all.

I think you need to reengineer your system and not depend on any sort of timing from the PC end. The application that runs on the PC should be assumed to be, timing-wise, limited to the performance of the human that interacts with it. Anything that requires guaranteed real time performance must be on your dsPIC devices. Even the USB bus doesn't cut it as you have no guarantees at all as to how soon will your request be scheduled on the bus.

Basically, if you want guaranteed real-time performance on Windows, then there must be no user mode involved -- it must all run in kernel mode, and you must use communications channels that are for your exclusive use (or you make them act that way, e.g. by filtering right on top of the USB host).

Locker answered 17/2, 2013 at 4:44 Comment(4)

You're dead wrong. SSG's post was key to understanding what was needed. Since I made that post, I've was able to pull this off with a Windows driver. Further, our competition has been doing this for some time (years) and because I'd seen another company doing it, it was proof it could be done. Some in this thread even suggested I pay a company to write the driver for me. It was much easier than was suggested in my opinion. Yes there were some things to figure out but it wasn't difficult. Some folks tend to over exaggerate the complexity of things when they are in the know. – Tanaka 18/2, 2013 at 14:38

Then I don't understand then what you are after, and what's the point. The PC application is subject to typical userland latencies and lack of guarantees. No matter what you do on the kernel end, the application is driving the latency. If the application is not driving the latency, then there's no need for anything special kernel-side: just tell your DSpic what it needs to do and when, and it'll do it precisely on time. Your original post basically doesn't say what you wanted to do and is misleading. – Rabia 19/2, 2013 at 23:54

Do note that the FTDI driver has two threads that don't do anything special but simply keep the host busy with transfers. I don't see what is it that your driver can do that FTDI driver can't. – Rabia 19/2, 2013 at 23:56

The USB does run on a fixed schedule and you're never guaranteed anything as to whether you can actually do a transfer at a particular time. If it happens to work, you're depending on a happy coincidence. Put some load on the bus and see what happens. – Rabia 19/2, 2013 at 23:58

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags