Why threads starve even on preemptive multitasking OS (Windows 7)
Asked Answered
T

4

12

I wrote a Win32 application (in Delphi-7 which is 32-bit using TThread class) to create 100 threads. Each thread when resumed will continuously (in a loop) increment a 64 bit counter associated with the thread object (so no locking or sharing of data).

If you let the system run for 10 to 15 seconds and stop after that, you should see roughly the same counts in each of the threads. But what I observed was that 81 threads ran under 400 million loops and the remaining ones looped more than 950 million times. Slowest thread got only 230 million compared to the fastest 2111 million.

According to MSDN, the preemptive multitasking is at the thread-level (not process level), so each of my thread should have gotten its time-slice in a round-robin fashion. What am I missing here and why is this discrepancy?

Edit1: Machine configuration: Intel i7 Quad Core 3.4GHz with hyper-threading turned on (8 active threads at a time). Running Windows-7 64 bit professional (and the test application is 32 bit)

Edit2 (thread code): The test application is built with optimization turned on and without any debug info. Run the test application outside of IDE.

type

  TMyThread = class(TThread)
  protected
    FCount: Int64;
  public
    constructor Create;
    procedure Execute; override;
    property Count: Int64 read FCount;
  end;


{ TMyThread }

constructor TMyThread.Create;
begin
  inherited Create(True);
  FCount := 0;
end;  

procedure TMyThread.Execute;
begin
  inherited;
  while not Terminated do
  begin
    Inc(FCount);
  end;
end;
Trihydric answered 13/8, 2012 at 21:47 Comment(9)
You did not said how your counters are placed in memory; i guess if the counters form a solid array that may cause thread interdependence when they increment counters.Heron
Could you post your TThread derived class dec. + 'Execute' method? I would like to give it a go. If you haven't tried it already, what happens if you run the app outside the IDE? Also, What OS?Viscountess
OK - W7 - it says so in the question title. 64-bit or 32?Viscountess
@Serg - you're probably right, some nasty false-sharing with sets of threads overrunning cache-line boundaries :(Viscountess
I'd be shocked if the threaded increment-counters somehow all magically agreed within 10% on a consistent basis. Chaos is the nature of combining threading, with CPU micro-architecture features like caching, branch prediction, virtual memory and swapping, and a dynamic CPU load of both IO-bound and CPU-bound threads. This ain't no 8 bit cache-less 1mhz micro CPU from 1980.Cuirbouilli
Looked at the code - yep, that's about as simple as you can get :) Not sure what TThread.Execute does, (ie 'inherited' call), but presumably it's an empty method. Should be abstract, really.Viscountess
@Martin James: TThread.Execute calls BeginThread with CREATE_SUSPENDED. In my test program I "resume" suspended threads in a loop -- to have the start times as close as possible.Trihydric
@Trihydric - strange. Every time I raise a new thread class, code-completion puts in that 'inherited' call and I keep deleting it. I have never called the inherited Execute! I wonder what's going on?Viscountess
@Martin James: You are right - inherited Execute; is completely meaningless as it is an abstract method. I checked my project code and I don't call it either. I did not give enough attention to the sample code, I guess:-) Removing that did't change the behavior of the sample code in any way. I just ran my sample program with 30 threads after removing the inherited execute and am still getting order of magnitude difference between slowest and fastest thread. I guess there is no answer to this and that is the way windows work.Trihydric
U
11

Round-robin scheduling is an obvious strategy for a kernel. That's however not the way that the Windows scheduler works. It used to, back in the Windows 9x days, a scheduler which was very capable of giving various VMs equal time. But not in the NT branch, started by Dave Cutler's group, scheduling is purely based on priority.

Whatever thread has the highest priority gets the cpu. There's another chunk of code in Windows that tinkers with a thread's priority, modifying it from the default priority it gets when the thread got created. That code is aware of stuff like a thread owning a window that's in the foreground. Or a thread that's waiting for a synchronization object that got signaled. Or the more bizarre scheduling problems that tries to solve a priority inversion problem. Randomly giving a thread a chance to run even though it wasn't its turn.

Focus on writing sane code first. Starting a hundred threads isn't a very sane thing to do. You are trying to consume resources that the machine doesn't actually have available, nobody has a machine with a hundred cores. Yet. Powers of two, get a machine with 128 cores first.

Unaccomplished answered 13/8, 2012 at 22:14 Comment(8)
Thanks for the reply and for the advice. With none of my programs running, I have 1105 threads on my machine. Not sure who is not writing sane code. How many threads would there be on a vanilla windows 7 professional when you first start that up?Trihydric
Here is the break up of thread utilization on my machine (the top 5): System (215), Kaspersky AV (106), SQLServer Express (49), SVCHost(38+35), Outlook(33). I have two of my programs run as services one has 38 threads and the other 27. The program in question was a test program which was written when I noticed some of my threads were starving for CPU.Trihydric
So you see none of your threads doing real work. That's a problem that no scheduling algorithm will ever fix, it can only give time to threads that want to execute code. Adding more threads doesn't fix that. Makes it worse, actually. High odds for your process being I/O bound, waiting for the dbase server.Unaccomplished
Well.. each of the 100 threads, in my test program, does increment a counter in a loop (local to each thread). So it actually is very busy when it gets its chance to execute. All it does is incrementing the counter, so no I/O involved. Those 100 threads are instances of the very same class and hence there is no difference in what it does either. Even with all the odds of scheduling, I would not expect some threads to be 10 times luckier.Trihydric
@Trihydric see How many threads is too many?Arthralgia
Sure, 100 ready/running threads is err... 'unusual' <g> but, like the OP says, I would expect a more even sharing out of the CPU available over a long, long period like 10 seconds. If all the threads are running the same code, I would expect a more level score no matter how many players and balls are on the field.Viscountess
Also, like Hans says, 1000 threads waiting on I/O or each other is normal, 100 ready/running threads is a waste of stack space at best and, if the threads freqeuntly access [(L1 cache size/threadsCount) or more] amounts of data, (yes, not the case in OP code), then a lot of CPU/memory bandwidth can get wasted on cache swaps.Viscountess
Good sage words here. That said, the powers of two bit isn't quite right. AMD have been pushing out funky 6 core, 12 core and 24 core machines.Dorathydorca
Y
3

I have reproduced and confirm your results. Additionally, disabling thread priority boost doesn't change the distribution. GetThreadTimes reports that threads with higher Values took more UserTime and vice versa, while KernelTime seems to have no correlation with Values.

Thread 97: 1081,5928 Ke:0 Us:25116161
Thread 98: 1153,8029 Ke:0 Us:26988173
Thread 99: 704,6996  Ke:0 Us:16848108

Clearly, some threads really get to run more often than others.

I haven't graphed the results, but I suppose what we're seeing is a Normal distribution, which means the results depend on a number of factors, some which are random.

I tried disabling hyper-threading (this kinda smoothed the results), then assigning each thread a single physical processor (by using SetThreadAffinityMask). In the second case, Values were much closer to each other.

SetThreadAffinityMask(Self.Handle, 1 shl (FIndex mod 4));

I can sort of understand how running on a hyper-threaded system can make some threads "unlucky": they are scheduled to compete with other threads on the same physical processor, and because of "soft affinity" to this virtual core they get to run on it again and again, thus scoring lower than others.

But as to why binding each thread to a fixed core helps on a non-hyperthreaded system, I don't know.

There are probably other random things involved, such as the activity on the cores by other processes. Thread can get "unlucky" if some other process' thread associated with the same core suddenly wakes up and starts doing some (relatively) heavy work.

All of this is guessing though.

Ynes answered 23/8, 2012 at 14:2 Comment(1)
I did some more experiments with 32 bit as well as 64 bit executable (from the same code), but the results were similar. I also tried to yield from with in the thread by using Sleep(0) after each loop in the thread procedure and that produced even more bizarre results. The behavior is the same even for smaller number of threads like 20 or 30.Trihydric
S
1

Windows 7 is designed for user land. When your first thread wants to do work, the OS gives it a time slice. You, the user, just started it after all. By the time the 50th thread in succession (from the same process !) wants to do work, higher priority threads (background processes controlled by Windows 7 itself) step in. This is happening in such a fashion as to make some threads luckier.

You and I don't really want a personal OS that hands out CPU time based on the whims of user land processes. I would be curious to see how 2008 R2 server handled this. You also might play around with the Advanced tab setting: "Choose how to allocate processor resources".

Soto answered 14/8, 2012 at 11:36 Comment(0)
M
-1

Some good reasoning here..but there are some features to take into consideration. Windows is trying to do Multitasking with software. You hardware isnt multitasking, its using power to do what a parallel processed system would do. Under windows, it give priority. in many ways..and its confusing.

let me explain it this way. I have a small program what watches my Cores for their use. When windows loads, you would think that ALL the cores would get used. NOPE. As windows loads, the other cores Start to get used. Then you would think, that as windows loads it would accelerate loading as it has access to the cores. it dont accelerate. It doesnt use the cores are FULL speed to load faster. Even if windows shoved programs to 1 core EACH as they were loading, and running, it WAITS for them to finish. If it used ALL the cores to process each program, it uses Software(about 100 times slower then hardware) to assemble the parts in the other end. Long ago, Intel wanted to change the hardware to parallel processed, and MS said 'NO' as their software isnt designed for it. NOW they are trying to push Serial based hardware design to the N point. Even after MS bought NT software. They have forgotten to use much of its design, recently. There needs to be some hardware changes. There needs to be programming language Changes(MS created the programming language) and the Core of windows needs to be designed again. NOT changed. it needs to go back and start from scratch. Good luck with that. to tell you how old this thought idea is...VIVA La' Amiga.

Mitch answered 27/6, 2013 at 15:22 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.