Discussion:
HT and idle = poll
Andrew Theurer
2003-03-06 05:18:04 UTC
Permalink
The test: kernbench (average of kernel compiles5) with -j2 on a 2 physical/4
logical P4 system. This is on 2.5.64-HTschedB3:

idle != poll: Elapsed: 136.692s User: 249.846s System: 30.596s CPU: 204.8%
idle = poll: Elapsed: 161.868s User: 295.738s System: 32.966s CPU: 202.6%

A 15.5% increase in compile times.

So, don't use idle=poll with HT when you know your workload has idle time! I
have not tried oprofile, but it stands to reason that this would be a
problem. There's no point in using idle=poll with oprofile and HT anyway, as
the cpu utilization is totally wrong with HT to begin with (more on that
later).

Presumably a logical cpu polling while idle uses too many cpu resources
unnecessarily and significantly affects the performance of its sibling.

-Andrew Theurer
Linus Torvalds
2003-03-06 19:30:42 UTC
Permalink
Post by Andrew Theurer
The test: kernbench (average of kernel compiles5) with -j2 on a 2 physical/4
idle != poll: Elapsed: 136.692s User: 249.846s System: 30.596s CPU: 204.8%
idle = poll: Elapsed: 161.868s User: 295.738s System: 32.966s CPU: 202.6%
A 15.5% increase in compile times.
So, don't use idle=poll with HT when you know your workload has idle time! I
have not tried oprofile, but it stands to reason that this would be a
problem. There's no point in using idle=poll with oprofile and HT anyway, as
the cpu utilization is totally wrong with HT to begin with (more on that
later).
Presumably a logical cpu polling while idle uses too many cpu resources
unnecessarily and significantly affects the performance of its sibling.
Btw, I think this is exactly what the new HT prescott instructions are
for: instead of having busy loops polling for a change in memory (be it
a spinlock or a "need_resched" flag), new HT CPU's will support a
"mwait" instruction.

But yes, at least for now, I really don't think you should really _ever_
use "idle=poll" on HT-enabled hardware. The idle CPU's will just suck
cycles from the real work.

Linus
Davide Libenzi
2003-03-06 19:52:42 UTC
Permalink
Post by Linus Torvalds
But yes, at least for now, I really don't think you should really _ever_
use "idle=poll" on HT-enabled hardware. The idle CPU's will just suck
cycles from the real work.
Not only. The polling CPU will also shoot a strom of memory requests,
clobbering the CPU's memory I/O stages.



- Davide
Linus Torvalds
2003-03-06 20:05:48 UTC
Permalink
Post by Davide Libenzi
Not only. The polling CPU will also shoot a strom of memory requests,
clobbering the CPU's memory I/O stages.
Well, that would only be true with a really crappy CPU with no caches.

Polling the same location (as long as it's a pure poll, not trying to do
some locked read-modify-write cycle) should be fine. At least for
something like idle-polling, where the one location it _is_ polling should
not actually be touched by anybody else until the wakeup actually happens.

Linus
Davide Libenzi
2003-03-06 20:52:59 UTC
Permalink
Post by Linus Torvalds
Post by Davide Libenzi
Not only. The polling CPU will also shoot a strom of memory requests,
clobbering the CPU's memory I/O stages.
Well, that would only be true with a really crappy CPU with no caches.
Polling the same location (as long as it's a pure poll, not trying to do
some locked read-modify-write cycle) should be fine. At least for
something like idle-polling, where the one location it _is_ polling should
not actually be touched by anybody else until the wakeup actually happens.
We are talking about HT, don't we ? Cores share execution units and memory
requests are shot on the memory I/O units of the CPU. Before there is a
cache circuitry intervention. Something like "while (!run);" will generate
an enormous amount of memory I/O requests on the CPU's memory units. That
are shared by cores. Even with non-HT CPU, the above loop creates problems
respect of the latency to exit the loop itself when the condition will
become true. This because of the huge number of alloc request issued, that
must be, exiting the loop, 1) discarded 2) checked against reordering. But
I don't think the exit latency matters a lot here.



- Davide
Alan Cox
2003-03-06 21:09:29 UTC
Permalink
Post by Andrew Theurer
So, don't use idle=poll with HT when you know your workload has idle time! I
have not tried oprofile, but it stands to reason that this would be a
idle=poll probably needs to be doing "rep nop" in a tight loop. That
ironically also saves more power than "hlt" on PIV last time someone
investigated
Martin J. Bligh
2003-03-06 22:22:48 UTC
Permalink
Post by Linus Torvalds
Post by Andrew Theurer
The test: kernbench (average of kernel compiles5) with -j2 on a 2 physical/4
idle != poll: Elapsed: 136.692s User: 249.846s System: 30.596s CPU: 204.8%
idle = poll: Elapsed: 161.868s User: 295.738s System: 32.966s CPU: 202.6%
A 15.5% increase in compile times.
So, don't use idle=poll with HT when you know your workload has idle time! I
have not tried oprofile, but it stands to reason that this would be a
problem. There's no point in using idle=poll with oprofile and HT anyway, as
the cpu utilization is totally wrong with HT to begin with (more on that
later).
Presumably a logical cpu polling while idle uses too many cpu resources
unnecessarily and significantly affects the performance of its sibling.
Btw, I think this is exactly what the new HT prescott instructions are
for: instead of having busy loops polling for a change in memory (be it
a spinlock or a "need_resched" flag), new HT CPU's will support a
"mwait" instruction.
But yes, at least for now, I really don't think you should really _ever_
use "idle=poll" on HT-enabled hardware. The idle CPU's will just suck
cycles from the real work.
BTW, could someone give a brief summary of why idle=poll is needed for
oprofile, I'd love to add it do the "documentation for dummies" file I
was writing.

M.
John Levon
2003-03-06 23:59:04 UTC
Permalink
Post by Martin J. Bligh
BTW, could someone give a brief summary of why idle=poll is needed for
oprofile, I'd love to add it do the "documentation for dummies" file I
was writing.
Because events like CPU_CLK_UNHALTED don't tick when the cpu is halted,
so the idle time doesn't show up properly in the kernel profile.
idle=poll doesn't hlt so the profile for poll_idle() reflects the actual
idle percentage.

Something like that anyway.

john

Linus Torvalds
2003-03-06 20:08:43 UTC
Permalink
Post by Andrew Theurer
So, don't use idle=poll with HT when you know your workload has idle time! I
have not tried oprofile, but it stands to reason that this would be a
idle=poll probably needs to be doing "rep nop" in a tight loop.
We already do that. It's not enough. The HT thing will still steal cycles
continually, since the "rep nop" is really only equivalent to a
"sched_yield()".

Think of "rep nop" as yielding, and "mwait" as a true wait.

(I don't actually have any real information on "mwait", so I may be wrong
about the details on the new instructions. They looked obvious enough,
though).

Linus
Eric Northup
2003-03-06 22:36:17 UTC
Permalink
Post by Linus Torvalds
idle=poll probably needs to be doing "rep nop" in a tight loop.
We already do that. It's not enough. The HT thing will still steal cycles
continually, since the "rep nop" is really only equivalent to a
"sched_yield()".
(Perhaps a naive idea) Right now, there is a single "rep nop" per poll. What
happens if you unroll the loop a few times:

while (!condition) {
cpu_relax();
cpu_relax();
cpu_relax();
}

? I have no HT hardware so can't test this.

-Eric
Nakajima, Jun
2003-03-06 21:15:43 UTC
Permalink
Linus,

That's correct. Basically mwait is similar to hlt, but you can avoid IPI to wake up the processor waiting. A write to the address specified by monitor wakes up the processor, unlike hlt.

So our plan is to use monitor/mwait in the idle loop, for example, in the kernel to lower the latency.

Jun
-----Original Message-----
Sent: Thursday, March 06, 2003 12:09 PM
To: Alan Cox
Cc: Linux Kernel Mailing List
Subject: Re: HT and idle = poll
Post by Andrew Theurer
So, don't use idle=poll with HT when you know your workload has idle
time! I
Post by Andrew Theurer
have not tried oprofile, but it stands to reason that this would be a
idle=poll probably needs to be doing "rep nop" in a tight loop.
We already do that. It's not enough. The HT thing will still steal cycles
continually, since the "rep nop" is really only equivalent to a
"sched_yield()".
Think of "rep nop" as yielding, and "mwait" as a true wait.
(I don't actually have any real information on "mwait", so I may be wrong
about the details on the new instructions. They looked obvious enough,
though).
Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Alan Cox
2003-03-06 22:42:30 UTC
Permalink
Post by Nakajima, Jun
Linus,
That's correct. Basically mwait is similar to hlt, but you can avoid IPI to wake up the processor waiting. A write to the address specified by monitor wakes up the processor, unlike hlt.
So our plan is to use monitor/mwait in the idle loop, for example, in the kernel to lower the latency.
Thats nice. It means you've got the basis of the instructions (although not quite the same
exact functionality) as Brian Grayson proposed four years ago with Armadillo.
Loading...