Discussion:
Can context switches be faster?
John Richard Moser
2006-10-12 15:44:56 UTC
Permalink
Can context switches be made faster? This is a simple question, mainly
because I don't really understand what happens during a context switch
that the kernel has control over (besides storing registers).

Linux ported onto the L4-Iguana microkernel is reported to be faster
than the monolith[1]; it's not like microkernels are faster, but the
L4-Iguana apparently just has super awesome context switching code:

Wombat's context-switching overheads as measured by lmbench on an
XScale processor are up to thirty times less than those of native
Linux, thanks to Wombat profiting from the implementation of fast
context switches in L4-embedded.

The first question that comes into my mind is, obviously, is this some
special "fast context switch" code for only embedded systems; or is it
possible to work this on normal systems?

The second is, if it IS possible to get faster context switches in
general use, can the L4 context switch methods be used in Linux? I
believe L4 is BSD licensed-- at least the files in their CVS repo that I
looked at have "BSD" stamped on them. Maybe some of the code can be
examined, adopted, adapted, etc.

If it's not possible to speed up context switches, the classical
question would probably be.. why not? ;) No use knowing a thing and
not understanding it; you get situations like this right here... :)

[1]http://l4hq.org/
- --
We will enslave their women, eat their children and rape their
cattle!
-- Bosc, Evil alien overlord from the fifth dimension
John Richard Moser
2006-10-12 15:53:08 UTC
Permalink
John Richard Moser wrote:
...
Post by John Richard Moser
The second is, if it IS possible to get faster context switches in
general use, can the L4 context switch methods be used in Linux? I
believe L4 is BSD licensed-- at least the files in their CVS repo that I
looked at have "BSD" stamped on them. Maybe some of the code can be
examined, adopted, adapted, etc.
Looking a bit deeper, Iguana is OzPLB licensed, oops. :(

The question still remains, though, as to what must happen during
context switches that takes so long and if any of it can be sped up.
Wikipedia has some light detail...


- --
We will enslave their women, eat their children and rape their
cattle!
-- Bosc, Evil alien overlord from the fifth dimension
Russell King
2006-10-12 17:19:30 UTC
Permalink
Post by John Richard Moser
Can context switches be made faster? This is a simple question, mainly
because I don't really understand what happens during a context switch
that the kernel has control over (besides storing registers).
They can be, but there's a big penalty that you pay for it. You must
limit the virtual memory space to 32MB for _every_ process in the
system, and if you have many processes running (I forget how many)
you end up running into the same latency problems.

The latency problem comes from the requirement to keep the cache
coherent with the VM mappings, and to this extent on Linux we need to
flush the cache each time we change the VM mapping.

There have been projects in the past which have come and gone to
support the "Fast Context Switch" approach found on these CPUs, but
patches have _never_ been submitted, so I can only conclude that the
projects never got off the ground.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core
John Richard Moser
2006-10-12 18:25:17 UTC
Permalink
Post by Russell King
Post by John Richard Moser
Can context switches be made faster? This is a simple question, mainly
because I don't really understand what happens during a context switch
that the kernel has control over (besides storing registers).
They can be, but there's a big penalty that you pay for it. You must
limit the virtual memory space to 32MB for _every_ process in the
system, and if you have many processes running (I forget how many)
you end up running into the same latency problems.
Interesting information; for the rest of this discussion let's assume
that we want the system to remain functional. :)
Post by Russell King
The latency problem comes from the requirement to keep the cache
coherent with the VM mappings, and to this extent on Linux we need to
flush the cache each time we change the VM mapping.
*OUCH*

Flushing cache takes time, doesn't it? Worse, you can't have happy
accidents where cache remains the same for various areas (i.e. I1's
caching of libc and gtk) between processes.

I guess on x86 and x86-64 at least (popular CPUs) the cache is not tied
to physical memory; but rather to virtual memory? Wikipedia:

Multiple virtual addresses can map to a single physical address. Most
processors guarantee that all updates to that single physical address
will happen in program order. To deliver on that guarantee, the
processor must ensure that only one copy of a physical address resides
in the cache at any given time.

...

But virtual indexing is not always the best choice. It introduces the
problem of virtual aliases -- the cache may have multiple locations
which can store the value of a single physical address. The cost of
dealing with virtual aliases grows with cache size, and as a result
most level-2 and larger caches are physically indexed.

-- http://en.wikipedia.org/wiki/CPU_cache

So apparently most CPUs virtually address L1 cache and physically
address L2; but sometimes physically addressing L1 is better.. hur.

I'd need more information on this one.

- Is L1 on <CPU of choice> physically aliased or physically tagged
such that leaving it in place between switches will cause the CPU to
recognize it's wrong?

- Is L2 on <CPU of choice> in such a manner?

- Can L1 be flushed without flushing L2?

- Does the current code act on these behaviors, or just flush all
cache regardless?
Post by Russell King
There have been projects in the past which have come and gone to
support the "Fast Context Switch" approach found on these CPUs, but
patches have _never_ been submitted, so I can only conclude that the
projects never got off the ground.
A shame.

- --
We will enslave their women, eat their children and rape their
cattle!
-- Bosc, Evil alien overlord from the fifth dimension
Arjan van de Ven
2006-10-12 19:02:45 UTC
Permalink
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hi,
Post by John Richard Moser
So apparently most CPUs virtually address L1 cache and physically
address L2; but sometimes physically addressing L1 is better.. hur.
if you are interested in this I would strongly urge you to read Curt
Schimmel's book (UNIX(R) Systems for Modern Architectures: Symmetric
Multiprocessing and Caching for Kernel Programmers); it explains this
and related materials really really well.
That will likely be more useful when I've got more background knowledge;
the book is quite explenatory, so maybe your uni has it in the library
so that you can try to see if it makes sense?
the cache flushing is a per architecture property. On x86, the cache
flushing isn't needed; but a TLB flush is. Depending on your hardware
that can be expensive as well.
Mm. TLB flush is expensive, and pretty unavoidable unless you do stupid
things like map libc into the same address everywhere.
TLB flush is needed if ANY part of the address space is pointing at
different pages. Since that is always the case for different processes,
where exactly glibc is doesn't matter... as long as there is ONE page of
memory different you have to flush. (And the flush is not expensive
enough to do partial ones based on lots of smarts; the smarts will be
more expensive than just doing a full flush)
TLB flush
between threads in the same process is probably avoidable; aren't
threads basically processes with the same PID in something called a
thread group? (Hmm... creative scheduling possibilities...)
the scheduler is smart enough to avoid the flush then afaik
(threads share the "mm" in Linux, and "different mm" is used as trigger
for the flush. Linux is even smarter; kernel threads don't have any mm
at all so those count as wildcards in this respect, so that if you have
"thread A" -> "kernel thread" -> "thread B" you'll have no flush)
John Richard Moser
2006-10-12 18:56:33 UTC
Permalink
Hi,
Post by John Richard Moser
So apparently most CPUs virtually address L1 cache and physically
address L2; but sometimes physically addressing L1 is better.. hur.
if you are interested in this I would strongly urge you to read Curt
Schimmel's book (UNIX(R) Systems for Modern Architectures: Symmetric
Multiprocessing and Caching for Kernel Programmers); it explains this
and related materials really really well.
That will likely be more useful when I've got more background knowledge;
right now my biggest problem is I'm inexperienced and haven't yet gotten
my 2-year compsci degree (more importantly, the associated knowledge
that goes with it), so things like compiler design are like "There's a
concept of turning stuff into trees and associating actions with virtual
registers and then turning that into real register accounting and
instructions, but I really don't know WTF it's doing."

I imagine once I get into the 4-year stuff I'll be able to digest
something like that; so I'll keep that in mind. The '1st year compsci
student' crutch is annoying and I need to get rid of it so I don't feel
so damn crippled trying to do anything.
Post by John Richard Moser
- Does the current code act on these behaviors, or just flush all
cache regardless?
the cache flushing is a per architecture property. On x86, the cache
flushing isn't needed; but a TLB flush is. Depending on your hardware
that can be expensive as well.
Mm. TLB flush is expensive, and pretty unavoidable unless you do stupid
things like map libc into the same address everywhere. TLB flush
between threads in the same process is probably avoidable; aren't
threads basically processes with the same PID in something called a
thread group? (Hmm... creative scheduling possibilities...)
Greetings,
Arjan van de Ven
- --
We will enslave their women, eat their children and rape their
cattle!
-- Bosc, Evil alien overlord from the fifth dimension
Arjan van de Ven
2006-10-12 18:37:11 UTC
Permalink
On Thu, 2006-10-12 at 14:25 -0400, John Richard Moser wrote:

Hi,
Post by John Richard Moser
So apparently most CPUs virtually address L1 cache and physically
address L2; but sometimes physically addressing L1 is better.. hur.
if you are interested in this I would strongly urge you to read Curt
Schimmel's book (UNIX(R) Systems for Modern Architectures: Symmetric
Multiprocessing and Caching for Kernel Programmers); it explains this
and related materials really really well.
Post by John Richard Moser
- Does the current code act on these behaviors, or just flush all
cache regardless?
the cache flushing is a per architecture property. On x86, the cache
flushing isn't needed; but a TLB flush is. Depending on your hardware
that can be expensive as well.

Greetings,
Arjan van de Ven
James Courtier-Dutton
2006-10-13 11:05:39 UTC
Permalink
Post by John Richard Moser
- Does the current code act on these behaviors, or just flush all
cache regardless?
the cache flushing is a per architecture property. On x86, the cache
flushing isn't needed; but a TLB flush is. Depending on your hardware
that can be expensive as well.
So, that is needed for a full process context switch to another process.
Is the context switch between threads quicker as it should not need to
flush the TLB?

James
Chase Venters
2006-10-13 14:51:51 UTC
Permalink
Post by James Courtier-Dutton
Post by John Richard Moser
- Does the current code act on these behaviors, or just flush all
cache regardless?
the cache flushing is a per architecture property. On x86, the cache
flushing isn't needed; but a TLB flush is. Depending on your hardware
that can be expensive as well.
So, that is needed for a full process context switch to another process.
Is the context switch between threads quicker as it should not need to
flush the TLB?
Indeed. This is also true for switching from a process to a kernel thread
and back, because kernel threads don't have their own user-space virtual
memory; they just live inside the kernel virtual memory mapped into every
process.
Post by James Courtier-Dutton
James
Thanks,
Chase
Phillip Susi
2006-10-12 18:20:31 UTC
Permalink
Post by John Richard Moser
Can context switches be made faster? This is a simple question, mainly
because I don't really understand what happens during a context switch
that the kernel has control over (besides storing registers).
Besides saving the registers, the expensive operation in a context
switch involves flushing caches and switching page tables. This can be
avoided if the new and old processes both share the same address space.
John Richard Moser
2006-10-12 18:29:20 UTC
Permalink
Post by Phillip Susi
Post by John Richard Moser
Can context switches be made faster? This is a simple question, mainly
because I don't really understand what happens during a context switch
that the kernel has control over (besides storing registers).
Besides saving the registers, the expensive operation in a context
switch involves flushing caches and switching page tables. This can be
avoided if the new and old processes both share the same address space.
i.e. this can be avoided when switching to threads in the same process.

How does a page table switch work? As I understand there are PTE chains
which are pretty much linked lists the MMU follows; I can't imagine this
being a harder problem than replacing the head. I'd imagine the head
PTE would be something like a no-access page that's not really mapped to
anything so exchanging it is pretty much exchanging its "next" pointer?
- --
We will enslave their women, eat their children and rape their
cattle!
-- Bosc, Evil alien overlord from the fifth dimension
Andrew James Wade
2006-10-13 02:53:01 UTC
Permalink
Post by John Richard Moser
How does a page table switch work? As I understand there are PTE chains
which are pretty much linked lists the MMU follows; I can't imagine this
being a harder problem than replacing the head.
Generally, the virtual memory mappings are stored as high-fanout trees
rather than linked lists. (ia64 supports a hash table based scheme,
but I don't know if Linux uses it.) But the bulk of the mapping
lookups will actually occur in a cache of the virtual memory mappings
called the translation lookaside buffer (TLB). It is from the TLB and
not the memory mapping trees that some of the performance problems
with address space switches originate.

The kernel can tolerate some small inconsistencies between the TLB
and the mapping tree (it can fix them in the page fault handler). But
for the most part the TLB must be kept consistent with the current
address space mappings for correct operation. Unfortunately, on some
architectures the only practical way of doing this is to flush the TLB
on address space switches. I do not know if the flush itself takes any
appreciable time, but each of the subsequent TLB cache misses will
necessitate walking the current mapping tree. Whether done by the MMU
or by the kernel (implementations vary), these walks in the aggregate
can be a performance issue.

On some architectures the L1 cache can also require attention from the
kernel on address space switches for correct operation. Even when the
L1 cache doesn't need flushing a change in address space will generally
be accompanied by a change of working set, leading to a period of high
cache misses for the L1/L2 caches.

Microbenchmarks can miss the cache miss costs associated with context
switches. But I believe the costs of cache thrashing and flushing are
the reason that the time-sharing granularity is so coarse in Linux,
rather than the time it takes the kernel to actually perform a context
switch. (The default time-slice is 100 ms.) Still, the cache miss costs
are workload-dependent, and the actual time the kernel takes to context
switch can be important as well.

Andrew Wade
John Richard Moser
2006-10-13 05:29:23 UTC
Permalink
Post by Andrew James Wade
Post by John Richard Moser
How does a page table switch work? As I understand there are PTE chains
which are pretty much linked lists the MMU follows; I can't imagine this
being a harder problem than replacing the head.
Generally, the virtual memory mappings are stored as high-fanout trees
rather than linked lists. (ia64 supports a hash table based scheme,
but I don't know if Linux uses it.) But the bulk of the mapping
lookups will actually occur in a cache of the virtual memory mappings
called the translation lookaside buffer (TLB). It is from the TLB and
not the memory mapping trees that some of the performance problems
with address space switches originate.
The kernel can tolerate some small inconsistencies between the TLB
and the mapping tree (it can fix them in the page fault handler). But
for the most part the TLB must be kept consistent with the current
address space mappings for correct operation. Unfortunately, on some
architectures the only practical way of doing this is to flush the TLB
on address space switches. I do not know if the flush itself takes any
appreciable time, but each of the subsequent TLB cache misses will
necessitate walking the current mapping tree. Whether done by the MMU
or by the kernel (implementations vary), these walks in the aggregate
can be a performance issue.
True. You can trick the MMU into faulting into the kernel (PaX does
this to apply non-executable pages-- pages, not halves of VM-- on x86),
but it's orders of magnitude slower as I understand and the petty gains
you can get over the hardware MMU doing it are not going to outweigh it.
Post by Andrew James Wade
On some architectures the L1 cache can also require attention from the
kernel on address space switches for correct operation. Even when the
L1 cache doesn't need flushing a change in address space will generally
be accompanied by a change of working set, leading to a period of high
cache misses for the L1/L2 caches.
Yeah, only exception being if L1 and L2 are both physically addressed,
and thing like libc's .text are shared, leading to shared working sets
in I1 and L2.
Post by Andrew James Wade
Microbenchmarks can miss the cache miss costs associated with context
switches. But I believe the costs of cache thrashing and flushing are
cachegrind is probably guilty but I haven't examined it.
Post by Andrew James Wade
the reason that the time-sharing granularity is so coarse in Linux,
rather than the time it takes the kernel to actually perform a context
switch. (The default time-slice is 100 ms.) Still, the cache miss costs
I thought it was minimum 5mS... I don't know what default is. Heh.
Post by Andrew James Wade
are workload-dependent, and the actual time the kernel takes to context
switch can be important as well.
Andrew Wade
- --
We will enslave their women, eat their children and rape their
cattle!
-- Bosc, Evil alien overlord from the fifth dimension
Andrew James Wade
2006-10-13 16:56:42 UTC
Permalink
Post by John Richard Moser
True. You can trick the MMU into faulting into the kernel (PaX does
this to apply non-executable pages-- pages, not halves of VM-- on x86),
Oooh, that is a neat hack!
Post by John Richard Moser
but it's orders of magnitude slower as I understand and the petty gains
you can get over the hardware MMU doing it are not going to outweigh it.
It's architecture-dependent; not all architectures are even capeable of
walking the page table trees in hardware. They compensate with
lightweight traps for TLB cache misses.

Andrew Wade
John Richard Moser
2006-10-13 17:24:11 UTC
Permalink
Post by Andrew James Wade
Post by John Richard Moser
True. You can trick the MMU into faulting into the kernel (PaX does
this to apply non-executable pages-- pages, not halves of VM-- on x86),
Oooh, that is a neat hack!
Yes, very neat ;)

The way it works is pretty much that he has NX pages marked SUPERVISOR
so the userspace code can't put them in the TLB. What happens on access is:

- MMU cries, faults into kernel
- Kernel examines nature of fault
- If attempting to read/write, put the mapping in the DTLB for the
process
- If attempting to execute, KILL.
- Application continues running if it was a data access.

As I understand it, this is about 7000 times slower than having a
hardware NX bit and an MMU that handles the event; but once the page is
in the DTLB the MMU doesn't complain any more until it's pushed out, so
the hit is minimal. Wide memory access patterns that can push data out
of the TLB quickly and come back to it just as fast get a pretty
noticeable performance hit; but they're rare.

These days he's got the OpenBSD W^X/RedHat Exec Shield method in there
as well, so when possible the stack is made non-executable using
hardware and is completely exempt from these performance considerations.

- From what I hear, Linus isn't interested in either PTE hacks or segment
limit hacks, so old x86 will never have an NX bit (full like PaX or best
effort like Exec Shield) in the kernel. Too bad; I'd really like to see
some enforcement of non-executable pages on good old x86 chips in mainline.
Post by Andrew James Wade
Post by John Richard Moser
but it's orders of magnitude slower as I understand and the petty gains
you can get over the hardware MMU doing it are not going to outweigh it.
It's architecture-dependent; not all architectures are even capeable of
walking the page table trees in hardware. They compensate with
lightweight traps for TLB cache misses.
The ones that can do it probably have hardware tuned enough that a
software implementation isn't going to outrun them, least of all not far
enough to overtake the weight of the traps that can be set up. It's a
lovely area of theoretical hacks though, if you like coming up with
impractical "what kinds of nasty things can we do to the hardware"
ideas. :)
Post by Andrew James Wade
Andrew Wade
- --
We will enslave their women, eat their children and rape their
cattle!
-- Bosc, Evil alien overlord from the fifth dimension
Chris Friesen
2006-10-12 19:57:59 UTC
Permalink
Post by John Richard Moser
Linux ported onto the L4-Iguana microkernel is reported to be faster
than the monolith[1]; it's not like microkernels are faster, but the
Wombat's context-switching overheads as measured by lmbench on an
XScale processor are up to thirty times less than those of native
Linux, thanks to Wombat profiting from the implementation of fast
context switches in L4-embedded.
The Xscale is a fairly special beast, and it's context-switch times are
pretty slow by default.

Here are some context-switch times from lmbench on a modified 2.6.10
kernel. Times are in microseconds:

cpu clock speed context switch
pentium-M 1.8GHz 0.890
dual-Xeon 2GHz 7.430
Xscale 700MHz 108.2
dual 970FX 1.8GHz 5.850
ppc 7447 1GHz 1.720

Reducing the Xscale time by a factor of 30 would basically bring it into
line with the other uniprocessor machines.

Chris
Arjan van de Ven
2006-10-12 20:29:14 UTC
Permalink
Post by Chris Friesen
ther uniprocessor machines.
That's a load more descriptive :D
0.890 uS, 0.556uS/cycle, that's barely 2 cycles you know. (Pentium M)
PPC performs similarly, 1 cycle should be about 1uS.
Chris
you have your units off; 1 cycle is 1 nS not 1 uS (or 0.556 nS for the
pM)
John Richard Moser
2006-10-12 20:36:54 UTC
Permalink
Post by Arjan van de Ven
Post by Chris Friesen
ther uniprocessor machines.
That's a load more descriptive :D
0.890 uS, 0.556uS/cycle, that's barely 2 cycles you know. (Pentium M)
PPC performs similarly, 1 cycle should be about 1uS.
Chris
you have your units off; 1 cycle is 1 nS not 1 uS (or 0.556 nS for the
pM)
Ah, right right. Nano is a billion, micro is a million. How did I get
that mixed up.

So that's barely two THOUSAND cycles. Which is closer to what I
expected (i.e., not instant)
Post by Arjan van de Ven
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
- --
We will enslave their women, eat their children and rape their
cattle!
-- Bosc, Evil alien overlord from the fifth dimension
Jeremy Fitzhardinge
2006-10-12 20:35:12 UTC
Permalink
Post by Chris Friesen
That's a load more descriptive :D
0.890 uS, 0.556uS/cycle, that's barely 2 cycles you know. (Pentium M)
PPC performs similarly, 1 cycle should be about 1uS.
No, you're a factor of 1000 off - these numbers show the context switch
is around 1600-75000 cycles. And that doesn't really tell the whole
story: if caches/TLB get flushed on context switch, then the newly
switched-to task will bear the cost of having cold caches, which isn't
visible in the raw context switch time.

But modern x86 processors have a very quick context switch time, and I
don't think there's much room for improvement aside from
micro-optimisations (though that might change if the architecture grows
a way to avoid flushing the TLB on switch).

J
Andreas Mohr
2006-10-13 23:32:38 UTC
Permalink
Post by Jeremy Fitzhardinge
Post by Chris Friesen
That's a load more descriptive :D
0.890 uS, 0.556uS/cycle, that's barely 2 cycles you know. (Pentium M)
PPC performs similarly, 1 cycle should be about 1uS.
No, you're a factor of 1000 off - these numbers show the context switch
is around 1600-75000 cycles. And that doesn't really tell the whole
story: if caches/TLB get flushed on context switch, then the newly
switched-to task will bear the cost of having cold caches, which isn't
visible in the raw context switch time.
But modern x86 processors have a very quick context switch time, and I
don't think there's much room for improvement aside from
micro-optimisations (though that might change if the architecture grows
a way to avoid flushing the TLB on switch).
OK, so since we've now amply worked out in this thread that TLB/cache flushing
is a real problem for context switching management, would it be possible to
smartly reorder processes on the runqueue (probably works best with many active
processes with the same/similar priority on the runqueue!) to minimize
TLB flushing needs due to less mm context differences of adjacently scheduled
processes?
(i.e. don't immediately switch from user process 1 to user process 2 and
back to 1 again, but always try to sort some kernel threads in between
to avoid excessive TLB flushing)

See also my new posting about this at
http://bhhdoa.org.au/pipermail/ck/2006-October/006442.html

Andreas Mohr
David Lang
2006-10-13 23:47:45 UTC
Permalink
Post by Andreas Mohr
OK, so since we've now amply worked out in this thread that TLB/cache flushing
is a real problem for context switching management, would it be possible to
smartly reorder processes on the runqueue (probably works best with many active
processes with the same/similar priority on the runqueue!) to minimize
TLB flushing needs due to less mm context differences of adjacently scheduled
processes?
(i.e. don't immediately switch from user process 1 to user process 2 and
back to 1 again, but always try to sort some kernel threads in between
to avoid excessive TLB flushing)
since kernel threads don't cause flushing it shouldn't matter where they appear
in the scheduleing.

other then kernel threads, only threaded programs share the mm context (in
normal situations), and it would be a fair bit of work to sort the list of
potential things to be scheduled to group these togeather (not to mention the
potential fairness issues that would arise from this).

I suspect that the overhead of doing this sorting (and looking up the mm context
to do the sorting) would overwelm the relativly small number of TLB flushes that
would be saved.

I could see this being a potential advantage for servers with massive numbers of
threads for one program, but someone would have to look at how much overhead the
sorting would be (not to mention the fact that the kernel devs tend to frown on
special case optimizations that have a noticable impact on the general case)

David Lang
Alan Cox
2006-10-14 00:30:12 UTC
Permalink
Post by Andreas Mohr
OK, so since we've now amply worked out in this thread that TLB/cache flushing
is a real problem for context switching management, would it be possible to
smartly reorder processes on the runqueue (probably works best with many active
processes with the same/similar priority on the runqueue!) to minimize
TLB flushing needs due to less mm context differences of adjacently scheduled
processes?
We already do. The newer x86 processors also have TLB tagging so they
can (if you find one without errata!) avoid the actual flush and instead
track the tag.

Alan
Jeremy Fitzhardinge
2006-10-14 00:14:57 UTC
Permalink
Post by Alan Cox
We already do. The newer x86 processors also have TLB tagging so they
can (if you find one without errata!) avoid the actual flush and instead
track the tag.
Are there any? I think AMD dropped it rather than spend the effort to
make it work.

J

Jeremy Fitzhardinge
2006-10-14 00:14:24 UTC
Permalink
Post by Andreas Mohr
OK, so since we've now amply worked out in this thread that TLB/cache flushing
is a real problem for context switching management, would it be possible to
smartly reorder processes on the runqueue (probably works best with many active
processes with the same/similar priority on the runqueue!) to minimize
TLB flushing needs due to less mm context differences of adjacently scheduled
processes?
(i.e. don't immediately switch from user process 1 to user process 2 and
back to 1 again, but always try to sort some kernel threads in between
to avoid excessive TLB flushing)
It does. The kernel will (slightly) prefer to switch between two
threads sharing an address space over switching to a different address
space. (Hm, at least it used to, but I can't see where that happens now.)

J
John Richard Moser
2006-10-12 20:23:29 UTC
Permalink
Post by Chris Friesen
Post by John Richard Moser
Linux ported onto the L4-Iguana microkernel is reported to be faster
than the monolith[1]; it's not like microkernels are faster, but the
Wombat's context-switching overheads as measured by lmbench on an
XScale processor are up to thirty times less than those of native
Linux, thanks to Wombat profiting from the implementation of fast
context switches in L4-embedded.
The Xscale is a fairly special beast, and it's context-switch times are
pretty slow by default.
Here are some context-switch times from lmbench on a modified 2.6.10
cpu clock speed context switch
pentium-M 1.8GHz 0.890
dual-Xeon 2GHz 7.430
Xscale 700MHz 108.2
dual 970FX 1.8GHz 5.850
ppc 7447 1GHz 1.720
Reducing the Xscale time by a factor of 30 would basically bring it into
line with the other uniprocessor machines.
That's a load more descriptive :D

0.890 uS, 0.556uS/cycle, that's barely 2 cycles you know. (Pentium M)
PPC performs similarly, 1 cycle should be about 1uS.
Post by Chris Friesen
Chris
- --
We will enslave their women, eat their children and rape their
cattle!
-- Bosc, Evil alien overlord from the fifth dimension
Loading...