Discussion:
2.4.19pre*: IO statistics in /proc/partitions corrupt
(too old to reply)
Jochen Suckfuell
2002-05-22 06:51:11 UTC
Permalink
Hi!

The statistics patch included in the kernel since 2.4.19pre still has a
bug leading to negative values for the "running io's" value, called
ios_in_flight internally.
This leads to completely wrong results for many other values computed
from this one and renders the statistics utterly unusable.

The problem appears on IDE and SCSI drives, affecting values for
partitions and also whole disks. It seems to be most significant when
using a RAID (which is often the case on servers with much disk access,
where statistics are important!):

8 16 35842048 sdb 12637435 51727 101513266 103991890 19600590
14721219 274592008 988438640 **-100** 250563400 315019978
8 32 35842048 sdc 8438773 75872 68117130 62271950 13147577
9950844 184838544 550059270 **-32** 247111750 1119563006

Here sdb and sdc are each a RAID1 pair, on a Dual-CPU running
2.4.19-pre8-ac4.

Does anyone have an idea where a starting disk io might not be counted
correctly?

Bye
Jochen
--
Jochen Suckfuell --- http://www.suckfuell.net/jochen/ ---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
M. Edward Borasky
2002-05-22 14:00:11 UTC
Permalink
A few months ago, there was a flurry of reports from people having
difficulties with memory management on large machines (ia32 over 4 GB). I've
seen a lot of 2.4.x-yy kernels go by and much VM discussion, but what I'm
*not* seeing is reports of either catastrophic behavior or its absence on
large machines. I haven't had a chance to run my own test cases on the
2.4.18 kernel from Red Hat 7.3 yet, so I can't make any personal
contribution to this discussion.
--
M. Edward Borasky
http://www.borasky-research.net/HarryIannis.htm
***@aracnet.com


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
bert hubert
2002-05-22 14:08:39 UTC
Permalink
Post by M. Edward Borasky
A few months ago, there was a flurry of reports from people having
difficulties with memory management on large machines (ia32 over 4 GB). I've
seen a lot of 2.4.x-yy kernels go by and much VM discussion, but what I'm
*not* seeing is reports of either catastrophic behavior or its absence on
large machines. I haven't had a chance to run my own test cases on the
2.4.18 kernel from Red Hat 7.3 yet, so I can't make any personal
contribution to this discussion.
RedHat has fixed the problem in its kernels. There are fixes out there, but
Linus is not applying them. I would venture that this is because they would
fix the problems *for the moment* and take away interest in revamping VM for
real.

It might help if Linus would actually state his intentions. So far the
problem has been that the AA vm was badly documented and a big chunk of
patches. Andrew Morton split them up nicely and documented each patch, so
that is resolved.

Regards,

bert
--
http://www.PowerDNS.com Versatile DNS Software & Services
http://www.tk the dot in .tk
http://lartc.org Linux Advanced Routing & Traffic Control HOWTO
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Alan Cox
2002-05-22 14:55:56 UTC
Permalink
Post by bert hubert
Post by M. Edward Borasky
large machines. I haven't had a chance to run my own test cases on the
2.4.18 kernel from Red Hat 7.3 yet, so I can't make any personal
contribution to this discussion.
RedHat has fixed the problem in its kernels. There are fixes out there, but
Linus is not applying them. I would venture that this is because they would
fix the problems *for the moment* and take away interest in revamping VM for
real.
7.3 has some of what is needed but not all. To go past 16Gb you need highmem
mapped page tables which I'm pretty sure did not make it in.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Martin J. Bligh
2002-05-22 15:56:28 UTC
Permalink
Post by Alan Cox
7.3 has some of what is needed but not all.
Can you outline the changes in this area? I want to make sure we're
not all fighting the same problems seperately ;-) I know bounce
buffers is one large element of that, though I believe you still
only go up to 4Gb, unless I'm mistaken?
Post by Alan Cox
To go past 16Gb you need highmem mapped page tables which I'm
pretty sure did not make it in.
You need it earlier than that if you have many large tasks (4Gb
or so).

M.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Alan Cox
2002-05-22 16:23:32 UTC
Permalink
Post by Martin J. Bligh
Can you outline the changes in this area? I want to make sure we're
not all fighting the same problems seperately ;-) I know bounce
buffers is one large element of that, though I believe you still
only go up to 4Gb, unless I'm mistaken?
Bounce buffer handling
Post by Martin J. Bligh
Post by Alan Cox
To go past 16Gb you need highmem mapped page tables which I'm
pretty sure did not make it in.
You need it earlier than that if you have many large tasks (4Gb
or so).
That I can believe
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Doug Ledford
2002-05-22 21:46:14 UTC
Permalink
Post by Martin J. Bligh
Post by Alan Cox
7.3 has some of what is needed but not all.
Can you outline the changes in this area? I want to make sure we're
not all fighting the same problems seperately ;-) I know bounce
buffers is one large element of that, though I believe you still
only go up to 4Gb, unless I'm mistaken?
Yes, it only goes up to 4Gb. It's because of the error handling code in
the SCSI mid-layer and above, it fails to properly handle the >4gb sg
entries on error conditions. I'm working on that now and should have it
fixed soon.
--
Doug Ledford <***@redhat.com> 919-754-3700 x44233
Red Hat, Inc.
1801 Varsity Dr.
Raleigh, NC 27606

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
William Lee Irwin III
2002-05-22 14:36:51 UTC
Permalink
Post by M. Edward Borasky
A few months ago, there was a flurry of reports from people having
difficulties with memory management on large machines (ia32 over 4 GB). I've
seen a lot of 2.4.x-yy kernels go by and much VM discussion, but what I'm
*not* seeing is reports of either catastrophic behavior or its absence on
large machines. I haven't had a chance to run my own test cases on the
2.4.18 kernel from Red Hat 7.3 yet, so I can't make any personal
contribution to this discussion.
The catastrophic failures are still happening, in fact, the last
lse-tech conference call a week or two ago was dedicated at least in
part to them. The number of different ways in which these failures
occur is large, so it's taking a while for the iterations of whack-a-mole
game to converge to kernel stability. Andrea has probably been doing the
most visible stuff on this front with the recent bh/inode exhaustion
patches, with due credit to akpm as well for the unconditional bh
stripping patch.


Cheers,
Bill
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Martin J. Bligh
2002-05-22 15:44:25 UTC
Permalink
Post by William Lee Irwin III
Post by M. Edward Borasky
A few months ago, there was a flurry of reports from people having
difficulties with memory management on large machines (ia32 over 4 GB). I've
seen a lot of 2.4.x-yy kernels go by and much VM discussion, but what I'm
*not* seeing is reports of either catastrophic behavior or its absence on
large machines. I haven't had a chance to run my own test cases on the
2.4.18 kernel from Red Hat 7.3 yet, so I can't make any personal
contribution to this discussion.
I wouldn't bother using RedHat's kernel for this at the moment,
Andrea's tree is where the development work for this area has all
been happening recently. He's working on integrating O(1) sched
right now, which will get rid of the biggest issue I have with -aa
at the moment (the issue being that I'm too idle^H^H^H^Hbusy to
merge it ;-)).
Post by William Lee Irwin III
The catastrophic failures are still happening, in fact, the last
lse-tech conference call a week or two ago was dedicated at least in
part to them. The number of different ways in which these failures
occur is large, so it's taking a while for the iterations of whack-a-mole
game to converge to kernel stability. Andrea has probably been doing the
most visible stuff on this front with the recent bh/inode exhaustion
patches, with due credit to akpm as well for the unconditional bh
stripping patch.
The problems we're seeing are mainly KVA exhaustion. The top hitlist
for me at the moment are:

1. PTEs (5000 tasks sharing a 2Gb SGA = 10GB of PTEs).
We have two different implementations of highpte, Andrea's
latest seems to work fairly well, and is much more scalable
than earlier versions. We need to have shared PTEs as well.
I'd encourage people to benchmark the hell out of each
solution, and help us come down to one, or a hybrid of both.
2. rmap pte_chains.
As far as I can see, these consume twice as much space as
the PTEs (ie 20Gb in the case above).
3. buffer_heads
I have over 1Gb of bufferheads in (an enlarged) ZONE_NORMAL
right now. akpm has given me a patch to prune them pretty
viciously on an ongoing basis, Andrea has a patch to prune
them under memory pressure. I have slight concerns about
fragmentation under Andrea's approach, but both patches seem
to work fine - performance still needs to be worked out.
4. struct page
Bill Irwin has already done great things in shrinking this
somewhat, but I think we need to be even more drastic at
some point, and only map the PTEs we need for each process,
into a task (well address-space) specific KVA area, which
I call user-kernel address space or UKVA (search back for
my proposal to do this a couple of months ago).
5. kmap
Persistent kmap sucks, and the global systemwide TLB flushes
scale as O(1/N^2) with the number of CPUs. Enlarging the kmap
area helps a little, but really we need to stop doing this to
ourselves. I will have a patch (hopefully within a week) to do
per-task kmap, based on the UKVA patch that Dave McCracken has
already implemented.
6. vmalloc
Vmalloc space gets quickly exhausted, I think a large part of
that is threads allocating 64K LDTs ... and 2.5 has a recent
fix for that that we need to backport.

There are various other general scalability problems (eg. I'd like to
see Ingo's scheduler put into mainline 2.4 sometime soon, both 2.5 and
our benchmarking teams have kicked the hell out of it, and it stands
up well), but the above list is the things I can think of at the moment
that are specific to 32-bit machines (though some of those would also
help 64 bit).

M.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Martin J. Bligh
2002-05-22 15:53:28 UTC
Permalink
Post by Martin J. Bligh
I wouldn't bother using RedHat's kernel for this at the moment,
Andrea's tree is where the development work for this area has all
been happening recently. He's working on integrating O(1) sched
right now, which will get rid of the biggest issue I have with -aa
at the moment (the issue being that I'm too idle^H^H^H^Hbusy to
merge it ;-)).
Oh, of course, I left of the bounce buffer issue, which RedHat *have*
fixed in their tree, I believe. Not sure what the status of this work
is for -aa at the moment.

M.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
William Lee Irwin III
2002-05-22 16:07:31 UTC
Permalink
Post by Martin J. Bligh
1. PTEs (5000 tasks sharing a 2Gb SGA = 10GB of PTEs).
We have two different implementations of highpte, Andrea's
latest seems to work fairly well, and is much more scalable
than earlier versions. We need to have shared PTEs as well.
I'd encourage people to benchmark the hell out of each
solution, and help us come down to one, or a hybrid of both.
pte-highmem isn't enough. On an 8GB machine it's already dead. Sharing
is required just to avoid running out of space period. IIRC Dave
McCracken has been working on daniel's original pte sharing patch.
Post by Martin J. Bligh
2. rmap pte_chains.
As far as I can see, these consume twice as much space as
the PTEs (ie 20Gb in the case above).
If the rmap is to fly at all under those conditions the reverse
translation structures and algorithm will need to be heavily revised
for space consumption. If they're not also shared in some way then they
incur the same or greater space cost as the original pagetables.
Post by Martin J. Bligh
4. struct page
Bill Irwin has already done great things in shrinking this
somewhat, but I think we need to be even more drastic at
some point, and only map the PTEs we need for each process,
into a task (well address-space) specific KVA area, which
I call user-kernel address space or UKVA (search back for
my proposal to do this a couple of months ago).
People really don't like the idea of kmapping struct page; but it is
straightforward. I've been stalling on this because people hate the
idea so badly. I'm also plotting to shrink struct page further still.

OTOH 64GB with a 32B struct page gives us a whopping 512MB of KVA
eaten alive at boot, which effectively castrates the machine.

Please, don't reply with "Get a Hammer" or "Get a 64-bit machine",
I've heard enough of that refrain already and it's also quite useless
to say, as we can neither dictate the replacement of others' hardware
nor ignore the fact their hardware isn't working.
Post by Martin J. Bligh
5. kmap
Persistent kmap sucks, and the global systemwide TLB flushes
scale as O(1/N^2) with the number of CPUs. Enlarging the kmap
area helps a little, but really we need to stop doing this to
ourselves. I will have a patch (hopefully within a week) to do
per-task kmap, based on the UKVA patch that Dave McCracken has
already implemented.
O(1/N^2)? wouldn't that get progressively better as the number of cpu's
grows without bound?


Cheers,
Bill
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Martin J. Bligh
2002-05-22 16:36:35 UTC
Permalink
Post by William Lee Irwin III
pte-highmem isn't enough. On an 8GB machine it's already dead. Sharing
is required just to avoid running out of space period. IIRC Dave
McCracken has been working on daniel's original pte sharing patch.
Depends on the workload, but yes.
Post by William Lee Irwin III
Post by Martin J. Bligh
5. kmap
Persistent kmap sucks, and the global systemwide TLB flushes
scale as O(1/N^2) with the number of CPUs. Enlarging the kmap
area helps a little, but really we need to stop doing this to
ourselves. I will have a patch (hopefully within a week) to do
per-task kmap, based on the UKVA patch that Dave McCracken has
already implemented.
O(1/N^2)? wouldn't that get progressively better as the number of cpu's
grows without bound?
Cost of TLB flush on 1 cpu = 1. Number of CPUs = N. Cost of systemwide
TLB flush = N. Assuming we actually use those CPUs in a comparable way,
we do N times as many global tlbflushes per second with N cpus. This N^2.
Or that's my reckoning, anyway.

M.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Andrea Arcangeli
2002-05-22 17:21:57 UTC
Permalink
Post by Martin J. Bligh
Post by William Lee Irwin III
pte-highmem isn't enough. On an 8GB machine it's already dead. Sharing
is required just to avoid running out of space period. IIRC Dave
McCracken has been working on daniel's original pte sharing patch.
Depends on the workload, but yes.
well, at least that problem can be dominated till a certain degree (say
32G of ram) by throwing more money into the hardware, so it's somehow a
secondary problem compared to the lack of pte-highmem feature.
Post by Martin J. Bligh
Post by William Lee Irwin III
Post by Martin J. Bligh
5. kmap
Persistent kmap sucks, and the global systemwide TLB flushes
scale as O(1/N^2) with the number of CPUs. Enlarging the kmap
area helps a little, but really we need to stop doing this to
ourselves. I will have a patch (hopefully within a week) to do
per-task kmap, based on the UKVA patch that Dave McCracken has
already implemented.
O(1/N^2)? wouldn't that get progressively better as the number of cpu's
1/N^2 is less than O(1), no-way.
Post by Martin J. Bligh
Post by William Lee Irwin III
grows without bound?
Cost of TLB flush on 1 cpu = 1. Number of CPUs = N. Cost of systemwide
TLB flush = N. Assuming we actually use those CPUs in a comparable way,
we do N times as many global tlbflushes per second with N cpus. This N^2.
Or that's my reckoning, anyway.
I think you're right, if you go to some billion CPU it becomes a
quadratic complexity in the tlb flush. OTOH in current kernels NR_CPUS
is #define set to 32, so it's O(1) in pure math terms (but pure math
terms doesn't matter here, they overlook completly the different cost
spent in the tlb flushes with 1 CPUs or 32 CPUs).

Anyways this is only a matter of implementing the
persistent-and-atomic-kmap, I'm pretty sure they're the right solution
for this problem, then the whole pool in highmem.c will go away and even
the pagecache will stop blocking on the kmaps.

Note though that this decision is SMP oriented, on UP the pool that
caches the page->virtual may be more efficient, but the bottleneck of
the pool is a showstopper for SMP and the overhead of the atomic kmap
compared to the cached page->virtual pool is not noticeable if
something, so I've no doubt this is the right direction to take.

I look forward to see the patch (just the kmap-atomic-and-persistent,
not the constnatly mapped pte that is more likely to be a regression
than current linux way IMHO), so we can possibly cleanup and then
integrate it in 2.5 :).

Other things like managing 63G of highmem with only 850M of direct
mapping they're almost unsolvable in a generic manner, however
configuration options and arch-ifdefs can be used here. If the
computation always stays in kernel or always in usersapce then 4G KVA is
a solution (as slow as 2.0, the first bigmem for 2.2 and PTX I guess).
But if you runs syscalls all the time it will be as bad as 2.0, and in
such case current 2.4 way will be much better. But if you need lots of
normal memory you may prefer CONFIG_2G to avoid running out of
inodes/dcache/files/buffercache [limited to normal zone]/vmas/bh, vmas
and bh for rawio/O_DIRECT probably being one of the most problematics
because they cannot shrink dynamically and those bigs machines will do
lots of mappings. But even CONFG_2G may not be ok if you want 1.7G of
shm constnatly mapped in all tasks. So at the end the closest to a
generic solution may be to rewrite the whole kernel MM API to use pfn
instead of page structures and to kmap the mem_map to get the struct
page, so you don't shrink the user address space, you move the huge
mem_map to highmem and the slowdown ""should"" be minor than the 4G KVA
probably (not obvious though), but that's an huge work just before we
finally switch to 64bit computing and with an uncertain performance
result... so it's hard for me not to think 64bit when the other option
is to rewrite the whole kernel MM internal API that affects 99% of the
.c file in the tree :).

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Martin J. Bligh
2002-05-22 18:18:58 UTC
Permalink
Post by Andrea Arcangeli
Post by William Lee Irwin III
Post by Martin J. Bligh
Persistent kmap sucks, and the global systemwide TLB flushes
scale as O(1/N^2) with the number of CPUs. Enlarging the kmap
area helps a little, but really we need to stop doing this to
ourselves. I will have a patch (hopefully within a week) to do
per-task kmap, based on the UKVA patch that Dave McCracken has
already implemented.
O(1/N^2)? wouldn't that get progressively better as the number of cpu's
1/N^2 is less than O(1), no-way.
Sorry, typo - O(N^2). Cost of each systemwide flush is N times as much, and
we do them N times more often (fixed size kmap pool, due to fixed size KVA).
At a quick test, Keith found that increasing the size of the kmap pool from 1024
to 4096 (4Mb to 16Mb of KVA consumed) reduces the number of flushes by a
factor of 10 (due to the static overhead).
Post by Andrea Arcangeli
Anyways this is only a matter of implementing the
persistent-and-atomic-kmap, I'm pretty sure they're the right solution
for this problem, then the whole pool in highmem.c will go away and even
the pagecache will stop blocking on the kmaps.
Working on the first stage of it as we speak ...
Post by Andrea Arcangeli
I look forward to see the patch (just the kmap-atomic-and-persistent,
not the constnatly mapped pte that is more likely to be a regression
than current linux way IMHO), so we can possibly cleanup and then
integrate it in 2.5 :).
We have a breakoff of the UKVA infrastructure now (thanks to Dave McCracken),
and once we've kicked its tires a little, we'll pass it across for inspection.
Post by Andrea Arcangeli
Other things like managing 63G of highmem with only 850M of direct
mapping they're almost unsolvable in a generic manner, however
configuration options and arch-ifdefs can be used here. If the
computation always stays in kernel or always in usersapce then 4G KVA is
a solution (as slow as 2.0, the first bigmem for 2.2 and PTX I guess).
I'm more worried about 32Gb than 64Gb for the moment, I don't know
of any machines anyone is actually selling that will take 64Gb - the
NUMA-Q will if we want to work on it, but 16Gb and 32Gb are the
real points right now.
Post by Andrea Arcangeli
But even CONFG_2G may not be ok if you want 1.7G of
shm constnatly mapped in all tasks.
Exactly. Sometimes I hate databases ;-)
Post by Andrea Arcangeli
So at the end the closest to a
generic solution may be to rewrite the whole kernel MM API to use pfn
instead of page structures and to kmap the mem_map to get the struct
page, so you don't shrink the user address space, you move the huge
mem_map to highmem and the slowdown ""should"" be minor than the 4G KVA
probably (not obvious though),
Page clustering may be an easier solution for now, and you're right this is only
a "bridge" to the new world ... that'd give us an effective 16Kb page size, with
probably much less pain than the kmap'ed mem_map, and might even *improve*
performance ;-) Taming the beast to something workable rather than killing it totally
is good enough ...

M.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Alan Cox
2002-05-22 18:02:02 UTC
Permalink
Post by William Lee Irwin III
Please, don't reply with "Get a Hammer" or "Get a 64-bit machine",
I've heard enough of that refrain already and it's also quite useless
to say, as we can neither dictate the replacement of others' hardware
nor ignore the fact their hardware isn't working.
There are a whole pile of valid answers. Those happen to be good ones.
Fixing the application to use clone() not 4000 individual sets of page
tables might not be a bad plan either.

On a more practical note its been true for years but nobody needed the
memory to implement it that you can discard page tables for non anonymous
objects when you need space (arguably they belong on the lru like
everything else) and you can page anonymous maps to disk, although you must
use software not hardware assisted stuff for this due to x86 cpu bugs,
and a complete lack of pageable page table walking hardware in many
processors.

Do each of your tasks map the stuff at the same address. If you are
assuming this how do you plan to handle the person who doesn't. You won't
be able to share page tables then ?

For the shared case then yes sharing pte pages potentially works, but you
have to handle a lot of nasty corner cases buried in vm code like mremap
and mprotect which badly need rewriting before anyone tackles hacking more
crap into them. The rmap would need to know the vma in this case rather
than the pages since the pages would be mapped identically to each user
of the vma and the page cannot be present/absent in different tasks when
the pte is shared

So you need to clean up the uglies in the mm operations, implement reference
counting structures for ptes, hashes to find them, code to page them and
to do accounting on them, rebalance the vm after you do this, fix all of
the unsharing cases to be correct, verify them, make them lock safe against
truncate, disk writes, asynchronous I/O, sendfile and each other.

Can you even make that work -before- the customers have all upgraded
anyway ?

Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Linus Torvalds
2002-05-22 18:08:27 UTC
Permalink
Post by Alan Cox
On a more practical note its been true for years but nobody needed the
memory to implement it that you can discard page tables for non anonymous
objects when you need space (arguably they belong on the lru like
everything else)
You don't strictly even need to LRU it - you could just keep a pte count
aroudn, and when it goes to zero you zap the pmd. You can use the normal
page_count() thing for it.

HOWEVER, I'm rather certain that this won't actually help in real life,
and it does add complexity.

The solution really is a "don't do it then" kind of thing. If you have
5000 processes that all want to map a big shared memory area, and you
don't want to upgrade your CPU's, it's a _whole_ lot easier to just have a
magic "map_large_page()" system call, and start using the 2MB page support
of the x86.

And no, this should NOT be a mmap.

It's a magic x86-only system call, for the express purpose of adding
something well-contained (but really ugly) for Oracle or other similar
users. I don't mind "really ugly" as long as it doesn't have any impact
on the rest of the system.

It should be less than a few hundred lines of code. Suggested starting
point appended.

Making the _generic_ code jump through hoops because some stupid special
case that nobody else is interested in is bad.

Linus

--- don't look too closely or you'll go blind! ---

static unsigned long get_magic_bigpage(int idx)
{
unsigned long bigpage;

if (idx >= MAXBIGPAGES)
return 0;

down(&bigpage_sem);
bigpage = bigpage_array[idx];
if (bigpage)
goto out;
bigpage = alloc_bigpage_from_magic_zone();
if (bigpage) {
bigpage_users[idx] = bigpage;
bigpage_users[idx]++;
}
out:
up(&bigpage_sem);
return bigpage;
}


asmlinkage unsigned long sys_map_ugly_big_page(
unsigned long address,
unsigned long size,
unsigned long idx)
{
/*
* Only root can do this, because the
* allocation will be non-pageable.
*/
if (!capable(CAP_ADMIN))
return -EPERM;

/*
* We require the user to give us the exact
* address, and it has to be PMD_SIZE-aligned
*/
if ((address|size) & (PMD_SIZE-1))
return -EINVAL;
if (size > TASK_SIZE || TASK_SIZE - size < address)
return -EINVAL;
if (!size)
return 0;

down_write(&current->mm->mmap_sem);
vma = find_vma(address);
retval = -ENOMEM;

/* We won't unmap any existing pages */
if (vma && vma->start < address + size)
goto out;

vma = kmem_cache_alloc(&vma_slab, GFP_KERNEL);
if (!vma)
goto out;

vma->vm_flags = VM_MAGIC;
retval = 0;
do {
bigpage = get_magic_bigpage(idx);
if (!bigpage)
break;
set_pmd(pgd_offset(mm, address), pmd_bigpage(bigpoage));
idx++;
retval += PMD_SIZE;
address += PMD_SIZE;
size -= PMD_SIZE;
} while (size);
out:
up_write(&current->mm->mmap_sem);
return retval;
}

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Alan Cox
2002-05-22 18:34:55 UTC
Permalink
Post by Linus Torvalds
You don't strictly even need to LRU it - you could just keep a pte count
aroudn, and when it goes to zero you zap the pmd. You can use the normal
page_count() thing for it.
That assumes you want to page out the page table only after the pages it
references are paged out. There is no reason I can see for not flushing it
first. Its very cheap to regenerate for non-anonymous pages - much cheaper
than the pages it references. Also the locality of most apps means that
there are zillions of glibc pages they reference only once (for init, and
for linker fixups/names)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Linus Torvalds
2002-05-22 18:27:21 UTC
Permalink
Post by Alan Cox
That assumes you want to page out the page table only after the pages it
references are paged out. There is no reason I can see for not flushing it
first.
Dirty state and mixed vma's on the same pmd would make this more complex
than I really like, but sure..

However, the pmd almost certainly gets re-created very quickly anyway, so
I seriously doubt you get any real wins.

Remember: the point of swapping stuff out (or just dropping them) is that
we don't need them in the near future. With a 4MB (or 2MB) granularity,
that's not very likely.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Rik van Riel
2002-05-22 18:30:00 UTC
Permalink
Post by Linus Torvalds
don't want to upgrade your CPU's, it's a _whole_ lot easier to just have a
magic "map_large_page()" system call, and start using the 2MB page support
of the x86.
And no, this should NOT be a mmap.
It's a magic x86-only system call,
Making the _generic_ code jump through hoops because some stupid special
case that nobody else is interested in is bad.
Actually, I suspect that MIPS, x86-64 and other
architectures are also interested ...

kind regards,

Rik
--
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/ http://distro.conectiva.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Linus Torvalds
2002-05-22 18:40:44 UTC
Permalink
Post by Rik van Riel
Post by Linus Torvalds
It's a magic x86-only system call,
Making the _generic_ code jump through hoops because some stupid special
case that nobody else is interested in is bad.
Actually, I suspect that MIPS, x86-64 and other
architectures are also interested ...
Oh, it can certainly have similar semantics on other architectures, the
point really being not so much the x86'ness, but the fact that this is a
separate subsystem with very limited scope.

Limiting the scope means, for example, that:

- no issues about memory coherency of shared mappings with "read/write"

mmap _has_ to be coherent for good behaviour (yeah, yeah, I know there
are systems out there that aren't, but they are clearly inferior and
cannot run innd with mappings etc).

But doing some kind of "file coherent big page" support is just too
horrible for words.

- no mixups with "get_unmapped_page()" and friends having to be able to
find aligned mappings, and more magic paths on mmap/unmap. As far as
the rest of the VM, the big pages are basically just not there. Make
that explicit by actually making "pmd_present()" return 0 for big
pages.

- you can later, if you want, _extend_ the semantics without breaking
stuff, if some future VM actually wants to be natively aware of big
pages. I consider that unlikely, but hey..

Is it fairly ugly? Yes. But it gets the job done, and doing it in some
special C file with little impact on the rest of the system means that we
can tweak it for the hardware instead of trying to make it a "good
design".

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Martin J. Bligh
2002-05-22 18:48:00 UTC
Permalink
Post by Rik van Riel
Post by Linus Torvalds
don't want to upgrade your CPU's, it's a _whole_ lot easier to just have a
magic "map_large_page()" system call, and start using the 2MB page support
of the x86.
And no, this should NOT be a mmap.
It's a magic x86-only system call,
Making the _generic_ code jump through hoops because some stupid special
case that nobody else is interested in is bad.
Actually, I suspect that MIPS, x86-64 and other
architectures are also interested ...
Indeed. Even if you happen to have a spare 10Gb of RAM, and can address it
efficiently, that's still no reason to blow it on mindless copies of data ;-)

M.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
William Lee Irwin III
2002-05-22 20:30:24 UTC
Permalink
Post by Linus Torvalds
The solution really is a "don't do it then" kind of thing. If you have
5000 processes that all want to map a big shared memory area, and you
don't want to upgrade your CPU's, it's a _whole_ lot easier to just have a
magic "map_large_page()" system call, and start using the 2MB page support
of the x86.
map_large_page() sounds decent, though I don't really know how easy
it'll be to get apps to cooperate. I suspect it's easier when the
answer is "the app crashed" as opposed to "the kernel crashed".

Okay, someone told me "upgrade my cpu", and even though it was you,
here it is:

It still requires the cooperation of userspace to prevent the system from
being taken down. At the very least resource accounting and prevention
of the deadlock based on it are additionally required. Furthermore,
64-bit address spaces do not reduce the sizes of pagetables, and they
face the same memory exhaustion issues as 32-bit machines in the presence
of shared mappings performed by large numbers of tasks. A larger pagesize
just increases the number of processes and/or the size of the mapping
required to trigger this, where the size of the mapping is likely the
compensator on 64-bit machines. Last, but not least, "upgrade your cpu"
is a very strange (and perplexing) answer to a software problem, and
one with several well-known techniques for solving it, and it's
certainly not solved by upgrading the cpu. mmap a 16TB space in 4MB
pages with 8-byte pte's in 4K tasks on a 64GB 64-bit machine, with your
favorite number of cpu's (the cpu count is irrelevant; assume 1 cpu per
task if you wish, I certainly can't afford that), and it happens again.
As far as I can tell, it's only worse on 64-bit machines, and that has
motivated many of the 64-bit cpu vendors to use either pagetables which
are not radix trees or software-fill TLB's.


Cheers,
Bill
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Martin J. Bligh
2002-05-22 21:18:54 UTC
Permalink
Post by William Lee Irwin III
Post by Linus Torvalds
The solution really is a "don't do it then" kind of thing. If you have
5000 processes that all want to map a big shared memory area, and you
don't want to upgrade your CPU's, it's a _whole_ lot easier to just have a
magic "map_large_page()" system call, and start using the 2MB page support
of the x86.
map_large_page() sounds decent, though I don't really know how easy
it'll be to get apps to cooperate. I suspect it's easier when the
answer is "the app crashed" as opposed to "the kernel crashed".
If we could get the apps (well, Oracle) to co-operate, we could just use
clone ;-) Having this transparent for shmem segments would be really nice.

M.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Linus Torvalds
2002-05-22 21:23:39 UTC
Permalink
Post by Martin J. Bligh
If we could get the apps (well, Oracle) to co-operate, we could just use
clone ;-) Having this transparent for shmem segments would be really nice.
The thing is, we won't get Oracle to rewrite a lot for a completely
threaded system. And clone does _not_ come with a way to share only parts
of the VM, and never will - that's fundamentally against the way "struct
mm_struct" works.

Oracle is apparently already used to magic shmem-like things, so doing
that is probably acceptable to them.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Andrea Arcangeli
2002-05-22 22:35:12 UTC
Permalink
Post by Linus Torvalds
Post by Martin J. Bligh
If we could get the apps (well, Oracle) to co-operate, we could just use
clone ;-) Having this transparent for shmem segments would be really nice.
The thing is, we won't get Oracle to rewrite a lot for a completely
threaded system. And clone does _not_ come with a way to share only parts
actually not using threads also provides a bit more of protection across
the different ""threads"", but OTOH the shm part could be corrupted
anyways if there's a bug.
Post by Linus Torvalds
of the VM, and never will - that's fundamentally against the way "struct
mm_struct" works.
Oracle is apparently already used to magic shmem-like things, so doing
that is probably acceptable to them.
For x86 using largepages is the first prio, the relevance of sharing
pagetables is near to nothing compared to 4M pages. As HPA said at the
last kernel summit during the commetary of either the VM or the Oracle
speech (and I'm not sure if everybody understood what he said), without
PAE 4M pages just provides shared pagetables because there's nothing
anymore to share, and no the pgd cannot be shared because the fact is
not clone() is our problem since the first place.

With PAE there's to share the pmd but if you actually do the math the
pmd for a worth of 3G of address space the pmd is 12k per task,
it doesn't really matter at all, with 4000 tasks all mapping the same
1.5G shm segment the amount of sharable pmd memory is reduced to
24Mbytes, who cares about 24Mbytes with 4000 tasks working on 1.5G of
ram each (in particular with a much more powerful tlb caching on such
ram virtual addresses)? At the very least it is a very secondary
interest, and it cannot make differnces at all without PAE.

On x86-64 as well we use a single top level pte for all the tasks (only
the first top level entry changes in the mm switches) so again using 2M
pages would be more than enough there too as much as in x86 PAE (like if
x86-64 would be 3 level pages like PAE x86). Of course there we can go
way above 1.5G of shm mapped per task, so at some point as the shm size
incrase, the pmd sharing may get more relevance but I don't see it as a
short term issue at least.

As last thing, we'll also need to enforce a large enough MAX_ORDER to be
able to allocate 2/4M pages, I hoped we could turn it down and to save
cpu cycles as soon as all the hashtable allocations will be finally
rewritten to use the bootmem allocator, and that won't be possible
anymore, but it's not a showstopper, desktops are idle all the time
anyways.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Martin J. Bligh
2002-05-22 22:44:01 UTC
Permalink
Post by Linus Torvalds
Post by Martin J. Bligh
If we could get the apps (well, Oracle) to co-operate, we could just use
clone ;-) Having this transparent for shmem segments would be really nice.
The thing is, we won't get Oracle to rewrite a lot for a completely
threaded system. And clone does _not_ come with a way to share only parts
of the VM, and never will - that's fundamentally against the way "struct
mm_struct" works.
We're actually playing with Oracle apps - I'm told they already run on threaded
mode on NT ... I personally get the feeling that Oracle's commitment to Linux is
distinctly half-hearted. The whole support matrix debacle was pretty indicative,
IMHO. All personal opinion, I speaketh not for IBM.
Post by Linus Torvalds
Oracle is apparently already used to magic shmem-like things, so doing
that is probably acceptable to them.
We can but try, but I still think some transparent magic would be implementable.

M.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Wim Coekaerts
2002-05-28 02:08:33 UTC
Permalink
Post by Martin J. Bligh
If we could get the apps (well, Oracle) to co-operate, we could just use
clone ;-) Having this transparent for shmem segments would be really nice.
Except that we fork() from different areas, eg at startup, or from the
listener process once things are up, or directly through a locally
running client etc. so it's not just using clone() and done...

Altho we probably should have a look at it, maybe someone already did,
will check it out.

Wim

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Andrea Arcangeli
2002-05-31 20:39:11 UTC
Permalink
Post by Wim Coekaerts
running client etc. so it's not just using clone() and done...
that seems a minor problem, you could still use shm and the "corner
cases" could attach to the shm as usual, but the "core" threads wouldn't
need replication of pagetables to use the shm.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Bill Davidsen
2002-05-23 14:16:41 UTC
Permalink
Post by Linus Torvalds
Making the _generic_ code jump through hoops because some stupid special
case that nobody else is interested in is bad.
Thoughts in no particular order:
- set large page size based on file size or mapped section size
- set LPS based on a capability on the program
- set LPS based on a flag of some nature on the file
- set LPS based on the number of processes mapping the file

I mention these because it would be nice to get better behaviour from
programs which aren't optimized for Linux and may never be.
--
bill davidsen <***@tmr.com>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Linus Torvalds
2002-05-23 17:18:17 UTC
Permalink
Post by Bill Davidsen
Post by Linus Torvalds
Making the _generic_ code jump through hoops because some stupid special
case that nobody else is interested in is bad.
- set large page size based on file size or mapped section size
No can do. Most hardware out there (by about 99.9%) only has two page
sizes, and the second page size actually depends on a kernel compile
option (2M of 4M).

And the problem with _big_ pages (as opposed to just "slightly bigger"
pages like some architectures have, ranging from 8kB - 64kB) is that they
are _way_ too easy to misuse for VM DoS attacks.

Basically, they aren't just a "gradual improvement" on the VM subsystem.
They have _totally_ different characteristics, and simply do not fit with
the VM.
Post by Bill Davidsen
- set LPS based on a capability on the program
- set LPS based on a flag of some nature on the file
- set LPS based on the number of processes mapping the file
I mention these because it would be nice to get better behaviour from
programs which aren't optimized for Linux and may never be.
One of the problems with LPS is that it simply _will_not_ be coherent with
read/write and the regular page cache.

If you make the LPS decision on some process capability or other flag, you
have to accept the mixing of small pages and large pages - because other
processes that do _not_ have the capability will not get the LPS.

And once you go there, you have not just started using different pages,
you've changed the _semantics_ of the pages: they are no longer coherent
with other processes accessing the same file.

And that's ignoring the fact that the regular interfaces change in other
ways, ie the alignment restrictions on mmap() etc change _radically_.

In other words, don't do it. It changes semantics, and because it changes
semantics the program _has_ to be aware of it. In other words, the program
has to be compiled for the behaviour.

Now, we can make those semantics _easier_ to use, so that the changes to
existing programs are minimal. But changes there will be.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Bill Davidsen
2002-05-23 19:34:06 UTC
Permalink
Post by Linus Torvalds
Post by Bill Davidsen
Post by Linus Torvalds
Making the _generic_ code jump through hoops because some stupid special
case that nobody else is interested in is bad.
- set LPS based on a capability on the program
- set LPS based on a flag of some nature on the file
- set LPS based on the number of processes mapping the file
I mention these because it would be nice to get better behaviour from
programs which aren't optimized for Linux and may never be.
One of the problems with LPS is that it simply _will_not_ be coherent with
read/write and the regular page cache.
If you make the LPS decision on some process capability or other flag, you
have to accept the mixing of small pages and large pages - because other
processes that do _not_ have the capability will not get the LPS.
And once you go there, you have not just started using different pages,
you've changed the _semantics_ of the pages: they are no longer coherent
with other processes accessing the same file.
That's the case with a capability, obviously, it's less clear that if you
had a flag on the file that could happen, since any processes doing a
mmap() on the file would get LPS.
Post by Linus Torvalds
And that's ignoring the fact that the regular interfaces change in other
ways, ie the alignment restrictions on mmap() etc change _radically_.
mmap() returns results codes, I personally would have no problem with a
program having to cope with getting one if that's what it takes. This
wasn't (and isn't) my issue, but it seems that Linux has been so clever in
other ways that it would be desirable to address the problems caused by
crappy application software.
Post by Linus Torvalds
In other words, don't do it. It changes semantics, and because it changes
semantics the program _has_ to be aware of it. In other words, the program
has to be compiled for the behaviour.
Now, we can make those semantics _easier_ to use, so that the changes to
existing programs are minimal. But changes there will be.
Then I hope someone gets the shared, paged or otherwise smaller entries
working, I suspect some of my machine have more memory in page tables than
user memory :-(_
--
bill davidsen <***@tmr.com>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Linus Torvalds
2002-05-23 19:46:57 UTC
Permalink
Post by Bill Davidsen
Post by Linus Torvalds
And once you go there, you have not just started using different pages,
you've changed the _semantics_ of the pages: they are no longer coherent
with other processes accessing the same file.
That's the case with a capability, obviously, it's less clear that if you
had a flag on the file that could happen, since any processes doing a
mmap() on the file would get LPS.
Note that in that case mmap() would be coherent with itself, but still
probably not coherent with read/write without _major_ surgery in the VM.

So even the per-file flag doesn't really help.

It also gets really hard to do a sane replacement policy for _any_ backing
store (in order to cover the whole file). The problem with the replacement
policy is that we simply cannot unmap the big page sanely, because we
don't want to force a huge 4MB IO of dirty data, _and_ because the VM
doesn't know about big pages in the first place.

My suggested interface only does anonymous pages exactly for this reason:
you get a fixed number of big pages, and no more. No replacement policies
to worry about, no fundamental VM changes, and you get what people
actually want to have (then you can teach some well-defined places like
the direct-IO page walking about the big page, so that the database can do
IO directly into a _part_ of the page).

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Martin J. Bligh
2002-05-22 18:38:23 UTC
Permalink
Post by Alan Cox
Fixing the application to use clone() not 4000 individual sets of page
tables might not be a bad plan either.
Do each of your tasks map the stuff at the same address. If you are
assuming this how do you plan to handle the person who doesn't. You won't
be able to share page tables then ?
I think so. They're also hardlocked in memory which makes life easier.
Post by Alan Cox
Can you even make that work -before- the customers have all upgraded
anyway ?
Given that we're selling a new line of machines based on this now, I'd guess
it'll be 5 years before they're all upgraded. On the other hand, I think they'll
lynch us if Linux doesn't work properly on these type of machines within the
next year ;-) But, yes, I still think it's worth it. Hammer is a great promise, but
it's just not here right now, and I don't think we'll have production level 8-way
and 16-way machines for at least a year ...

M.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Alan Cox
2002-05-22 17:50:53 UTC
Permalink
Post by Martin J. Bligh
I wouldn't bother using RedHat's kernel for this at the moment,
Andrea's tree is where the development work for this area has all
been happening recently. He's working on integrating O(1) sched
right now, which will get rid of the biggest issue I have with -aa
Still ? Its been in the Red Hat 7.3 tree since we released it. Its also
in the -ac tree all nicely merged. I guess your definition of happening
is my definition of "happened" 8)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
J Sloan
2002-05-22 17:54:52 UTC
Permalink
Post by Alan Cox
Post by Martin J. Bligh
I wouldn't bother using RedHat's kernel for this at the moment,
Andrea's tree is where the development work for this area has all
been happening recently. He's working on integrating O(1) sched
right now, which will get rid of the biggest issue I have with -aa
Still ? Its been in the Red Hat 7.3 tree since we released it. Its also
in the -ac tree all nicely merged. I guess your definition of happening
is my definition of "happened" 8)
Huh? RH 7.3 kernel has the O(1) scheduler merged?

If the RH kernel is anything like the 2.4.19-pre8-ac5
I'm currently running, that is sweet indeed.

Joe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Alan Cox
2002-05-22 18:22:00 UTC
Permalink
Post by J Sloan
Huh? RH 7.3 kernel has the O(1) scheduler merged?
If the RH kernel is anything like the 2.4.19-pre8-ac5
I'm currently running, that is sweet indeed.
rpm -q --changelog |grep "O(1)"

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
J Sloan
2002-05-22 22:14:00 UTC
Permalink
Post by Alan Cox
Post by J Sloan
Huh? RH 7.3 kernel has the O(1) scheduler merged?
If the RH kernel is anything like the 2.4.19-pre8-ac5
I'm currently running, that is sweet indeed.
rpm -q --changelog |grep "O(1)"
Yes, I was being lazy - but now that finals
are done with for now, I'll grab that kernel
source rpm and have a proper look at it.

Thanks,

Joe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Martin J. Bligh
2002-05-22 18:24:57 UTC
Permalink
Post by Alan Cox
Post by Martin J. Bligh
I wouldn't bother using RedHat's kernel for this at the moment,
Andrea's tree is where the development work for this area has all
been happening recently. He's working on integrating O(1) sched
right now, which will get rid of the biggest issue I have with -aa
Still ? Its been in the Red Hat 7.3 tree since we released it. Its also
in the -ac tree all nicely merged. I guess your definition of happening
is my definition of "happened" 8)
There are definitely good things in both trees for this problem area at
the moment. If you're interested in fixing this Alan and Andrea, let's start a
mergefest. I'm sure I can volunteer some IBM resources to help port patches,
and test the hell out of it .... if you're willing to consider taking things, I'll draw
up a list of what the issues are, what patches are available, and which
trees they reside in (often none ;-( )

If my spies are correct, 7.3AS kernel is still based off the old 2.4.9 VM, with
no rmap at present ... correct? I presume 7.3 is 2.4.18 or so VM with rmap?

M.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Alan Cox
2002-05-22 22:05:17 UTC
Permalink
Post by Martin J. Bligh
If my spies are correct, 7.3AS kernel is still based off the old 2.4.9 VM, with
no rmap at present ... correct? I presume 7.3 is 2.4.18 or so VM with rmap?
<Red Hat Marketing>There is no such product as 7.3AS</Red Hat Marketing> ;)

The AS 2.1 kernel is 2.4.9 based for enterprise stability with the pre
Linus being hit by cosmic rays VM fixed and tuned for enterprise workloads.

7.3 is the rmap VM

Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Alastair Stevens
2002-05-22 14:29:37 UTC
Permalink
... Linus is not applying them
Shouldn't that be "Marcelo is not applying them"? Linus has devolved
all responsibility for 2.4 now, and is concentrating on the 2.5 series
and all its radical changes.

Marcelo objected to Andrea's mega-patch, but if I recall, he hinted that
me might start merging the split-up patches for 2.4.20 - in the
meantime, you can always apply the latest -aa patch yourself to a
2.4.19-pre kernel. Otherwise, the Red Hat patched kernel (which I
believe still doesn't use Andrea's VM at all) ought to work well, with
all their spiffy regression testing etc....

Cheers
Alastair

o o o o o o o o o o o o o o o o o o o o o o o o o o o o
Alastair Stevens \ \
MRC Biostatistics Unit \ \___________ 01223 330383
Cambridge UK \___ www.mrc-bsu.cam.ac.uk

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Alan Cox
2002-05-22 14:52:20 UTC
Permalink
Post by Alastair Stevens
2.4.19-pre kernel. Otherwise, the Red Hat patched kernel (which I
believe still doesn't use Andrea's VM at all) ought to work well, with
all their spiffy regression testing etc....
The Red Hat 7.3 kernel uses Rik van Riel's rmap and Andre Hedricks IDE
updates. It did indeed pass our stress testing and seems to perform very
well under memory contention and high shared page counts - the classic
desktop/developer set up.

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Loading...