Discussion:
large page patch (fwd) (fwd)
Linus Torvalds
2002-08-02 19:34:08 UTC
Permalink
[ linux-kernel cc'd, simply because I don't want to write the same thing
over and over again ]

[ Executive summary: the current large-page-patch is nothing but a magic
interface to the TLB. Don't view it as anything else, or you'll just be
confused by all the smoke and mirrors. ]
Because _none_ of the large-page codepaths are shared with _any_ of the
normal cases.
Isn't that currently an implementation detail?
Yes and no.

We may well expand the FS layer to bigger pages, but "bigger" is almost
certainly not going to include things like 256MB pages - if for no other
reason than the fact that memory fragmentation really means that the limit
on page sizes in practice is somewhere around 128kB for any reasonable
usage patterns even with gigabytes of RAM.

And _maybe_ we might get to the single-digit megabytes. I doubt it, simply
because even with a good buddy allocator and a memory manager that
actively frees pages to get large contiguous chunks of RAM, it's basically
impossible to have something that can reliably give you that big chunks
without making normal performance go totally down the toiled.

(Yeah, once you have terabytes of memory, that worry probably ends up
largely going away. I don't think that is going to be a common enough
platform for Linux to care about in the next ten years, though).

So there are implementation issues, yes. In particular, there _is_ a push
for larger pages in the FS and generic MM layers too, but the issues there
are very different and have no basically no generality with the TLB and
page table mapping issues of the current push.

What this VM/VFS push means is that we may actually have a _different_
"large page" support on that level, where the most likely implementation
is that the "struct address_space" will at some point have a new member
that specifies the "page allocation order" for that address space. This
will allow us to do per-file allocations, so that some files (or some
filesystems) migth want to do all IO in 64kB chunks, and they'd just make
the address_space specify a page allocation order that matches that.

This is in fact one of the reasons I explicitly _want_ to keep the
interfaces separate - because there are two totally different issues at
play, and I suspect that we'll end up implementing _both_ of them, but
that they will _still_ have no commonalities.

The current largepage patch is really nothing but an interface to the TLB.
Please view it as that - a direct TLB interface that has zero impact on
the VFS or VM layers, and that is meant _purely_ as a way to expose hw
capabilities to the few applications that really really want them.

The important thing to take away from this is that _even_ if we could
change the FS and VM layers to know about a per-address_space variable-
sized PAGE_CACHE_SIZE (which I think it the long-term goal), that doesn't
impact the fact that we _also_ want to have the TLB interface.

Maybe the largepage patch could be improved upon by just renaming it, and
making clear that it's a "TLB_hugepage" thing. That's what a CPU designer
thinks of when you say "largepage" to him. Some of the confusion is
probably because a VM/FS person in an OS group does _not_ necessarily
think the same way, but thinks about doing big-granularity IO.

Linus
David Mosberger
2002-08-03 03:19:59 UTC
Permalink
Linus> We may well expand the FS layer to bigger pages, but "bigger"
Linus> is almost certainly not going to include things like 256MB
Linus> pages - if for no other reason than the fact that memory
Linus> fragmentation really means that the limit on page sizes in
Linus> practice is somewhere around 128kB for any reasonable usage
Linus> patterns even with gigabytes of RAM.

Linus> And _maybe_ we might get to the single-digit megabytes. I
Linus> doubt it, simply because even with a good buddy allocator and
Linus> a memory manager that actively frees pages to get large
Linus> contiguous chunks of RAM, it's basically impossible to have
Linus> something that can reliably give you that big chunks without
Linus> making normal performance go totally down the toiled.

The Rice people avoided some of the fragmentation problems by
pro-actively allocating a max-order physical page, even when only a
(small) virtual page was being mapped. This should work very well as
long as the total memory usage (including memory lost due to internal
fragmentation of max-order physical pages) doesn't exceed available
memory. That's not a condition which will hold for every system in
the world, but I suspect it is true for lots of systems for large
periods of time. And since superpages quickly become
counter-productive in tight-memory situations anyhow, this seems like
a very reasonable approach.

--david
Linus Torvalds
2002-08-03 03:32:10 UTC
Permalink
Post by David Mosberger
The Rice people avoided some of the fragmentation problems by
pro-actively allocating a max-order physical page, even when only a
(small) virtual page was being mapped.
This probably works ok if
- the superpages are only slightly smaller than the smaller page
- superpages are a nice optimization.
Post by David Mosberger
And since superpages quickly become
counter-productive in tight-memory situations anyhow, this seems like
a very reasonable approach.
Ehh.. The only people who are _really_ asking for the superpages want
almost nothing _but_ superpages. They are willing to use 80% of all memory
for just superpages.

Yes, it's Oracle etc, and the whole point for these users is to avoid
having any OS memory allocation for these areas.

Linus
David Mosberger
2002-08-03 04:17:45 UTC
Permalink
And since superpages quickly become counter-productive in
tight-memory situations anyhow, this seems like a very reasonable
approach.
Linus> Ehh.. The only people who are _really_ asking for the
Linus> superpages want almost nothing _but_ superpages. They are
Linus> willing to use 80% of all memory for just superpages.

Linus> Yes, it's Oracle etc, and the whole point for these users is
Linus> to avoid having any OS memory allocation for these areas.

My terminology is perhaps a bit too subtle: I user "superpage"
exclusively for the case where multiple pages get coalesced into a
larger page. The "large page" ("huge page") case that you were
talking about is different, since pages never get demoted or promoted.

I wasn't disagreeing with your case for separate large page syscalls.
Those syscalls certainly simplify implementation and, as you point
out, it well may be the case that a transparent superpage scheme never
will be able to replace the former.

--david
Linus Torvalds
2002-08-03 04:26:52 UTC
Permalink
Post by David Mosberger
My terminology is perhaps a bit too subtle: I user "superpage"
exclusively for the case where multiple pages get coalesced into a
larger page. The "large page" ("huge page") case that you were
talking about is different, since pages never get demoted or promoted.
Ahh, ok.
Post by David Mosberger
I wasn't disagreeing with your case for separate large page syscalls.
Those syscalls certainly simplify implementation and, as you point
out, it well may be the case that a transparent superpage scheme never
will be able to replace the former.
Somebody already had patches for the transparent superpage thing for
alpha, which supports it. I remember seeing numbers implying that helped
noticeably.

But yes, that definitely doesn't work for humongous pages (or whatever we
should call the multi-megabyte-special-case-thing ;).

Linus
David Mosberger
2002-08-03 04:39:36 UTC
Permalink
Post by David Mosberger
I wasn't disagreeing with your case for separate large page
syscalls. Those syscalls certainly simplify implementation and,
as you point out, it well may be the case that a transparent
superpage scheme never will be able to replace the former.
Linus> Somebody already had patches for the transparent superpage
Linus> thing for alpha, which supports it. I remember seeing numbers
Linus> implying that helped noticeably.

Yes, I saw those. I still like the Rice work a _lot_ better. It's
just a thing of beauty, from a design point of view (disclaimer: I
haven't seen the implementation, so there may be ugly things
lurking...).

Linus> But yes, that definitely doesn't work for humongous pages (or
Linus> whatever we should call the multi-megabyte-special-case-thing
Linus> ;).

Yes, you're probably right. 2MB was reported to be fine in the Rice
experiments, but I doubt 256MB (and much less 4GB, as supported by
some CPUs) would fly.

--david
David S. Miller
2002-08-03 05:20:24 UTC
Permalink
From: David Mosberger <***@napali.hpl.hp.com>
Date: Fri, 2 Aug 2002 21:39:36 -0700
Post by David Mosberger
I wasn't disagreeing with your case for separate large page
syscalls. Those syscalls certainly simplify implementation and,
as you point out, it well may be the case that a transparent
superpage scheme never will be able to replace the former.
Linus> Somebody already had patches for the transparent superpage
Linus> thing for alpha, which supports it. I remember seeing numbers
Linus> implying that helped noticeably.

Yes, I saw those. I still like the Rice work a _lot_ better.

Now here's the thing. To me, we should be adding these superpage
syscalls to things like the implementation of malloc() :-) If you
allocate enough anonymous pages together, you should get a superpage
in the TLB if that is easy to do. Once any hint of memory pressure
occurs, you just break up the large page clusters as you hit such
ptes. This is what one of the Linux large-page implementations did
and I personally find it the most elegant way to handle the so called
"paging complexity" of transparent superpages.

At that point it's like "why the system call". If it would rather be
more of a large-page reservation system than a "optimization hint"
then these syscalls would sit better with me. Currently I think they
are superfluous. To me the hint to use large-pages is a given :-)

Stated another way, if these syscalls said "gimme large pages for this
area and lock them into memory", this would be fine. If the syscalls
say "use large pages if you can", that's crap. And in fact we could
use mmap() attribute flags if we really thought that stating this was
necessary.
Linus Torvalds
2002-08-03 17:35:00 UTC
Permalink
Post by David S. Miller
Now here's the thing. To me, we should be adding these superpage
syscalls to things like the implementation of malloc() :-) If you
allocate enough anonymous pages together, you should get a superpage
in the TLB if that is easy to do.
For architectures that have these "small" superpages, we can just do it
transparently. That's what the alpha patches did.

The problem space is roughly the same as just page coloring.
Post by David S. Miller
At that point it's like "why the system call". If it would rather be
more of a large-page reservation system than a "optimization hint"
then these syscalls would sit better with me. Currently I think they
are superfluous. To me the hint to use large-pages is a given :-)
Yup.

David, you did page coloring once.

I bet your patches worked reasonably well to color into 4 or 8 colors.

How well do you think something like your old patches would work if

- you _require_ 1024 colors in order to get the TLB speedup on some
hypothetical machine (the same hypothetical machine that might
hypothetically run on 95% of all hardware ;)

- the machine is under heavy load, and heavy load is exactly when you
want this optimization to trigger.

Can you explain this difficulty to people?
Post by David S. Miller
Stated another way, if these syscalls said "gimme large pages for this
area and lock them into memory", this would be fine. If the syscalls
say "use large pages if you can", that's crap. And in fact we could
use mmap() attribute flags if we really thought that stating this was
necessary.
I agree 100%.

I think we can at some point do the small cases completely transparently,
with no need for a new system call, and not even any new hint flags. We'll
just silently do 4/8-page superpages and be done with it. Programs don't
need to know about it to take advantage of better TLB usage.

Linus
David Mosberger
2002-08-03 19:30:05 UTC
Permalink
Linus> How well do you think something like your old patches would
Linus> work if

Linus> - you _require_ 1024 colors in order to get the TLB speedup
Linus> on some hypothetical machine (the same hypothetical machine
Linus> that might hypothetically run on 95% of all hardware ;)

Linus> - the machine is under heavy load, and heavy load is exactly
Linus> when you want this optimization to trigger.

Your point about wanting databases have access to giant pages even
under memory pressure is a good one. I had not considered that
before. However, what we really are talking about then is a security
or resource policy as to who gets to allocate from a reserved and
pinned pool of giant physical pages. You don't need separate system
calls for that: with a transparent superpage framework and a
privileged & reserved giant-page pool, it's trivial to set up things
such that your favorite data base will always be able to get the giant
pages (and hence the giant TLB mappings) it wants. The only thing you
lose in the transparent case is control over _which_ pages need to use
the pinned giant pages. I can certainly imagine cases where this
would be an issue, but I kind of doubt it would be an issue for
databases.

As Dave Miller justly pointed out, it's stupid for a task not to ask
for giant pages for anonymous memory. The only reason this is not a
smart thing overall is that globally it's not optimal (it is optimal
only locally, from the task's point of view). So if the only barrier
to getting the giant pinned pages is needing to know about the new
system calls, I'll predict that very soon we'll have EVERY task in the
system allocating such pages (and LD_PRELOAD tricks make that pretty
much trivial). Then we're back to square one, because the favorite
database may not even be able to start up, because all the "reserved"
memory is already used up by the other tasks.

Clearly there needs to be some additional policies in effect, no
matter what the implementation is (the normal VM policies don't work,
because, by definition, the pinned giant pages are not pageable).

In my opinion, the primary benefit of the separate syscalls is still
ease-of-implementation (which isn't unimportant, of course).

--david
Linus Torvalds
2002-08-03 19:43:47 UTC
Permalink
Post by David Mosberger
Your point about wanting databases have access to giant pages even
under memory pressure is a good one. I had not considered that
before. However, what we really are talking about then is a security
or resource policy as to who gets to allocate from a reserved and
pinned pool of giant physical pages.
Absolutely. We can't allow just anybody to allocate giant pages, since
they are a scarce resource (set up at boot time in both Ingo's and Intels
patches - with the potential to move things around later with additional
interfaces).
Post by David Mosberger
You don't need separate system
calls for that: with a transparent superpage framework and a
privileged & reserved giant-page pool, it's trivial to set up things
such that your favorite data base will always be able to get the giant
pages (and hence the giant TLB mappings) it wants. The only thing you
lose in the transparent case is control over _which_ pages need to use
the pinned giant pages. I can certainly imagine cases where this
would be an issue, but I kind of doubt it would be an issue for
databases.
That's _probably_ true. There aren't that many allocations that ask for
megabytes of consecutive memory that wouldn't want to do it. However,
there might certainly be non-critical maintenance programs (with the same
privileges as the database program proper) that _do_ do large allocations,
and that we don't want to give large pages to.

Guessing is always bad, especially since the application certainly does
know what it wants.

Linus
David Mosberger
2002-08-03 21:18:15 UTC
Permalink
You don't need separate system calls for that: with a transparent
superpage framework and a privileged & reserved giant-page pool,
it's trivial to set up things such that your favorite data base
will always be able to get the giant pages (and hence the giant
TLB mappings) it wants. The only thing you lose in the
transparent case is control over _which_ pages need to use the
pinned giant pages. I can certainly imagine cases where this
would be an issue, but I kind of doubt it would be an issue for
databases.
Linus> That's _probably_ true. There aren't that many allocations
Linus> that ask for megabytes of consecutive memory that wouldn't
Linus> want to do it. However, there might certainly be non-critical
Linus> maintenance programs (with the same privileges as the
Linus> database program proper) that _do_ do large allocations, and
Linus> that we don't want to give large pages to.

Linus> Guessing is always bad, especially since the application
Linus> certainly does know what it wants.

Yes, but that applies even to a transparent superpage scheme: in those
instances where an application knows what page size is optimal, it's
better if the application can express that (saves time
promoting/demoting pages needlessly). It's not unlike madvise() or
the readahead() syscall: use reasonable policies for the ordinary
apps, and provide the means to let the smart apps tell the kernel
exactly what they need.

--david
Hubertus Franke
2002-08-03 21:54:30 UTC
Permalink
Post by David Mosberger
On Sat, 3 Aug 2002 12:43:47 -0700 (PDT), Linus Torvalds
You don't need separate system calls for that: with a transparent
superpage framework and a privileged & reserved giant-page pool,
it's trivial to set up things such that your favorite data base
will always be able to get the giant pages (and hence the giant
TLB mappings) it wants. The only thing you lose in the
transparent case is control over _which_ pages need to use the
pinned giant pages. I can certainly imagine cases where this
would be an issue, but I kind of doubt it would be an issue for
databases.
Linus> That's _probably_ true. There aren't that many allocations
Linus> that ask for megabytes of consecutive memory that wouldn't
Linus> want to do it. However, there might certainly be non-critical
Linus> maintenance programs (with the same privileges as the
Linus> database program proper) that _do_ do large allocations, and
Linus> that we don't want to give large pages to.
Linus> Guessing is always bad, especially since the application
Linus> certainly does know what it wants.
Yes, but that applies even to a transparent superpage scheme: in those
instances where an application knows what page size is optimal, it's
better if the application can express that (saves time
promoting/demoting pages needlessly). It's not unlike madvise() or
the readahead() syscall: use reasonable policies for the ordinary
apps, and provide the means to let the smart apps tell the kernel
exactly what they need.
--david
So that's what is/can-be done through the madvice call or a flag on MMAP().
Force a specific size and policy. Why do you need a new system call.

The Rice paper solved this reasonably elegant. Reservation and check
after a while. If you didn't use reserved memory, you loose it, this is the
auto promotion/demotion.

For special apps one provides the interface using madvice().
--
-- Hubertus Franke (***@watson.ibm.com)
David S. Miller
2002-08-04 00:35:30 UTC
Permalink
From: Hubertus Franke <***@watson.ibm.com>
Date: Sat, 3 Aug 2002 17:54:30 -0400

The Rice paper solved this reasonably elegant. Reservation and check
after a while. If you didn't use reserved memory, you loose it, this is the
auto promotion/demotion.

I keep seeing this Rice stuff being mentioned over and over,
can someone post a URL pointer to this work?
David Mosberger
2002-08-04 02:25:30 UTC
Permalink
DaveM> From: Hubertus Franke <***@watson.ibm.com> Date: Sat,
DaveM> 3 Aug 2002 17:54:30 -0400

DaveM> The Rice paper solved this reasonably elegant. Reservation
DaveM> and check after a while. If you didn't use reserved memory,
DaveM> you loose it, this is the auto promotion/demotion.

DaveM> I keep seeing this Rice stuff being mentioned over and over,
DaveM> can someone post a URL pointer to this work?

Sure thing. It's the first link under "Publications" at this URL:

http://www.cs.rice.edu/~jnavarro/

--david
Hubertus Franke
2002-08-04 17:19:05 UTC
Permalink
Post by David Mosberger
On Sat, 03 Aug 2002 17:35:30 -0700 (PDT), "David S. Miller"
DaveM> 3 Aug 2002 17:54:30 -0400
DaveM> The Rice paper solved this reasonably elegant. Reservation
DaveM> and check after a while. If you didn't use reserved memory,
DaveM> you loose it, this is the auto promotion/demotion.
DaveM> I keep seeing this Rice stuff being mentioned over and over,
DaveM> can someone post a URL pointer to this work?
http://www.cs.rice.edu/~jnavarro/
--david
Also in this context:

"Implemenation of Multiple Pagesize Support in HP-UX"
http://www.usenix.org/publications/library/proceedings/usenix98/full_papers/subramanian/subramanian.pdf

"General Purpose Operating System Support for Multiple Page Sizes"
htpp://www.usenix.org/publications/library/proceedings/usenix98/full_papers/ganapathy/ganapathy.pdf
--
-- Hubertus Franke (***@watson.ibm.com)
Daniel Phillips
2002-08-09 15:20:52 UTC
Permalink
Post by Hubertus Franke
"General Purpose Operating System Support for Multiple Page Sizes"
htpp://www.usenix.org/publications/library/proceedings/usenix98/full_papers/ganapathy/ganapathy.pdf
This reference describes roughly what I had in mind for active
defragmentation, which depends on reverse mapping. The main additional
wrinkle I'd contemplated is introducing a new ZONE_LARGE, and GPF_LARGE,
which means the caller promises not to pin the allocation unit for long
periods and does not mind if the underlying physical page changes
spontaneously. Defragmenting in this zone is straightforward.
--
Daniel
Linus Torvalds
2002-08-09 15:56:09 UTC
Permalink
Post by Daniel Phillips
This reference describes roughly what I had in mind for active
defragmentation, which depends on reverse mapping.
Note that even active defrag won't be able to handle the case where you
want have lots of big pages, consituting a large percentage of available
memory.

Not unless you think I am crazy enough to do garbage collection on kernel
data structures (repeat after me: "garbage collection is stupid, slow, bad
for caches, and only for people who cannot count").

Also, I think the jury (ie Andrew) is still out on whether rmap is worth
it.

Linus
Daniel Phillips
2002-08-09 16:15:51 UTC
Permalink
Post by Linus Torvalds
Post by Daniel Phillips
This reference describes roughly what I had in mind for active
defragmentation, which depends on reverse mapping.
Note that even active defrag won't be able to handle the case where you
want have lots of big pages, consituting a large percentage of available
memory.
Perhaps I'm missing something, but I don't see why.
Post by Linus Torvalds
Not unless you think I am crazy enough to do garbage collection on kernel
data structures (repeat after me: "garbage collection is stupid, slow, bad
for caches, and only for people who cannot count").
Slab allocations would not have GFP_DEFRAG (I mistakenly wrote GFP_LARGE
earlier) and so would be allocated outside ZONE_LARGE.
Post by Linus Torvalds
Also, I think the jury (ie Andrew) is still out on whether rmap is worth
it.
Tell me about it. Well, I feel strongly enough about it to spend the next
week coding yet another pte chain optimization.
--
Daniel
Rik van Riel
2002-08-09 16:31:52 UTC
Permalink
Post by Daniel Phillips
Post by Linus Torvalds
Also, I think the jury (ie Andrew) is still out on whether rmap is worth
it.
Tell me about it. Well, I feel strongly enough about it to spend the
next week coding yet another pte chain optimization.
Well yes, we've _seen_ that 2.4 -rmap improves system behaviour,
but we don't have any tools to _quantify_ that improvement.

As long as the only measurable thing is the overhead (which may
get close to zero, but will never become zero) the numbers will
continue being against rmap. Not because of rmap, but just
because the overhead is the only thing being measured ;)

Personally I'll spend some more time just improving the behaviour
of the VM, even if we don't have tools to quantify the improvement.

Somehow there seems to be a lack of meaningful "macrobenchmarks" ;)

(as opposed to microbenchmarks, which can don't always have a
relation to how the performance of the system as a whole will
be influenced by some code change)

kind regards,

Rik
--
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/ http://distro.conectiva.com/
Daniel Phillips
2002-08-09 18:08:23 UTC
Permalink
Post by Rik van Riel
Post by Daniel Phillips
Post by Linus Torvalds
Also, I think the jury (ie Andrew) is still out on whether rmap is worth
it.
Tell me about it. Well, I feel strongly enough about it to spend the
next week coding yet another pte chain optimization.
Well yes, we've _seen_ that 2.4 -rmap improves system behaviour,
but we don't have any tools to _quantify_ that improvement.
As long as the only measurable thing is the overhead (which may
get close to zero, but will never become zero) the numbers will
continue being against rmap. Not because of rmap, but just
because the overhead is the only thing being measured ;)
You know what to do, instead of moaning about it. Just code up a test load
that blatantly favors rmap and post the results. In effect, that's what
Andrew's 'doitlots' benchmark does, in the other direction.
--
Daniel
Linus Torvalds
2002-08-09 16:51:30 UTC
Permalink
Post by Daniel Phillips
Post by Linus Torvalds
Note that even active defrag won't be able to handle the case where you
want have lots of big pages, consituting a large percentage of available
memory.
Perhaps I'm missing something, but I don't see why.
The statistics are against you. rmap won't help at all with all the other
kernel allocations, and the dcache/icache is often large, and on big
machines while there may be tens of thousands of idle entries, there will
also be hundreds of _non_idle entries that you can't just remove.
Post by Daniel Phillips
Slab allocations would not have GFP_DEFRAG (I mistakenly wrote GFP_LARGE
earlier) and so would be allocated outside ZONE_LARGE.
.. at which poin tyou then get zone balancing problems.

Or we end up with the same kind of special zone that we have _anyway_ in
the current large-page patch, in which case the point of doing this is
what?

Linus
Daniel Phillips
2002-08-09 17:11:56 UTC
Permalink
Post by Linus Torvalds
Post by Daniel Phillips
Slab allocations would not have GFP_DEFRAG (I mistakenly wrote GFP_LARGE
earlier) and so would be allocated outside ZONE_LARGE.
.. at which poin tyou then get zone balancing problems.
Or we end up with the same kind of special zone that we have _anyway_ in
the current large-page patch, in which case the point of doing this is
what?
The current large-page patch doesn't have any kind of defragmentation in the
special zone and that memory is just not available for other uses. The thing
is, when demand for large pages is low the zone should be allowed to fragment.

All of highmem also qualifies as defraggable memory, so certainly on these
big memory machines we can easily get a majority of memory in large pages.

I don't see a fundamental reason for new zone balancing problems. The fact
that balancing has sucked by tradition is not a fundamental reason ;-)
--
Daniel
Rik van Riel
2002-08-09 16:27:49 UTC
Permalink
Post by Linus Torvalds
Post by Daniel Phillips
This reference describes roughly what I had in mind for active
defragmentation, which depends on reverse mapping.
Note that even active defrag won't be able to handle the case where you
want have lots of big pages, consituting a large percentage of available
memory.
Not unless you think I am crazy enough to do garbage collection on kernel
data structures (repeat after me: "garbage collection is stupid, slow, bad
for caches, and only for people who cannot count").
It's also necessary if you want to prevent death by physical
memory exhaustion since it's pretty easy to construct workloads
where the page table memory requirement is larger than physical
memory.

OTOH, I also think that it's (probably, almost certainly) not
worth doing active defragmenting for really huge superpages.
This category of garbage collection just gets into the 'rediculous'
class ;)
Post by Linus Torvalds
Also, I think the jury (ie Andrew) is still out on whether rmap is worth
it.
One problem we're running into here is that there are absolutely
no tools to measure some of the things rmap is supposed to fix,
like page replacement.

Sure, Craig Kulesa's tests all went faster on rmap than on the
virtual scanning VM, but that's just one application. There doesn't
seem to exist any kind of tool to quantify things like "quality
of page replacement" or even "efficiency of page replacement" ...

I suspect this is true for many pieces of the kernel, no tools
available to measure the benefits of the code, but only tools
to microbenchmark the _overhead_ of the code...

kind regards,

Rik
--
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/ http://distro.conectiva.com/
Linus Torvalds
2002-08-09 16:52:53 UTC
Permalink
Post by Rik van Riel
Post by Linus Torvalds
Also, I think the jury (ie Andrew) is still out on whether rmap is worth
it.
One problem we're running into here is that there are absolutely
no tools to measure some of the things rmap is supposed to fix,
like page replacement.
Read up on positivism.

"If it can't be measured, it doesn't exist".

The point being that there are things we can measure, and until anything
else comes around, those are the things that will have to guide us.

Linus
Bill Rugolsky Jr.
2002-08-09 17:46:47 UTC
Permalink
Post by Linus Torvalds
Read up on positivism.
Please don't. Read Karl Popper instead.
Post by Linus Torvalds
"If it can't be measured, it doesn't exist".
The positivist Copenhagen interpretation stifled important areas of
physics for half a century. There is a distinction to be made between
an explanatory construct (whereby I mean to imply nothing fancy, no
quarks, just a brick), and the evidence that supports that construct
in the form of observable quantities. It's all there in Popper's work.
Post by Linus Torvalds
The point being that there are things we can measure, and until anything
else comes around, those are the things that will have to guide us.
True, as far as it goes. Measurement=good, idle-speculation=bad.

But it pays to keep in mind that progress is nonlinear. In 1988, Van
Jabobsen noted (http://www.kohala.com/start/vanj.88jul20.txt):

(I had one test case that went like

Basic system: 600 KB/s
add feature A: 520 KB/s
drop A, add B: 530 KB/s
add both A & B: 700 KB/s

Obviously, any statement of the form "feature A/B is good/bad"
is bogus.) But, in spite of the ambiguity, some of the network
design folklore I've heard seems to be clearly wrong.

Such anomalies abound.

Regards,

Bill Rugolsky
y***@fsmlabs.com
2002-08-09 17:40:50 UTC
Permalink
Post by Rik van Riel
One problem we're running into here is that there are absolutely
no tools to measure some of the things rmap is supposed to fix,
like page replacement.
But page replacement is a means to an end. One thing tht would be
very interesting to know is how well the basic VM assumptions about
locality work in a Linux server, desktop, and embedded environment.

You have a LRU approximation that is supposed to approximate working
sets that were originally understood and measured on < 1Meg machines
with static libraries, tiny cache, no GUI and no mmap.
Post by Rik van Riel
Read up on positivism.
It's been discredited as recursively unsound reasoning.

---------------------------------------------------------
Victor Yodaiken
Finite State Machine Labs: The RTLinux Company.
www.fsmlabs.com www.rtlinux.com
Rik van Riel
2002-08-09 19:15:26 UTC
Permalink
Post by y***@fsmlabs.com
Post by Rik van Riel
One problem we're running into here is that there are absolutely
no tools to measure some of the things rmap is supposed to fix,
like page replacement.
But page replacement is a means to an end. One thing tht would be
very interesting to know is how well the basic VM assumptions about
locality work in a Linux server, desktop, and embedded environment.
You have a LRU approximation that is supposed to approximate working
sets that were originally understood and measured on < 1Meg machines
with static libraries, tiny cache, no GUI and no mmap.
Absolutely, it would be interesting to know this.
However, up to now I haven't seen any programs that
measure this.

In this case we know what we want to measure, know we
want to measure it for all workloads, but don't know
how to do this in a quantifyable way.
Post by y***@fsmlabs.com
Post by Rik van Riel
Read up on positivism.
It's been discredited as recursively unsound reasoning.
To further this point, by how much has the security number
of Linux improved as a result of the inclusion of the Linux
Security Module framework ? ;)

I'm sure even Linus will agree that the security potential
has increased, even though he can't measure or quantify it.

regards,

Rik
--
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/ http://distro.conectiva.com/
Linus Torvalds
2002-08-09 21:20:11 UTC
Permalink
Post by Rik van Riel
To further this point, by how much has the security number
of Linux improved as a result of the inclusion of the Linux
Security Module framework ? ;)
I'm sure even Linus will agree that the security potential
has increased, even though he can't measure or quantify it.
Actually, the security number is irrelevant to me - the "noise index" from
people who think security protocols are interesting is what drove that
patch (and that one is definitely measurable).

This way, the security noise is now in somebody elses court ;)

Linus
Marcin Dalecki
2002-08-09 21:19:41 UTC
Permalink
Post by y***@fsmlabs.com
Post by Rik van Riel
One problem we're running into here is that there are absolutely
no tools to measure some of the things rmap is supposed to fix,
like page replacement.
But page replacement is a means to an end. One thing tht would be
very interesting to know is how well the basic VM assumptions about
locality work in a Linux server, desktop, and embedded environment.
You have a LRU approximation that is supposed to approximate working
sets that were originally understood and measured on < 1Meg machines
with static libraries, tiny cache, no GUI and no mmap.
Post by Rik van Riel
Read up on positivism.
It's been discredited as recursively unsound reasoning.
Well not taking the "axiom of choice" for granted is really
really narrowing what can be reasoned about in a really really not
funny way. It makes it for example very "difficult" to invent real
numbers. Well apparently recently some guy published a book which is
basically proposing that the world is just a FSA, so we can see again
that this inconvenience appears to be still very compelling to people
who never had to deal with complicated stuff like for example fluid
dynamics and the associated differential equations :-).

But if talking about actual computers, and since those are in esp.
finite, it may very well be possible to get around without it. ;-)
Helge Hafting
2002-08-12 09:23:50 UTC
Permalink
Post by Rik van Riel
One problem we're running into here is that there are absolutely
no tools to measure some of the things rmap is supposed to fix,
like page replacement.
There are things like running vmstat while running tests or production.

My office desktop machine (256M RAM) rarely swaps more than 10M
during work with 2.5.30. It used to go some 70M into swap
after a few days of writing, browsing, and those updatedb runs.

Helge Hafting
Bill Davidsen
2002-08-13 03:15:24 UTC
Permalink
Post by Helge Hafting
Post by Rik van Riel
One problem we're running into here is that there are absolutely
no tools to measure some of the things rmap is supposed to fix,
like page replacement.
There are things like running vmstat while running tests or production.
My office desktop machine (256M RAM) rarely swaps more than 10M
during work with 2.5.30. It used to go some 70M into swap
after a few days of writing, browsing, and those updatedb runs.
Now tell us how someone who isn't a VM developer can tell if that's bad or
good. Is it good because it didn't swap more than it needed to, or bad
because there were more things it could have swapped to make more buffer
room?

Serious question, tuning the -aa VM sometimes makes the swap use higher,
even as the response to starting small jobs while doing kernel compiles or
mkisofs gets better. I don't normally tune -ac kernels much, so I can't
comment there.
--
bill davidsen <***@tmr.com>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.
Rik van Riel
2002-08-13 03:31:42 UTC
Permalink
Post by Bill Davidsen
Now tell us how someone who isn't a VM developer can tell if that's bad
or good. Is it good because it didn't swap more than it needed to, or
bad because there were more things it could have swapped to make more
buffer room?
Good point, just looking at the swap usage doesn't mean
much because we're interested in the _consequences_ of
that number and not in the number itself.
Post by Bill Davidsen
Serious question, tuning the -aa VM sometimes makes the swap use higher,
even as the response to starting small jobs while doing kernel compiles
or mkisofs gets better. I don't normally tune -ac kernels much, so I
can't comment there.
The key word here is "response", benchmarks really need
to be able to measure responsiveness.

Some benchmarks (eg. irman by Bob Matthews) do this
already, but we're still focussing too much on throughput.


In 1990 Margo Selzer wrote an excellent paper on disk IO
sorting and its effects on throughput and latency. The
end result was that in order to get decent throughput by
doing just disk IO sorting you would need queues so deep
that IO latency would grow to about 30 seconds. ;)

Of course, if databases or online shops would increase
their throughput by going to deep queueing and every
request would get 30 second latencies ... they would
immediately lose their users (or customers) !!!

I'm pretty convinced that sysadmins aren't interested
in throughput, at least not until throughput is so low
that it starts affecting system response latency.


regards,

Rik
--
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/ http://distro.conectiva.com/
Helge Hafting
2002-08-13 07:28:24 UTC
Permalink
Post by Bill Davidsen
Post by Helge Hafting
My office desktop machine (256M RAM) rarely swaps more than 10M
during work with 2.5.30. It used to go some 70M into swap
after a few days of writing, browsing, and those updatedb runs.
Now tell us how someone who isn't a VM developer can tell if that's bad or
good. Is it good because it didn't swap more than it needed to, or bad
because there were more things it could have swapped to make more buffer
room?
It feels more responsive too - which is no surprise. Like most users,
I don't _expect_ to wait for swapin when pressing a key or something.
Waiting for file io seems to be less of a problem, that stuff
_is_ on disk after all. I guess many people who knows a little about
computers feel this way. People that don't know what a "disk" is
may be different and more interested in total waiting.

On the serious side: vmstat provides more than swap info. It also
lists block io, where one might see if the block io goes up or down.
I suggest to find some repeatable workload with lots of file & swap
io, and see how much we get of each. My guess is that rmap
results in less io to to the same job. Not only swap io, but
swap+file io too. The design is more capable of selecting
the _right_ page to evict. (Assuming that page usage may
tell us something useful.) So the only questions left is
if the current implementation is good, and if the
improved efficiency makes up for the memory overhead.
Post by Bill Davidsen
Serious question, tuning the -aa VM sometimes makes the swap use higher,
even as the response to starting small jobs while doing kernel compiles or
mkisofs gets better. I don't normally tune -ac kernels much, so I can't
comment there.
Swap is good if there's lots of file io and
lots of unused apps sitting around. And bad if there's a large working
set and little _repeating_ file io. Such as the user switching between
a bunch of big apps working on few files. And perhaps some
non-repeating
io like updatedb or mail processing...

Helge Hafting

Helge Hafting
Andrew Morton
2002-08-09 21:38:13 UTC
Permalink
Post by Linus Torvalds
...
Also, I think the jury (ie Andrew) is still out on whether rmap is worth
it.
The most glaring problem has been the fork/exec/exit overhead.

Anton had a program which did 10,000 forks and we were looking at
the time it took for them all to exit. Initial rmap slowed the exitting
by 400%, and we now have that down to 70%.

I've been treating a gcc configure script as the most forky workload
which we're likely to care about. rmap slowed configure down by 7%
and the work Daniel and I have done has reduced that to 2.8%.

(Not that rmap is the biggest problem for configure:

c013c07c 176 1.93046 __page_add_rmap
c013c194 225 2.46792 __page_remove_rmap
c012a274 236 2.58857 free_one_pgd
c012a7f8 405 4.44225 __constant_c_and_count_memset
c01055fc 917 10.0581 poll_idle
c012a6cc 1253 13.7436 __constant_memcpy

It's that i387 struct copy.)

There don't seem to be any catastrophic failure modes here, and
I expect tests could be concocted against the virtual scan which
_do_ have gross performance problems.

So. Not great, but OK if the reverse map gives us something back.
And I don't agree that the quality of page replacement is all too
hard to measure. It's just that nobody has got off their butt
and tried to measure it.

The other worry is the ZONE_NORMAL space consumption of pte_chains.
We've halved that, but it will still make high sharing levels
unfeasible on the big ia32 machines. We are dependant upon large
pages to solve that problem. (Resurrection of pte_highmem is in
progress, but it doesn't work yet).

I don't see a sufficient case for reverting rmap at present, and
it's time to move on with other work. There is nothing in the
queue at present which _requires_ rmap, so if we do hit a
showstopper then going back to a virtual scan will be feasible
for at least the next month.

Two points:

1) It would be most useful to have *some* damn test on the table
which works better with 2.4-rmap, along with a believable
description of why it's better.

2) If would be most irritating to reach 2.6.5 before discovering
that there is some terrible resource consumption problem
arising from the reverse map. Now is a good time for people
with large machines to be testing 2.5, please. This is
happening, and I expect we'll be in better shape in a month
or so.
Eric W. Biederman
2002-08-10 18:20:06 UTC
Permalink
Post by Andrew Morton
The other worry is the ZONE_NORMAL space consumption of pte_chains.
We've halved that, but it will still make high sharing levels
unfeasible on the big ia32 machines. We are dependant upon large
pages to solve that problem. (Resurrection of pte_highmem is in
progress, but it doesn't work yet).
There is a second method to address this. Pages can be swapped out
of the page tables and still remain in the page cache, the virtual
scan does this all of the time. This should allow for arbitrary
amounts of sharing. There is some overhead, in faulting the pages
back in but it is much better than cases that do not work. A simple
implementation would have a maximum pte_chain length.

For any page that is not backed by anonymous memory we do not need to
keep the pte entries after the page has been swapped of the page
table. Which should show a reduction in page table size. In a highly
shared setting with anonymous pages it is likely worth it to promote
those pages to being posix shared memory.

All of the above should allow us to keep a limit on the amount of
resources that go towards sharing, reducing the need for something
like pte_highmem, and keeping memory pressure down in general.

For the cases you describe I have trouble seeing pte_highmem as
anything other than a performance optimization. Only placing shmem
direct and indirect entries in high memory or in swap can I see as
limit to feasibility.

Eric
Daniel Phillips
2002-08-10 18:59:36 UTC
Permalink
Post by Eric W. Biederman
Post by Andrew Morton
The other worry is the ZONE_NORMAL space consumption of pte_chains.
We've halved that, but it will still make high sharing levels
unfeasible on the big ia32 machines. We are dependant upon large
pages to solve that problem. (Resurrection of pte_highmem is in
progress, but it doesn't work yet).
There is a second method to address this. Pages can be swapped out
of the page tables and still remain in the page cache, the virtual
scan does this all of the time. This should allow for arbitrary
amounts of sharing. There is some overhead, in faulting the pages
back in but it is much better than cases that do not work. A simple
implementation would have a maximum pte_chain length.
Oh gosh, nice point. We could put together a lovely cooked benchmark where
copy_page_range just fails to copy all the mmap pages, which are most of them
in the bash test.
--
Daniel
Rik van Riel
2002-08-10 19:55:45 UTC
Permalink
Post by Eric W. Biederman
Post by Andrew Morton
The other worry is the ZONE_NORMAL space consumption of pte_chains.
We've halved that, but it will still make high sharing levels
unfeasible on the big ia32 machines.
There is a second method to address this. Pages can be swapped out
of the page tables and still remain in the page cache, the virtual
scan does this all of the time. This should allow for arbitrary
amounts of sharing. There is some overhead, in faulting the pages
back in but it is much better than cases that do not work. A simple
implementation would have a maximum pte_chain length.
Indeed. We need this same thing for page tables too, otherwise
a high sharing situation can easily "require" more page table
memory than the total amount of physical memory in the system ;)

regards,

Rik
--
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/ http://distro.conectiva.com/
Eric W. Biederman
2002-08-10 19:54:47 UTC
Permalink
Post by Rik van Riel
Post by Eric W. Biederman
Post by Andrew Morton
The other worry is the ZONE_NORMAL space consumption of pte_chains.
We've halved that, but it will still make high sharing levels
unfeasible on the big ia32 machines.
There is a second method to address this. Pages can be swapped out
of the page tables and still remain in the page cache, the virtual
scan does this all of the time. This should allow for arbitrary
amounts of sharing. There is some overhead, in faulting the pages
back in but it is much better than cases that do not work. A simple
implementation would have a maximum pte_chain length.
Indeed. We need this same thing for page tables too, otherwise
a high sharing situation can easily "require" more page table
memory than the total amount of physical memory in the system ;)
It's exactly the same situation. To remove a pte from the chain you must
remove it from the page table as well. Then we just need to free
pages with no interesting pte entries.
Eric
Hubertus Franke
2002-08-09 18:32:38 UTC
Permalink
Post by Daniel Phillips
Post by Hubertus Franke
"General Purpose Operating System Support for Multiple Page Sizes"
htpp://www.usenix.org/publications/library/proceedings/usenix98/full_pape
rs/ganapathy/ganapathy.pdf
This reference describes roughly what I had in mind for active
defragmentation, which depends on reverse mapping. The main additional
wrinkle I'd contemplated is introducing a new ZONE_LARGE, and GPF_LARGE,
which means the caller promises not to pin the allocation unit for long
periods and does not mind if the underlying physical page changes
spontaneously. Defragmenting in this zone is straightforward.
I think the objection to that is that in many cases the cost of
defragmentation is to heavy to be recollectable through TLB miss handling
alone.
What the above paper does is a reservation protocol with timeouts
which decide that either (a) the reserved mem was used in time and hence
the page is upgraded to a large page OR (b) the reserved mem is not used and
hence unused parts are released.
It relies on the fact that within the given timeout, most/mamy pages are
typically referenced.

In our patch we have the ZONE_LARGE into which we allocate the
large page. Currently they are effectively pinned down, but in 2.4.18
we had it backed by the page cache.

My gut feeling right now would be to follow the reservation based scheme,
but as said its a gut feeling.
Defragmenting to me seems a matter of last resort, Copying pages is expensive.
If you however simply target the superpages for smaller clusters, then its an
option. But at the same time one might contemplate to simply make
the base page 16K or 32K and page fault time simply map / swap / read /
writeback the whole cluster.
What studies has been done on this wrt to benefits of such an approach.
I talked to Ted Tso who would really like small super pages for better I/O
performance...
--
-- Hubertus Franke (***@watson.ibm.com)
Daniel Phillips
2002-08-09 18:43:42 UTC
Permalink
Post by Hubertus Franke
Post by Daniel Phillips
Post by Hubertus Franke
"General Purpose Operating System Support for Multiple Page Sizes"
htpp://www.usenix.org/publications/library/proceedings/usenix98/full_pape
rs/ganapathy/ganapathy.pdf
This reference describes roughly what I had in mind for active
defragmentation, which depends on reverse mapping. The main additional
wrinkle I'd contemplated is introducing a new ZONE_LARGE, and GPF_LARGE,
which means the caller promises not to pin the allocation unit for long
periods and does not mind if the underlying physical page changes
spontaneously. Defragmenting in this zone is straightforward.
I think the objection to that is that in many cases the cost of
defragmentation is to heavy to be recollectable through TLB miss handling
alone.
You pay the cost only on transition from a load that doesn't use many large
pages to one that does, it is not an ongoing cost.
Post by Hubertus Franke
[...]
Defragmenting to me seems a matter of last resort, Copying pages is expensive.
It is the only way to ever have a seamless implementation. Really, I don't
understand this fear of active defragmentation. Oh well, like davem said,
code talks.
--
Daniel
Hubertus Franke
2002-08-09 19:17:55 UTC
Permalink
Post by Daniel Phillips
Post by Hubertus Franke
Post by Daniel Phillips
Post by Hubertus Franke
"General Purpose Operating System Support for Multiple Page Sizes"
htpp://www.usenix.org/publications/library/proceedings/usenix98/full_
pape rs/ganapathy/ganapathy.pdf
This reference describes roughly what I had in mind for active
defragmentation, which depends on reverse mapping. The main additional
wrinkle I'd contemplated is introducing a new ZONE_LARGE, and
GPF_LARGE, which means the caller promises not to pin the allocation
unit for long periods and does not mind if the underlying physical page
changes spontaneously. Defragmenting in this zone is straightforward.
I think the objection to that is that in many cases the cost of
defragmentation is to heavy to be recollectable through TLB miss handling
alone.
You pay the cost only on transition from a load that doesn't use many large
pages to one that does, it is not an ongoing cost.
Correct. Maybe I misunderstood, when are you doing the coalloction of
adjacent pages (page-clusters, super pages).
Our intend was to do it at page fault time and breakup only during
memory pressure.
Post by Daniel Phillips
Post by Hubertus Franke
[...]
Defragmenting to me seems a matter of last resort, Copying pages is expensive.
It is the only way to ever have a seamless implementation. Really, I don't
understand this fear of active defragmentation. Oh well, like davem said,
code talks.
--
-- Hubertus Franke (***@watson.ibm.com)
Alan Cox
2002-08-11 20:30:17 UTC
Permalink
Post by Daniel Phillips
Post by Hubertus Franke
"General Purpose Operating System Support for Multiple Page Sizes"
htpp://www.usenix.org/publications/library/proceedings/usenix98/full_papers/ganapathy/ganapathy.pdf
This reference describes roughly what I had in mind for active
defragmentation, which depends on reverse mapping. The main additional
wrinkle I'd contemplated is introducing a new ZONE_LARGE, and GPF_LARGE,
which means the caller promises not to pin the allocation unit for long
periods and does not mind if the underlying physical page changes
spontaneously. Defragmenting in this zone is straightforward.
Slight problem. This paper is about a patented SGI method for handling
defragmentation into large pages (6,182,089). They patented it before
the presentation.

They also hold patents on the other stuff that you've recently been
discussing about not keeping seperate rmap structures until there are
more than some value 'n' when they switch from direct to indirect lists
of reverse mappings (6,112,286)

If you are going read and propose things you find on Usenix at least
check what the authors policies on patents are.

Perhaps someone should first of all ask SGI to give the Linux community
permission to use it in a GPL'd operating system ?
Daniel Phillips
2002-08-11 22:33:29 UTC
Permalink
Post by Alan Cox
Post by Daniel Phillips
Post by Hubertus Franke
"General Purpose Operating System Support for Multiple Page Sizes"
htpp://www.usenix.org/publications/library/proceedings/usenix98/full_papers/ganapathy/ganapathy.pdf
This reference describes roughly what I had in mind for active
defragmentation, which depends on reverse mapping. The main additional
wrinkle I'd contemplated is introducing a new ZONE_LARGE, and GPF_LARGE,
which means the caller promises not to pin the allocation unit for long
periods and does not mind if the underlying physical page changes
spontaneously. Defragmenting in this zone is straightforward.
Slight problem. This paper is about a patented SGI method for handling
defragmentation into large pages (6,182,089). They patented it before
the presentation.
See 'straightforward' above, i.e., obvious to a practitioner of the art.
This is another one-click patent.

Look at claim 16, it covers our buddy allocator quite nicely:

http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=/netahtml/srchnum.htm&r=1&f=G&l=50&s1='6182089'.WKU.&OS=PN/6182089&RS=PN/6182089

Claim 1 covers the idea of per-size freelist thresholds, below which no
coalescing is done.

Claim 13 covers the idea of having a buddy system on each node of a numa
system. Bill is going to be somewhat disappointed to find out he can't do
that any more.

It goes on in this vein. I suggest all vm hackers have a close look at
this. Yes, it's stupid, but we can't just ignore it.
Post by Alan Cox
They also hold patents on the other stuff that you've recently been
discussing about not keeping seperate rmap structures until there are
more than some value 'n' when they switch from direct to indirect lists
of reverse mappings (6,112,286)
This is interesting. By setting their 'm' to 1, you get essentially the
scheme implemented by Dave a few weeks ago, and by setting 'm' to 0, the
patent covers pretty much every imaginable reverse mapping scheme. Gee,
so SGI thought of reverse mapping in 1997 or thereabouts, and nobody ever
did before?

http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=/netahtml/srchnum.htm&r=1&f=G&l=50&s1='6112286'.WKU.&OS=PN/6112286&RS=PN/6112286

Claim 2 covers use of their reverse mapping scheme, which as we have seen,
includes all reverse mapping schemes, for migrating the data content of
pages, and updating the page table pointers.

Claim 4 goes on to cover migration of data pages between nodes of a numa
system. (Got that wli?)

This patent goes on to claim just about everything you can do with a
reverse map. It's sure lucky for SGI that they were the first to think
of the idea of reverse mapping.
Post by Alan Cox
If you are going read and propose things you find on Usenix at least
check what the authors policies on patents are.
As always, I developed my ideas from first principles. I never saw or
heard of the paper until a few days ago. I don't need their self-serving
paper to figure this stuff out, and if they are going to do blatantly
commercial stuff like that, I'd rather the paper were not published at
all. Perhaps Usenix needs to establish a policy about that.
Post by Alan Cox
Perhaps someone should first of all ask SGI to give the Linux community
permission to use it in a GPL'd operating system ?
Yes, we should ask nicely, if we run into something that matters. Asking
nicely isn't the only option though.

And yes, I'm trying to be polite. It's just so stupid.

--
Daniel
Linus Torvalds
2002-08-11 22:55:08 UTC
Permalink
Post by Daniel Phillips
It goes on in this vein. I suggest all vm hackers have a close look at
this. Yes, it's stupid, but we can't just ignore it.
Actually, we can, and I will.

I do not look up any patents on _principle_, because (a) it's a horrible
waste of time and (b) I don't want to know.

The fact is, technical people are better off not looking at patents. If
you don't know what they cover and where they are, you won't be knowingly
infringing on them. If somebody sues you, you change the algorithm or you
just hire a hit-man to whack the stupid git.

Linus
Larry McVoy
2002-08-11 23:15:01 UTC
Permalink
Post by Linus Torvalds
Post by Daniel Phillips
It goes on in this vein. I suggest all vm hackers have a close look at
this. Yes, it's stupid, but we can't just ignore it.
Actually, we can, and I will.
I do not look up any patents on _principle_, because (a) it's a horrible
waste of time and (b) I don't want to know.
The fact is, technical people are better off not looking at patents. If
you don't know what they cover and where they are, you won't be knowingly
infringing on them. If somebody sues you, you change the algorithm or you
just hire a hit-man to whack the stupid git.
This issue is more complicated than you might think. Big companies with
big pockets are very nervous about being too closely associated with
Linux because of this problem. Imagine that IBM, for example, starts
shipping IBM Linux. Somewhere in the code there is something that
infringes on a patent. Given that it is IBM Linux, people can make
the case that IBM should have known and should have fixed it and
since they didn't, they get sued. Notice that IBM doesn't ship
their own version of Linux, they ship / support Red Hat or Suse
(maybe others, doesn't matter). So if they ever get hassled, they'll
vector the problem to those little guys and the issue will likely
get dropped because the little guys have no money to speak of.

Maybe this is all good, I dunno, but be aware that the patents
have long arms and effects.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm
Linus Torvalds
2002-08-12 01:26:00 UTC
Permalink
Post by Larry McVoy
This issue is more complicated than you might think.
No, it's not. You miss the point.
Post by Larry McVoy
Big companies with
big pockets are very nervous about being too closely associated with
Linux because of this problem.
The point being that that is _their_ problem, and at a level that has
nothing to do with technology.

I'm saying that technical people shouldn't care. I certainly don't. The
people who _should_ care are patent attourneys etc, since they actually
get paid for it, and can better judge the matter anyway.

Everybody in the whole software industry knows that any non-trivial
program (and probably most trivial programs too, for that matter) will
infringe on _some_ patent. Ask anybody. It's apparently an accepted fact,
or at least a saying that I've heard too many times.

I just don't care. Clearly, if all significant programs infringe on
something, the issue is no longer "do we infringe", but "is it an issue"?

And that's _exactly_ why technical people shouldn't care. The "is it an
issue" is not something a technical guy can answer, since the answer
depends on totally non-technical things.

Ask your legal counsel, and I strongly suspect that if he is any good, he
will tell you the same thing. Namely that it's _his_ problem, and that
your engineers should not waste their time trying to find existing
patents.

Linus
Larry McVoy
2002-08-12 05:05:45 UTC
Permalink
Post by Linus Torvalds
Ask your legal counsel, and I strongly suspect that if he is any good, he
will tell you the same thing. Namely that it's _his_ problem, and that
your engineers should not waste their time trying to find existing
patents.
Partially true for us. We do do patent searches to make sure we aren't
doing anything blatently stupid.

I do agree with you 100% that it is impossible to ship any software that
does not infringe on some patent. It's a big point of contention in
contract negotiations because everyone wants you to warrant that your
software doesn't infringe and indemnify them if it does.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm
Alan Cox
2002-08-12 10:31:48 UTC
Permalink
Post by Linus Torvalds
Ask your legal counsel, and I strongly suspect that if he is any good, he
will tell you the same thing. Namely that it's _his_ problem, and that
your engineers should not waste their time trying to find existing
patents.
Wasn't a case of wasting time. That one is extremely well known because
there were upset people when SGI patented it and then submitted a usenix
paper on it.
David S. Miller
2002-08-04 00:28:36 UTC
Permalink
From: Linus Torvalds <***@transmeta.com>
Date: Sat, 3 Aug 2002 10:35:00 -0700 (PDT)

David, you did page coloring once.

I bet your patches worked reasonably well to color into 4 or 8 colors.

How well do you think something like your old patches would work if

- you _require_ 1024 colors in order to get the TLB speedup on some
hypothetical machine (the same hypothetical machine that might
hypothetically run on 95% of all hardware ;)

- the machine is under heavy load, and heavy load is exactly when you
want this optimization to trigger.

Can you explain this difficulty to people?

Actually, we need some clarification here. I tried coloring several
times, the problem with my diffs is that I tried to do the coloring
all the time no matter what.

I wanted strict coloring on the 2-color level for broken L1 caches
that have aliasing problems. If I could make this work, all of the
dumb cache flushing I have to do on Sparcs could be deleted. Because
of this, I couldn't legitimately change the cache flushing rules
unless I had absolutely strict coloring done on all pages where it
mattered (basically anything that could end up in the user's address
space).

So I kept track of color existence precisely in the page lists. The
implementation was fast, but things got really bad fragmentation wise.

No matter how I tweaked things, just running a kernel build 40 or 50
times would fragment the free page lists to shreds such that 2-order
and up pages simply did not exist.

Another person did an implementation of coloring which basically
worked by allocating a big-order chunk and slicing that up. It's not
strictly done and that is why his version works better. In fact I
like that patch a lot and it worked quite well for L2 coloring on
sparc64. Any time there is page pressure, he tosses away all of the
color carving big-order pages.

I think we can at some point do the small cases completely transparently,
with no need for a new system call, and not even any new hint flags. We'll
just silently do 4/8-page superpages and be done with it. Programs don't
need to know about it to take advantage of better TLB usage.

Ok. I think even 64-page ones are viable to attempt but we'll see.
Most TLB's that do superpages seem to have a range from the base
page size to the largest supported superpage with 2-powers of two
being incrememnted between each supported size.

For example on Sparc64 this is:

8K PAGE_SIZE
64K PAGE_SIZE * 8
512K PAGE_SIZE * 64
4M PAGE_SIZE * 512

One of the transparent large page implementations just defined a
small array that the core code used to try and see "hey how big
a superpage can we try" and if the largest for the area failed
(because page orders that large weren't available) it would simply
fall back to the next smallest superpage size.
Hubertus Franke
2002-08-04 17:31:24 UTC
Permalink
Post by David S. Miller
Date: Sat, 3 Aug 2002 10:35:00 -0700 (PDT)
David, you did page coloring once.
I bet your patches worked reasonably well to color into 4 or 8 colors.
How well do you think something like your old patches would work if
- you _require_ 1024 colors in order to get the TLB speedup on some
hypothetical machine (the same hypothetical machine that might
hypothetically run on 95% of all hardware ;)
- the machine is under heavy load, and heavy load is exactly when you
want this optimization to trigger.
Can you explain this difficulty to people?
Actually, we need some clarification here. I tried coloring several
times, the problem with my diffs is that I tried to do the coloring
all the time no matter what.
I wanted strict coloring on the 2-color level for broken L1 caches
that have aliasing problems. If I could make this work, all of the
dumb cache flushing I have to do on Sparcs could be deleted. Because
of this, I couldn't legitimately change the cache flushing rules
unless I had absolutely strict coloring done on all pages where it
mattered (basically anything that could end up in the user's address
space).
So I kept track of color existence precisely in the page lists. The
implementation was fast, but things got really bad fragmentation wise.
No matter how I tweaked things, just running a kernel build 40 or 50
times would fragment the free page lists to shreds such that 2-order
and up pages simply did not exist.
Another person did an implementation of coloring which basically
worked by allocating a big-order chunk and slicing that up. It's not
strictly done and that is why his version works better. In fact I
like that patch a lot and it worked quite well for L2 coloring on
sparc64. Any time there is page pressure, he tosses away all of the
color carving big-order pages.
I think we can at some point do the small cases completely
transparently, with no need for a new system call, and not even any new
hint flags. We'll just silently do 4/8-page superpages and be done with it.
Programs don't need to know about it to take advantage of better TLB usage.
Ok. I think even 64-page ones are viable to attempt but we'll see.
Most TLB's that do superpages seem to have a range from the base
page size to the largest supported superpage with 2-powers of two
being incrememnted between each supported size.
8K PAGE_SIZE
64K PAGE_SIZE * 8
512K PAGE_SIZE * 64
4M PAGE_SIZE * 512
One of the transparent large page implementations just defined a
small array that the core code used to try and see "hey how big
a superpage can we try" and if the largest for the area failed
(because page orders that large weren't available) it would simply
fall back to the next smallest superpage size.
Well, that's exactly what we do !!!!

We also ensure that if one process opens with basic page size and
the next one opens with super page size that we appropriately map
the second one to smaller pages to avoid conflict in case of shared
memory or memory mapped files.

As of the page coloring !
Can we tweak the buddy allocator to give us this additional functionality?
Seems like we can have a free-list per color and if that's empty we go back to
the buddy sys. There we should be able to do some magic based on the bit maps
to figure out where which page is to be used that fits the right color?

Fragmentation is an issue.
--
-- Hubertus Franke (***@watson.ibm.com)
Linus Torvalds
2002-08-04 18:38:12 UTC
Permalink
Post by Hubertus Franke
As of the page coloring !
Can we tweak the buddy allocator to give us this additional functionality?
I would really prefer to avoid this, and get "95% coloring" by just doing
read-ahead with higher-order allocations instead of the current "loop
allocation of one block".

I bet that you will get _practically_ perfect coloring with just two small
changes:

- do_anonymous_page() looks to see if the page tables are empty around
the faulting address (and check vma ranges too, of course), and
optimistically does a non-blocking order-X allocation.

If the order-X allocation fails, we're likely low on memory (this is
_especially_ true since the very fact that we do lots of order-X
allocations will probably actually help keep fragementation down
normally), and we just allocate one page (with a regular GFP_USER this
time).

Map in all pages.

- do the same for page_cache_readahead() (this, btw, is where radix trees
will kick some serious ass - we'd have had a hard time doing the "is
this range of order-X pages populated" efficiently with the old hashes.

I bet just those fairly small changes will give you effective coloring,
_and_ they are also what you want for doing small superpages.

And no, I do not want separate coloring support in the allocator. I think
coloring without superpage support is stupid and worthless (and
complicates the code for no good reason).

Linus
Andrew Morton
2002-08-04 19:23:16 UTC
Permalink
Post by Linus Torvalds
Post by Hubertus Franke
As of the page coloring !
Can we tweak the buddy allocator to give us this additional functionality?
I would really prefer to avoid this, and get "95% coloring" by just doing
read-ahead with higher-order allocations instead of the current "loop
allocation of one block".
I bet that you will get _practically_ perfect coloring with just two small
- do_anonymous_page() looks to see if the page tables are empty around
the faulting address (and check vma ranges too, of course), and
optimistically does a non-blocking order-X allocation.
If the order-X allocation fails, we're likely low on memory (this is
_especially_ true since the very fact that we do lots of order-X
allocations will probably actually help keep fragementation down
normally), and we just allocate one page (with a regular GFP_USER this
time).
Map in all pages.
This would be a problem for short-lived processes. Because "map in
all pages" also means "zero them out". And I think that performing
a 4k clear_user_highpage() immediately before returning to userspace
is optimal. It's essentialy a cache preload for userspace.

If we instead clear out 4 or 8 pages, we trash a ton of cache and
the chances of userspace _using_ pages 1-7 in the short-term are
lower. We could clear the pages with 7,6,5,4,3,2,1,0 ordering,
but the cache implications of faultahead are still there.

Could we establish the eight pte's but still arrange for pages 1-7
to trap, so the kernel can zero the out at the latest possible time?
Post by Linus Torvalds
- do the same for page_cache_readahead() (this, btw, is where radix trees
will kick some serious ass - we'd have had a hard time doing the "is
this range of order-X pages populated" efficiently with the old hashes.
On the nopage path, yes. That memory is cache-cold anyway.
Linus Torvalds
2002-08-04 19:28:54 UTC
Permalink
Post by Andrew Morton
Could we establish the eight pte's but still arrange for pages 1-7
to trap, so the kernel can zero the out at the latest possible time?
You could do that by marking the pages as being there, but PROT_NONE.

On the other hand, cutting down the number of initial pagefaults (by _not_
doing what you suggest) migth be a bigger speedup for process startup than
the slowdown from occasionally doing unnecessary work.

I suspect that there is some non-zero order-X (probably 2 or 3), where you
just win more than you lose. Even for small programs.

Linus
David S. Miller
2002-08-05 05:42:20 UTC
Permalink
From: Linus Torvalds <***@transmeta.com>
Date: Sun, 4 Aug 2002 12:28:54 -0700 (PDT)

I suspect that there is some non-zero order-X (probably 2 or 3), where you
just win more than you lose. Even for small programs.

Furthermore it would obviously help to enhance the clear_user_page()
interface to handle multiple pages because that would nullify the
startup/finish overhead of the copy loop. (read as: things like TLB
loads and FPU save/restore on some platforms)
Hubertus Franke
2002-08-04 19:30:24 UTC
Permalink
Post by Linus Torvalds
Post by Hubertus Franke
As of the page coloring !
Can we tweak the buddy allocator to give us this additional
functionality?
I would really prefer to avoid this, and get "95% coloring" by just doing
read-ahead with higher-order allocations instead of the current "loop
allocation of one block".
Yes, if we (correctly) assume that page coloring only buys you significant
benefits for small associative caches (e.g. <4 or <= 8).
Post by Linus Torvalds
I bet that you will get _practically_ perfect coloring with just two small
- do_anonymous_page() looks to see if the page tables are empty around
the faulting address (and check vma ranges too, of course), and
optimistically does a non-blocking order-X allocation.
As long as the alignments are observed, which you I guess imply by the range.
Post by Linus Torvalds
If the order-X allocation fails, we're likely low on memory (this is
_especially_ true since the very fact that we do lots of order-X
allocations will probably actually help keep fragementation down
normally), and we just allocate one page (with a regular GFP_USER this
time).
Correct.
Post by Linus Torvalds
Map in all pages.
- do the same for page_cache_readahead() (this, btw, is where radix trees
will kick some serious ass - we'd have had a hard time doing the "is
this range of order-X pages populated" efficiently with the old hashes.
Hey, we use the radix tree to track page cache mappings for large pages
particularly for this reason...
Post by Linus Torvalds
I bet just those fairly small changes will give you effective coloring,
_and_ they are also what you want for doing small superpages.
Well, in what you described above there is no concept of superpages
the way it is defined for the purpose of <tracking> and <TLB overhead
reduction>.
If you don't know about super pages at the VM level, then you need to
deal with them at TLB fault level to actually create the <large TLB>
entry. That what the INTC patch will do, namely throughing all the
complexity over the fence for the page fault.
In your case not keeping track of the super pages in the
VM layer and PT layer requires to discover the large page at soft TLB
time by scanning PT proximity for contigous pages if we are talking now
about the read_ahead ....
In our case, we store the same physical address of the super page
in the PTEs spanning the superpage together with the page order.
At software TLB time we simply extra the single PTE from the PT based
on the faulting address and move it into the TLB. This ofcourse works only
for software TLBs (PwrPC, MIPS, IA64). For HW TLB (x86) the PT structure
by definition overlaps the large page size support.
The HW TLB case can be extended to not store the same PA in all the PTEs,
but conceptually carry the superpage concept for the purpose described above.

We have that concept exactly the way you want it, but the dress code
seems to be wrong. That can be worked on.
Our goal was in the long run 2.7 to explore the Rice approach to see
whether it yields benefits or whether we getting down the road of
fragmentation reduction overhead that will kill all the benefits we get
from reduced TLB overhead. Time would tell.

But to go down this route we need the concept of a superpage in the VM,
not just at TLB time or a hack that throws these things over the fence.
Post by Linus Torvalds
And no, I do not want separate coloring support in the allocator. I think
coloring without superpage support is stupid and worthless (and
complicates the code for no good reason).
Linus
That <stupid> seems premature. You are mixing the concept of
superpage from a TLB miss reduction perspective
with the concept of superpage for page coloring.

In a low associative cache (<=4) you have a larger number of colors (~100s)
To be reasonable effective you need to provide these large
number of colors, that could be quite a waste of memory if you do it
only through super pages.
On the otherhand, if you simply try to get a page from a targeted class X
you can solve this problem one page at a time. This still makes sense.
Last you can move these two approaches together by providing small
conceptual super pages (nothing or not necessarily anything to do with your
TLB at this point) and provide a smaller number of classes from where
superpages will be allocated. I hope you meant the latter one when
referring to <stupid>.
Eitherway, you need the concept of a superpage IMHO in the VM to
support all this stuff.

And we got just the right stuff for you :-).
Again the final dress code and capabilities are still up for discussion.

Bill Irwin and I are working on moving Simon's 2.4.18 patch up to 2.5.30.
Clean up some of the stuff and make sure that the integration with the latest
radix tree and writeback functionality is proper.
There aren't that many major changes. We hope to have something for
review soon.

Cheers.
--
-- Hubertus Franke (***@watson.ibm.com)
William Lee Irwin III
2002-08-04 20:23:22 UTC
Permalink
Post by Hubertus Franke
As long as the alignments are observed, which you I guess imply by the range.
Post by Linus Torvalds
If the order-X allocation fails, we're likely low on memory (this is
_especially_ true since the very fact that we do lots of order-X
allocations will probably actually help keep fragementation down
normally), and we just allocate one page (with a regular GFP_USER this
time).
Later on I can redo one of the various online defragmentation things
that went around last October or so if it would help with this.
Post by Hubertus Franke
Post by Linus Torvalds
Map in all pages.
- do the same for page_cache_readahead() (this, btw, is where radix trees
will kick some serious ass - we'd have had a hard time doing the "is
this range of order-X pages populated" efficiently with the old hashes.
Hey, we use the radix tree to track page cache mappings for large pages
particularly for this reason...
Proportion of radix tree populated beneath a given node can be computed
by means of traversals adding up ->count or by incrementally maintaining
a secondary counter for ancestors within the radix tree node. I can look
into this when I go over the path compression heuristics, which would
help the space consumption for access patterns fooling the current one.
Getting physical contiguity out of that is another matter, but the code
can be used for other things (e.g. exec()-time prefaulting) until that's
worked out, and it's not a focus or requirement of this code anyway.
Post by Hubertus Franke
Post by Linus Torvalds
I bet just those fairly small changes will give you effective coloring,
_and_ they are also what you want for doing small superpages.
The HW TLB case can be extended to not store the same PA in all the PTEs,
but conceptually carry the superpage concept for the purpose described above.
Pagetable walking gets a tiny hook, not much interesting goes on there.
A specialized wrapper for extracting physical pfn's from the pmd's like
the one for testing whether they're terminal nodes might look more
polished, but that's mostly cosmetic.

Hmm, from looking at the "small" vs. "large" page bits, I have an
inkling this may be relative to the machine size. 256GB boxen will
probably think of 4MB pages as small.
Post by Hubertus Franke
But to go down this route we need the concept of a superpage in the VM,
not just at TLB time or a hack that throws these things over the fence.
The bit throwing it over the fence is probably still useful, as Oracle
knows what it's doing and I suspect it's largely to dodge pagetable
space consumption OOM'ing machines as opposed to optimizing anything.
It pretty much wants the kernel out of the way aside from as a big bag
of device drivers, so I'm not surprised they're more than happy to have
the MMU in their hands too. The more I think about it, the less related
to superpages it seems. The motive for superpages is 100% TLB, not a
workaround for pagetable OOM.


Cheers,
Bill
David Mosberger
2002-08-05 16:59:19 UTC
Permalink
Hubertus> Yes, if we (correctly) assume that page coloring only buys
Hubertus> you significant benefits for small associative caches
Hubertus> (e.g. <4 or <= 8).

This seems to be a popular misconception. Yes, page-coloring
obviously plays no role as long as your cache no bigger than
PAGE_SIZE*ASSOCIATIVITY. IIRC, Xeon can have up to 1MB of cache and I
bet that it doesn't have a 1MB/4KB=256-way associative cache. Thus,
I'm quite confident that it's possible to observe significant
page-coloring effects even on a Xeon.

--david
Hubertus Franke
2002-08-05 17:21:15 UTC
Permalink
Post by David Mosberger
On Sun, 4 Aug 2002 15:30:24 -0400, Hubertus Franke
Hubertus> Yes, if we (correctly) assume that page coloring only buys
Hubertus> you significant benefits for small associative caches
Hubertus> (e.g. <4 or <= 8).
This seems to be a popular misconception. Yes, page-coloring
obviously plays no role as long as your cache no bigger than
PAGE_SIZE*ASSOCIATIVITY. IIRC, Xeon can have up to 1MB of cache and I
bet that it doesn't have a 1MB/4KB=256-way associative cache. Thus,
I'm quite confident that it's possible to observe significant
page-coloring effects even on a Xeon.
--david
The wording was "significant" benefits.
The point is/was that as your associativity goes up, the likelihood of
full cache occupancy increases, with cache thrashing in each class decreasing.
Would have to dig through the literature to figure out at what point
the benefits are insignificant (<1 %) wrt page coloring.

I am probably missing something in your argument?
How is the Xeon cache indexed (bits), what's the cache line size ?
My assumptions are as follows.

Take the bits of an address to be two different bit assignments.

< PG , PGOFS > with PG=V,X and PGOFS=<Y,Z> => < <V, X>, Y, Z >
where Z is the cacheline size,
<X,Y> is used to index the cache (that is not strictly required to be
contiguous, but apparently many arch do it that way).
Page coloring should guarantee that X remains the same in the virtual and the
physical address assigned to it.
As your associativity goes up, your number of rows (colors) in the cache comes
down !!

We can take this offline as to not bother the rest, your call. Just interested
in flushing out the arguments.
--
-- Hubertus Franke (***@watson.ibm.com)
Jamie Lokier
2002-08-05 21:10:39 UTC
Permalink
The wording was "significant" benefits. The point is/was that as your
associativity goes up, the likelihood of full cache occupancy
increases, with cache thrashing in each class decreasing.
Would have to dig through the literature to figure out at what point
the benefits are insignificant (<1 %) wrt page coloring.
One of the benefits of page colouring may be that a program's run time
may be expected to vary less from run to run?

In the old days (6 years ago), I found that a video game I was working
on would vary in its peak frame rate by about 3-5% (I don't recall
exactly). Once the program was started, it would remain operating at
the peak frame rate it had selected, and killing and restarting the
program didn't often make a difference either. In DOS, the same program
always ran at a consistent frame rate (higher than Linux as it happens).
The actual number of objects executing in the program, and the amount of
memory allocated, were deterministic in these tests.

This is pointing at a cache colouring issue to me -- although quite
which cache I am not sure. I suppose it could have been something to do
with Linux' VM page scanner access patterns into the page array instead.

-- Jamie
Rik van Riel
2002-08-04 19:41:51 UTC
Permalink
Post by Linus Torvalds
Post by Hubertus Franke
As of the page coloring !
Can we tweak the buddy allocator to give us this additional functionality?
I would really prefer to avoid this, and get "95% coloring" by just doing
read-ahead with higher-order allocations instead of the current "loop
allocation of one block".
OK, now I'm really going to start on some code to try and free
physically contiguous pages when a higher-order allocation comes
in ;)

(well, after this hamradio rpm I started)

cheers,

Rik
--
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/ http://distro.conectiva.com/
David S. Miller
2002-08-05 05:40:43 UTC
Permalink
From: Hubertus Franke <***@watson.ibm.com>
Date: Sun, 4 Aug 2002 13:31:24 -0400

Can we tweak the buddy allocator to give us this additional functionality?

Absolutely not, it's a total lose.

I have tried at least 5 times to make it work without fragmenting the
buddy lists to shit. I channege you to code one up that works without
fragmenting things to shreds. Just run an endless kernel build over
and over in a loop for a few hours to a day. If the buddy lists are
not fragmented after these runs, then you have succeeded in my
challenge.

Do not even reply to this email without meeting the challenge as it
will fall on deaf ears. I've been there and I've done that, and at
this point code talks bullshit walks when it comes to trying to
colorize the buddy allocator in a way that actually works and isn't
disgusting.
Hubertus Franke
2002-08-03 18:41:29 UTC
Permalink
Post by David Mosberger
Post by David Mosberger
On Fri, 2 Aug 2002 21:26:52 -0700 (PDT), Linus Torvalds
I wasn't disagreeing with your case for separate large page
syscalls. Those syscalls certainly simplify implementation and,
as you point out, it well may be the case that a transparent
superpage scheme never will be able to replace the former.
Linus> Somebody already had patches for the transparent superpage
Linus> thing for alpha, which supports it. I remember seeing numbers
Linus> implying that helped noticeably.
Yes, I saw those. I still like the Rice work a _lot_ better. It's
just a thing of beauty, from a design point of view (disclaimer: I
haven't seen the implementation, so there may be ugly things
lurking...).
I agree, the Rice solution is ellegant in the promotion and demotion.
Post by David Mosberger
Linus> But yes, that definitely doesn't work for humongous pages (or
Linus> whatever we should call the multi-megabyte-special-case-thing
Linus> ;).
Yes, you're probably right. 2MB was reported to be fine in the Rice
experiments, but I doubt 256MB (and much less 4GB, as supported by
some CPUs) would fly.
--david
As if the page coloring, it certainly helps.
But I'd like to point out that superpages are there to reduce the number of
TLB misses by providing larger coverage. Simply providing page coloring
will not get you there.
--
-- Hubertus Franke (***@watson.ibm.com)
David Mosberger
2002-08-03 19:41:33 UTC
Permalink
Hubertus> But I'd like to point out that superpages are there to
Hubertus> reduce the number of TLB misses by providing larger
Hubertus> coverage. Simply providing page coloring will not get you
Hubertus> there.

Yes, I agree.

It appears that Juan Navarro, the primary author behind the Rice
project, is working on breaking down the superpage benefits they
observed. That would tell us how much benefit is due to page-coloring
and how much is due to TLB effects. Here in our lab, we do have some
(weak) empirical evidence that some of the SPECint benchmarks benefit
primarily from page-coloring, but clearly there are others that are
TLB limited.

--daivd
Hubertus Franke
2002-08-03 20:53:39 UTC
Permalink
Post by David Mosberger
On Sat, 3 Aug 2002 14:41:29 -0400, Hubertus Franke
Hubertus> But I'd like to point out that superpages are there to
Hubertus> reduce the number of TLB misses by providing larger
Hubertus> coverage. Simply providing page coloring will not get you
Hubertus> there.
Yes, I agree.
It appears that Juan Navarro, the primary author behind the Rice
project, is working on breaking down the superpage benefits they
observed. That would tell us how much benefit is due to page-coloring
and how much is due to TLB effects. Here in our lab, we do have some
(weak) empirical evidence that some of the SPECint benchmarks benefit
primarily from page-coloring, but clearly there are others that are
TLB limited.
--daivd
Cool.
Does that mean that BSD already has page coloring implemented ?

The agony is:
Page Coloring helps to reduce cache conflicts in low associative caches
while large pages may reduce TLB overhead.

One shouldn't rule out one for the other, there is a place for both.

How did you arrive to the (weak) empirical evidence?
You checked TLB misses and cache misses and turned
page coloring on and off and large pages on and off?
--
-- Hubertus Franke (***@watson.ibm.com)
David Mosberger
2002-08-03 21:26:27 UTC
Permalink
Hubertus> Cool. Does that mean that BSD already has page coloring
Hubertus> implemented ?

FreeBSD (at least on Alpha) makes some attempts at page-coloring, but
it's said to be far from perfect.

Hubertus> The agony is: Page Coloring helps to reduce cache
Hubertus> conflicts in low associative caches while large pages may
Hubertus> reduce TLB overhead.

Why agony? The latter helps the TLB _and_ solves the page coloring
problem (assuming the largest page size is bigger than the largest
cache; yeah, I see that could be a problem on some Power 4
machines... ;-)

Hubertus> One shouldn't rule out one for the other, there is a place
Hubertus> for both.

Hubertus> How did you arrive to the (weak) empirical evidence? You
Hubertus> checked TLB misses and cache misses and turned page
Hubertus> coloring on and off and large pages on and off?

Yes, that's basically what we did (there is a patch implementing a
page coloring kernel module floating around).

--david
Hubertus Franke
2002-08-03 21:50:03 UTC
Permalink
Post by David Mosberger
On Sat, 3 Aug 2002 16:53:39 -0400, Hubertus Franke
Hubertus> Cool. Does that mean that BSD already has page coloring
Hubertus> implemented ?
FreeBSD (at least on Alpha) makes some attempts at page-coloring, but
it's said to be far from perfect.
Hubertus> The agony is: Page Coloring helps to reduce cache
Hubertus> conflicts in low associative caches while large pages may
Hubertus> reduce TLB overhead.
Why agony? The latter helps the TLB _and_ solves the page coloring
problem (assuming the largest page size is bigger than the largest
cache; yeah, I see that could be a problem on some Power 4
machines... ;-)
In essense, remember page coloring preserves the same bits used
for cache indexing from virtual to physical. If these bits are covered
by the large page, then ofcourse you will get page coloring for free
otherwise you won't.
Also, page coloring is mainly helpful in low associativity caches.
From my recollection of the literature, for 4-way or higher its not
worth the trouble.

Just to rephrase:
- Large pages almost always solve your page coloring problem.
- Page coloring never solves your TLB coverage problem.
Post by David Mosberger
Hubertus> One shouldn't rule out one for the other, there is a place
Hubertus> for both.
Hubertus> How did you arrive to the (weak) empirical evidence? You
Hubertus> checked TLB misses and cache misses and turned page
Hubertus> coloring on and off and large pages on and off?
Yes, that's basically what we did (there is a patch implementing a
page coloring kernel module floating around).
--david
--
-- Hubertus Franke (***@watson.ibm.com)
David S. Miller
2002-08-04 00:34:02 UTC
Permalink
From: Hubertus Franke <***@watson.ibm.com>
Date: Sat, 3 Aug 2002 16:53:39 -0400

Does that mean that BSD already has page coloring implemented ?

FreeBSD has had page coloring for quite some time.

Because they don't use buddy lists and don't allow higher-order
allocations fundamentally in the page allocator, they don't have
to deal with all the buddy fragmentation issues we do.

On the other hand, since higher-order page allocations are not
a fundamental operation it might be more difficult for FreeBSD
to implement superpage support efficiently like we can with
the buddy lists.
David S. Miller
2002-08-04 00:31:11 UTC
Permalink
From: David Mosberger <***@napali.hpl.hp.com>
Date: Sat, 3 Aug 2002 12:41:33 -0700

It appears that Juan Navarro, the primary author behind the Rice
project, is working on breaking down the superpage benefits they
observed. That would tell us how much benefit is due to page-coloring
and how much is due to TLB effects. Here in our lab, we do have some
(weak) empirical evidence that some of the SPECint benchmarks benefit
primarily from page-coloring, but clearly there are others that are
TLB limited.

There was some comparison done between large-page vs. plain
page coloring for a bunch of scientific number crunchers.

Only one benefitted from page coloring and not from TLB
superpage use.

The ones that benefitted from both coloring and superpages, the
superpage gain was about equal to the coloring gain. Basically,
superpages ended up giving the necessary coloring :-)

Search for the topic "Areas for superpage discussion" in the
***@vger.kernel.org list archives, it has pointers to
all the patches and test programs involved.
Hubertus Franke
2002-08-04 17:25:42 UTC
Permalink
Post by David S. Miller
Date: Sat, 3 Aug 2002 12:41:33 -0700
It appears that Juan Navarro, the primary author behind the Rice
project, is working on breaking down the superpage benefits they
observed. That would tell us how much benefit is due to page-coloring
and how much is due to TLB effects. Here in our lab, we do have some
(weak) empirical evidence that some of the SPECint benchmarks benefit
primarily from page-coloring, but clearly there are others that are
TLB limited.
There was some comparison done between large-page vs. plain
page coloring for a bunch of scientific number crunchers.
Only one benefitted from page coloring and not from TLB
superpage use.
I would expect that from scientific apps, which often go through their
dataset in a fairy regular pattern. If sequential, then page coloring
is at its best, because your cache can become the limiting factor, if
you can't squeeze data into the cache due to false sharing in the same
cache class.

The way I see page coloring is that any hard work done in virtual space
(either by compiler or by app writer [ latter holds for numerical apps ])
to be cache friendly, is not circumvented by a <stupid> physical page
assignment by the OS that leads to less than complete cache utilization.
That's why the cache index bits from the address are carried over or
are kept the same in virtual and physical address. That's the purpose of
page coloring.....

This regular access pattern is not necessarily true in apps like JVM or other
object oriented code where data accesses can be less predictive. There page
coloring might not help you at all.
Post by David S. Miller
The ones that benefitted from both coloring and superpages, the
superpage gain was about equal to the coloring gain. Basically,
superpages ended up giving the necessary coloring :-)
Search for the topic "Areas for superpage discussion" in the
all the patches and test programs involved.
--
-- Hubertus Franke (***@watson.ibm.com)
Linus Torvalds
2002-08-03 19:39:40 UTC
Permalink
Post by Hubertus Franke
But I'd like to point out that superpages are there to reduce the number of
TLB misses by providing larger coverage. Simply providing page coloring
will not get you there.
Superpages can from a memory allocation angle be seen as a very strict
form of page coloring - the problems are fairly closely related, I think
(superpages are just a lot stricter, in that it's not enough to get "any
page of color X", you have to get just the _right_ page).

Doing superpages will automatically do coloring (while the reverse is
obviously not true). And the way David did coloring a long time ago (if
I remember his implementation correctly) was the same way you'd do
superpages: just do higher order allocations.

Linus
David S. Miller
2002-08-04 00:32:04 UTC
Permalink
From: Linus Torvalds <***@transmeta.com>
Date: Sat, 3 Aug 2002 12:39:40 -0700 (PDT)

And the way David did coloring a long time ago (if
I remember his implementation correctly) was the same way you'd do
superpages: just do higher order allocations.

Although it wasn't my implementation which did this,
one of them did do it this way. I agree that it is
the nicest way to do coloring.
Andi Kleen
2002-08-04 20:20:16 UTC
Permalink
Post by Andrew Morton
If we instead clear out 4 or 8 pages, we trash a ton of cache and
the chances of userspace _using_ pages 1-7 in the short-term are
lower. We could clear the pages with 7,6,5,4,3,2,1,0 ordering,
but the cache implications of faultahead are still there.
What you could do on modern x86 and probably most other architectures as
well is to clear the faulted page in cache and clear the other pages
with a non temporal write. The non temporal write will go straight
to main memory and not pollute any caches.

When the process accesses it later it has to fetch the zeroes from
main memory. This is probably still faster than a page fault at least
for the first few accesses. It could be more costly when walking the full
page (then the added up cache miss costs could exceed the page fault cost),
but then hopefully the CPU will help by doing hardware prefetch.

It could help or not help, may be worth a try at least :-)

-Andi
Eric W. Biederman
2002-08-04 23:51:51 UTC
Permalink
Post by Andi Kleen
Post by Andrew Morton
If we instead clear out 4 or 8 pages, we trash a ton of cache and
the chances of userspace _using_ pages 1-7 in the short-term are
lower. We could clear the pages with 7,6,5,4,3,2,1,0 ordering,
but the cache implications of faultahead are still there.
What you could do on modern x86 and probably most other architectures as
well is to clear the faulted page in cache and clear the other pages
with a non temporal write. The non temporal write will go straight
to main memory and not pollute any caches.
Plus a non temporal write is 3x faster than a write that lands in
the cache on x86 (tested on Athlons, P4, & P3).
Post by Andi Kleen
When the process accesses it later it has to fetch the zeroes from
main memory. This is probably still faster than a page fault at least
for the first few accesses. It could be more costly when walking the full
page (then the added up cache miss costs could exceed the page fault cost),
but then hopefully the CPU will help by doing hardware prefetch.
It could help or not help, may be worth a try at least :-)
Certainly.

Eric
Seth, Rohit
2002-08-05 23:30:54 UTC
Permalink
-----Original Message-----
Sent: Sunday, August 04, 2002 12:30 PM
To: Linus Torvalds
Subject: Re: large page patch (fwd) (fwd)
Well, in what you described above there is no concept of superpages
the way it is defined for the purpose of <tracking> and <TLB overhead
reduction>.
If you don't know about super pages at the VM level, then you need to
deal with them at TLB fault level to actually create the <large TLB>
entry. That what the INTC patch will do, namely throughing all the
complexity over the fence for the page fault.
Our patch does the preallocation of large pages at the time of request.
There is really nothing special like replicating PTEs (that you mentioned
below in your design) happens there. In any case, even for IA-64 where the
TLBs are also sw controlled (we also have Hardware Page Walker that can walk
any 3rd level pt and insert the PTE in TLB.) there are almost no changes (to
be precise one additional asm instructionin the begining of handler for
shifting extra bits) in our implementation that pollute the low level TLB
fault handlers to have the knowledge of large page size in traversing the
3-level page table. (Though there are couple of other asm instructions that
are added in this low-level routine to set helping register with proper
page_size while inserting bigger TLBs). On IA-32 obviously things fall in
place automagically as the page tables are setup as per arch.
In your case not keeping track of the super pages in the
VM layer and PT layer requires to discover the large page at soft TLB
time by scanning PT proximity for contigous pages if we are
talking now
about the read_ahead ....
In our case, we store the same physical address of the super page
in the PTEs spanning the superpage together with the page order.
At software TLB time we simply extra the single PTE from the PT based
on the faulting address and move it into the TLB. This
ofcourse works only
for software TLBs (PwrPC, MIPS, IA64). For HW TLB (x86) the
PT structure
by definition overlaps the large page size support.
The HW TLB case can be extended to not store the same PA in
all the PTEs,
but conceptually carry the superpage concept for the purpose
described above.
I'm afraid you may be wasting a lot of extra memory by replicaitng these
PTEs(Take an example of one 4G large TLB size entry and assume there are few
hunderd processes using that same physical page.)
We have that concept exactly the way you want it, but the dress code
seems to be wrong. That can be worked on.
Our goal was in the long run 2.7 to explore the Rice approach to see
whether it yields benefits or whether we getting down the road of
fragmentation reduction overhead that will kill all the
benefits we get
from reduced TLB overhead. Time would tell.
But to go down this route we need the concept of a superpage
in the VM,
not just at TLB time or a hack that throws these things over
the fence.
As others have already said that you may want to have the support of smaller
superpages in this way. Where VM is embeded with some knowledge of
different page sizes that it can support. Demoting and permoting pages from
one size to another (efficiently)will be very critical in the design. In my
opinion supporting the largest TLB on archs (like 256M or 4G) will need more
direct appraoch and less intrusion from kernel VM will be prefered.
Ofcourse, kernel will need to put extra checks etc. to maintain some sanity
for allowed users.

There has already been lot of discussion on this mailing list about what is
the right approach. Whether the new APIs are needed or something like
madvise would do it, whether kernel needs to allocate large_pages
transparently to the user or we should expose the underlying HW feature to
user land. There are issues that favor one approach over another. But the
bottom line is: 1) We should not break anything semantically for regular
system calls that happen to be using large TLBs and 2) The performance
advantage of this HW feature (on most of the archs I hope) is just too much
to let go without notice. I hope we get to consensus for getting this
support in kernel ASAP. This will benefit lot of Linux users. (And yes I
understand that we need to do things right in kernel so that we don't make
unforeseen errors.)
Post by Linus Torvalds
And no, I do not want separate coloring support in the
allocator. I think
Post by Linus Torvalds
coloring without superpage support is stupid and worthless (and
complicates the code for no good reason).
Linus
That <stupid> seems premature. You are mixing the concept of
superpage from a TLB miss reduction perspective
with the concept of superpage for page coloring.
I have seen couple of HPC apps that try to fit (configure) in their data
sets on the L3 caches size (Like on IA-64 4M). I think these are the apps
that really get hit hardest by lack of proper page coloring support in Linux
kernel. The performance variation of these workloads from run to run could
be as much as 60% And with the page coloring patch, these apps seems to be
giving consistent higher throuput (The real bad part is that once the
throughput of these workloads drop, it stays down thereafter :( ) But seems
like DavidM has enough real world data that prohibits the use of this
approach in kernel for real world scenarios. The good part of large TLBs is
that, TLBs larger than CPU cache size will automatically get you perfect
page coloring .........for free.

rohit
David Mosberger
2002-08-06 05:01:16 UTC
Permalink
Rohit> I'm afraid you may be wasting a lot of extra memory by
Rohit> replicaitng these PTEs(Take an example of one 4G large TLB
Rohit> size entry and assume there are few hunderd processes using
Rohit> that same physical page.)

In my opinion, this is perhaps the strongest argument *for* a separate
"giant page" syscall interface. It will be very hard (perhaps
impossible) to optimize superpages to work efficiently when the ratio
of superpage/basepage grows huge (as, by definition, the kernel would
manage them as a set of basepages). For example, even if we used a
base page-size of 64KB, a 4GB giant page (as supported by Itanium 2)
would correspond to 65536 base pages. A superpage of this size would
almost certainly still do a lot better than 65536 base pages, but
compared to a single giant page, it probably stands no chance
performance-wise.

--david
David S. Miller
2002-08-06 04:58:17 UTC
Permalink
From: David Mosberger <***@napali.hpl.hp.com>
Date: Mon, 5 Aug 2002 22:01:16 -0700

In my opinion, this is perhaps the strongest argument *for* a separate
"giant page" syscall interface. It will be very hard (perhaps
impossible) to optimize superpages to work efficiently when the ratio
of superpage/basepage grows huge (as, by definition, the kernel would
manage them as a set of basepages).

Actually, this is one of the reasons there was a lot of research into
using sub-page clustering for large mappings in the TLB. Basically
how this worked is that for a superpage, you could stick multiple
sub-mappings into the entry such that you didn't need a fully
physically contiguous superpage.

It's talked about in one of the Talluri papers.
David Mosberger
2002-08-06 05:19:24 UTC
Permalink
Post by David S. Miller
Mon, 5 Aug 2002 22:01:16 -0700
In my opinion, this is perhaps the strongest argument
*for* a separate "giant page" syscall interface. It will be
very hard (perhaps impossible) to optimize superpages to work
efficiently when the ratio of superpage/basepage grows huge
(as, by definition, the kernel would manage them as a set of
basepages).
DaveM> Actually, this is one of the reasons there was a lot of
DaveM> research into using sub-page clustering for large mappings in
DaveM> the TLB. Basically how this worked is that for a superpage,
DaveM> you could stick multiple sub-mappings into the entry such
DaveM> that you didn't need a fully physically contiguous superpage.

Sounds great if you have the hardware that can do it. Not too many
CPUs I know of support it.

--david
David S. Miller
2002-08-06 05:08:36 UTC
Permalink
From: David Mosberger <***@napali.hpl.hp.com>
Date: Mon, 5 Aug 2002 22:19:24 -0700

Sounds great if you have the hardware that can do it. Not too many
CPUs I know of support it.

Of course, and the fact that nobody has put it into silicon may
be a suggestion of how useful the feature really is :-)
David Mosberger
2002-08-06 05:32:09 UTC
Permalink
Post by David S. Miller
Mon, 5 Aug 2002 22:19:24 -0700
Sounds great if you have the hardware that can do it. Not
too many CPUs I know of support it.
DaveM> Of course, and the fact that nobody has put it into silicon
DaveM> may be a suggestion of how useful the feature really is :-)

My thought exactly! ;-)

--david
Hubertus Franke
2002-08-06 19:11:33 UTC
Permalink
Post by Seth, Rohit
-----Original Message-----
Sent: Sunday, August 04, 2002 12:30 PM
To: Linus Torvalds
Subject: Re: large page patch (fwd) (fwd)
Well, in what you described above there is no concept of superpages
the way it is defined for the purpose of <tracking> and <TLB overhead
reduction>.
If you don't know about super pages at the VM level, then you need to
deal with them at TLB fault level to actually create the <large TLB>
entry. That what the INTC patch will do, namely throughing all the
complexity over the fence for the page fault.
Our patch does the preallocation of large pages at the time of request.
There is really nothing special like replicating PTEs (that you mentioned
below in your design) happens there. In any case, even for IA-64 where the
TLBs are also sw controlled (we also have Hardware Page Walker that can
walk any 3rd level pt and insert the PTE in TLB.) there are almost no
changes (to be precise one additional asm instructionin the begining of
handler for shifting extra bits) in our implementation that pollute the low
level TLB fault handlers to have the knowledge of large page size in
traversing the 3-level page table. (Though there are couple of other asm
instructions that are added in this low-level routine to set helping
register with proper page_size while inserting bigger TLBs). On IA-32
obviously things fall in place automagically as the page tables are setup
as per arch.
Hi, I quite apparent from your answer that I didn't make our approach and
intent clear.
I don't mean to discredit your approach in anyway, its quick, its special
purpose and does what it does namely anonymous memory for large pages
supported by a particular architecture on specific request in an efficient
manner with absolutely no overhead on the base kernel. Agreed that
the interface is up for negotiation but that has nothing to do with the
essense its estatics and other API arguments.

Our intent has been, if you followed our presentation at OLS or on the web,
to build a multiple page size support that spans (i) mmap'ed files, (ii) shm
segments and (iii) anonymous files and memory, hence covers the page cache
and eventually the swap system as well. The target would really to be like
the Rice BSD paper with automatic promotion/demotion and reservation.
Ofcourse its up to discussions whether that is any good or just overkill and
useless in the face that if important apps could simply use a special purpose
interface and force the issue.

We are nowhere close to there. More analysis is required to even establish
that fragmentation, large page aware defragmenter etc. won't kill any TLB
overhead performance gains seen. The Rice Paper is one reference point.
But its important to point out that this is what the OS research community
wants to see.
This can NOT be accomplished in one big patch, several intermediate steps are
required.
(a) The first one is to demonstrate that large pages for anonymous memory
(shared through fork and non-shared) can be integrated effortless into the
current VM
code with no overhead an almost no major cludges and code messups.....
Doing so would give the benefit for every architecture. In essense what needs
to be provided is a few low level macros to force the page order into the
PTE/PMD entry.
(b) The second one is to extend the concept to the I/O to be able to back
regions with files.

We are closed to be done with (a) having pulled out the stuff out of Simon's
patch for 2.4.18 and moved it up to 2.5.30. We will retrofit (b) later.

The next confusion is that of the definition of a large page and a super page.
I am guilty too of mixing these up every now and then.
SuperPage: a cluster of contiguous physical pages that is treated by the OS
as a single entity. Operations are on superpages, including
tracking of Dirty, referenced, active .... although
they might be broken down in smaller operations on their
base pages.
Large Page: A superpage that coincides with pagesize supported by the HW.

In your case you are clearly dealing with <LargePages> as a special memory
area that is intercepted at fault time and specially dealt with.
Our case, essentially supports the super page concept as defined above.
Not all is implemented and the x86 prototype was focused on the
SuperPage=LargePage issue.
Continuation below ....
Post by Seth, Rohit
In your case not keeping track of the super pages in the
VM layer and PT layer requires to discover the large page at soft TLB
time by scanning PT proximity for contigous pages if we are
talking now
about the read_ahead ....
In our case, we store the same physical address of the super page
in the PTEs spanning the superpage together with the page order.
At software TLB time we simply extra the single PTE from the PT based
on the faulting address and move it into the TLB. This
ofcourse works only
for software TLBs (PwrPC, MIPS, IA64). For HW TLB (x86) the
PT structure
by definition overlaps the large page size support.
The HW TLB case can be extended to not store the same PA in
all the PTEs,
but conceptually carry the superpage concept for the purpose
described above.
I'm afraid you may be wasting a lot of extra memory by replicaitng these
PTEs(Take an example of one 4G large TLB size entry and assume there are
few hunderd processes using that same physical page.)
4GB TLB entry size ???
I assume you mean 4MB TLB entry size or did I fall into a coma for 10 years
:-)

Well, the way it is architected is that all translations (including
superpages) are stored in the PT. Should a superpage coincide with
a PMD we do not allocate the lowest level of (1<<PMD_SHIFT) entries and
record that fact in the PMD.
In case the super page is smaller then a PMD (e.g. 4MB), then entries need
to be created in the PTEs. One can dream up some optimizations, that only
a single entry needs to be created, but that only works for SW-TLB.
For HW-TLB (x86) there is no choice but doing so.
For SW-TLB, this probably works the same as you stuff (only have seen the
HW-TLB x86 port). You store this one entry in the PT at the basepage
translation of the superpage and walk at page fault time. This can be done
equally in architecture independent code as far as I can tell, with the same
tricks you are mentioning above dressed up in architecture dependent code.
super
Post by Seth, Rohit
We have that concept exactly the way you want it, but the dress code
seems to be wrong. That can be worked on.
Our goal was in the long run 2.7 to explore the Rice approach to see
whether it yields benefits or whether we getting down the road of
fragmentation reduction overhead that will kill all the
benefits we get
from reduced TLB overhead. Time would tell.
But to go down this route we need the concept of a superpage
in the VM,
not just at TLB time or a hack that throws these things over
the fence.
As others have already said that you may want to have the support of
smaller superpages in this way. Where VM is embeded with some knowledge of
different page sizes that it can support. Demoting and permoting pages
from one size to another (efficiently)will be very critical in the design.
In my opinion supporting the largest TLB on archs (like 256M or 4G) will
need more direct appraoch and less intrusion from kernel VM will be
prefered. Ofcourse, kernel will need to put extra checks etc. to maintain
some sanity for allowed users.
Yipp, that's the goal. Just think that in general coverage of up to decently
large TLB entry sizes (4MB) can be handled with our approach and it would
essentially allow the filemaps and general shm segments.
Post by Seth, Rohit
There has already been lot of discussion on this mailing list about what is
the right approach. Whether the new APIs are needed or something like
madvise would do it, whether kernel needs to allocate large_pages
transparently to the user or we should expose the underlying HW feature to
user land. There are issues that favor one approach over another. But the
bottom line is: 1) We should not break anything semantically for regular
system calls that happen to be using large TLBs and 2) The performance
advantage of this HW feature (on most of the archs I hope) is just too much
to let go without notice. I hope we get to consensus for getting this
support in kernel ASAP. This will benefit lot of Linux users. (And yes I
understand that we need to do things right in kernel so that we don't make
unforeseen errors.)
Absolutely, well phrased. The "+"s and "-"s of each approach have been
pointed out by many folks and papers. Your stuff certainly provides a short
term solution that works. After OLS I was hoping to look from it at a long
term solution ala Rice BSD approach.
Post by Seth, Rohit
Post by Linus Torvalds
And no, I do not want separate coloring support in the
allocator. I think
Post by Linus Torvalds
coloring without superpage support is stupid and worthless (and
complicates the code for no good reason).
Linus
That <stupid> seems premature. You are mixing the concept of
superpage from a TLB miss reduction perspective
with the concept of superpage for page coloring.
I have seen couple of HPC apps that try to fit (configure) in their data
sets on the L3 caches size (Like on IA-64 4M). I think these are the apps
that really get hit hardest by lack of proper page coloring support in
Linux kernel. The performance variation of these workloads from run to run
could be as much as 60% And with the page coloring patch, these apps seems
to be giving consistent higher throuput (The real bad part is that once the
throughput of these workloads drop, it stays down thereafter :( ) But
seems like DavidM has enough real world data that prohibits the use of this
approach in kernel for real world scenarios. The good part of large TLBs
is that, TLBs larger than CPU cache size will automatically get you perfect
page coloring .........for free.
rohit
Yipp....
let's see how it evolves.
As DavidM so elegantly stated: talk=BS walk=code :-)

Cheers....
--
-- Hubertus Franke (***@watson.ibm.com)
Luck, Tony
2002-08-06 20:38:09 UTC
Permalink
Post by Hubertus Franke
Post by Hubertus Franke
4GB TLB entry size ???
I assume you mean 4MB TLB entry size or did I fall
into a coma for 10 years
That wasn't a typo ... Itanium2 supports page sizes up
to 4 Gigabytes. Databases (well, Oracle for sure) want
to use those huge TLB entries to map their multi-gigabyte
shared memory areas.

-Tony
Hubertus Franke
2002-08-06 21:03:24 UTC
Permalink
Post by Luck, Tony
Post by Hubertus Franke
Post by Hubertus Franke
4GB TLB entry size ???
I assume you mean 4MB TLB entry size or did I fall
into a coma for 10 years
That wasn't a typo ... Itanium2 supports page sizes up
to 4 Gigabytes. Databases (well, Oracle for sure) want
to use those huge TLB entries to map their multi-gigabyte
shared memory areas.
-Tony
Whooooowww... Power4 I believe opted out at 16MB.
So the story about sleeping beauty is true :-).

Wouldn't want to manage the 8-32 physical pages of memory through the VM.
Paging not an option, file access irrelevant.

In that case I agree that it should be handled by a special purpose
extension like Seth's patch to cover a 4GB page.

Upto 4MB or so I still believe going the other way is proper.
More later... thanks for the info.
--
-- Hubertus Franke (***@watson.ibm.com)
Seth, Rohit
2002-08-09 17:51:55 UTC
Permalink
-----Original Message-----
Sent: Friday, August 09, 2002 10:12 AM
To: Linus Torvalds
Mosberger; David S.
Subject: Re: large page patch (fwd) (fwd)
Post by Linus Torvalds
Post by Daniel Phillips
Slab allocations would not have GFP_DEFRAG (I mistakenly
wrote GFP_LARGE
Post by Linus Torvalds
Post by Daniel Phillips
earlier) and so would be allocated outside ZONE_LARGE.
.. at which poin tyou then get zone balancing problems.
Or we end up with the same kind of special zone that we
have _anyway_ in
Post by Linus Torvalds
the current large-page patch, in which case the point of
doing this is
Post by Linus Torvalds
what?
The current large-page patch doesn't have any kind of
defragmentation in the
special zone and that memory is just not available for other
uses. The thing
is, when demand for large pages is low the zone should be
allowed to fragment.
You are right that as long as the pages are in large page pool they are not
available for other regualr purposes. Though the current implementation
basically allows on-demand moving of pages between large_page and other
regular pools using sysctl interface. The issue is really not forced (in
the sense that large pages are freed only if they are available and vice
versa). And it will not be an issue where demand for large pages is low.
Theoritically you can extend this support in pageout daemon to find out if
it can retrieve some free large pages (for environments where expectations
are that most of the memory will be used for large pages but actual usage is
not as per the expectations. Though I doubt if those environments will
occur, but bad configurations are always there) The current approach really
allows the large page/regular_page movement without doing too much of house
cleaning. It is likely that once a large page goes back to general pool, it
will not easy to replenish the large_page pool because of fragmentation in
regular memory pool (for memory starved machines. For the scenarios where
sometime the machine is running low on regular memory and sometimes on
large_pages....probably it would be a good idea to add in more RAM in these
cases.).
Linus Torvalds
2002-08-11 22:56:10 UTC
Permalink
If somebody sues you, you change the algorithm or you just hire a
hit-man to whack the stupid git.
Btw, I'm not a lawyer, and I suspect this may not be legally tenable
advice. Whatever. I refuse to bother with the crap.

Linus
Alan Cox
2002-08-12 00:46:19 UTC
Permalink
Post by Linus Torvalds
If somebody sues you, you change the algorithm or you just hire a
hit-man to whack the stupid git.
Btw, I'm not a lawyer, and I suspect this may not be legally tenable
advice. Whatever. I refuse to bother with the crap.
In which case you might as well do the rest of the world a favour and
restrict US usage of Linux in the license file while you are at it.
Unfortunately the USA forces people to deal with this crap. I'd hope SGI
would be decent enough to explicitly state they will license this stuff
freely for GPL use (although having shipping Linux themselves the
question is partly moot as the GPL says they can't impose additional
restrictions)

Alan
Daniel Phillips
2002-08-11 23:44:07 UTC
Permalink
Post by Alan Cox
Post by Linus Torvalds
If somebody sues you, you change the algorithm or you just hire a
hit-man to whack the stupid git.
Btw, I'm not a lawyer, and I suspect this may not be legally tenable
advice. Whatever. I refuse to bother with the crap.
In which case you might as well do the rest of the world a favour and
restrict US usage of Linux in the license file while you are at it.
Unfortunately the USA forces people to deal with this crap. I'd hope SGI
would be decent enough to explicitly state they will license this stuff
freely for GPL use (although having shipping Linux themselves the
question is partly moot as the GPL says they can't impose additional
restrictions)
I do not agree that it is enough to license it for 'GPL' use. If there is
a license, it should impose no restrictions that the GPL does not. There
is a big distinction. Anything else, and the licensor is sending the message
that they reserve the right to enforce against Linux users.

In other words, a license grant has to cover *all* uses of Linux and not just
GPL uses.

In my opinion, RedHat has set a bad example by stopping short of promising
free use of Ingo's patents for all Linux users. We are entering a difficult
time, and such a wrong-footed move simply makes it more difficult.
--
Daniel
Rob Landley
2002-08-13 08:51:50 UTC
Permalink
Post by Daniel Phillips
In other words, a license grant has to cover *all* uses of Linux and not
just GPL uses.
Including a BSD license where source code is never released? Or dot-net
application servers hosted on a Linux system under lock and key in a vault
somewhere? And no termination clause, so this jerk can still sue you over
other frivolous patents?

So you would object to microsoft granting rights to its patents saying "you
can use this patent in software that runs on windows, but use it on any other
platform and we'll sue you", but you don't mind going the other way?

Either way BSD gets the shaft, of course. But then BSDI was doing that them
a decade ago, and Sun hired away Bill Joy and forked off SunOS years before
that, so they should be used to it by now... :) (And BSD runs plenty of GPL
application code...)
Post by Daniel Phillips
In my opinion, RedHat has set a bad example by stopping short of promising
free use of Ingo's patents for all Linux users. We are entering a
difficult time, and such a wrong-footed move simply makes it more
difficult.
Imagine a slimeball company that puts out proprietary software, gets a patent
on turning a computer on, and sues everybody in the northern hemisphere ala
rambus. They run a Linux system in the corner in their office, therefore
they are "a linux user". How do you stop somebody with that mindset from
finding a similarly trivial loophole in your language? (Think Spamford
Wallace. Think the CEO of Rambus. Think Unisys and the gif patent. Think
the people who recently got a patent on JPEG. Think the british telecom
idiots trying to patent hyperlinking a decade after Tim Berners-Lee's first
code drop to usenet...)

Today, all these people do NOT sue IBM, unless they're really stupid. (And
if they do, they will have cross-licensed their patent portfolio with IBM in
a year or two. Pretty much guaranteed.)

Rob
Daniel Phillips
2002-08-13 16:47:01 UTC
Permalink
Post by Rob Landley
So you would object to microsoft granting rights to its patents saying "you
can use this patent in software that runs on windows, but use it on any other
platform and we'll sue you", but you don't mind going the other way?
You missed the point. I was talking about using copyright against patents,
and specifically in the case where patents are held by people who also want
to use the copyrighted code. The intention is to help keep our friends
honest.

Dealing with Microsoft, or anyone else whose only motivation is to obstruct,
is an entirely separate issue.
--
Daniel
Rik van Riel
2002-08-11 23:42:16 UTC
Permalink
Post by Alan Cox
Unfortunately the USA forces people to deal with this crap. I'd hope SGI
would be decent enough to explicitly state they will license this stuff
freely for GPL use
I seem to remember Apple having a clause for this in
their Darwin sources, forbidding people who contribute
code from suing them about patent violations due to
the code they themselves contributed.

kind regards,

Rik
--
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/ http://distro.conectiva.com/
Larry McVoy
2002-08-11 23:50:03 UTC
Permalink
Post by Rik van Riel
Post by Alan Cox
Unfortunately the USA forces people to deal with this crap. I'd hope SGI
would be decent enough to explicitly state they will license this stuff
freely for GPL use
I seem to remember Apple having a clause for this in
their Darwin sources, forbidding people who contribute
code from suing them about patent violations due to
the code they themselves contributed.
IBM has a fantastic clause in their open source license. The license grants
you various rights to use, etc., and then goes on to say something in
the termination section (I think) along the lines of

In the event that You or your affiliates instigate patent, trademark,
and/or any other intellectual property suits, this license terminates
as of the filing date of said suit[s].

You get the idea. It's basically "screw me, OK, then screw you too" language.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm
Daniel Phillips
2002-08-12 08:22:26 UTC
Permalink
Post by Larry McVoy
Post by Rik van Riel
Post by Alan Cox
Unfortunately the USA forces people to deal with this crap. I'd hope SGI
would be decent enough to explicitly state they will license this stuff
freely for GPL use
I seem to remember Apple having a clause for this in
their Darwin sources, forbidding people who contribute
code from suing them about patent violations due to
the code they themselves contributed.
IBM has a fantastic clause in their open source license. The license grants
you various rights to use, etc., and then goes on to say something in
the termination section (I think) along the lines of
In the event that You or your affiliates instigate patent, trademark,
and/or any other intellectual property suits, this license terminates
as of the filing date of said suit[s].
You get the idea. It's basically "screw me, OK, then screw you too" language.
Yes. I would like to add my current rmap optimization work, if it is worthy
for the usual reasons, to the kernel under a DPL license which is in every
respect the GPL, except that it adds one additional restriction along the
lines:

"If you enforce a patent against a user of this code, or you have a
beneficial relationship with someone who does, then your licence to
use or distribute this code is automatically terminated"

with more language to extend the protection to the aggregate work, and to
specify that we are talking about enforcement of patents concerned with any
part of the aggregate work. Would something like that fly?

In other words, use copyright law as a lever against patent law.

This would tend to provide protection against 'our friends', who on the one
hand, depend on Linux in their businesses, and on the other hand, do seem to
be holding large portfolios of equivalently stupid patents.

As far as protection against those who would have no intention or need to use
the aggregate work anyway, that's an entirely separate question. Frankly, I
enjoy the sport of undermining a patent much more when it is held by someone
who is not a friend.
--
Daniel
Rob Landley
2002-08-13 08:40:08 UTC
Permalink
Post by Daniel Phillips
Yes. I would like to add my current rmap optimization work, if it is
worthy for the usual reasons, to the kernel under a DPL license which is in
every respect the GPL, except that it adds one additional restriction along
"If you enforce a patent against a user of this code, or you have a
beneficial relationship with someone who does, then your licence to
use or distribute this code is automatically terminated"
with more language to extend the protection to the aggregate work, and to
specify that we are talking about enforcement of patents concerned with any
part of the aggregate work. Would something like that fly?
In other words, use copyright law as a lever against patent law.
More than that, the GPL could easily be used to form a "patent pool". Just
say "This patent is licensed for use in GPL code. If you want to use it
outside of GPL code, you need a seperate license."

The purpose of modern patents is Mutually Assured Destruction: If you sue me,
I have 800 random patents you're bound to have infringed just by breating,
and even though they won't actually hold up to scrutiny I can keep you tied
up in court for years and force you to spend millions on legal fees. So why
don't you just cross-license your entire patent portfolio with us, and that
way we can make the whole #*%(&#% patent issue just go away. (Notice: when
aybody DOES sue, the result is usually a cross-licensing agreement of the
entire patent portfolio. Even in those rare cases when the patent
infringement is LEGITIMATE, the patent system is too screwed up to function
against large corporations due to the zillions of frivolous patents and the
tendency for corporations to have lawyers on staff so defending a lawsuit
doesn't really cost them anything.)

This is how companies like IBM and even Microsoft think. They get as many
patents as possible to prevent anybody ELSE from suing them, because the
patent office is stupid enough to give out a patent on scrollbars a decade
after the fact and they don't want to be on the receiving end of this
nonsense. And then they blanket cross-license with EVERYBODY, so nobody can
sue them.

People do NOT want to give a blanket license to everybody for any use on
these patents because it gives up the one thing they're good for: mutually
assured destruction. Licensing for "open source licenses" could mean "BSD
license but we never gave anybody any source code, so ha ha."

But if people with patents were to license all their patents FOR USE IN GPL
CODE, then any proprietary infringement (or attempt to sue) still gives them
leverage for a counter-suit. (IBM retained counter-suit ability in a
different way: you sue, the license terminates. That's not bad, but I think
sucking the patent system into the GPL the same way copyright gets inverted
would be more useful.)

This is more or less what Red Hat's done with its patents, by the way.
Blanket license for use under GPL-type licenses, but not BSD because that
would disarm mutually assured destruction. Now if we got somebody like IBM
on board a GPL patent pool (with more patents than anybody else, as far as I
know), that would realy mean something...

Unfortunately, the maintainer of the GPL is Stallman, so he's the logical guy
to spearhead a "GPL patent pool" project, but any time anybody mentions the
phrase "intellectual property" to him he goes off on a tangent about how you
shouldn't call anything "intellectual property", so how can you have a
discussion about it, and nothing ever gets done. It's FRUSTRATING to see
somebody with such brilliant ideas hamstrung not just by idealism, but
PEDANTIC idealism.

Sigh...

Rob
Alan Cox
2002-08-13 15:06:26 UTC
Permalink
Post by Rob Landley
Unfortunately, the maintainer of the GPL is Stallman, so he's the logical guy
to spearhead a "GPL patent pool" project, but any time anybody mentions the
phrase "intellectual property" to him he goes off on a tangent about how you
shouldn't call anything "intellectual property", so how can you have a
discussion about it, and nothing ever gets done. It's FRUSTRATING to see
somebody with such brilliant ideas hamstrung not just by idealism, but
PEDANTIC idealism.
Richard isnt daft on this one. The FSF does not have the 30 million
dollars needed to fight a *single* US patent lawsuit. The problem also
reflects back on things like Debian, because Debian certainly cannot
afford to play the patent game either.
Rob Landley
2002-08-13 11:36:24 UTC
Permalink
Post by Alan Cox
Post by Rob Landley
Unfortunately, the maintainer of the GPL is Stallman, so he's the logical
guy to spearhead a "GPL patent pool" project, but any time anybody
mentions the phrase "intellectual property" to him he goes off on a
tangent about how you shouldn't call anything "intellectual property", so
how can you have a discussion about it, and nothing ever gets done. It's
FRUSTRATING to see somebody with such brilliant ideas hamstrung not just
by idealism, but PEDANTIC idealism.
Richard isnt daft on this one. The FSF does not have the 30 million
dollars needed to fight a *single* US patent lawsuit. The problem also
reflects back on things like Debian, because Debian certainly cannot
afford to play the patent game either.
Agreed, but they can try to give standing to companies that have either the
resources or the need to do it themselves, and also to placate people who see
patent applications by SGI and Red Hat as evil proprietary encroachment
rather than an attempt to scrape together some kind of defense against the
insanity of the patent system.

Like politics: it's a game you can't win by ignoring, you can only try to use
it against itself. The GPL did a great job of this with copyright law: it
doesn't abandon stuff into the public domain for other people to copyright
and claim, but keeps it copyrighted and uses that copyright against the
copyright system. But at the time software patents weren't enforceable yet
and I'm guessing the wording of the license didn't want to lend credibility
to the concept. This situation has changed since: now software patents are
themselves an IP threat to free software that needs a copyleft solution.

Releasing a GPL 2.1 with an extra clause about a patent pool wouldn't cost
$30 million. (I.E. patents used in GPL code are copyleft style licensed and
not BSD style licensed: they can be used in GPL code but use outside it
requires a seperate license. Right now it says something like "free for use
by all" which makes the mutually assured destruction people cringe.)

By the way, the average figure I've heard to defend against a patent suit is
about $2 1/2 million. That's defend and not pursue, and admittedly that's
not near the upper limit, but it CAN be done for less. And what you're
looking for in a patent pool is something to countersue with in a defense,
not something to initiate action with. (Obviously, I'm not a professional
intellectual property lawyer. I know who to ask, but to get more than an off
the cuff remark I'd have to sponsor some research...)

Last time I really looked into all this, Stallman was trying to do an
enormous new GPL 3.0, addressing application service providers. That seems
to have fallen though (as has the ASP business model), but the patent issue
remains unresolved.

Red Hat would certainly be willing to play in a GPL patent pool. The
statement on their website already gives blanket permission to use patents in
GPL code (and a couple similar licenses; this would be a subset of the
permission they've already given). Red Hat's participation might convince
other distributors to do a "me too" thing (there's certainly precedent for
it). SGI could probably be talked into it as well, since they need the
goodwill of the Linux community unless they want to try to resurrect Irix.
IBM would take some convincing, it took them a couple years to get over their
distaste for the GPL in the first place, and they hate to be first on
anything, but if they weren't first... HP I haven't got a CLUE about with
Fiorina at the helm. Dell is being weird too...

Dunno. But ANY patent pool is better than none. If suing somebody for the
use of a patent in GPL code terminates your right to participate in a GPL
patent pool and makes you vulnerable to a suit over violating any patent in
the pool, then the larger the pool is the more incentive there is NOT to
sue...

Rob
Linus Torvalds
2002-08-13 16:51:45 UTC
Permalink
Post by Rob Landley
Last time I really looked into all this, Stallman was trying to do an
enormous new GPL 3.0, addressing application service providers. That seems
to have fallen though (as has the ASP business model), but the patent issue
remains unresolved.
At least one problem is exactly the politics played by the FSF, which
means that a lot of people (not just me), do not trust such new versions
of the GPL. Especially since the last time this happened, it all happened
in dark back-rooms, and I got to hear about it not off any of the lists,
but because I had an insider snitch on it.

I lost all respect I had for the FSF due to its sneakiness.

The kernel explicitly states that it is under the _one_ particular version
of the "GPL v2" that is included with the kernel. Exactly because I do not
want to have politics dragged into the picture by an external party (and
I'm anal enough that I made sure that "version 2" cannot be misconstrued
to include "version 2.1".

Also, a license is a two-way street. I do not think it is morally right to
change an _existing_ license for any other reason than the fact that it
has some technical legal problem. I intensely dislike the fact that many
people seem to want to extend the current GPL as a way to take advantage
of people who used the old GPL and agreed with _that_ - but not
necessarily the new one.

As a result, every time this comes up, I ask for any potential new
"patent-GPL" to be a _new_ license, and not try to feed off existing
works. Please dopn't make it "GPL". Make it the GPPL for "General Public
Patent License" or something. And let people buy into it on its own
merits, not on some "the FSF decided unilaterally to make this decision
for us".

I don't like patents. But I absolutely _hate_ people who play politics
with other peoples code. Be up-front, not sneaky after-the-fact.

Linus
Ruth Ivimey-Cook
2002-08-13 17:14:04 UTC
Permalink
Post by Linus Torvalds
I don't like patents. But I absolutely _hate_ people who play politics
with other peoples code. Be up-front, not sneaky after-the-fact.
Well said :-)

Ruth
--
Ruth Ivimey-Cook
Software engineer and technical writer.
Rik van Riel
2002-08-13 17:29:14 UTC
Permalink
Post by Linus Torvalds
Also, a license is a two-way street. I do not think it is morally right
to change an _existing_ license for any other reason than the fact that
it has some technical legal problem.
Agreed, but we might be running into one of these.
Post by Linus Torvalds
I don't like patents. But I absolutely _hate_ people who play politics
with other peoples code. Be up-front, not sneaky after-the-fact.
Suppose somebody sends you a patch which implements a nice
algorithm that just happens to be patented by that same
somebody. You don't know about the patent.

You integrate the patch into the kernel and distribute it,
one year later you get sued by the original contributor of
that patch because you distribute code that is patented by
that person.

Not having some protection in the license could open you
up to sneaky after-the-fact problems.

Having a license that explicitly states that people who
contribute and use Linux shouldn't sue you over it might
prevent some problems.

regards,

Rik
--
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/ http://distro.conectiva.com/
Alexander Viro
2002-08-13 17:45:40 UTC
Permalink
Post by Rik van Riel
Suppose somebody sends you a patch which implements a nice
algorithm that just happens to be patented by that same
somebody. You don't know about the patent.
You integrate the patch into the kernel and distribute it,
one year later you get sued by the original contributor of
that patch because you distribute code that is patented by
that person.
Not having some protection in the license could open you
up to sneaky after-the-fact problems.
Accepting non-trivial patches from malicious source means running code
from malicious source on your boxen. In kernel mode. And in that case
patents are the least of your troubles...
William Lee Irwin III
2002-08-11 23:36:10 UTC
Permalink
Post by Linus Torvalds
If somebody sues you, you change the algorithm or you just hire a
hit-man to whack the stupid git.
Btw, I'm not a lawyer, and I suspect this may not be legally tenable
advice. Whatever. I refuse to bother with the crap.
I'm not really sure what to think of all this patent stuff myself, but
I may need to get some directions from lawyerish types before moving on
here. OTOH I certainly like the suggested approach more than my
conservative one, even though I'm still too chicken to follow it. =)

On a more practical note, though, someone left out an essential 'h'
from my email address. Please adjust the cc: list. =)


Thanks,
Bill
Rik van Riel
2002-08-13 17:59:30 UTC
Permalink
Post by Rik van Riel
Having a license that explicitly states that people who
contribute and use Linux shouldn't sue you over it might
prevent some problems.
The thing is, if you own the patent, and you sneaked the code into the
kernel, you will almost certainly be laughed out of court for trying to
enforce it.
Apparently not everybody agrees on this:

http://zdnet.com.com/2100-1106-884681.html

regards,

Rik
--
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/ http://distro.conectiva.com/
Loading...