Discussion:
[RFC] NUMA futex hashing
(too old to reply)
Ravikiran G Thirumalai
2006-08-08 07:07:08 UTC
Permalink
Current futex hash scheme is not the best for NUMA. The futex hash table is
an array of struct futex_hash_bucket, which is just a spinlock and a
list_head -- this means multiple spinlocks on the same cacheline and on NUMA
machines, on the same internode cacheline. If futexes of two unrelated
threads running on two different nodes happen to hash onto adjacent hash
buckets, or buckets on the same internode cacheline, then we have the
internode cacheline bouncing between nodes.

Here is a simple scheme which maintains per-node hash tables for futexes.

In this scheme, a private futex is assigned to the node id of the futex's KVA.
The reasoning is, the futex KVA is allocated from the node as indicated
by memory policy set by the process, and that should be a good 'home node'
for that futex. Of course this helps workloads where all the threads of a
process are bound to the same node, but it seems reasonable to run all
threads of a process on the same node.

A shared futex is assigned a home node based on jhash2 itself. Since inode
and offset are used as the key, the same inode offset is used to arrive at
the home node of a shared futex. This distributes private futexes across
all nodes.

Comments? Suggestions? Particularly regarding shared futexes. Any policy
suggestions?

Thanks,
Kiran

Note: This patch needs to have kvaddr_to_nid() reintroduced. This was taken
out in git commit 9f3fd602aef96c2a490e3bfd669d06475aeba8d8

Index: linux-2.6.18-rc3/kernel/futex.c
===================================================================
--- linux-2.6.18-rc3.orig/kernel/futex.c 2006-08-02 12:11:34.000000000 -0700
+++ linux-2.6.18-rc3/kernel/futex.c 2006-08-02 16:48:47.000000000 -0700
@@ -137,20 +137,35 @@ struct futex_hash_bucket {
struct list_head chain;
};

-static struct futex_hash_bucket futex_queues[1<<FUTEX_HASHBITS];
+static struct futex_hash_bucket *futex_queues[MAX_NUMNODES] __read_mostly;

/* Futex-fs vfsmount entry: */
static struct vfsmount *futex_mnt;

/*
* We hash on the keys returned from get_futex_key (see below).
+ * With NUMA aware futex hashing, we have per-node hash tables.
+ * We determine the home node of a futex based on the KVA -- if the futex
+ * is a private futex. For shared futexes, we use jhash2 itself on the
+ * futex_key to arrive at a home node.
*/
static struct futex_hash_bucket *hash_futex(union futex_key *key)
{
+ int nodeid;
u32 hash = jhash2((u32*)&key->both.word,
(sizeof(key->both.word)+sizeof(key->both.ptr))/4,
key->both.offset);
- return &futex_queues[hash & ((1 << FUTEX_HASHBITS)-1)];
+ if (key->both.offset & 0x1) {
+ /*
+ * Shared futex: Use any of the 'possible' nodes as home node.
+ */
+ nodeid = hash & (MAX_NUMNODES -1);
+ BUG_ON(!node_possible(nodeid));
+ } else
+ /* Private futex */
+ nodeid = kvaddr_to_nid(key->both.ptr);
+
+ return &futex_queues[nodeid][hash & ((1 << FUTEX_HASHBITS)-1)];
}

/*
@@ -1909,13 +1924,25 @@ static int __init init(void)
{
unsigned int i;

+ int nid;
+
+ for_each_node(nid)
+ {
+ futex_queues[nid] = kmalloc_node(
+ (sizeof(struct futex_hash_bucket) *
+ (1 << FUTEX_HASHBITS)),
+ GFP_KERNEL, nid);
+ if (!futex_queues[nid])
+ panic("futex_init: Allocation of multi-node futex_queues failed");
+ for (i = 0; i < (1 << FUTEX_HASHBITS); i++) {
+ INIT_LIST_HEAD(&futex_queues[nid][i].chain);
+ spin_lock_init(&futex_queues[nid][i].lock);
+ }
+ }
+
register_filesystem(&futex_fs_type);
futex_mnt = kern_mount(&futex_fs_type);

- for (i = 0; i < ARRAY_SIZE(futex_queues); i++) {
- INIT_LIST_HEAD(&futex_queues[i].chain);
- spin_lock_init(&futex_queues[i].lock);
- }
return 0;
}
__initcall(init);
Eric Dumazet
2006-08-08 09:14:49 UTC
Permalink
Post by Ravikiran G Thirumalai
Current futex hash scheme is not the best for NUMA. The futex hash table
is an array of struct futex_hash_bucket, which is just a spinlock and a
list_head -- this means multiple spinlocks on the same cacheline and on
NUMA machines, on the same internode cacheline. If futexes of two
unrelated threads running on two different nodes happen to hash onto
adjacent hash buckets, or buckets on the same internode cacheline, then we
have the internode cacheline bouncing between nodes.
Here is a simple scheme which maintains per-node hash tables for futexes.
In this scheme, a private futex is assigned to the node id of the futex's
KVA. The reasoning is, the futex KVA is allocated from the node as
indicated by memory policy set by the process, and that should be a good
'home node' for that futex. Of course this helps workloads where all the
threads of a process are bound to the same node, but it seems reasonable to
run all threads of a process on the same node.
A shared futex is assigned a home node based on jhash2 itself. Since inode
and offset are used as the key, the same inode offset is used to arrive at
the home node of a shared futex. This distributes private futexes across
all nodes.
Comments? Suggestions? Particularly regarding shared futexes. Any policy
suggestions?
Your patch seems fine, but I have one comment.

For non NUMA machine, we would have one useless indirection to get the
futex_queues pointer.

static struct futex_hash_bucket *futex_queues[1];

I think it is worth to redesign your patch so that this extra-indirection is
needed only for NUMA machines.

#if defined(CONFIG_NUMA)
static struct futex_hash_bucket *futex_queues[MAX_NUMNODES];
#define FUTEX_QUEUES(nodeid, hash) \
&futex_queues[nodeid][hash & ((1 << FUTEX_HASHBITS)-1)];
#else
static struct futex_hash_bucket futex_queues[1<<FUTEX_HASHBITS];
# define FUTEX_QUEUES(nodeid, hash) \
&futex_queues[hash & ((1 << FUTEX_HASHBITS)-1)];
#endif

Thank you
Post by Ravikiran G Thirumalai
Thanks,
Kiran
Note: This patch needs to have kvaddr_to_nid() reintroduced. This was
taken out in git commit 9f3fd602aef96c2a490e3bfd669d06475aeba8d8
Index: linux-2.6.18-rc3/kernel/futex.c
===================================================================
--- linux-2.6.18-rc3.orig/kernel/futex.c 2006-08-02 12:11:34.000000000
-0700 +++ linux-2.6.18-rc3/kernel/futex.c 2006-08-02 16:48:47.000000000
struct list_head chain;
};
-static struct futex_hash_bucket futex_queues[1<<FUTEX_HASHBITS];
+static struct futex_hash_bucket *futex_queues[MAX_NUMNODES] __read_mostly;
/* Futex-fs vfsmount entry: */
static struct vfsmount *futex_mnt;
/*
* We hash on the keys returned from get_futex_key (see below).
+ * With NUMA aware futex hashing, we have per-node hash tables.
+ * We determine the home node of a futex based on the KVA -- if the futex
+ * is a private futex. For shared futexes, we use jhash2 itself on the
+ * futex_key to arrive at a home node.
*/
static struct futex_hash_bucket *hash_futex(union futex_key *key)
{
+ int nodeid;
u32 hash = jhash2((u32*)&key->both.word,
(sizeof(key->both.word)+sizeof(key->both.ptr))/4,
key->both.offset);
- return &futex_queues[hash & ((1 << FUTEX_HASHBITS)-1)];
+ if (key->both.offset & 0x1) {
+ /*
+ * Shared futex: Use any of the 'possible' nodes as home node.
+ */
+ nodeid = hash & (MAX_NUMNODES -1);
+ BUG_ON(!node_possible(nodeid));
+ } else
+ /* Private futex */
+ nodeid = kvaddr_to_nid(key->both.ptr);
+
+ return &futex_queues[nodeid][hash & ((1 << FUTEX_HASHBITS)-1)];
}
/*
@@ -1909,13 +1924,25 @@ static int __init init(void)
{
unsigned int i;
+ int nid;
+
+ for_each_node(nid)
+ {
+ futex_queues[nid] = kmalloc_node(
+ (sizeof(struct futex_hash_bucket) *
+ (1 << FUTEX_HASHBITS)),
+ GFP_KERNEL, nid);
+ if (!futex_queues[nid])
+ panic("futex_init: Allocation of multi-node futex_queues failed");
+ for (i = 0; i < (1 << FUTEX_HASHBITS); i++) {
+ INIT_LIST_HEAD(&futex_queues[nid][i].chain);
+ spin_lock_init(&futex_queues[nid][i].lock);
+ }
+ }
+
register_filesystem(&futex_fs_type);
futex_mnt = kern_mount(&futex_fs_type);
- for (i = 0; i < ARRAY_SIZE(futex_queues); i++) {
- INIT_LIST_HEAD(&futex_queues[i].chain);
- spin_lock_init(&futex_queues[i].lock);
- }
return 0;
}
__initcall(init);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Ravikiran G Thirumalai
2006-08-08 20:31:20 UTC
Permalink
Post by Eric Dumazet
Your patch seems fine, but I have one comment.
For non NUMA machine, we would have one useless indirection to get the
futex_queues pointer.
static struct futex_hash_bucket *futex_queues[1];
I think it is worth to redesign your patch so that this extra-indirection is
needed only for NUMA machines.
Yes. Will do in the next iteration.

Thanks,
Kiran
Jes Sorensen
2006-08-08 09:37:23 UTC
Permalink
Ravikiran> Current futex hash scheme is not the best for NUMA. The
Ravikiran> futex hash table is an array of struct futex_hash_bucket,
Ravikiran> which is just a spinlock and a list_head -- this means
Ravikiran> multiple spinlocks on the same cacheline and on NUMA
Ravikiran> machines, on the same internode cacheline. If futexes of
Ravikiran> two unrelated threads running on two different nodes happen
Ravikiran> to hash onto adjacent hash buckets, or buckets on the same
Ravikiran> internode cacheline, then we have the internode cacheline
Ravikiran> bouncing between nodes.

Ravikiran,

Using that argument, all you need to do is to add the alignment
____cacheline_aligned_in_smp to the definition of
struct futex_hash_bucket and the problem is solved, given that the
internode cacheline in a NUMA system is defined to be the same as the
SMP cacheline size.

Ravikiran> Here is a simple scheme which maintains per-node hash
Ravikiran> tables for futexes.

Ravikiran> In this scheme, a private futex is assigned to the node id
Ravikiran> of the futex's KVA. The reasoning is, the futex KVA is
Ravikiran> allocated from the node as indicated by memory policy set
Ravikiran> by the process, and that should be a good 'home node' for
Ravikiran> that futex. Of course this helps workloads where all the
Ravikiran> threads of a process are bound to the same node, but it
Ravikiran> seems reasonable to run all threads of a process on the
Ravikiran> same node.

You can't make that assumption at all. In many NUMA workloads it is
not common to have all threads of a process run on the same node. You
often see a case where one thread spawns a number of threads that are
then grouped onto the various nodes.

If we want to make the futexes really NUMA aware, having them
explicitly allocated on a given node would be more useful or
alternatively have them allocated on a first touch basis.

But to be honest, I doubt it matters too much since the futex
cacheline is most likely to end up in cache on the node where it's
being used and as long as the other nodes don't try and touch the same
futex this becomes a non-issue with the proper alignment.

I don't think your patch is harmful, but it looks awfully complex for
something that could be solved just as well by a simple alignment
statement.

Cheers,
Jes
Andi Kleen
2006-08-08 09:58:39 UTC
Permalink
Post by Jes Sorensen
Using that argument, all you need to do is to add the alignment
____cacheline_aligned_in_smp to the definition of
struct futex_hash_bucket and the problem is solved, given that the
internode cacheline in a NUMA system is defined to be the same as the
SMP cacheline size.
Yes but it would waste quite a lot of memory and cache.
Wasted cache = slow.

-Andi
Jes Sorensen
2006-08-08 10:07:47 UTC
Permalink
Post by Jes Sorensen
Using that argument, all you need to do is to add the alignment
____cacheline_aligned_in_smp to the definition of struct
futex_hash_bucket and the problem is solved, given that the
internode cacheline in a NUMA system is defined to be the same as
the SMP cacheline size.
Andi> Yes but it would waste quite a lot of memory and cache. Wasted
Andi> cache = slow.

Compared to the extra level of indirection, I doubt it would be
measurable. The cache space is barely wasted, we're talking
approximately half a cacheline per futex hash bucket in use.

Jes
Andi Kleen
2006-08-08 09:57:59 UTC
Permalink
Post by Ravikiran G Thirumalai
Current futex hash scheme is not the best for NUMA. The futex hash table is
an array of struct futex_hash_bucket, which is just a spinlock and a
list_head -- this means multiple spinlocks on the same cacheline and on NUMA
machines, on the same internode cacheline. If futexes of two unrelated
threads running on two different nodes happen to hash onto adjacent hash
buckets, or buckets on the same internode cacheline, then we have the
internode cacheline bouncing between nodes.
When I did some testing with a (arguably far too lock intensive) benchmark
on a bigger box I got most bouncing cycles not in the futex locks itself,
but in the down_read on the mm semaphore.

I guess that is not addressed?

-Andi
Eric Dumazet
2006-08-08 10:10:39 UTC
Permalink
Post by Andi Kleen
Post by Ravikiran G Thirumalai
Current futex hash scheme is not the best for NUMA. The futex hash
table is an array of struct futex_hash_bucket, which is just a spinlock
and a list_head -- this means multiple spinlocks on the same cacheline
and on NUMA machines, on the same internode cacheline. If futexes of two
unrelated threads running on two different nodes happen to hash onto
adjacent hash buckets, or buckets on the same internode cacheline, then
we have the internode cacheline bouncing between nodes.
When I did some testing with a (arguably far too lock intensive) benchmark
on a bigger box I got most bouncing cycles not in the futex locks itself,
but in the down_read on the mm semaphore.
This is true, even with a normal application (not a biased benchmark) and
using oprofile. mmap_sem is the killer.

We may have special case for PRIVATE futexes (they dont need to be chained in
a global table, but a process private table)

POSIX thread api already can let the application tell glibc/kernel a
mutex/futex ahe a process scope.

For this private futexes, I think we would not need to down_read(mmap_sem) at
all. (only a/some lock/s protecting the process private table)

Eric
Andi Kleen
2006-08-08 10:36:15 UTC
Permalink
Post by Eric Dumazet
We may have special case for PRIVATE futexes (they dont need to be chained in
a global table, but a process private table)
What do you mean with PRIVATE futex?

Even if the futex mapping is only visible by a single MM mmap_sem is still needed
to protect against other threads doing mmap.

-Andi
Eric Dumazet
2006-08-08 12:29:44 UTC
Permalink
Post by Andi Kleen
Post by Eric Dumazet
We may have special case for PRIVATE futexes (they dont need to be
chained in a global table, but a process private table)
What do you mean with PRIVATE futex?
Even if the futex mapping is only visible by a single MM mmap_sem is still
needed to protect against other threads doing mmap.
Hum... I would call that a user error.

If a thread is munmap()ing the vma that contains active futexes, result is
undefined. Same as today I think (a thread blocked in a FUTEX_WAIT should
stay blocked)

The point is that private futexes could be managed using virtual addresses,
and no call to find_extend_vma(), hence no mmap_sem contention.

There could be problem if the same futex (32 bits integer) could be mapped at
different virtual addresses in the same process.

Eric
Andi Kleen
2006-08-08 12:47:42 UTC
Permalink
Post by Eric Dumazet
Post by Andi Kleen
Post by Eric Dumazet
We may have special case for PRIVATE futexes (they dont need to be
chained in a global table, but a process private table)
What do you mean with PRIVATE futex?
Even if the futex mapping is only visible by a single MM mmap_sem is still
needed to protect against other threads doing mmap.
Hum... I would call that a user error.
If a thread is munmap()ing the vma that contains active futexes, result is
undefined.
We can't allow anything that could crash the kernel, corrupt a kernel,
data structure, allow writing to freed memory etc. No matter how
defined it is or not. Working with a vma that doesn't have
an existence guarantee would be just that.

-Andi
Eric Dumazet
2006-08-08 12:57:11 UTC
Permalink
Post by Andi Kleen
Post by Eric Dumazet
Post by Andi Kleen
Post by Eric Dumazet
We may have special case for PRIVATE futexes (they dont need to be
chained in a global table, but a process private table)
What do you mean with PRIVATE futex?
Even if the futex mapping is only visible by a single MM mmap_sem is
still needed to protect against other threads doing mmap.
Hum... I would call that a user error.
If a thread is munmap()ing the vma that contains active futexes, result
is undefined.
We can't allow anything that could crash the kernel, corrupt a kernel,
data structure, allow writing to freed memory etc. No matter how
defined it is or not. Working with a vma that doesn't have
an existence guarantee would be just that.
As I said, we do not walk the vmas anymore, no crashes are ever possible.

Just keep a process private list of 'private futexes' , indexed by their
virtual address. This list can be of course stored in a efficient data
structure, an AVL or RB tree, or hash table.

The validity of the virtual address is still tested by normal get_user()
call.. If the memory was freed by a thread, then a normal EFAULT error will
be reported... eventually.

Eric
Ulrich Drepper
2006-08-08 14:39:57 UTC
Permalink
Post by Eric Dumazet
The validity of the virtual address is still tested by normal get_user()
call.. If the memory was freed by a thread, then a normal EFAULT error will
be reported... eventually.
This is indeed what should be done. Private futexes are the by far
more frequent case and I bet you'd see improvements when avoiding the
mm mutex even for normal machines since futexes really are everywhere.
For shared mutexes you end up doing two lookups and that's fine IMO
as long as the first lookup is fast.

As for the NUMA case, I would oppose any change which has the
slightest impact on non-NUMA machines. It cannot be allowed that the
majority of systems is slowed down significantly just because of NUMA.
Especially since the effects of NUMA beside cache line transfer
penalties IMO probably are neglect able. The in-kernel futex
representation only exists when there are waiters and so the memory
needed is only allocated when we a waiting. In this case it just be
easy enough to use local memory. But this unlikely will help much
since the waker thread is ideally not on the same processor, maybe not
even on the same node. So there will be cacheline transfers in most
cases and everything possible improvement will be minimal and maybe
even not generally measurable.

If you want to do anything in this area, first remove the global
mutex. Then really measure with real world application. And I don't
mean specially designed HPC apps which assign threads/processes to
processors or nodes. Those are special cases of a special case.
Nick Piggin
2006-08-08 15:11:58 UTC
Permalink
Post by Ulrich Drepper
Post by Eric Dumazet
The validity of the virtual address is still tested by normal get_user()
call.. If the memory was freed by a thread, then a normal EFAULT error will
be reported... eventually.
This is indeed what should be done. Private futexes are the by far
more frequent case and I bet you'd see improvements when avoiding the
mm mutex even for normal machines since futexes really are everywhere.
For shared mutexes you end up doing two lookups and that's fine IMO
as long as the first lookup is fast.
The private futex's namespace is its virtual address, so I don't see
how you can decouple that from the management of virtual addresses.

Let me get this straight: to insert a contended futex into your rbtree,
you need to hold the mmap sem to ensure that address remains valid,
then you need to take a lock which protects your rbtree. Then to wake
up a process and remove the futex, you need to take the rbtree lock. Or
to unmap any memory you also need to take the rbtree lock and ensure
there are no futexes there.

So you just add another lock for no reason, or have I got a few screws
loose myself? I don't see how you can significantly reduce lock
cacheline bouncing in a futex heavy workload if you're just going to
add another shared data structure. But if you can, sweet ;)
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
Ulrich Drepper
2006-08-08 15:36:23 UTC
Permalink
Post by Nick Piggin
Let me get this straight: to insert a contended futex into your rbtree,
you need to hold the mmap sem to ensure that address remains valid,
then you need to take a lock which protects your rbtree.
Why does it have to remain valid? As long as the kernel doesn't crash
on any of the operations associated with the futex syscalls let the
address space region explode, implode, whatever. It's a bug in the
program if the address region is changed while a futex is placed
there. If the futex syscall hangs forever or returns with a bogus
state (error or even success) this is perfectly acceptable. We
shouldn't slow down correct uses just to make it possible for broken
programs to receive a more detailed error description.
Nick Piggin
2006-08-08 16:22:17 UTC
Permalink
Post by Ulrich Drepper
Post by Nick Piggin
Let me get this straight: to insert a contended futex into your rbtree,
you need to hold the mmap sem to ensure that address remains valid,
then you need to take a lock which protects your rbtree.
Why does it have to remain valid? As long as the kernel doesn't crash
on any of the operations associated with the futex syscalls let the
address space region explode, implode, whatever. It's a bug in the
program if the address region is changed while a futex is placed
there. If the futex syscall hangs forever or returns with a bogus
state (error or even success) this is perfectly acceptable. We
I thought mremap (no, that's already kind of messed up); or
even just getting consistency in failures (eg. so you don't have
the situation that a futex op can succeed on a previously
unmapped region).

If you're not worried about the latter, then it might work...

I didn't initially click that the private futex API operates
purely on tokens rather than virtual memory... comments in
futex.c talk about futexes being hashed to a particular
physical page (which is the case for shared). That's whacked.

So actually you would change semantics in some weird corner
cases, like mremaping a shared futex over a private futex's
Arguably that's broken, though ;)
Post by Ulrich Drepper
shouldn't slow down correct uses just to make it possible for broken
programs to receive a more detailed error description.
No we shouldn't slow them down. I'd be interested to see whether
locking is significantly sped up with this new data structure,
though.

You might also slow down due to the fact that you'd have to do the
locking and unconditionally traverse the private futexes even for
shared futexes.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
Nick Piggin
2006-08-08 16:26:43 UTC
Permalink
Post by Nick Piggin
No we shouldn't slow them down. I'd be interested to see whether
locking is significantly sped up with this new data structure,
though.
OTOH, maybe you don't need a new data structure. Maybe you could
use the hash and check that for a match on a private futex before
trying to find a possible shared futex.

Locking I guess becomes no more of a problem than now, and in some
cases maybe much less. So OK, I stand corrected.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
Ulrich Drepper
2006-08-08 16:49:54 UTC
Permalink
Post by Nick Piggin
I thought mremap (no, that's already kind of messed up); or
even just getting consistency in failures (eg. so you don't have
the situation that a futex op can succeed on a previously
unmapped region).
If you're not worried about the latter, then it might work...
I'm not the least bit worried about this. It's 100% an application's
fault. You cannot touch an address space if it's used, e.g., for
mutexes.
Post by Nick Piggin
I didn't initially click that the private futex API operates
purely on tokens rather than virtual memory...
I haven't looked at the code in some time but I thought this got
clarified in the comments. For waiting on private mutexes we need
nothing but the address value itself. There is the FUTEX_WAKE_OP
operation which will also write to memory but this is only the waker
side and if the memory mapping is gone, just flag an error. It's
another program error which shouldn't in any way slow down normal
operations.
Eric Dumazet
2006-08-08 16:08:34 UTC
Permalink
Post by Nick Piggin
Post by Ulrich Drepper
Post by Eric Dumazet
The validity of the virtual address is still tested by normal get_user()
call.. If the memory was freed by a thread, then a normal EFAULT error will
be reported... eventually.
This is indeed what should be done. Private futexes are the by far
more frequent case and I bet you'd see improvements when avoiding the
mm mutex even for normal machines since futexes really are everywhere.
For shared mutexes you end up doing two lookups and that's fine IMO
as long as the first lookup is fast.
The private futex's namespace is its virtual address, so I don't see
how you can decouple that from the management of virtual addresses.
Let me get this straight: to insert a contended futex into your rbtree,
you need to hold the mmap sem to ensure that address remains valid,
then you need to take a lock which protects your rbtree. Then to wake
up a process and remove the futex, you need to take the rbtree lock. Or
to unmap any memory you also need to take the rbtree lock and ensure
there are no futexes there.
So you just add another lock for no reason, or have I got a few screws
loose myself? I don't see how you can significantly reduce lock
cacheline bouncing in a futex heavy workload if you're just going to
add another shared data structure. But if you can, sweet ;)
We certainly can. But if you insist of using mmap sem at all, then we have a
problem.

rbtree would not reduce cacheline bouncing, so :

We could use a hashtable (allocated on demand) of size N, N depending on
NR_CPUS for example. each chain protected by a private spinlock. If N is well
chosen, we might reduce lock cacheline bouncing. (different threads fighting
on different private futexes would have a good chance to get different
cachelines in this hashtable)

As soon a process enters 'private futex' code, the futex code allocates this
hashtable if the process has a NULL hash table (set to NULL at exec() time,
or maybe re-allocated because we want to be sure futex syscall always suceed
(no ENOMEM))

So we really can... but for 'private futexes' which are the vast majority of
futexes needed by typical program (using POSIX pshared thread mutex attribute
PTHREAD_PROCESS_PRIVATE, currently not used by NPTL glibc)

Of course we would need a new syscall, and to change glibc to be able to
actually use this new private_futex syscall.

Probably a lot of work, still, but could help heavy threaded programs not
touching mmap_sem.

We might have a refcounting problem on this 'hashtable' since several threads
share this structure, but only at thread creation/destruction, not in futex
call (ie no cacheline bouncing on the refcount)

Eric
Nick Piggin
2006-08-08 16:34:49 UTC
Permalink
Post by Eric Dumazet
We certainly can. But if you insist of using mmap sem at all, then we have a
problem.
We could use a hashtable (allocated on demand) of size N, N depending on
NR_CPUS for example. each chain protected by a private spinlock. If N is well
chosen, we might reduce lock cacheline bouncing. (different threads fighting
on different private futexes would have a good chance to get different
cachelines in this hashtable)
See other mail. We already have a hash table ;)
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
Eric Dumazet
2006-08-08 16:49:28 UTC
Permalink
Post by Nick Piggin
Post by Eric Dumazet
We certainly can. But if you insist of using mmap sem at all, then we
have a problem.
We could use a hashtable (allocated on demand) of size N, N depending on
NR_CPUS for example. each chain protected by a private spinlock. If N is
well chosen, we might reduce lock cacheline bouncing. (different threads
fighting on different private futexes would have a good chance to get
different cachelines in this hashtable)
See other mail. We already have a hash table ;)
Yes but still you want at FUTEX_WAIT time to tell the kernel the futex is
private to this process.

Giving the same info at FUTEX_WAKE time could avoid the kernel to make the
second pass (using only a private futex lookup), avoiding again the mmap_sem
touch in case no threads are waiting anymore on this futex.

Eric
Eric Dumazet
2006-08-08 16:59:04 UTC
Permalink
Post by Eric Dumazet
Post by Nick Piggin
Post by Eric Dumazet
We certainly can. But if you insist of using mmap sem at all, then we
have a problem.
We could use a hashtable (allocated on demand) of size N, N depending
on NR_CPUS for example. each chain protected by a private spinlock. If
N is well chosen, we might reduce lock cacheline bouncing. (different
threads fighting on different private futexes would have a good chance
to get different cachelines in this hashtable)
See other mail. We already have a hash table ;)
Yes but still you want at FUTEX_WAIT time to tell the kernel the futex is
private to this process.
Giving the same info at FUTEX_WAKE time could avoid the kernel to make the
second pass (using only a private futex lookup), avoiding again the
mmap_sem touch in case no threads are waiting anymore on this futex.
After looking at kernel/futex.c, I realize we also can avoid the atomic ops
(and another cacheline bouncing) done in get_key_refs()/drop_key_refs(),
touching the inode i_count or mm_count refcounter)

Eric
Nick Piggin
2006-08-09 01:56:26 UTC
Permalink
Post by Eric Dumazet
Post by Nick Piggin
Post by Eric Dumazet
We certainly can. But if you insist of using mmap sem at all, then we
have a problem.
We could use a hashtable (allocated on demand) of size N, N depending on
NR_CPUS for example. each chain protected by a private spinlock. If N is
well chosen, we might reduce lock cacheline bouncing. (different threads
fighting on different private futexes would have a good chance to get
different cachelines in this hashtable)
See other mail. We already have a hash table ;)
Yes but still you want at FUTEX_WAIT time to tell the kernel the futex is
private to this process.
Yes, but I'm saying we already have a hash table. The hash table.

I'm *not* saying you *don't* also want a private directive from userspace.
--

Send instant messages to your online friends http://au.messenger.yahoo.com
Ulrich Drepper
2006-08-08 16:58:10 UTC
Permalink
Post by Eric Dumazet
So we really can... but for 'private futexes' which are the vast majority of
futexes needed by typical program (using POSIX pshared thread mutex attribute
PTHREAD_PROCESS_PRIVATE, currently not used by NPTL glibc)
Nonsense. Mutexes are by default always private. They explicitly
have to be marked as sharable. This happens using the
pthread_mutexattr_setpshared function which takes
PTHREAD_PROCESS_PRIVATE or PTHREAD_PROCESS_SHARED in the second
parameter. So the former _is_ clearly used.
Post by Eric Dumazet
Of course we would need a new syscall, and to change glibc to be able to
actually use this new private_futex syscall.
No, why? The kernel already does recognize private mutexes. It just
checks whether the pages used to store it are private or mapped. This
requires some interaction with the memory subsystem but as long as no
crashes happen the data can change underneath. It's the program's
fault if it does.

On the waker side you would search the local futex hash table/tree
first and if this doesn't yield a match, search the global table.
Wakeup calls without any waiters are usually rare.
Eric Dumazet
2006-08-08 17:08:17 UTC
Permalink
Post by Ulrich Drepper
Post by Eric Dumazet
So we really can... but for 'private futexes' which are the vast majority
of futexes needed by typical program (using POSIX pshared thread mutex
attribute PTHREAD_PROCESS_PRIVATE, currently not used by NPTL glibc)
Nonsense. Mutexes are by default always private. They explicitly
have to be marked as sharable. This happens using the
pthread_mutexattr_setpshared function which takes
PTHREAD_PROCESS_PRIVATE or PTHREAD_PROCESS_SHARED in the second
parameter. So the former _is_ clearly used.
I was saying that PTHREAD_PROCESS_PRIVATE or PTHREAD_PROCESS_SHARED info is
not provided to the kernel (because futex api/implementation dont need to).
It was not an attack on glibc.
Post by Ulrich Drepper
Post by Eric Dumazet
Of course we would need a new syscall, and to change glibc to be able to
actually use this new private_futex syscall.
No, why? The kernel already does recognize private mutexes. It just
checks whether the pages used to store it are private or mapped. This
requires some interaction with the memory subsystem but as long as no
crashes happen the data can change underneath. It's the program's
fault if it does.
But if you let futex code doing the vma walk to check the private/shared
status, you still need the mmap_sem locking.

Moreover, a program can mmap() a file (shared in terms of VMA), and continue
to use a PTHREAD_PROCESS_PRIVATE mutex lying in this shared zone
(Example : shmem or hugetlb mapping, wich API might always give a 'shared'
vma)
Post by Ulrich Drepper
On the waker side you would search the local futex hash table/tree
first and if this doesn't yield a match, search the global table.
Wakeup calls without any waiters are usually rare.
If the two searches touch two different cache lines in the hash table, we
might have a performance regression.
Of course we might chose a hash function so that the same slot is accessed.

Eric
Nick Piggin
2006-08-09 01:58:26 UTC
Permalink
Post by Ulrich Drepper
Post by Eric Dumazet
Of course we would need a new syscall, and to change glibc to be able to
actually use this new private_futex syscall.
No, why? The kernel already does recognize private mutexes. It just
checks whether the pages used to store it are private or mapped. This
requires some interaction with the memory subsystem but as long as no
crashes happen the data can change underneath. It's the program's
fault if it does.
Because that requires taking mmap_sem. Avoiding that is the whole
purpose, isn't it?

--

Send instant messages to your online friends http://au.messenger.yahoo.com
Eric Dumazet
2006-08-09 06:26:27 UTC
Permalink
Based on various discussions and feedbacks, I cooked a patch that implements
the notion of private futexes (private to a process, in the spirit of POSIX
pshared PTHREAD_PROCESS_PRIVATE )

[PATCH] futex : Add new PRIVATE futex primitives for performance improvements

When a futex is privately used by a process, we dont really need to lookup the
list of vmas of the process in order to discover if the futex is backed by a
inode or by the mm struct. We dont really need to keep a refcount on the
inode or mm.

This patch introduces new futex calls, that could be used by user land (glibc
of course) when private futexes are used.

Avoiding vmas lookup means avoiding taking the mmap_sem (and forcing cacheline
bouncings).

Avoiding refcounting on underlying inode or mm struct also avoids cacheline
bouncing.

Thats two cacheline bounces avoided per FUTEX syscall

glibc could use the new futex primitives introduced here (in particular for
PTHREAD_PROCESS_PRIVATE semantic), and fallback to old one if running on
older kernel. Fallback could set a global variable with the number of syscall
so that only one failed syscall is done in the process lifetime.

Note : Compatibility should be maintained by this patch, as old applications
will use the 'SHARED' functionality, unchanged.

Signed-off-by: Eric Dumazet <***@cosmosbay.com>
Eric Dumazet
2006-08-09 06:43:52 UTC
Permalink
(patch inlined this time)
Based on various discussions and feedbacks, I cooked a patch that implements
the notion of private futexes (private to a process, in the spirit of POSIX
pshared PTHREAD_PROCESS_PRIVATE )

[PATCH] futex : Add new PRIVATE futex primitives for performance improvements

When a futex is privately used by a process, we dont really need to lookup the
list of vmas of the process in order to discover if the futex is backed by a
inode or by the mm struct. We dont really need to keep a refcount on the
inode or mm.

This patch introduces new futex calls, that could be used by user land (glibc
of course) when private futexes are used.

Avoiding vmas lookup means avoiding taking the mmap_sem (and forcing cacheline
bouncings).

Avoiding refcounting on underlying inode or mm struct also avoids cacheline
bouncing.

Thats two cacheline bounces avoided per FUTEX syscall

glibc could use the new futex primitives introduced here (in particular for
PTHREAD_PROCESS_PRIVATE semantic), and fallback to old one if running on
older kernel. Fallback could set a global variable with the number of syscall
so that only one failed syscall is done in the process lifetime.

Note : Compatibility should be maintained by this patch, as old applications
will use the 'SHARED' functionality, unchanged.

Signed-off-by: Eric Dumazet <***@cosmosbay.com>
--- linux-2.6.18-rc4/include/linux/futex.h 2006-08-08 22:46:13.000000000 +0200
+++ linux-2.6.18-rc4-ed/include/linux/futex.h 2006-08-08 23:23:13.000000000
+0200
@@ -15,6 +15,11 @@
#define FUTEX_LOCK_PI 6
#define FUTEX_UNLOCK_PI 7
#define FUTEX_TRYLOCK_PI 8
+#define FUTEX_WAIT_PRIVATE 9
+#define FUTEX_WAKE_PRIVATE 10
+#define FUTEX_REQUEUE_PRIVATE 11
+#define FUTEX_CMP_REQUEUE_PRIVATE 12
+#define FUTEX_WAKE_OP_PRIVATE 13

/*
* Support for robust futexes: the kernel cleans up held futexes at
--- linux-2.6.18-rc4/kernel/futex.c 2006-08-08 22:45:46.000000000 +0200
+++ linux-2.6.18-rc4-ed/kernel/futex.c 2006-08-09 07:26:19.000000000 +0200
@@ -60,8 +60,9 @@
* Don't rearrange members without looking at hash_futex().
*
* offset is aligned to a multiple of sizeof(u32) (== 4) by definition.
- * We set bit 0 to indicate if it's an inode-based key.
*/
+#define OFF_INODE 1 /* We set bit 0 if it has a reference on inode */
+#define OFF_MMSHARED 2 /* We set bit 1 if it has a reference on mm */
union futex_key {
struct {
unsigned long pgoff;
@@ -79,6 +80,8 @@
int offset;
} both;
};
+#define FUT_SHARED 1 /* we should walk vmas */
+#define FUT_PRIVATE 0 /* private futex: no need to walk vmas*/

/*
* Priority Inheritance state:
@@ -140,7 +143,7 @@
static struct futex_hash_bucket futex_queues[1<<FUTEX_HASHBITS];

/* Futex-fs vfsmount entry: */
-static struct vfsmount *futex_mnt;
+static struct vfsmount *futex_mnt __read_mostly;

/*
* We hash on the keys returned from get_futex_key (see below).
@@ -175,7 +178,7 @@
*
* Should be called with &current->mm->mmap_sem but NOT any spinlocks.
*/
-static int get_futex_key(u32 __user *uaddr, union futex_key *key)
+static int get_futex_key(u32 __user *uaddr, union futex_key *key, int shared)
{
unsigned long address = (unsigned long)uaddr;
struct mm_struct *mm = current->mm;
@@ -191,6 +194,11 @@
return -EINVAL;
address -= key->both.offset;

+ if (shared == FUT_PRIVATE) {
+ key->private.mm = mm;
+ key->private.address = address;
+ return 0;
+ }
/*
* The futex is hashed differently depending on whether
* it's in a shared or private mapping. So check vma first.
@@ -215,6 +223,7 @@
* mappings of _writable_ handles.
*/
if (likely(!(vma->vm_flags & VM_MAYSHARE))) {
+ key->both.offset += OFF_MMSHARED; /* reference taken on mm */
key->private.mm = mm;
key->private.address = address;
return 0;
@@ -224,7 +233,7 @@
* Linear file mappings are also simple.
*/
key->shared.inode = vma->vm_file->f_dentry->d_inode;
- key->both.offset++; /* Bit 0 of offset indicates inode-based key. */
+ key->both.offset += OFF_INODE; /* reference taken on inode */
if (likely(!(vma->vm_flags & VM_NONLINEAR))) {
key->shared.pgoff = (((address - vma->vm_start) >> PAGE_SHIFT)
+ vma->vm_pgoff);
@@ -256,8 +265,8 @@
*/
static inline void get_key_refs(union futex_key *key)
{
- if (key->both.ptr != 0) {
- if (key->both.offset & 1)
+ if (key->both.offset & (OFF_INODE|OFF_MMSHARED)) {
+ if (key->both.offset & OFF_INODE)
atomic_inc(&key->shared.inode->i_count);
else
atomic_inc(&key->private.mm->mm_count);
@@ -270,8 +279,8 @@
*/
static void drop_key_refs(union futex_key *key)
{
- if (key->both.ptr != 0) {
- if (key->both.offset & 1)
+ if (key->both.offset & (OFF_INODE|OFF_MMSHARED)) {
+ if (key->both.offset & OFF_INODE)
iput(key->shared.inode);
else
mmdrop(key->private.mm);
@@ -650,7 +659,7 @@
* Wake up all waiters hashed on the physical page that is mapped
* to this virtual address:
*/
-static int futex_wake(u32 __user *uaddr, int nr_wake)
+static int futex_wake(u32 __user *uaddr, int nr_wake, int shared)
{
struct futex_hash_bucket *hb;
struct futex_q *this, *next;
@@ -658,9 +667,10 @@
union futex_key key;
int ret;

- down_read(&current->mm->mmap_sem);
+ if (shared)
+ down_read(&current->mm->mmap_sem);

- ret = get_futex_key(uaddr, &key);
+ ret = get_futex_key(uaddr, &key, shared);
if (unlikely(ret != 0))
goto out;

@@ -682,7 +692,8 @@

spin_unlock(&hb->lock);
out:
- up_read(&current->mm->mmap_sem);
+ if (shared)
+ up_read(&current->mm->mmap_sem);
return ret;
}

@@ -692,7 +703,7 @@
*/
static int
futex_wake_op(u32 __user *uaddr1, u32 __user *uaddr2,
- int nr_wake, int nr_wake2, int op)
+ int nr_wake, int nr_wake2, int op, int shared)
{
union futex_key key1, key2;
struct futex_hash_bucket *hb1, *hb2;
@@ -703,10 +714,10 @@
retryfull:
down_read(&current->mm->mmap_sem);

- ret = get_futex_key(uaddr1, &key1);
+ ret = get_futex_key(uaddr1, &key1, shared);
if (unlikely(ret != 0))
goto out;
- ret = get_futex_key(uaddr2, &key2);
+ ret = get_futex_key(uaddr2, &key2, shared);
if (unlikely(ret != 0))
goto out;

@@ -802,7 +813,7 @@
* physical page.
*/
static int futex_requeue(u32 __user *uaddr1, u32 __user *uaddr2,
- int nr_wake, int nr_requeue, u32 *cmpval)
+ int nr_wake, int nr_requeue, u32 *cmpval, int shared)
{
union futex_key key1, key2;
struct futex_hash_bucket *hb1, *hb2;
@@ -811,12 +822,13 @@
int ret, drop_count = 0;

retry:
- down_read(&current->mm->mmap_sem);
+ if (shared)
+ down_read(&current->mm->mmap_sem);

- ret = get_futex_key(uaddr1, &key1);
+ ret = get_futex_key(uaddr1, &key1, shared);
if (unlikely(ret != 0))
goto out;
- ret = get_futex_key(uaddr2, &key2);
+ ret = get_futex_key(uaddr2, &key2, shared);
if (unlikely(ret != 0))
goto out;

@@ -839,7 +851,8 @@
* If we would have faulted, release mmap_sem, fault
* it in and start all over again.
*/
- up_read(&current->mm->mmap_sem);
+ if (shared)
+ up_read(&current->mm->mmap_sem);

ret = get_user(curval, uaddr1);

@@ -888,7 +901,8 @@
drop_key_refs(&key1);

out:
- up_read(&current->mm->mmap_sem);
+ if (shared)
+ up_read(&current->mm->mmap_sem);
return ret;
}

@@ -999,7 +1013,7 @@
drop_key_refs(&q->key);
}

-static int futex_wait(u32 __user *uaddr, u32 val, unsigned long time)
+static int futex_wait(u32 __user *uaddr, u32 val, unsigned long time, int
shared)
{
struct task_struct *curr = current;
DECLARE_WAITQUEUE(wait, curr);
@@ -1010,9 +1024,10 @@

q.pi_state = NULL;
retry:
- down_read(&curr->mm->mmap_sem);
+ if (shared)
+ down_read(&curr->mm->mmap_sem);

- ret = get_futex_key(uaddr, &q.key);
+ ret = get_futex_key(uaddr, &q.key, shared);
if (unlikely(ret != 0))
goto out_release_sem;

@@ -1047,7 +1062,8 @@
* If we would have faulted, release mmap_sem, fault it in and
* start all over again.
*/
- up_read(&curr->mm->mmap_sem);
+ if (shared)
+ up_read(&curr->mm->mmap_sem);

ret = get_user(uval, uaddr);

@@ -1066,7 +1082,8 @@
* Now the futex is queued and we have checked the data, we
* don't want to hold mmap_sem while we sleep.
*/
- up_read(&curr->mm->mmap_sem);
+ if (shared)
+ up_read(&curr->mm->mmap_sem);

/*
* There might have been scheduling since the queue_me(), as we
@@ -1108,7 +1125,8 @@
queue_unlock(&q, hb);

out_release_sem:
- up_read(&curr->mm->mmap_sem);
+ if (shared)
+ up_read(&curr->mm->mmap_sem);
return ret;
}

@@ -1134,7 +1152,7 @@
retry:
down_read(&curr->mm->mmap_sem);

- ret = get_futex_key(uaddr, &q.key);
+ ret = get_futex_key(uaddr, &q.key, FUT_SHARED);
if (unlikely(ret != 0))
goto out_release_sem;

@@ -1435,7 +1453,7 @@
*/
down_read(&current->mm->mmap_sem);

- ret = get_futex_key(uaddr, &key);
+ ret = get_futex_key(uaddr, &key, FUT_SHARED);
if (unlikely(ret != 0))
goto out;

@@ -1551,7 +1569,7 @@
return ret;
}

-static struct file_operations futex_fops = {
+static const struct file_operations futex_fops = {
.release = futex_close,
.poll = futex_poll,
};
@@ -1600,7 +1618,7 @@
q->pi_state = NULL;

down_read(&current->mm->mmap_sem);
- err = get_futex_key(uaddr, &q->key);
+ err = get_futex_key(uaddr, &q->key, FUT_SHARED);

if (unlikely(err != 0)) {
up_read(&current->mm->mmap_sem);
@@ -1742,7 +1760,7 @@
*/
if (!pi) {
if (uval & FUTEX_WAITERS)
- futex_wake(uaddr, 1);
+ futex_wake(uaddr, 1, FUT_SHARED);
}
}
return 0;
@@ -1830,23 +1848,38 @@

switch (op) {
case FUTEX_WAIT:
- ret = futex_wait(uaddr, val, timeout);
+ ret = futex_wait(uaddr, val, timeout, FUT_SHARED);
+ break;
+ case FUTEX_WAIT_PRIVATE:
+ ret = futex_wait(uaddr, val, timeout, FUT_PRIVATE);
break;
case FUTEX_WAKE:
- ret = futex_wake(uaddr, val);
+ ret = futex_wake(uaddr, val, FUT_SHARED);
+ break;
+ case FUTEX_WAKE_PRIVATE:
+ ret = futex_wake(uaddr, val, FUT_PRIVATE);
break;
case FUTEX_FD:
/* non-zero val means F_SETOWN(getpid()) & F_SETSIG(val) */
ret = futex_fd(uaddr, val);
break;
case FUTEX_REQUEUE:
- ret = futex_requeue(uaddr, uaddr2, val, val2, NULL);
+ ret = futex_requeue(uaddr, uaddr2, val, val2, NULL, FUT_SHARED);
+ break;
+ case FUTEX_REQUEUE_PRIVATE:
+ ret = futex_requeue(uaddr, uaddr2, val, val2, NULL, FUT_PRIVATE);
break;
case FUTEX_CMP_REQUEUE:
- ret = futex_requeue(uaddr, uaddr2, val, val2, &val3);
+ ret = futex_requeue(uaddr, uaddr2, val, val2, &val3, FUT_SHARED);
+ break;
+ case FUTEX_CMP_REQUEUE_PRIVATE:
+ ret = futex_requeue(uaddr, uaddr2, val, val2, &val3, FUT_PRIVATE);
break;
case FUTEX_WAKE_OP:
- ret = futex_wake_op(uaddr, uaddr2, val, val2, val3);
+ ret = futex_wake_op(uaddr, uaddr2, val, val2, val3, FUT_SHARED);
+ break;
+ case FUTEX_WAKE_OP_PRIVATE:
+ ret = futex_wake_op(uaddr, uaddr2, val, val2, val3, FUT_PRIVATE);
break;
case FUTEX_LOCK_PI:
ret = futex_lock_pi(uaddr, val, timeout, val2, 0);
Eric Dumazet
2007-03-15 19:10:34 UTC
Permalink
Hi

I'm pleased to present these patches which improve linux futex performance and
scalability, on both UP, SMP and NUMA configs.

I had this idea last year but I was not understood, probably because I gave
not enough explanations. Sorry if this mail is really long...


Analysis of current linux futex code :
--------------------------------------

A central hash table futex_queues[] holds all contexts (futex_q) of waiting
threads.
Each futex_wait()/futex_wait() has to obtain a spinlock on a hash slot to
perform lookups or insert/deletion of a futex_q.

When a futex_wait() is done, thread has to :

1) - Obtain a read lock on mmap_sem to be able to validate the user pointer
(calling find_vma()). This validation tells us if the futex use
an inode based store (mapped file), or mm based store (anonymous mem)

2) - compute a hash key

3) - Atomic Increment of reference counter on an inode or a mm

4) - lock part of futex_queues[] hash table

5) - perform the test on value of futex.
(rollback is value != expected_value, returns EWOULDBLOCK)
(various loops if test triggers mm faults)

6) queue the context into hash table, release the lock got in 4)

7) - release the read_lock on mmap_sem

<block>

8) Eventually unqueue the context (but rarely, as this part
may be done by the futex_wake())

Futexes were designed to improve scalability but current implementation
has various problems :

- Central hashtable :
This means scalability problems if many processes/threads want to use
futexes at the same time.
This means NUMA unbalance because this hashtable is located on one node.

- Using mmap_sem on every futex() syscall :

Even if mmap_sem is a rw_semaphore, up_read()/down_read() are doing atomic
ops on mmap_sem, dirtying cache line :
- lot of cache line ping pongs on SMP configurations.

mmap_sem is also extensively used by mm code (page faults, mmap()/munmap())
Highly threaded processes might suffer from mmap_sem contention.

mmap_sem is also used by oprofile code. Enabling oprofile hurts threaded
programs because of contention on the mmap_sem cache line.

- Using an atomic_inc()/atomic_dec() on inode ref counter or mm ref counter:
It's also a cache line ping pong on SMP. It also increases mmap_sem hold time
because of cache misses.

Most of these scalability problems come from the fact that futexes are in
one global namespace. As we use a central hash table, we must make sure
they are all using the same reference (given by the mm subsystem).
We chose to force all futexes be 'shared'. This has a cost.

But fact is POSIX defined PRIVATE and SHARED, allowing clear separation, and
optimal performance if carefuly implemented. Time has come for linux to have
better threading performance.

PTHREAD_PROCESS_PRIVATE semantic allows implementation to use separate
repositories :
- One 'global' namespace for all PROCESS_SHARED futexes.
- One "per process private repository" for PROCESS_PRIVATE futexes.
This repository is NUMA aware, it is allocated the first time a process
issues a futex(XXXX_PRIVATE) call. If allocation is not possible because
of memory shortage, we just fallback using the central repository.

The goal is to permit new futex commands to avoid :
- Using the central hash table (still used by PTHREAD_PROCESS_SHARED futexes)
- Taking the mmap_sem semaphore, conflicting with other subsystems.
- Modifying a ref_count on mm or an inode, still conflicting with mm or fs.

This is possible because, for one process using PTHREAD_PROCESS_PRIVATE
futexes, we only need to distinguish futexes by their virtual address, no
matter the underlying mm storage is.

This is why this patches :

1) Define new futex subcommands (basically adding a _PRIVATE flag)
Avoids using mmap_sem, and ref counter on inode or mm.

2) Allows each process to have a private repository (a small hash table)
where its PROCESS_PRIVATE active futexes are stored, instead of
the global repository.
if CONFIG_BASE_SMALL, we still use the global repository

3) NUMA optimization : we allocate the global hash table with vmalloc()

If glibc wants to exploit this new infrastructure, it should use new
_PRIVATE futex subcommands for PTHREAD_PROCESS_PRIVATE futexes. And
be prepared to fallback on old subcommands for old kernels. Using one
global variable with the FUTEX_PRIVATE_FLAG or 0 value should be OK.

Only PTHREAD_PROCESS_SHARED futexes should use the old subcommands.

Compatibility with old applications is preserved, they still hit the
scalability problems, but new applications can fly :)

Note : the same SHARED futex (mapped on a file) can be used by old binaries
*and* new binaries, because both binaries will use the old subcommands.

Note : Vast majority of futexes should be using PROCESS_PRIVATE semantic,
as this is the default semantic. Almost all applications should benefit
of this changes (new kernel and updated libc)

Some bench results on a Pentium M 1.6 GHz (SMP kernel on a UP machine)

/* calling futex_wait(addr, value) with value != *addr */
450 cycles per futex(FUTEX_WAIT) call (mixing 2 futexes)
427 cycles per futex(FUTEX_WAIT) call (using one futex)
337 cycles per futex(FUTEX_WAIT_PRIVATE) call (mixing 2 futexes)
332 cycles per futex(FUTEX_WAIT_PRIVATE) call (using one futex)
For reference :
214 cycles per futex(1000) call (returns ENOSYS)
186 cycles per getppid() call
187 cycles per umask() call
182 cycles per ni_syscall() call

Thank you for reading this mail

[PATCH 1/3] FUTEX : introduce PROCESS_PRIVATE semantic
[PATCH 2/3] FUTEX : introduce private hashtables
[PATCH 3/3] FUTEX : NUMA friendly global hashtable

Signed-off-by: Eric Dumazet <***@cosmosbay.com>
Nick Piggin
2007-03-15 20:15:16 UTC
Permalink
Post by Eric Dumazet
Hi
I'm pleased to present these patches which improve linux futex performance and
scalability, on both UP, SMP and NUMA configs.
I had this idea last year but I was not understood, probably because I gave
not enough explanations. Sorry if this mail is really long...
Yes please, this is really nice. The mmap_sem is already overworked just
doing real mm stuff, let alone making it the central point of our "scalable"
thread synchronisation syscall.

To summarise: not only will this make futexes scale much better, but it will
also take some load off mmap_sem.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
Peter Zijlstra
2007-03-16 08:05:41 UTC
Permalink
Post by Eric Dumazet
Hi
I'm pleased to present these patches which improve linux futex performance and
scalability, on both UP, SMP and NUMA configs.
I had this idea last year but I was not understood, probably because I gave
not enough explanations. Sorry if this mail is really long...
I started playing with it after your last reference to it, I have some
code here (against -rt):
http://programming.kicks-ass.net/kernel-patches/futex-vma-cache/

Which I will post once I have the found what keeps pthread_join() from
completing :-(

It basically adds a per task vma lookup cache which can also activate
the private logic without explicit use of the new interface.
Eric Dumazet
2007-03-16 09:30:17 UTC
Permalink
Post by Peter Zijlstra
Post by Eric Dumazet
Hi
I'm pleased to present these patches which improve linux futex
performance and scalability, on both UP, SMP and NUMA configs.
I had this idea last year but I was not understood, probably because I
gave not enough explanations. Sorry if this mail is really long...
I started playing with it after your last reference to it, I have some
http://programming.kicks-ass.net/kernel-patches/futex-vma-cache/
Which I will post once I have the found what keeps pthread_join() from
completing :-(
It basically adds a per task vma lookup cache which can also activate
the private logic without explicit use of the new interface.
Hi Peter

I dont think yet another cache will help in the general case.
A typical program uses many vmas at once...

glibc has internal futexes, on a different vma than futexes declared in your
program. Each shared library is going to have its own vma for its data (and
futexes)

(244 vmas on one kmail program for example)

About your guess_futex_shared() thing, I miss the vma_anon() definition.
But if it has to walk the vmas (and take mmap_sem), you already loose the
PRIVATE benefit.
Peter Zijlstra
2007-03-16 10:10:13 UTC
Permalink
Post by Eric Dumazet
Post by Peter Zijlstra
Post by Eric Dumazet
Hi
I'm pleased to present these patches which improve linux futex
performance and scalability, on both UP, SMP and NUMA configs.
I had this idea last year but I was not understood, probably because I
gave not enough explanations. Sorry if this mail is really long...
I started playing with it after your last reference to it, I have some
http://programming.kicks-ass.net/kernel-patches/futex-vma-cache/
Which I will post once I have the found what keeps pthread_join() from
completing :-(
It basically adds a per task vma lookup cache which can also activate
the private logic without explicit use of the new interface.
Hi Peter
I dont think yet another cache will help in the general case.
A typical program uses many vmas at once...
glibc has internal futexes, on a different vma than futexes declared in your
program. Each shared library is going to have its own vma for its data (and
futexes)
(244 vmas on one kmail program for example)
Yeah, I was just hoping a few cache entries would be enough to get the
worst of them. A benchmark will have to tell I guess.
Post by Eric Dumazet
About your guess_futex_shared() thing, I miss the vma_anon() definition.
http://programming.kicks-ass.net/kernel-patches/futex-vma-cache/vma_cache.patch
Post by Eric Dumazet
But if it has to walk the vmas (and take mmap_sem), you already loose the
PRIVATE benefit.
It doesn't take mmap_sem, I am aware of the problems.
Eric Dumazet
2007-03-16 10:30:06 UTC
Permalink
Post by Peter Zijlstra
http://programming.kicks-ass.net/kernel-patches/futex-vma-cache/vma_cache.p
atch
Oh thanks
Post by Peter Zijlstra
Post by Eric Dumazet
But if it has to walk the vmas (and take mmap_sem), you already loose the
PRIVATE benefit.
It doesn't take mmap_sem, I am aware of the problems.
Yes but the vma_anon() -> vma_cache_find() needs to read 3 cache lines on
x86_64 (sizeof(struct vma_cache) = 136)
and dirty one bit, so it might be more expensive than the mmap_sem ...
Peter Zijlstra
2007-03-16 10:36:26 UTC
Permalink
Post by Eric Dumazet
Post by Peter Zijlstra
http://programming.kicks-ass.net/kernel-patches/futex-vma-cache/vma_cache.p
atch
Oh thanks
Post by Peter Zijlstra
Post by Eric Dumazet
But if it has to walk the vmas (and take mmap_sem), you already loose the
PRIVATE benefit.
It doesn't take mmap_sem, I am aware of the problems.
Yes but the vma_anon() -> vma_cache_find() needs to read 3 cache lines on
x86_64 (sizeof(struct vma_cache) = 136)
and dirty one bit, so it might be more expensive than the mmap_sem ...
I though the cacheline was 128 bytes, but point taken.
Ulrich Drepper
2007-04-04 07:16:09 UTC
Permalink
Post by Eric Dumazet
But fact is POSIX defined PRIVATE and SHARED, allowing clear separation, and
optimal performance if carefuly implemented. Time has come for linux to have
better threading performance.
Now that I've been pointed to the thread I can comment on it.

Yes, this approach makes a lot of sense. Programs which shared syn
objects without declaring them correctly are broken and deserve to
fail. It would be quite a lot of change in libpthread but it's
manageable.

I haven't tested the code nor will I likely do until I get a Fedora
kernel with it. So, convince DaveJ to take it for testing.
Eric Dumazet
2007-04-05 17:49:42 UTC
Permalink
Hi all

I'm pleased to present this patch which improves linux futexes performa=
nce and=20
scalability, merely avoiding taking mmap_sem rwlock.

Ulrich agreed with the API and said glibc work could start as soon
as he gets a Fedora kernel with it :)

Andrew, could we get this in mm as well ? This version is against 2.6.2=
1-rc5-mm4
(so supports 64bit futexes)

In this third version I dropped the NUMA optims and process private has=
h table,
to let new API come in and be tested.

Thank you

[PATCH] FUTEX : new PRIVATE futexes

Analysis of current linux futex code :
--------------------------------------

A central hash table futex_queues[] holds all contexts (futex_q) of wai=
ting=20
threads.
Each futex_wait()/futex_wait() has to obtain a spinlock on a hash slot =
to=20
perform lookups or insert/deletion of a futex_q.

When a futex_wait() is done, calling thread has to :


1) - Obtain a read lock on mmap_sem to be able to validate the user poi=
nter
=A0 =A0 =A0(calling find_vma()). This validation tells us if the futex =
uses
=A0 =A0 =A0an inode based store (mapped file), or mm based store (anony=
mous mem)

2) - compute a hash key

3) - Atomic increment of reference counter on an inode or a mm_struct

4) - lock part of futex_queues[] hash table

5) - perform the test on value of futex.
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 (rollback is value !=3D expected_value,=
returns EWOULDBLOCK)
=A0 =A0 =A0 =A0 (various loops if test triggers mm faults)

6) queue the context into hash table, release the lock got in 4)

7) - release the read_lock on mmap_sem

=A0 =A0<block>

8) Eventually unqueue the context (but rarely, as this part
=A0may be done by the futex_wake())

=46utexes were designed to improve scalability but current implementati=
on
has various problems :

- Central hashtable :
=A0This means scalability problems if many processes/threads want to us=
e
=A0futexes at the same time.
=A0This means NUMA unbalance because this hashtable is located on one n=
ode.

- Using mmap_sem on every futex() syscall :

=A0Even if mmap_sem is a rw_semaphore, up_read()/down_read() are doing =
atomic
=A0ops on mmap_sem, dirtying cache line :
=A0 =A0 =A0 =A0 - lot of cache line ping pongs on SMP configurations.

=A0mmap_sem is also extensively used by mm code (page faults, mmap()/mu=
nmap())
=A0Highly threaded processes might suffer from mmap_sem contention.

=A0mmap_sem is also used by oprofile code. Enabling oprofile hurts thre=
aded
programs because of contention on the mmap_sem cache line.

- Using an atomic_inc()/atomic_dec() on inode ref counter or mm ref cou=
nter:
=A0It's also a cache line ping pong on SMP. It also increases mmap_sem =
hold time
=A0because of cache misses.

Most of these scalability problems come from the fact that futexes are =
in
one global namespace. As we use a central hash table, we must make sure
they are all using the same reference (given by the mm subsystem).
We chose to force all futexes be 'shared'. This has a cost.

But fact is POSIX defined PRIVATE and SHARED, allowing clear separation=
, and
optimal performance if carefuly implemented. Time has come for linux to=
have=20
better threading performance.

The goal is to permit new futex commands to avoid :
=A0- Taking the mmap_sem semaphore, conflicting with other subsystems.
=A0- Modifying a ref_count on mm or an inode, still conflicting with mm=
or fs.

This is possible because, for one process using PTHREAD_PROCESS_PRIVATE
futexes, we only need to distinguish futexes by their virtual address, =
no
matter the underlying mm storage is.



If glibc wants to exploit this new infrastructure, it should use new
_PRIVATE futex subcommands for PTHREAD_PROCESS_PRIVATE futexes. And
be prepared to fallback on old subcommands for old kernels. Using one
global variable with the FUTEX_PRIVATE_FLAG or 0 value should be OK.

PTHREAD_PROCESS_SHARED futexes should still use the old subcommands.

Compatibility with old applications is preserved, they still hit the
scalability problems, but new applications can fly :)

Note : the same SHARED futex (mapped on a file) can be used by old bina=
ries=20
*and* new binaries, because both binaries will use the old subcommands.

Note : Vast majority of futexes should be using PROCESS_PRIVATE semanti=
c,
as this is the default semantic. Almost all applications should benefit
of this changes (new kernel and updated libc)

Some bench results on a Pentium M 1.6 GHz (SMP kernel on a UP machine)

/* calling futex_wait(addr, value) with value !=3D *addr */
434 cycles per futex(FUTEX_WAIT) call (mixing 2 futexes)
427 cycles per futex(FUTEX_WAIT) call (using one futex)
345 cycles per futex(FUTEX_WAIT_PRIVATE) call (mixing 2 futexes)
345 cycles per futex(FUTEX_WAIT_PRIVATE) call (using one futex)
=46or reference :
187 cycles per getppid() call
188 cycles per umask() call
183 cycles per ni_syscall() call

Signed-off-by: Eric Dumazet <***@cosmosbay.com>
---
include/linux/futex.h | 45 +++++-
kernel/futex.c | 294 +++++++++++++++++++++++++---------------
2 files changed, 230 insertions(+), 109 deletions(-)

--- linux-2.6.21-rc5-mm4/include/linux/futex.h
+++ linux-2.6.21-rc5-mm4-ed/include/linux/futex.h
@@ -19,6 +19,18 @@ union ktime;
#define FUTEX_TRYLOCK_PI 8
#define FUTEX_CMP_REQUEUE_PI 9
=20
+#define FUTEX_PRIVATE_FLAG 128
+#define FUTEX_CMD_MASK ~FUTEX_PRIVATE_FLAG
+
+#define FUTEX_WAIT_PRIVATE (FUTEX_WAIT | FUTEX_PRIVATE_FLAG)
+#define FUTEX_WAKE_PRIVATE (FUTEX_WAKE | FUTEX_PRIVATE_FLAG)
+#define FUTEX_REQUEUE_PRIVATE (FUTEX_REQUEUE | FUTEX_PRIVATE_FLAG)
+#define FUTEX_CMP_REQUEUE_PRIVATE (FUTEX_CMP_REQUEUE | FUTEX_PRIVATE_F=
LAG)
+#define FUTEX_WAKE_OP_PRIVATE (FUTEX_WAKE_OP | FUTEX_PRIVATE_FLAG)
+#define FUTEX_LOCK_PI_PRIVATE (FUTEX_LOCK_PI | FUTEX_PRIVATE_FLAG)
+#define FUTEX_UNLOCK_PI_PRIVATE (FUTEX_UNLOCK_PI | FUTEX_PRIVATE_FLAG)
+#define FUTEX_TRYLOCK_PI_PRIVATE (FUTEX_TRYLOCK_PI | FUTEX_PRIVATE_FLA=
G)
+
/*
* Support for robust futexes: the kernel cleans up held futexes at
* thread exit time.
@@ -115,8 +127,18 @@ handle_futex_death(u32 __user *uaddr, st
* Don't rearrange members without looking at hash_futex().
*
* offset is aligned to a multiple of sizeof(u32) (=3D=3D 4) by defini=
tion.
- * We set bit 0 to indicate if it's an inode-based key.
+ * We use the two low order bits of offset to tell what is the kind of=
key :
+ * 00 : Private process futex (PTHREAD_PROCESS_PRIVATE)
+ * (no reference on an inode or mm)
+ * 01 : Shared futex (PTHREAD_PROCESS_SHARED)
+ * mapped on a file (reference on the underlying inode)
+ * 10 : Shared futex (PTHREAD_PROCESS_SHARED)
+ * (but private mapping on an mm, and reference taken on it)
*/
+
+#define FUT_OFF_INODE 1 /* We set bit 0 if key has a reference on i=
node */
+#define FUT_OFF_MMSHARED 2 /* We set bit 1 if key has a reference on m=
m */
+
union futex_key {
unsigned long __user *uaddr;
struct {
@@ -135,8 +157,27 @@ union futex_key {
int offset;
} both;
};
-int get_futex_key(void __user *uaddr, union futex_key *key);
+
+/**
+ * get_futex_key - Get parameters which are the keys for a futex.
+ * @uaddr: virtual address of the futex
+ * @shared: NULL for a PROCESS_PRIVATE futex,
+ * &current->mm->mmap_sem for a PROCESS_SHARED futex
+ * @key: address where result is stored.
+ *
+ * Returns an error code or 0
+ */
+int get_futex_key(void __user *uaddr, union futex_key *key,
+ struct rw_semaphore *shared);
+
+/**
+ * get_futex_key_refs - Take a reference to the resource addressed by =
a key
+ */
void get_futex_key_refs(union futex_key *key);
+
+/**
+ * drop_futex_key_refs - Drop a reference to the resource addressed by=
a key.
+ */
void drop_futex_key_refs(union futex_key *key);
=20
#ifdef CONFIG_FUTEX
--- linux-2.6.21-rc5-mm4/kernel/futex.c
+++ linux-2.6.21-rc5-mm4-ed/kernel/futex.c
@@ -16,6 +16,9 @@
* Copyright (C) 2006 Red Hat, Inc., Ingo Molnar <***@redhat.com>
* Copyright (C) 2006 Timesys Corp., Thomas Gleixner <***@timesys.co=
m>
*
+ * PRIVATE futexes by Eric Dumazet
+ * Copyright (C) 2007 Eric Dumazet <***@cosmosbay.com>
+ *
* Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly
* enough at me, Linus for the original (flawed) idea, Matthew
* Kirkwood for proof-of-concept implementation.
@@ -199,9 +202,12 @@ static inline int match_futex(union fute
* Returns: 0, or negative error code.
* The key words are stored in *key on success.
*
- * Should be called with &current->mm->mmap_sem but NOT any spinlocks.
+ * shared is NULL for PROCESS_PRIVATE futexes
+ * For other futexes, it points to &current->mm->mmap_sem and
+ * caller must have taken the reader lock. but NOT any spinlocks.
*/
-int get_futex_key(void __user *uaddr, union futex_key *key)
+int get_futex_key(void __user *uaddr, union futex_key *key,
+ struct rw_semaphore *shared)
{
unsigned long address =3D (unsigned long)uaddr;
struct mm_struct *mm =3D current->mm;
@@ -218,6 +224,22 @@ int get_futex_key(void __user *uaddr, un
address -=3D key->both.offset;
=20
/*
+ * PROCESS_PRIVATE futexes are fast.
+ * As the mm cannot disappear under us and the 'key' only needs
+ * virtual address, we dont even have to find the underlying vma.
+ * Note : We do have to check 'address' is a valid user address,
+ * but access_ok() should be faster than find_vma()
+ * Note : At this point, address points to the start of page,
+ * not the real futex address, this is ok.
+ */
+ if (!shared) {
+ if (!access_ok(VERIFY_WRITE, address, sizeof(int)))
+ return -EFAULT;
+ key->private.mm =3D mm;
+ key->private.address =3D address;
+ return 0;
+ }
+ /*
* The futex is hashed differently depending on whether
* it's in a shared or private mapping. So check vma first.
*/
@@ -244,6 +266,7 @@ int get_futex_key(void __user *uaddr, un
* mappings of _writable_ handles.
*/
if (likely(!(vma->vm_flags & VM_MAYSHARE))) {
+ key->both.offset +=3D FUT_OFF_MMSHARED; /* reference taken on mm */
key->private.mm =3D mm;
key->private.address =3D address;
return 0;
@@ -253,7 +276,7 @@ int get_futex_key(void __user *uaddr, un
* Linear file mappings are also simple.
*/
key->shared.inode =3D vma->vm_file->f_path.dentry->d_inode;
- key->both.offset++; /* Bit 0 of offset indicates inode-based key. */
+ key->both.offset +=3D FUT_OFF_INODE; /* inode-based key. */
if (likely(!(vma->vm_flags & VM_NONLINEAR))) {
key->shared.pgoff =3D (((address - vma->vm_start) >> PAGE_SHIFT)
+ vma->vm_pgoff);
@@ -281,17 +304,19 @@ EXPORT_SYMBOL_GPL(get_futex_key);
* Take a reference to the resource addressed by a key.
* Can be called while holding spinlocks.
*
- * NOTE: mmap_sem MUST be held between get_futex_key() and calling thi=
s
- * function, if it is called at all. mmap_sem keeps key->shared.inode=
valid.
*/
inline void get_futex_key_refs(union futex_key *key)
{
- if (key->both.ptr !=3D 0) {
- if (key->both.offset & 1)
+ if (key->both.ptr =3D=3D 0)
+ return;
+ switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
+ case FUT_OFF_INODE:
atomic_inc(&key->shared.inode->i_count);
- else
+ break;
+ case FUT_OFF_MMSHARED:
atomic_inc(&key->private.mm->mm_count);
- }
+ break;
+ }
}
EXPORT_SYMBOL_GPL(get_futex_key_refs);
=20
@@ -301,11 +326,15 @@ EXPORT_SYMBOL_GPL(get_futex_key_refs);
*/
void drop_futex_key_refs(union futex_key *key)
{
- if (key->both.ptr !=3D 0) {
- if (key->both.offset & 1)
+ if (key->both.ptr =3D=3D 0)
+ return;
+ switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
+ case FUT_OFF_INODE:
iput(key->shared.inode);
- else
+ break;
+ case FUT_OFF_MMSHARED:
mmdrop(key->private.mm);
+ break;
}
}
EXPORT_SYMBOL_GPL(drop_futex_key_refs);
@@ -339,28 +368,40 @@ get_futex_value_locked(unsigned long *de
}
=20
/*
- * Fault handling. Called with current->mm->mmap_sem held.
+ * Fault handling.
+ * if shared is non NULL, current->mm->mmap_sem is already held
*/
-static int futex_handle_fault(unsigned long address, int attempt)
+static int futex_handle_fault(unsigned long address, int attempt,
+ struct rw_semaphore *shared)
{
struct vm_area_struct * vma;
struct mm_struct *mm =3D current->mm;
+ int ret =3D 0;
=20
- if (attempt > 2 || !(vma =3D find_vma(mm, address)) ||
- vma->vm_start > address || !(vma->vm_flags & VM_WRITE))
+ if (attempt > 2)
return -EFAULT;
=20
- switch (handle_mm_fault(mm, vma, address, 1)) {
- case VM_FAULT_MINOR:
- current->min_flt++;
- break;
- case VM_FAULT_MAJOR:
- current->maj_flt++;
- break;
- default:
- return -EFAULT;
- }
- return 0;
+ if (!shared)
+ down_read(&mm->mmap_sem);
+
+ if (!(vma =3D find_vma(mm, address)) ||
+ vma->vm_start > address || !(vma->vm_flags & VM_WRITE))
+ ret =3D -EFAULT;
+
+ else
+ switch (handle_mm_fault(mm, vma, address, 1)) {
+ case VM_FAULT_MINOR:
+ current->min_flt++;
+ break;
+ case VM_FAULT_MAJOR:
+ current->maj_flt++;
+ break;
+ default:
+ ret =3D -EFAULT;
+ }
+ if (!shared)
+ up_read(&mm->mmap_sem);
+ return ret;
}
=20
/*
@@ -705,7 +746,8 @@ double_lock_hb(struct futex_hash_bucket=20
* Wake up all waiters hashed on the physical page that is mapped
* to this virtual address:
*/
-static int futex_wake(unsigned long __user *uaddr, int nr_wake)
+static int futex_wake(unsigned long __user *uaddr, int nr_wake,
+ struct rw_semaphore *shared)
{
struct futex_hash_bucket *hb;
struct futex_q *this, *next;
@@ -713,9 +755,10 @@ static int futex_wake(unsigned long __us
union futex_key key;
int ret;
=20
- down_read(&current->mm->mmap_sem);
+ if (shared)
+ down_read(shared);
=20
- ret =3D get_futex_key(uaddr, &key);
+ ret =3D get_futex_key(uaddr, &key, shared);
if (unlikely(ret !=3D 0))
goto out;
=20
@@ -737,7 +780,8 @@ static int futex_wake(unsigned long __us
=20
spin_unlock(&hb->lock);
out:
- up_read(&current->mm->mmap_sem);
+ if (shared)
+ up_read(shared);
return ret;
}
=20
@@ -807,7 +851,8 @@ retry:
*/
static int
futex_requeue_pi(unsigned long __user *uaddr1, unsigned long __user *u=
addr2,
- int nr_wake, int nr_requeue, unsigned long *cmpval, int futex64)
+ int nr_wake, int nr_requeue, unsigned long *cmpval, int futex64,
+ struct rw_semaphore *shared)
{
union futex_key key1, key2;
struct futex_hash_bucket *hb1, *hb2;
@@ -825,12 +870,13 @@ retry:
/*
* First take all the futex related locks:
*/
- down_read(&current->mm->mmap_sem);
+ if (shared)
+ down_read(shared);
=20
- ret =3D get_futex_key(uaddr1, &key1);
+ ret =3D get_futex_key(uaddr1, &key1, shared);
if (unlikely(ret !=3D 0))
goto out;
- ret =3D get_futex_key(uaddr2, &key2);
+ ret =3D get_futex_key(uaddr2, &key2, shared);
if (unlikely(ret !=3D 0))
goto out;
=20
@@ -998,7 +1044,8 @@ out:
*/
static int
futex_wake_op(unsigned long __user *uaddr1, unsigned long __user *uadd=
r2,
- int nr_wake, int nr_wake2, int op, int futex64)
+ int nr_wake, int nr_wake2, int op, int futex64,
+ struct rw_semaphore *shared)
{
union futex_key key1, key2;
struct futex_hash_bucket *hb1, *hb2;
@@ -1007,12 +1054,13 @@ futex_wake_op(unsigned long __user *uadd
int ret, op_ret, attempt =3D 0;
=20
retryfull:
- down_read(&current->mm->mmap_sem);
+ if (shared)
+ down_read(shared);
=20
- ret =3D get_futex_key(uaddr1, &key1);
+ ret =3D get_futex_key(uaddr1, &key1, shared);
if (unlikely(ret !=3D 0))
goto out;
- ret =3D get_futex_key(uaddr2, &key2);
+ ret =3D get_futex_key(uaddr2, &key2, shared);
if (unlikely(ret !=3D 0))
goto out;
=20
@@ -1055,15 +1103,14 @@ retry:
* futex_atomic_op_inuser needs to both read and write
* *(int __user *)uaddr2, but we can't modify it
* non-atomically. Therefore, if get_user below is not
- * enough, we need to handle the fault ourselves, while
- * still holding the mmap_sem.
+ * enough, we need to handle the fault ourselves. Make=20
+ * sure we hold mmap_sem.
*/
if (attempt++) {
- if (futex_handle_fault((unsigned long)uaddr2,
- attempt)) {
- ret =3D -EFAULT;
+ ret =3D futex_handle_fault((unsigned long)uaddr2,
+ attempt, shared);
+ if (ret)
goto out;
- }
goto retry;
}
=20
@@ -1071,7 +1118,8 @@ retry:
* If we would have faulted, release mmap_sem,
* fault it in and start all over again.
*/
- up_read(&current->mm->mmap_sem);
+ if (shared)
+ up_read(shared);
=20
ret =3D futex_get_user(&dummy, uaddr2, futex64);
if (ret)
@@ -1108,7 +1156,8 @@ retry:
if (hb1 !=3D hb2)
spin_unlock(&hb2->lock);
out:
- up_read(&current->mm->mmap_sem);
+ if (shared)
+ up_read(shared);
return ret;
}
=20
@@ -1118,7 +1167,8 @@ out:
*/
static int
futex_requeue(unsigned long __user *uaddr1, unsigned long __user *uadd=
r2,
- int nr_wake, int nr_requeue, unsigned long *cmpval, int futex64=
)
+ int nr_wake, int nr_requeue, unsigned long *cmpval, int futex64=
,
+ struct rw_semaphore *shared)
{
union futex_key key1, key2;
struct futex_hash_bucket *hb1, *hb2;
@@ -1127,12 +1177,13 @@ futex_requeue(unsigned long __user *uadd
int ret, drop_count =3D 0;
=20
retry:
- down_read(&current->mm->mmap_sem);
+ if (shared)
+ down_read(shared);
=20
- ret =3D get_futex_key(uaddr1, &key1);
+ ret =3D get_futex_key(uaddr1, &key1, shared);
if (unlikely(ret !=3D 0))
goto out;
- ret =3D get_futex_key(uaddr2, &key2);
+ ret =3D get_futex_key(uaddr2, &key2, shared);
if (unlikely(ret !=3D 0))
goto out;
=20
@@ -1155,7 +1206,8 @@ futex_requeue(unsigned long __user *uadd
* If we would have faulted, release mmap_sem, fault
* it in and start all over again.
*/
- up_read(&current->mm->mmap_sem);
+ if (shared)
+ up_read(shared);
=20
ret =3D futex_get_user(&curval, uaddr1, futex64);
=20
@@ -1208,7 +1260,8 @@ out_unlock:
drop_futex_key_refs(&key1);
=20
out:
- up_read(&current->mm->mmap_sem);
+ if (shared)
+ up_read(shared);
return ret;
}
=20
@@ -1392,7 +1445,8 @@ static int fixup_pi_state_owner(unsigned
=20
static long futex_wait_restart(struct restart_block *restart);
static int futex_wait(unsigned long __user *uaddr, unsigned long val,
- ktime_t *abs_time, int futex64)
+ ktime_t *abs_time, int futex64,
+ struct rw_semaphore *shared)
{
struct task_struct *curr =3D current;
DECLARE_WAITQUEUE(wait, curr);
@@ -1405,9 +1459,10 @@ static int futex_wait(unsigned long __us
=20
q.pi_state =3D NULL;
retry:
- down_read(&curr->mm->mmap_sem);
+ if (shared)
+ down_read(shared);
=20
- ret =3D get_futex_key(uaddr, &q.key);
+ ret =3D get_futex_key(uaddr, &q.key, shared);
if (unlikely(ret !=3D 0))
goto out_release_sem;
=20
@@ -1430,8 +1485,8 @@ static int futex_wait(unsigned long __us
* a wakeup when *uaddr !=3D val on entry to the syscall. This is
* rare, but normal.
*
- * We hold the mmap semaphore, so the mapping cannot have changed
- * since we looked it up in get_futex_key.
+ * for shared futexes, we hold the mmap semaphore, so the mapping
+ * cannot have changed since we looked it up in get_futex_key.
*/
ret =3D get_futex_value_locked(&uval, uaddr, futex64);
=20
@@ -1442,7 +1497,8 @@ static int futex_wait(unsigned long __us
* If we would have faulted, release mmap_sem, fault it in and
* start all over again.
*/
- up_read(&curr->mm->mmap_sem);
+ if (shared)
+ up_read(shared);
ret =3D futex_get_user(&uval, uaddr, futex64);
=20
if (!ret)
@@ -1468,7 +1524,8 @@ static int futex_wait(unsigned long __us
* Now the futex is queued and we have checked the data, we
* don't want to hold mmap_sem while we sleep.
*/
- up_read(&curr->mm->mmap_sem);
+ if (shared)
+ up_read(shared);
=20
/*
* There might have been scheduling since the queue_me(), as we
@@ -1568,7 +1625,8 @@ static int futex_wait(unsigned long __us
}
/* Unqueue and drop the lock */
unqueue_me_pi(&q);
- up_read(&curr->mm->mmap_sem);
+ if (shared)
+ up_read(shared);
}
=20
debug_rt_mutex_free_waiter(&q.waiter);
@@ -1598,6 +1656,8 @@ static int futex_wait(unsigned long __us
restart->arg1 =3D val;
restart->arg2 =3D (unsigned long)abs_time;
restart->arg3 =3D (unsigned long)futex64;
+ if (shared)
+ restart->arg3 |=3D 2;
return -ERESTART_RESTARTBLOCK;
}
=20
@@ -1605,7 +1665,8 @@ static int futex_wait(unsigned long __us
queue_unlock(&q, hb);
=20
out_release_sem:
- up_read(&curr->mm->mmap_sem);
+ if (shared)
+ up_read(shared);
return ret;
}
=20
@@ -1615,10 +1676,11 @@ static long futex_wait_restart(struct re
unsigned long __user *uaddr =3D (unsigned long __user *)restart->arg0=
;
unsigned long val =3D restart->arg1;
ktime_t *abs_time =3D (ktime_t *)restart->arg2;
- int futex64 =3D (int)restart->arg3;
+ int futex64 =3D (int)restart->arg3 & 1 ;
+ struct rw_semaphore *shared =3D (restart->arg3 & 2) ? &current->mm->m=
map_sem : NULL;
=20
restart->fn =3D do_no_restart_syscall;
- return (long)futex_wait(uaddr, val, abs_time, futex64);
+ return (long)futex_wait(uaddr, val, abs_time, futex64, shared);
}
=20
=20
@@ -1674,7 +1736,7 @@ static void set_pi_futex_owner(struct fu
* races the kernel might see a 0 value of the futex too.)
*/
static int futex_lock_pi(unsigned long __user *uaddr, int detect, ktim=
e_t *time,
- int trylock, int futex64)
+ int trylock, int futex64, struct rw_semaphore *shared)
{
struct hrtimer_sleeper timeout, *to =3D NULL;
struct task_struct *curr =3D current;
@@ -1695,9 +1757,10 @@ static int futex_lock_pi(unsigned long _
=20
q.pi_state =3D NULL;
retry:
- down_read(&curr->mm->mmap_sem);
+ if (shared)
+ down_read(shared);
=20
- ret =3D get_futex_key(uaddr, &q.key);
+ ret =3D get_futex_key(uaddr, &q.key, shared);
if (unlikely(ret !=3D 0))
goto out_release_sem;
=20
@@ -1818,7 +1881,8 @@ static int futex_lock_pi(unsigned long _
* Now the futex is queued and we have checked the data, we
* don't want to hold mmap_sem while we sleep.
*/
- up_read(&curr->mm->mmap_sem);
+ if (shared)
+ up_read(shared);
=20
WARN_ON(!q.pi_state);
/*
@@ -1832,7 +1896,8 @@ static int futex_lock_pi(unsigned long _
ret =3D ret ? 0 : -EWOULDBLOCK;
}
=20
- down_read(&curr->mm->mmap_sem);
+ if (shared)
+ down_read(shared);
spin_lock(q.lock_ptr);
=20
/*
@@ -1854,7 +1919,8 @@ static int futex_lock_pi(unsigned long _
}
/* Unqueue and drop the lock */
unqueue_me_pi(&q);
- up_read(&curr->mm->mmap_sem);
+ if (shared)
+ up_read(shared);
}
=20
if (!detect && ret =3D=3D -EDEADLK && 0)
@@ -1866,7 +1932,8 @@ static int futex_lock_pi(unsigned long _
queue_unlock(&q, hb);
=20
out_release_sem:
- up_read(&curr->mm->mmap_sem);
+ if (shared)
+ up_read(shared);
return ret;
=20
uaddr_faulted:
@@ -1877,15 +1944,15 @@ static int futex_lock_pi(unsigned long _
* still holding the mmap_sem.
*/
if (attempt++) {
- if (futex_handle_fault((unsigned long)uaddr, attempt)) {
- ret =3D -EFAULT;
+ ret =3D futex_handle_fault((unsigned long)uaddr, attempt, shared);
+ if (ret)
goto out_unlock_release_sem;
- }
goto retry_locked;
}
=20
queue_unlock(&q, hb);
- up_read(&curr->mm->mmap_sem);
+ if (shared)
+ up_read(shared);
=20
ret =3D futex_get_user(&uval, uaddr, futex64);
if (!ret && (uval !=3D -EFAULT))
@@ -1899,7 +1966,8 @@ static int futex_lock_pi(unsigned long _
* This is the in-kernel slowpath: we look up the PI state (if any),
* and do the rt-mutex unlock.
*/
-static int futex_unlock_pi(unsigned long __user *uaddr, int futex64)
+static int futex_unlock_pi(unsigned long __user *uaddr, int futex64,
+ struct rw_semaphore *shared)
{
struct futex_hash_bucket *hb;
struct futex_q *this, *next;
@@ -1919,9 +1987,10 @@ retry:
/*
* First take all the futex related locks:
*/
- down_read(&current->mm->mmap_sem);
+ if (shared)
+ down_read(shared);
=20
- ret =3D get_futex_key(uaddr, &key);
+ ret =3D get_futex_key(uaddr, &key, shared);
if (unlikely(ret !=3D 0))
goto out;
=20
@@ -1980,7 +2049,8 @@ retry_locked:
out_unlock:
spin_unlock(&hb->lock);
out:
- up_read(&current->mm->mmap_sem);
+ if (shared)
+ up_read(shared);
=20
return ret;
=20
@@ -1992,15 +2062,15 @@ pi_faulted:
* still holding the mmap_sem.
*/
if (attempt++) {
- if (futex_handle_fault((unsigned long)uaddr, attempt)) {
- ret =3D -EFAULT;
+ ret =3D futex_handle_fault((unsigned long)uaddr, attempt, shared);
+ if (ret)
goto out_unlock;
- }
goto retry_locked;
}
=20
spin_unlock(&hb->lock);
- up_read(&current->mm->mmap_sem);
+ if (shared)
+ up_read(shared);
=20
ret =3D futex_get_user(&uval, uaddr, futex64);
if (!ret && (uval !=3D -EFAULT))
@@ -2052,6 +2122,7 @@ static int futex_fd(u32 __user *uaddr, i
struct futex_q *q;
struct file *filp;
int ret, err;
+ struct rw_semaphore *shared;
static unsigned long printk_interval;
=20
if (printk_timed_ratelimit(&printk_interval, 60 * 60 * 1000)) {
@@ -2093,11 +2164,12 @@ static int futex_fd(u32 __user *uaddr, i
}
q->pi_state =3D NULL;
=20
- down_read(&current->mm->mmap_sem);
- err =3D get_futex_key(uaddr, &q->key);
+ shared =3D &current->mm->mmap_sem;
+ down_read(shared);
+ err =3D get_futex_key(uaddr, &q->key, shared);
=20
if (unlikely(err !=3D 0)) {
- up_read(&current->mm->mmap_sem);
+ up_read(shared);
kfree(q);
goto error;
}
@@ -2109,7 +2181,7 @@ static int futex_fd(u32 __user *uaddr, i
filp->private_data =3D q;
=20
queue_me(q, ret, filp);
- up_read(&current->mm->mmap_sem);
+ up_read(shared);
=20
/* Now we map fd to filp, so userspace can access it */
fd_install(ret, filp);
@@ -2238,7 +2310,8 @@ retry:
*/
if (!pi) {
if (uval & FUTEX_WAITERS)
- futex_wake((unsigned long __user *)uaddr, 1);
+ futex_wake((unsigned long __user *)uaddr, 1,
+ &curr->mm->mmap_sem);
}
}
return 0;
@@ -2326,13 +2399,18 @@ long do_futex(unsigned long __user *uadd
unsigned long val2, unsigned long val3, int fut64)
{
int ret;
+ int opm =3D op & FUTEX_CMD_MASK;
+ struct rw_semaphore *shared =3D NULL;
+
+ if (!(op & FUTEX_PRIVATE_FLAG))
+ shared =3D &current->mm->mmap_sem;
=20
- switch (op) {
+ switch (opm) {
case FUTEX_WAIT:
- ret =3D futex_wait(uaddr, val, timeout, fut64);
+ ret =3D futex_wait(uaddr, val, timeout, fut64, shared);
break;
case FUTEX_WAKE:
- ret =3D futex_wake(uaddr, val);
+ ret =3D futex_wake(uaddr, val, shared);
break;
case FUTEX_FD:
if (fut64)
@@ -2342,25 +2420,25 @@ long do_futex(unsigned long __user *uadd
ret =3D futex_fd((u32 __user *)uaddr, val);
break;
case FUTEX_REQUEUE:
- ret =3D futex_requeue(uaddr, uaddr2, val, val2, NULL, fut64);
+ ret =3D futex_requeue(uaddr, uaddr2, val, val2, NULL, fut64, shared)=
;
break;
case FUTEX_CMP_REQUEUE:
- ret =3D futex_requeue(uaddr, uaddr2, val, val2, &val3, fut64);
+ ret =3D futex_requeue(uaddr, uaddr2, val, val2, &val3, fut64, shared=
);
break;
case FUTEX_WAKE_OP:
- ret =3D futex_wake_op(uaddr, uaddr2, val, val2, val3, fut64);
+ ret =3D futex_wake_op(uaddr, uaddr2, val, val2, val3, fut64, shared)=
;
break;
case FUTEX_LOCK_PI:
- ret =3D futex_lock_pi(uaddr, val, timeout, 0, fut64);
+ ret =3D futex_lock_pi(uaddr, val, timeout, 0, fut64, shared);
break;
case FUTEX_UNLOCK_PI:
- ret =3D futex_unlock_pi(uaddr, fut64);
+ ret =3D futex_unlock_pi(uaddr, fut64, shared);
break;
case FUTEX_TRYLOCK_PI:
- ret =3D futex_lock_pi(uaddr, 0, timeout, 1, fut64);
+ ret =3D futex_lock_pi(uaddr, 0, timeout, 1, fut64, shared);
break;
case FUTEX_CMP_REQUEUE_PI:
- ret =3D futex_requeue_pi(uaddr, uaddr2, val, val2, &val3, fut64);
+ ret =3D futex_requeue_pi(uaddr, uaddr2, val, val2, &val3, fut64, sha=
red);
break;
default:
ret =3D -ENOSYS;
@@ -2377,23 +2455,24 @@ sys_futex64(u64 __user *uaddr, int op, u
struct timespec ts;
ktime_t t, *tp =3D NULL;
u64 val2 =3D 0;
+ int opm =3D op & FUTEX_CMD_MASK;
=20
- if (utime && (op =3D=3D FUTEX_WAIT || op =3D=3D FUTEX_LOCK_PI)) {
+ if (utime && (opm =3D=3D FUTEX_WAIT || opm =3D=3D FUTEX_LOCK_PI)) {
if (copy_from_user(&ts, utime, sizeof(ts)) !=3D 0)
return -EFAULT;
if (!timespec_valid(&ts))
return -EINVAL;
=20
t =3D timespec_to_ktime(ts);
- if (op =3D=3D FUTEX_WAIT)
+ if (opm =3D=3D FUTEX_WAIT)
t =3D ktime_add(ktime_get(), t);
tp =3D &t;
}
/*
- * requeue parameter in 'utime' if op =3D=3D FUTEX_REQUEUE.
+ * requeue parameter in 'utime' if opm =3D=3D FUTEX_REQUEUE.
*/
- if (op =3D=3D FUTEX_REQUEUE || op =3D=3D FUTEX_CMP_REQUEUE
- || op =3D=3D FUTEX_CMP_REQUEUE_PI)
+ if (opm =3D=3D FUTEX_REQUEUE || opm =3D=3D FUTEX_CMP_REQUEUE
+ || opm =3D=3D FUTEX_CMP_REQUEUE_PI)
val2 =3D (unsigned long) utime;
=20
return do_futex((unsigned long __user*)uaddr, op, val, tp,
@@ -2409,23 +2488,24 @@ asmlinkage long sys_futex(u32 __user *ua
struct timespec ts;
ktime_t t, *tp =3D NULL;
u32 val2 =3D 0;
+ int opm =3D op & FUTEX_CMD_MASK;
=20
- if (utime && (op =3D=3D FUTEX_WAIT || op =3D=3D FUTEX_LOCK_PI)) {
+ if (utime && (opm =3D=3D FUTEX_WAIT || opm =3D=3D FUTEX_LOCK_PI)) {
if (copy_from_user(&ts, utime, sizeof(ts)) !=3D 0)
return -EFAULT;
if (!timespec_valid(&ts))
return -EINVAL;
=20
t =3D timespec_to_ktime(ts);
- if (op =3D=3D FUTEX_WAIT)
+ if (opm =3D=3D FUTEX_WAIT)
t =3D ktime_add(ktime_get(), t);
tp =3D &t;
}
/*
* requeue parameter in 'utime' if op =3D=3D FUTEX_REQUEUE.
*/
- if (op =3D=3D FUTEX_REQUEUE || op =3D=3D FUTEX_CMP_REQUEUE
- || op =3D=3D FUTEX_CMP_REQUEUE_PI)
+ if (opm =3D=3D FUTEX_REQUEUE || opm =3D=3D FUTEX_CMP_REQUEUE
+ || opm =3D=3D FUTEX_CMP_REQUEUE_PI)
val2 =3D (u32) (unsigned long) utime;
=20
return do_futex((unsigned long __user*)uaddr, op, val, tp,
Ulrich Drepper
2007-04-05 20:43:09 UTC
Permalink
Post by Eric Dumazet
---
include/linux/futex.h | 45 +++++-
kernel/futex.c | 294 +++++++++++++++++++++++++---------------
2 files changed, 230 insertions(+), 109 deletions(-)
I cannot vouch for the whole code but the concept is very sound and
definitely along the way the specification allows it:

Acked-by: Ulrich Drepper <***@redhat.com>
Nick Piggin
2007-04-06 01:19:05 UTC
Permalink
Hi Eric,

Thanks for doing this... It's looking good, I just have some minor
Post by Eric Dumazet
--- linux-2.6.21-rc5-mm4/kernel/futex.c
+++ linux-2.6.21-rc5-mm4-ed/kernel/futex.c
@@ -16,6 +16,9 @@
*
+ * PRIVATE futexes by Eric Dumazet
+ *
* Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly
* enough at me, Linus for the original (flawed) idea, Matthew
* Kirkwood for proof-of-concept implementation.
@@ -199,9 +202,12 @@ static inline int match_futex(union fute
* Returns: 0, or negative error code.
* The key words are stored in *key on success.
*
- * Should be called with &current->mm->mmap_sem but NOT any spinlocks.
+ * shared is NULL for PROCESS_PRIVATE futexes
+ * For other futexes, it points to &current->mm->mmap_sem and
+ * caller must have taken the reader lock. but NOT any spinlocks.
*/
-int get_futex_key(void __user *uaddr, union futex_key *key)
+int get_futex_key(void __user *uaddr, union futex_key *key,
+ struct rw_semaphore *shared)
Can we pass in something other than the rw_semaphore here? Seeing as
it only actually gets used as a flag, it might be nicer just to pass
a 0 or 1? And all through the call stack...

Did the whole thing just turn out neater when you passed the rwsem?
We always know to use current->mm->mmap_sem, so it doesn't seem like
a boolean flag would hurt?
Post by Eric Dumazet
{
unsigned long address = (unsigned long)uaddr;
struct mm_struct *mm = current->mm;
@@ -218,6 +224,22 @@ int get_futex_key(void __user *uaddr, un
address -= key->both.offset;
/*
+ * PROCESS_PRIVATE futexes are fast.
+ * As the mm cannot disappear under us and the 'key' only needs
+ * virtual address, we dont even have to find the underlying vma.
+ * Note : We do have to check 'address' is a valid user address,
+ * but access_ok() should be faster than find_vma()
+ * Note : At this point, address points to the start of page,
+ * not the real futex address, this is ok.
+ */
+ if (!shared) {
+ if (!access_ok(VERIFY_WRITE, address, sizeof(int)))
+ return -EFAULT;
Shouldn't that be sizeof(long) to handle 64 bit futexes? Or strictly, it
should depend on the size of the operation. Maybe the access_ok check
should go outside get_futex_key?
Post by Eric Dumazet
+ key->private.mm = mm;
+ key->private.address = address;
+ return 0;
+ }
+ /*
* The futex is hashed differently depending on whether
* it's in a shared or private mapping. So check vma first.
*/
@@ -244,6 +266,7 @@ int get_futex_key(void __user *uaddr, un
* mappings of _writable_ handles.
*/
if (likely(!(vma->vm_flags & VM_MAYSHARE))) {
+ key->both.offset += FUT_OFF_MMSHARED; /* reference taken on mm */
key->private.mm = mm;
key->private.address = address;
return 0;
@@ -253,7 +276,7 @@ int get_futex_key(void __user *uaddr, un
* Linear file mappings are also simple.
*/
key->shared.inode = vma->vm_file->f_path.dentry->d_inode;
- key->both.offset++; /* Bit 0 of offset indicates inode-based key. */
+ key->both.offset += FUT_OFF_INODE; /* inode-based key. */
if (likely(!(vma->vm_flags & VM_NONLINEAR))) {
key->shared.pgoff = (((address - vma->vm_start) >> PAGE_SHIFT)
+ vma->vm_pgoff);
I like |= for adding flags, it seems less ambiguous. But I guess that's
a matter of opinion. Hugh seems to like +=, and I can't argue with him
about style issues ;)
Post by Eric Dumazet
@@ -281,17 +304,19 @@ EXPORT_SYMBOL_GPL(get_futex_key);
* Take a reference to the resource addressed by a key.
* Can be called while holding spinlocks.
*
- * NOTE: mmap_sem MUST be held between get_futex_key() and calling this
- * function, if it is called at all. mmap_sem keeps key->shared.inode valid.
*/
inline void get_futex_key_refs(union futex_key *key)
{
- if (key->both.ptr != 0) {
- if (key->both.offset & 1)
+ if (key->both.ptr == 0)
+ return;
+ switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
atomic_inc(&key->shared.inode->i_count);
- else
+ break;
atomic_inc(&key->private.mm->mm_count);
- }
+ break;
+ }
}
EXPORT_SYMBOL_GPL(get_futex_key_refs);
@@ -301,11 +326,15 @@ EXPORT_SYMBOL_GPL(get_futex_key_refs);
*/
void drop_futex_key_refs(union futex_key *key)
{
- if (key->both.ptr != 0) {
- if (key->both.offset & 1)
+ if (key->both.ptr == 0)
+ return;
+ switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
iput(key->shared.inode);
- else
+ break;
mmdrop(key->private.mm);
+ break;
}
}
EXPORT_SYMBOL_GPL(drop_futex_key_refs);
I wonder if it would be worthwhile inlining and likley()ing the
private fastpath? Might make it pretty compact... I guess that's
something to worry about after glibc gets support.
Post by Eric Dumazet
@@ -339,28 +368,40 @@ get_futex_value_locked(unsigned long *de
}
/*
- * Fault handling. Called with current->mm->mmap_sem held.
+ * Fault handling.
+ * if shared is non NULL, current->mm->mmap_sem is already held
*/
-static int futex_handle_fault(unsigned long address, int attempt)
+static int futex_handle_fault(unsigned long address, int attempt,
+ struct rw_semaphore *shared)
{
struct vm_area_struct * vma;
struct mm_struct *mm = current->mm;
+ int ret = 0;
- if (attempt > 2 || !(vma = find_vma(mm, address)) ||
- vma->vm_start > address || !(vma->vm_flags & VM_WRITE))
+ if (attempt > 2)
return -EFAULT;
- switch (handle_mm_fault(mm, vma, address, 1)) {
- current->min_flt++;
- break;
- current->maj_flt++;
- break;
- return -EFAULT;
- }
- return 0;
+ if (!shared)
+ down_read(&mm->mmap_sem);
+
+ if (!(vma = find_vma(mm, address)) ||
+ vma->vm_start > address || !(vma->vm_flags & VM_WRITE))
+ ret = -EFAULT;
+
+ else
+ switch (handle_mm_fault(mm, vma, address, 1)) {
+ current->min_flt++;
+ break;
+ current->maj_flt++;
+ break;
+ ret = -EFAULT;
+ }
+ if (!shared)
+ up_read(&mm->mmap_sem);
+ return ret;
}
/*
You've got an extra space after the if (maybe for clarity?). In this
situation I prefer putting braces around both the if and the else, and
if you get rid of that blank line, it doesn't cost you anything more ;)
Post by Eric Dumazet
@@ -1598,6 +1656,8 @@ static int futex_wait(unsigned long __us
restart->arg1 = val;
restart->arg2 = (unsigned long)abs_time;
restart->arg3 = (unsigned long)futex64;
+ if (shared)
+ restart->arg3 |= 2;
Could you make this into a proper flags argument and use #define CONSTANTs for it?
Post by Eric Dumazet
@@ -2377,23 +2455,24 @@ sys_futex64(u64 __user *uaddr, int op, u
struct timespec ts;
ktime_t t, *tp = NULL;
u64 val2 = 0;
+ int opm = op & FUTEX_CMD_MASK;
What's opm stand for?
Post by Eric Dumazet
- if (utime && (op == FUTEX_WAIT || op == FUTEX_LOCK_PI)) {
+ if (utime && (opm == FUTEX_WAIT || opm == FUTEX_LOCK_PI)) {
--
SUSE Labs, Novell Inc.
Eric Dumazet
2007-04-06 05:53:08 UTC
Permalink
Post by Nick Piggin
Hi Eric,
=20
Thanks for doing this... It's looking good, I just have some minor
Hi Nick, thanks for reviewing.
Post by Nick Piggin
=20
Post by Eric Dumazet
*/
-int get_futex_key(void __user *uaddr, union futex_key *key)
+int get_futex_key(void __user *uaddr, union futex_key *key,
+ struct rw_semaphore *shared)
=20
Can we pass in something other than the rw_semaphore here? Seeing as
it only actually gets used as a flag, it might be nicer just to pass
a 0 or 1? And all through the call stack...
=20
Did the whole thing just turn out neater when you passed the rwsem?
We always know to use current->mm->mmap_sem, so it doesn't seem like
a boolean flag would hurt?
That's a good question

current->mm->mmap_sem being calculated once is a win in itself, because=
=20
current access is not cheap.
It also does the memory access to go through part of the chain in advan=
ce,=20
before its use. It does a prefetch() equivalent for free : If current->=
mm is=20
not in CPU cache, CPU wont stall because next instructions dont depend =
on it.

This means less CPU stall in case current->mm is not in CPU cache. That=
s=20
difficult to benchmark it, but you can trust me.

A flag means :

if (flag)
up_read(&current->mm->mmap_sem)

This generates quite a bad code.

if (ptr)
up_read(ptr)

generates *much* better code.

So this is a cleanup and a runtime optimization.

I dit a similar optimization on commit 163da958ba5282cbf85e8b3dc08e4f51=
f8b01c5e

I invite you to check it :

http://git.kernel.org/?p=3Dlinux/kernel/git/torvalds/linux-2.6.git;a=3D=
commit;h=3D163da958ba5282cbf85e8b3dc08e4f51f8b01c5e
Post by Nick Piggin
=20
Post by Eric Dumazet
{
unsigned long address =3D (unsigned long)uaddr;
struct mm_struct *mm =3D current->mm;
@@ -218,6 +224,22 @@ int get_futex_key(void __user *uaddr, un
address -=3D key->both.offset;
=20
/*
+ * PROCESS_PRIVATE futexes are fast.
+ * As the mm cannot disappear under us and the 'key' only needs
+ * virtual address, we dont even have to find the underlying vm=
a.
Post by Nick Piggin
Post by Eric Dumazet
+ * Note : We do have to check 'address' is a valid user address=
,
Post by Nick Piggin
Post by Eric Dumazet
+ * but access_ok() should be faster than find_vma()
+ * Note : At this point, address points to the start of page,
+ * not the real futex address, this is ok.
+ */
+ if (!shared) {
+ if (!access_ok(VERIFY_WRITE, address, sizeof(int)))
+ return -EFAULT;
=20
Shouldn't that be sizeof(long) to handle 64 bit futexes? Or strictly,=
it
Post by Nick Piggin
should depend on the size of the operation. Maybe the access_ok check
should go outside get_futex_key?
If you check again, you'll see that address points to the start of the =
PAGE,=20
not the real u32/u64 futex address. This checks the PAGE. We can use ch=
ar,=20
short, int, long, or char[PAGE_SIZE] as long as we know a futex cannot =
span=20
two pages.
Post by Nick Piggin
Post by Eric Dumazet
*/
key->shared.inode =3D vma->vm_file->f_path.dentry->d_inode;
- key->both.offset++; /* Bit 0 of offset indicates inode-based ke=
y. */
Post by Nick Piggin
Post by Eric Dumazet
+ key->both.offset +=3D FUT_OFF_INODE; /* inode-based key. */
if (likely(!(vma->vm_flags & VM_NONLINEAR))) {
key->shared.pgoff =3D (((address - vma->vm_start) >> PAGE_S=
HIFT)
Post by Nick Piggin
Post by Eric Dumazet
+ vma->vm_pgoff);
=20
I like |=3D for adding flags, it seems less ambiguous. But I guess th=
at's
Post by Nick Piggin
a matter of opinion. Hugh seems to like +=3D, and I can't argue with =
him
Post by Nick Piggin
about style issues ;)
Previous code was doing offset++ wich means offset +=3D 1;
I didnt want to hurt Hugh :)
Post by Nick Piggin
Post by Eric Dumazet
EXPORT_SYMBOL_GPL(drop_futex_key_refs);
=20
I wonder if it would be worthwhile inlining and likley()ing the
private fastpath? Might make it pretty compact... I guess that's
something to worry about after glibc gets support.
Yes, in a future patch, in about one year :)
Post by Nick Piggin
Post by Eric Dumazet
+
+ if (!(vma =3D find_vma(mm, address)) ||
+ vma->vm_start > address || !(vma->vm_flags & VM_WRITE))
+ ret =3D -EFAULT;
+
+ else
+ switch (handle_mm_fault(mm, vma, address, 1)) {
+ current->min_flt++;
+ break;
+ current->maj_flt++;
+ break;
+ ret =3D -EFAULT;
+ }
+ if (!shared)
+ up_read(&mm->mmap_sem);
+ return ret;
}
=20
/*
=20
You've got an extra space after the if (maybe for clarity?). In this
situation I prefer putting braces around both the if and the else, an=
d
Post by Nick Piggin
if you get rid of that blank line, it doesn't cost you anything more =
;)

Oh well...
Post by Nick Piggin
=20
Post by Eric Dumazet
@@ -1598,6 +1656,8 @@ static int futex_wait(unsigned long __us
restart->arg1 =3D val;
restart->arg2 =3D (unsigned long)abs_time;
restart->arg3 =3D (unsigned long)futex64;
+ if (shared)
+ restart->arg3 |=3D 2;
=20
Could you make this into a proper flags argument and use #define=20
CONSTANTs for it?
Yes, but I'm not sure it will improve readability.
Post by Nick Piggin
=20
Post by Eric Dumazet
@@ -2377,23 +2455,24 @@ sys_futex64(u64 __user *uaddr, int op, u
struct timespec ts;
ktime_t t, *tp =3D NULL;
u64 val2 =3D 0;
+ int opm =3D op & FUTEX_CMD_MASK;
=20
What's opm stand for?
I guess 'm' stands for 'mask' or 'masked' ?

Thank you
Nick Piggin
2007-04-06 11:50:16 UTC
Permalink
Post by Eric Dumazet
Post by Nick Piggin
Did the whole thing just turn out neater when you passed the rwsem?
We always know to use current->mm->mmap_sem, so it doesn't seem like
a boolean flag would hurt?
=20
=20
That's a good question
=20
current->mm->mmap_sem being calculated once is a win in itself, becau=
se=20
Post by Eric Dumazet
current access is not cheap.
It also does the memory access to go through part of the chain in=20
advance, before its use. It does a prefetch() equivalent for free : I=
f=20
Post by Eric Dumazet
current->mm is not in CPU cache, CPU wont stall because next=20
instructions dont depend on it.
=46air enough. Current access I think should be cheap though (it is
effectively a constant), but I guess it is still improvement.
Post by Eric Dumazet
Post by Nick Piggin
Shouldn't that be sizeof(long) to handle 64 bit futexes? Or strictly=
, it
Post by Eric Dumazet
Post by Nick Piggin
should depend on the size of the operation. Maybe the access_ok chec=
k
Post by Eric Dumazet
Post by Nick Piggin
should go outside get_futex_key?
=20
=20
If you check again, you'll see that address points to the start of th=
e=20
Post by Eric Dumazet
PAGE, not the real u32/u64 futex address. This checks the PAGE. We ca=
n=20
Post by Eric Dumazet
use char, short, int, long, or char[PAGE_SIZE] as long as we know a=20
futex cannot span two pages.
Ah, that works.
Post by Eric Dumazet
Post by Nick Piggin
Post by Eric Dumazet
*/
key->shared.inode =3D vma->vm_file->f_path.dentry->d_inode;
- key->both.offset++; /* Bit 0 of offset indicates inode-based=20
key. */
+ key->both.offset +=3D FUT_OFF_INODE; /* inode-based key. */
if (likely(!(vma->vm_flags & VM_NONLINEAR))) {
key->shared.pgoff =3D (((address - vma->vm_start) >> PAGE_=
SHIFT)
Post by Eric Dumazet
Post by Nick Piggin
Post by Eric Dumazet
+ vma->vm_pgoff);
I like |=3D for adding flags, it seems less ambiguous. But I guess t=
hat's
Post by Eric Dumazet
Post by Nick Piggin
a matter of opinion. Hugh seems to like +=3D, and I can't argue with=
him
Post by Eric Dumazet
Post by Nick Piggin
about style issues ;)
=20
=20
=20
Previous code was doing offset++ wich means offset +=3D 1;
But it doesn't mean you have to ;)
Post by Eric Dumazet
Post by Nick Piggin
Post by Eric Dumazet
@@ -1598,6 +1656,8 @@ static int futex_wait(unsigned long __us
restart->arg1 =3D val;
restart->arg2 =3D (unsigned long)abs_time;
restart->arg3 =3D (unsigned long)futex64;
+ if (shared)
+ restart->arg3 |=3D 2;
Could you make this into a proper flags argument and use #define=20
CONSTANTs for it?
=20
=20
Yes, but I'm not sure it will improve readability.
Well that bit of code alone is obviously unreadable.

restart->arg3 =3D 0;
if (futex64)
restart->arg3 |=3D FUTEX_64;
if (shared)
restart->arg3 |=3D FUTEX_SHARED;

Maybe a matter of taste.
Post by Eric Dumazet
=20
Post by Nick Piggin
Post by Eric Dumazet
@@ -2377,23 +2455,24 @@ sys_futex64(u64 __user *uaddr, int op, u
struct timespec ts;
ktime_t t, *tp =3D NULL;
u64 val2 =3D 0;
+ int opm =3D op & FUTEX_CMD_MASK;
What's opm stand for?
=20
=20
I guess 'm' stands for 'mask' or 'masked' ?
Why not call it cmd? (ie. what it is, rather than what you have done
to derive it).

--=20
SUSE Labs, Novell Inc.
Hugh Dickins
2007-04-06 06:05:22 UTC
Permalink
Post by Nick Piggin
I like |= for adding flags, it seems less ambiguous. But I guess that's
a matter of opinion. Hugh seems to like +=,
Do I? You probably have a shaming example in mind (PAGE_MAPPING_ANON?
that's a hybrid case where using + and - helped minimize the casting);
but in general I'd agree with you that it's |= for setting flag bits.

Hmm, Eric's FUT_OFF_INODE is hybrid too, that might justify the +=
Post by Nick Piggin
and I can't argue with him about style issues ;)
I feel a warm glow. Nobody ever called me a style guru before.

Hugh

p.s. Please don't interpret this as any useful contribution to
reviewing Eric's futex work: seems sensible, but I've hardly looked.
Jan Engelhardt
2007-04-06 17:41:47 UTC
Permalink
Post by Hugh Dickins
Post by Nick Piggin
I like |= for adding flags, it seems less ambiguous. But I guess that's
a matter of opinion. Hugh seems to like +=,
Do I? You probably have a shaming example in mind (PAGE_MAPPING_ANON?
that's a hybrid case where using + and - helped minimize the casting);
but in general I'd agree with you that it's |= for setting flag bits.
Hmm, Eric's FUT_OFF_INODE is hybrid too, that might justify the +=
If a bit is already set, you can't set it again using +=.


Jan
--
Peter Zijlstra
2007-04-06 12:26:09 UTC
Permalink
Hi,

some thoughts on shared futexes;

Could we get rid of the mmap_sem on the shared futexes in the following
manner:

- do a page table walk to find the pte;
- get a page using pfn_to_page (skipping VM_PFNMAP)
- get the futex key from page->mapping->host and page->index
and offset from addr % PAGE_SIZE.

or given a key:

- lookup the page from key.shared.inode->i_mapping by key.shared.pgoff
possibly loading the page using mapping->a_ops->readpage().

then:

- perform the futex operation on a kmap of the page


This should all work except for VM_PFNMAP.

Since the address is passed from userspace we cannot trust it to not
point into a VM_PFNMAP area.

However, with the RCU VMA lookup patches I'm working on we could do that
check without holding locks and without exclusive cachelines; the
question is, is that good enough?

Or is there an alternative way of determining a pfnmap given a
pfn/struct page?
Hugh Dickins
2007-04-06 13:02:44 UTC
Permalink
Post by Peter Zijlstra
some thoughts on shared futexes;
Could we get rid of the mmap_sem on the shared futexes in the following
- do a page table walk to find the pte;
("walk" meaning descent down the levels, I presume, rather than across)

I've not had time to digest your proposal, and I'm about to go out:
let me sound a warning that springs to mind, maybe it's entirely
inapproriate, but better said than kept silent.

It looks as if you're supposing that mmap_sem is needed to find_vma,
but not for going down the pagetables. It's not a simple as that:
you need to be careful that a concurrent munmap from another thread
isn't freeing pagetables from under you.

Holding (down_read) of mmap_sem is one way to protect against that.
try_to_unmap doesn't have that luxury: in its case, it's made safe
by the way free_pgtables does anon_vma_unlink and unlink_file_vma
before freeing any pagetables, so try_to_unmap etc. won't get there;
but you can't do that.

Hugh
Post by Peter Zijlstra
- get a page using pfn_to_page (skipping VM_PFNMAP)
- get the futex key from page->mapping->host and page->index
and offset from addr % PAGE_SIZE.
- lookup the page from key.shared.inode->i_mapping by key.shared.pgoff
possibly loading the page using mapping->a_ops->readpage().
- perform the futex operation on a kmap of the page
This should all work except for VM_PFNMAP.
Since the address is passed from userspace we cannot trust it to not
point into a VM_PFNMAP area.
However, with the RCU VMA lookup patches I'm working on we could do that
check without holding locks and without exclusive cachelines; the
question is, is that good enough?
Or is there an alternative way of determining a pfnmap given a
pfn/struct page?
Peter Zijlstra
2007-04-06 13:15:00 UTC
Permalink
Post by Hugh Dickins
Post by Peter Zijlstra
some thoughts on shared futexes;
Could we get rid of the mmap_sem on the shared futexes in the following
- do a page table walk to find the pte;
("walk" meaning descent down the levels, I presume, rather than across)
indeed.
Post by Hugh Dickins
let me sound a warning that springs to mind, maybe it's entirely
inapproriate, but better said than kept silent.
It looks as if you're supposing that mmap_sem is needed to find_vma,
you need to be careful that a concurrent munmap from another thread
isn't freeing pagetables from under you.
Holding (down_read) of mmap_sem is one way to protect against that.
try_to_unmap doesn't have that luxury: in its case, it's made safe
by the way free_pgtables does anon_vma_unlink and unlink_file_vma
before freeing any pagetables, so try_to_unmap etc. won't get there;
but you can't do that.
Ah, drad.

I had hoped that the pte_lock would solve that race, but that doesn't
cover the upper levels of the tree.

Back to the drawing board..
Nick Piggin
2007-04-06 13:15:33 UTC
Permalink
Post by Peter Zijlstra
some thoughts on shared futexes;
Could we get rid of the mmap_sem on the shared futexes in the following
I'd imagine shared futexes would be much less common than private for
threaded programs... I'd say we should reevaluate things once we have
private futexes, and malloc/free stop hammering mmap_sem so hard...
Post by Peter Zijlstra
- get a page using pfn_to_page (skipping VM_PFNMAP)
- get the futex key from page->mapping->host and page->index
and offset from addr % PAGE_SIZE.
- lookup the page from key.shared.inode->i_mapping by key.shared.pgoff
possibly loading the page using mapping->a_ops->readpage().
For shared futexes, wouldn't i_mapping be worse, because you'd be
ping-ponging the tree_lock between processes, rather than have each
use their own mmap_sem?

That also only helps for the wakeup case too, doesn't it? You have
to use the vmas to find out which inode to use to do the wait, I think?
(unless you introduce a new shared futex API).
--
SUSE Labs, Novell Inc.
Peter Zijlstra
2007-04-06 13:22:44 UTC
Permalink
Post by Nick Piggin
Post by Peter Zijlstra
some thoughts on shared futexes;
Could we get rid of the mmap_sem on the shared futexes in the following
I'd imagine shared futexes would be much less common than private for
threaded programs... I'd say we should reevaluate things once we have
private futexes, and malloc/free stop hammering mmap_sem so hard...
Indeed, private futexes are by far the most common.
Post by Nick Piggin
Post by Peter Zijlstra
- get a page using pfn_to_page (skipping VM_PFNMAP)
- get the futex key from page->mapping->host and page->index
and offset from addr % PAGE_SIZE.
- lookup the page from key.shared.inode->i_mapping by key.shared.pgoff
possibly loading the page using mapping->a_ops->readpage().
For shared futexes, wouldn't i_mapping be worse, because you'd be
ping-ponging the tree_lock between processes, rather than have each
use their own mmap_sem?
Your lockless pagecache work would solve most of that, no?
Post by Nick Piggin
That also only helps for the wakeup case too, doesn't it? You have
to use the vmas to find out which inode to use to do the wait, I think?
(unless you introduce a new shared futex API).
one could do something like this:

struct address_space *mapping = page_mapping(page);
if (!mapping || mapping == &swapper_space)
do_private_futex();
else
do_shared_futex(mapping->host, page->index, addr % PAGE_SIZE);


But alas, it seems I overlooked that the mmap_sem also protects the page
tables as pointed out by Hugh, so this is all in fain it seems.

A well.
Nick Piggin
2007-04-06 13:40:52 UTC
Permalink
Post by Peter Zijlstra
Post by Nick Piggin
Post by Peter Zijlstra
- lookup the page from key.shared.inode->i_mapping by key.shared.pgoff
possibly loading the page using mapping->a_ops->readpage().
For shared futexes, wouldn't i_mapping be worse, because you'd be
ping-ponging the tree_lock between processes, rather than have each
use their own mmap_sem?
Your lockless pagecache work would solve most of that, no?
Yeah it would.
Post by Peter Zijlstra
Post by Nick Piggin
That also only helps for the wakeup case too, doesn't it? You have
to use the vmas to find out which inode to use to do the wait, I think?
(unless you introduce a new shared futex API).
struct address_space *mapping = page_mapping(page);
if (!mapping || mapping == &swapper_space)
do_private_futex();
else
do_shared_futex(mapping->host, page->index, addr % PAGE_SIZE);
But alas, it seems I overlooked that the mmap_sem also protects the page
tables as pointed out by Hugh, so this is all in fain it seems.
A well.
Well if it turned out to be a real problem, we _could_ introduce a new
futex API that takes a file descriptor+offset.

The old futex API is basically just a shorthand for choosing between
shared and private, depending on what the vma at the given address
happens to be.
--
SUSE Labs, Novell Inc.
Peter Zijlstra
2007-04-06 12:31:58 UTC
Permalink
Post by Eric Dumazet
Hi all
I'm pleased to present this patch which improves linux futexes performance and
scalability, merely avoiding taking mmap_sem rwlock.
Ulrich agreed with the API and said glibc work could start as soon
as he gets a Fedora kernel with it :)
Andrew, could we get this in mm as well ? This version is against 2.6.21-rc5-mm4
(so supports 64bit futexes)
In this third version I dropped the NUMA optims and process private hash table,
to let new API come in and be tested.
Good work, Thanks!

Acked-by: Peter Zijlstra <***@chello.nl>
Eric Dumazet
2007-04-07 08:43:39 UTC
Permalink
Hi all

Updates on this take4 :

- All remarks from Nick were addressed I hope

- Current mm code have a problem with 64bit futexes, as spoted by Nick =
:

get_futex_key() does a check against sizeof(u32) regardless of futex be=
ing 64bits or not.
So it is possible a 64bit futex spans two pages of memory...
I had to change get_futex_key() prototype to be able to do a correct te=
st.

History :

I'm pleased to present this patch which improves linux futexes performa=
nce and=20
scalability, merely avoiding taking mmap_sem rwlock.

Ulrich agreed with the API and said glibc work could start as soon
as he gets a Fedora kernel with it :)

Andrew, could we get this in mm as well ? This version is against 2.6.2=
1-rc5-mm4
(so supports 64bit futexes)

In this third version I dropped the NUMA optims and process private has=
h table,
to let new API come in and be tested.

Thank you

[PATCH] FUTEX : new PRIVATE futexes

Analysis of current linux futex code :
--------------------------------------

A central hash table futex_queues[] holds all contexts (futex_q) of wai=
ting=20
threads.
Each futex_wait()/futex_wait() has to obtain a spinlock on a hash slot =
to=20
perform lookups or insert/deletion of a futex_q.

When a futex_wait() is done, calling thread has to :


1) - Obtain a read lock on mmap_sem to be able to validate the user poi=
nter
=A0 =A0 =A0(calling find_vma()). This validation tells us if the futex =
uses
=A0 =A0 =A0an inode based store (mapped file), or mm based store (anony=
mous mem)

2) - compute a hash key

3) - Atomic increment of reference counter on an inode or a mm_struct

4) - lock part of futex_queues[] hash table

5) - perform the test on value of futex.
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 (rollback is value !=3D expected_value,=
returns EWOULDBLOCK)
=A0 =A0 =A0 =A0 (various loops if test triggers mm faults)

6) queue the context into hash table, release the lock got in 4)

7) - release the read_lock on mmap_sem

=A0 =A0<block>

8) Eventually unqueue the context (but rarely, as this part
=A0may be done by the futex_wake())

=46utexes were designed to improve scalability but current implementati=
on
has various problems :

- Central hashtable :
=A0This means scalability problems if many processes/threads want to us=
e
=A0futexes at the same time.
=A0This means NUMA unbalance because this hashtable is located on one n=
ode.

- Using mmap_sem on every futex() syscall :

=A0Even if mmap_sem is a rw_semaphore, up_read()/down_read() are doing =
atomic
=A0ops on mmap_sem, dirtying cache line :
=A0 =A0 =A0 =A0 - lot of cache line ping pongs on SMP configurations.

=A0mmap_sem is also extensively used by mm code (page faults, mmap()/mu=
nmap())
=A0Highly threaded processes might suffer from mmap_sem contention.

=A0mmap_sem is also used by oprofile code. Enabling oprofile hurts thre=
aded
programs because of contention on the mmap_sem cache line.

- Using an atomic_inc()/atomic_dec() on inode ref counter or mm ref cou=
nter:
=A0It's also a cache line ping pong on SMP. It also increases mmap_sem =
hold time
=A0because of cache misses.

Most of these scalability problems come from the fact that futexes are =
in
one global namespace. As we use a central hash table, we must make sure
they are all using the same reference (given by the mm subsystem).
We chose to force all futexes be 'shared'. This has a cost.

But fact is POSIX defined PRIVATE and SHARED, allowing clear separation=
, and
optimal performance if carefuly implemented. Time has come for linux to=
have=20
better threading performance.

The goal is to permit new futex commands to avoid :
=A0- Taking the mmap_sem semaphore, conflicting with other subsystems.
=A0- Modifying a ref_count on mm or an inode, still conflicting with mm=
or fs.

This is possible because, for one process using PTHREAD_PROCESS_PRIVATE
futexes, we only need to distinguish futexes by their virtual address, =
no
matter the underlying mm storage is.



If glibc wants to exploit this new infrastructure, it should use new
_PRIVATE futex subcommands for PTHREAD_PROCESS_PRIVATE futexes. And
be prepared to fallback on old subcommands for old kernels. Using one
global variable with the FUTEX_PRIVATE_FLAG or 0 value should be OK.

PTHREAD_PROCESS_SHARED futexes should still use the old subcommands.

Compatibility with old applications is preserved, they still hit the
scalability problems, but new applications can fly :)

Note : the same SHARED futex (mapped on a file) can be used by old bina=
ries=20
*and* new binaries, because both binaries will use the old subcommands.

Note : Vast majority of futexes should be using PROCESS_PRIVATE semanti=
c,
as this is the default semantic. Almost all applications should benefit
of this changes (new kernel and updated libc)

Some bench results on a Pentium M 1.6 GHz (SMP kernel on a UP machine)

/* calling futex_wait(addr, value) with value !=3D *addr */
434 cycles per futex(FUTEX_WAIT) call (mixing 2 futexes)
427 cycles per futex(FUTEX_WAIT) call (using one futex)
345 cycles per futex(FUTEX_WAIT_PRIVATE) call (mixing 2 futexes)
345 cycles per futex(FUTEX_WAIT_PRIVATE) call (using one futex)
=46or reference :
187 cycles per getppid() call
188 cycles per umask() call
183 cycles per ni_syscall() call

Signed-off-by: Eric Dumazet <***@cosmosbay.com>
---
include/linux/futex.h | 46 +++++
kernel/futex.c | 340 ++++++++++++++++++++++++++--------------
2 files changed, 268 insertions(+), 118 deletions(-)

--- linux-2.6.21-rc5-mm4/include/linux/futex.h
+++ linux-2.6.21-rc5-mm4-ed/include/linux/futex.h
@@ -19,6 +19,18 @@ union ktime;
#define FUTEX_TRYLOCK_PI 8
#define FUTEX_CMP_REQUEUE_PI 9
=20
+#define FUTEX_PRIVATE_FLAG 128
+#define FUTEX_CMD_MASK ~FUTEX_PRIVATE_FLAG
+
+#define FUTEX_WAIT_PRIVATE (FUTEX_WAIT | FUTEX_PRIVATE_FLAG)
+#define FUTEX_WAKE_PRIVATE (FUTEX_WAKE | FUTEX_PRIVATE_FLAG)
+#define FUTEX_REQUEUE_PRIVATE (FUTEX_REQUEUE | FUTEX_PRIVATE_FLAG)
+#define FUTEX_CMP_REQUEUE_PRIVATE (FUTEX_CMP_REQUEUE | FUTEX_PRIVATE_F=
LAG)
+#define FUTEX_WAKE_OP_PRIVATE (FUTEX_WAKE_OP | FUTEX_PRIVATE_FLAG)
+#define FUTEX_LOCK_PI_PRIVATE (FUTEX_LOCK_PI | FUTEX_PRIVATE_FLAG)
+#define FUTEX_UNLOCK_PI_PRIVATE (FUTEX_UNLOCK_PI | FUTEX_PRIVATE_FLAG)
+#define FUTEX_TRYLOCK_PI_PRIVATE (FUTEX_TRYLOCK_PI | FUTEX_PRIVATE_FLA=
G)
+
/*
* Support for robust futexes: the kernel cleans up held futexes at
* thread exit time.
@@ -115,8 +127,18 @@ handle_futex_death(u32 __user *uaddr, st
* Don't rearrange members without looking at hash_futex().
*
* offset is aligned to a multiple of sizeof(u32) (=3D=3D 4) by defini=
tion.
- * We set bit 0 to indicate if it's an inode-based key.
+ * We use the two low order bits of offset to tell what is the kind of=
key :
+ * 00 : Private process futex (PTHREAD_PROCESS_PRIVATE)
+ * (no reference on an inode or mm)
+ * 01 : Shared futex (PTHREAD_PROCESS_SHARED)
+ * mapped on a file (reference on the underlying inode)
+ * 10 : Shared futex (PTHREAD_PROCESS_SHARED)
+ * (but private mapping on an mm, and reference taken on it)
*/
+
+#define FUT_OFF_INODE 1 /* We set bit 0 if key has a reference on i=
node */
+#define FUT_OFF_MMSHARED 2 /* We set bit 1 if key has a reference on m=
m */
+
union futex_key {
unsigned long __user *uaddr;
struct {
@@ -135,8 +157,28 @@ union futex_key {
int offset;
} both;
};
-int get_futex_key(void __user *uaddr, union futex_key *key);
+
+/**
+ * get_futex_key - Get parameters which are the keys for a futex.
+ * @uaddr: virtual address of the futex
+ * @size: size of futex (4 or 8)
+ * @shared: NULL for a PROCESS_PRIVATE futex,
+ * &current->mm->mmap_sem for a PROCESS_SHARED futex
+ * @key: address where result is stored.
+ *
+ * Returns an error code or 0
+ */
+int get_futex_key(void __user *uaddr, int size, struct rw_semaphore *s=
hared,
+ union futex_key *key);
+
+/**
+ * get_futex_key_refs - Take a reference to the resource addressed by =
a key
+ */
void get_futex_key_refs(union futex_key *key);
+
+/**
+ * drop_futex_key_refs - Drop a reference to the resource addressed by=
a key.
+ */
void drop_futex_key_refs(union futex_key *key);
=20
#ifdef CONFIG_FUTEX
--- linux-2.6.21-rc5-mm4/kernel/futex.c
+++ linux-2.6.21-rc5-mm4-ed/kernel/futex.c
@@ -16,6 +16,9 @@
* Copyright (C) 2006 Red Hat, Inc., Ingo Molnar <***@redhat.com>
* Copyright (C) 2006 Timesys Corp., Thomas Gleixner <***@timesys.co=
m>
*
+ * PRIVATE futexes by Eric Dumazet
+ * Copyright (C) 2007 Eric Dumazet <***@cosmosbay.com>
+ *
* Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly
* enough at me, Linus for the original (flawed) idea, Matthew
* Kirkwood for proof-of-concept implementation.
@@ -199,11 +202,14 @@ static inline int match_futex(union fute
* Returns: 0, or negative error code.
* The key words are stored in *key on success.
*
- * Should be called with &current->mm->mmap_sem but NOT any spinlocks.
+ * fshared is NULL for PROCESS_PRIVATE futexes
+ * For other futexes, it points to &current->mm->mmap_sem and
+ * caller must have taken the reader lock. but NOT any spinlocks.
*/
-int get_futex_key(void __user *uaddr, union futex_key *key)
+int get_futex_key(void __user *fuaddr, int fsize, struct rw_semaphore =
*fshared,
+ union futex_key *key)
{
- unsigned long address =3D (unsigned long)uaddr;
+ unsigned long address =3D (unsigned long)fuaddr;
struct mm_struct *mm =3D current->mm;
struct vm_area_struct *vma;
struct page *page;
@@ -213,11 +219,29 @@ int get_futex_key(void __user *uaddr, un
* The futex address must be "naturally" aligned.
*/
key->both.offset =3D address % PAGE_SIZE;
- if (unlikely((key->both.offset % sizeof(u32)) !=3D 0))
+ if (unlikely((key->both.offset & (fsize - 1)) !=3D 0))
return -EINVAL;
address -=3D key->both.offset;
=20
/*
+ * PROCESS_PRIVATE futexes are fast.
+ * As the mm cannot disappear under us and the 'key' only needs
+ * virtual address, we dont even have to find the underlying vma.
+ * Note : We do have to check 'address' is a valid user address,
+ * but access_ok() should be faster than find_vma()
+ */
+ if (!fshared) {
+ /*
+ * Note : At this point, address points to the start of page,
+ * not the real futex address, this is OK.
+ */
+ if (!access_ok(VERIFY_WRITE, address, sizeof(int)))
+ return -EFAULT;
+ key->private.mm =3D mm;
+ key->private.address =3D address;
+ return 0;
+ }
+ /*
* The futex is hashed differently depending on whether
* it's in a shared or private mapping. So check vma first.
*/
@@ -232,7 +256,7 @@ int get_futex_key(void __user *uaddr, un
return (vma->vm_flags & VM_IO) ? -EPERM : -EACCES;
=20
/* Save the user address in the ley */
- key->uaddr =3D uaddr;
+ key->uaddr =3D fuaddr;
=20
/*
* Private mappings are handled in a simple way.
@@ -244,6 +268,7 @@ int get_futex_key(void __user *uaddr, un
* mappings of _writable_ handles.
*/
if (likely(!(vma->vm_flags & VM_MAYSHARE))) {
+ key->both.offset |=3D FUT_OFF_MMSHARED; /* reference taken on mm */
key->private.mm =3D mm;
key->private.address =3D address;
return 0;
@@ -253,7 +278,7 @@ int get_futex_key(void __user *uaddr, un
* Linear file mappings are also simple.
*/
key->shared.inode =3D vma->vm_file->f_path.dentry->d_inode;
- key->both.offset++; /* Bit 0 of offset indicates inode-based key. */
+ key->both.offset |=3D FUT_OFF_INODE; /* inode-based key. */
if (likely(!(vma->vm_flags & VM_NONLINEAR))) {
key->shared.pgoff =3D (((address - vma->vm_start) >> PAGE_SHIFT)
+ vma->vm_pgoff);
@@ -281,17 +306,19 @@ EXPORT_SYMBOL_GPL(get_futex_key);
* Take a reference to the resource addressed by a key.
* Can be called while holding spinlocks.
*
- * NOTE: mmap_sem MUST be held between get_futex_key() and calling thi=
s
- * function, if it is called at all. mmap_sem keeps key->shared.inode=
valid.
*/
inline void get_futex_key_refs(union futex_key *key)
{
- if (key->both.ptr !=3D 0) {
- if (key->both.offset & 1)
+ if (key->both.ptr =3D=3D 0)
+ return;
+ switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
+ case FUT_OFF_INODE:
atomic_inc(&key->shared.inode->i_count);
- else
+ break;
+ case FUT_OFF_MMSHARED:
atomic_inc(&key->private.mm->mm_count);
- }
+ break;
+ }
}
EXPORT_SYMBOL_GPL(get_futex_key_refs);
=20
@@ -301,11 +328,15 @@ EXPORT_SYMBOL_GPL(get_futex_key_refs);
*/
void drop_futex_key_refs(union futex_key *key)
{
- if (key->both.ptr !=3D 0) {
- if (key->both.offset & 1)
+ if (key->both.ptr =3D=3D 0)
+ return;
+ switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
+ case FUT_OFF_INODE:
iput(key->shared.inode);
- else
+ break;
+ case FUT_OFF_MMSHARED:
mmdrop(key->private.mm);
+ break;
}
}
EXPORT_SYMBOL_GPL(drop_futex_key_refs);
@@ -339,28 +370,38 @@ get_futex_value_locked(unsigned long *de
}
=20
/*
- * Fault handling. Called with current->mm->mmap_sem held.
+ * Fault handling.
+ * if fshared is non NULL, current->mm->mmap_sem is already held
*/
-static int futex_handle_fault(unsigned long address, int attempt)
+static int futex_handle_fault(unsigned long address,
+ struct rw_semaphore *fshared, int attempt)
{
struct vm_area_struct * vma;
struct mm_struct *mm =3D current->mm;
+ int ret =3D -EFAULT;
=20
- if (attempt > 2 || !(vma =3D find_vma(mm, address)) ||
- vma->vm_start > address || !(vma->vm_flags & VM_WRITE))
- return -EFAULT;
-
- switch (handle_mm_fault(mm, vma, address, 1)) {
- case VM_FAULT_MINOR:
- current->min_flt++;
- break;
- case VM_FAULT_MAJOR:
- current->maj_flt++;
- break;
- default:
- return -EFAULT;
+ if (attempt > 2)
+ return ret;
+ if (!fshared)
+ down_read(&mm->mmap_sem);
+ vma =3D find_vma(mm, address);
+ if (vma &&
+ address >=3D vma->vm_start &&
+ (vma->vm_flags & VM_WRITE)) {
+ switch (handle_mm_fault(mm, vma, address, 1)) {
+ case VM_FAULT_MINOR:
+ ret =3D 0;
+ current->min_flt++;
+ break;
+ case VM_FAULT_MAJOR:
+ ret =3D 0;
+ current->maj_flt++;
+ break;
+ }
}
- return 0;
+ if (!fshared)
+ up_read(&mm->mmap_sem);
+ return ret;
}
=20
/*
@@ -702,20 +743,22 @@ double_lock_hb(struct futex_hash_bucket=20
}
=20
/*
- * Wake up all waiters hashed on the physical page that is mapped
- * to this virtual address:
+ * Wake up all waiters on a futex (fuaddr, futex64, fshared)
*/
-static int futex_wake(unsigned long __user *uaddr, int nr_wake)
+static int futex_wake(unsigned long __user *fuaddr, int futex64,
+ struct rw_semaphore *fshared, int nr_wake)
{
struct futex_hash_bucket *hb;
struct futex_q *this, *next;
struct plist_head *head;
union futex_key key;
int ret;
+ int fsize =3D futex64 ? sizeof(u64) : sizeof(u32);
=20
- down_read(&current->mm->mmap_sem);
+ if (fshared)
+ down_read(fshared);
=20
- ret =3D get_futex_key(uaddr, &key);
+ ret =3D get_futex_key(fuaddr, fsize, fshared, &key);
if (unlikely(ret !=3D 0))
goto out;
=20
@@ -737,7 +780,8 @@ static int futex_wake(unsigned long __us
=20
spin_unlock(&hb->lock);
out:
- up_read(&current->mm->mmap_sem);
+ if (fshared)
+ up_read(fshared);
return ret;
}
=20
@@ -807,7 +851,8 @@ retry:
*/
static int
futex_requeue_pi(unsigned long __user *uaddr1, unsigned long __user *u=
addr2,
- int nr_wake, int nr_requeue, unsigned long *cmpval, int futex64)
+ int nr_wake, int nr_requeue, unsigned long *cmpval, int futex64,
+ struct rw_semaphore *fshared)
{
union futex_key key1, key2;
struct futex_hash_bucket *hb1, *hb2;
@@ -817,6 +862,7 @@ futex_requeue_pi(unsigned long __user *u
struct rt_mutex_waiter *waiter, *top_waiter =3D NULL;
struct rt_mutex *lock2 =3D NULL;
int ret, drop_count =3D 0;
+ int fsize =3D futex64 ? sizeof(u64) : sizeof(u32);
=20
if (refill_pi_state_cache())
return -ENOMEM;
@@ -825,12 +871,13 @@ retry:
/*
* First take all the futex related locks:
*/
- down_read(&current->mm->mmap_sem);
+ if (fshared)
+ down_read(fshared);
=20
- ret =3D get_futex_key(uaddr1, &key1);
+ ret =3D get_futex_key(uaddr1, fsize, fshared, &key1);
if (unlikely(ret !=3D 0))
goto out;
- ret =3D get_futex_key(uaddr2, &key2);
+ ret =3D get_futex_key(uaddr2, fsize, fshared, &key2);
if (unlikely(ret !=3D 0))
goto out;
=20
@@ -998,21 +1045,24 @@ out:
*/
static int
futex_wake_op(unsigned long __user *uaddr1, unsigned long __user *uadd=
r2,
- int nr_wake, int nr_wake2, int op, int futex64)
+ int nr_wake, int nr_wake2, int op, int futex64,
+ struct rw_semaphore *fshared)
{
union futex_key key1, key2;
struct futex_hash_bucket *hb1, *hb2;
struct plist_head *head;
struct futex_q *this, *next;
int ret, op_ret, attempt =3D 0;
+ int fsize =3D futex64 ? sizeof(u64) : sizeof(u32);
=20
retryfull:
- down_read(&current->mm->mmap_sem);
+ if (fshared)
+ down_read(fshared);
=20
- ret =3D get_futex_key(uaddr1, &key1);
+ ret =3D get_futex_key(uaddr1, fsize, fshared, &key1);
if (unlikely(ret !=3D 0))
goto out;
- ret =3D get_futex_key(uaddr2, &key2);
+ ret =3D get_futex_key(uaddr2, fsize, fshared, &key2);
if (unlikely(ret !=3D 0))
goto out;
=20
@@ -1055,15 +1105,14 @@ retry:
* futex_atomic_op_inuser needs to both read and write
* *(int __user *)uaddr2, but we can't modify it
* non-atomically. Therefore, if get_user below is not
- * enough, we need to handle the fault ourselves, while
- * still holding the mmap_sem.
+ * enough, we need to handle the fault ourselves. Make=20
+ * sure we hold mmap_sem.
*/
if (attempt++) {
- if (futex_handle_fault((unsigned long)uaddr2,
- attempt)) {
- ret =3D -EFAULT;
+ ret =3D futex_handle_fault((unsigned long)uaddr2,
+ fshared, attempt);
+ if (ret)
goto out;
- }
goto retry;
}
=20
@@ -1071,7 +1120,8 @@ retry:
* If we would have faulted, release mmap_sem,
* fault it in and start all over again.
*/
- up_read(&current->mm->mmap_sem);
+ if (fshared)
+ up_read(fshared);
=20
ret =3D futex_get_user(&dummy, uaddr2, futex64);
if (ret)
@@ -1108,7 +1158,8 @@ retry:
if (hb1 !=3D hb2)
spin_unlock(&hb2->lock);
out:
- up_read(&current->mm->mmap_sem);
+ if (fshared)
+ up_read(fshared);
return ret;
}
=20
@@ -1118,21 +1169,23 @@ out:
*/
static int
futex_requeue(unsigned long __user *uaddr1, unsigned long __user *uadd=
r2,
- int nr_wake, int nr_requeue, unsigned long *cmpval, int futex64=
)
+ int nr_wake, int nr_requeue, unsigned long *cmpval, int futex64=
,
+ struct rw_semaphore *fshared)
{
union futex_key key1, key2;
struct futex_hash_bucket *hb1, *hb2;
struct plist_head *head1;
struct futex_q *this, *next;
int ret, drop_count =3D 0;
-
+ int fsize =3D futex64 ? sizeof(u64) : sizeof(u32);
retry:
- down_read(&current->mm->mmap_sem);
+ if (fshared)
+ down_read(fshared);
=20
- ret =3D get_futex_key(uaddr1, &key1);
+ ret =3D get_futex_key(uaddr1, fsize, fshared, &key1);
if (unlikely(ret !=3D 0))
goto out;
- ret =3D get_futex_key(uaddr2, &key2);
+ ret =3D get_futex_key(uaddr2, fsize, fshared, &key2);
if (unlikely(ret !=3D 0))
goto out;
=20
@@ -1155,7 +1208,8 @@ futex_requeue(unsigned long __user *uadd
* If we would have faulted, release mmap_sem, fault
* it in and start all over again.
*/
- up_read(&current->mm->mmap_sem);
+ if (fshared)
+ up_read(fshared);
=20
ret =3D futex_get_user(&curval, uaddr1, futex64);
=20
@@ -1208,7 +1262,8 @@ out_unlock:
drop_futex_key_refs(&key1);
=20
out:
- up_read(&current->mm->mmap_sem);
+ if (fshared)
+ up_read(fshared);
return ret;
}
=20
@@ -1390,9 +1445,17 @@ static int fixup_pi_state_owner(unsigned
return ret;
}
=20
+/*
+ * In case we must use restart_block to restart a futex_wait,
+ * we encode in the 'arg3' both futex64 and shared capabilities
+ */
+#define ARG3_FUTEX64 1
+#define ARG3_SHARED 2
+
static long futex_wait_restart(struct restart_block *restart);
-static int futex_wait(unsigned long __user *uaddr, unsigned long val,
- ktime_t *abs_time, int futex64)
+static int futex_wait(unsigned long __user *uaddr, int futex64,
+ struct rw_semaphore *fshared,
+ unsigned long val, ktime_t *abs_time)
{
struct task_struct *curr =3D current;
DECLARE_WAITQUEUE(wait, curr);
@@ -1402,12 +1465,14 @@ static int futex_wait(unsigned long __us
int ret;
struct hrtimer_sleeper t, *to =3D NULL;
int rem =3D 0;
+ int fsize =3D futex64 ? sizeof(u64) : sizeof(u32);
=20
q.pi_state =3D NULL;
retry:
- down_read(&curr->mm->mmap_sem);
+ if (fshared)
+ down_read(fshared);
=20
- ret =3D get_futex_key(uaddr, &q.key);
+ ret =3D get_futex_key(uaddr, fsize, fshared, &q.key);
if (unlikely(ret !=3D 0))
goto out_release_sem;
=20
@@ -1430,8 +1495,8 @@ static int futex_wait(unsigned long __us
* a wakeup when *uaddr !=3D val on entry to the syscall. This is
* rare, but normal.
*
- * We hold the mmap semaphore, so the mapping cannot have changed
- * since we looked it up in get_futex_key.
+ * for shared futexes, we hold the mmap semaphore, so the mapping
+ * cannot have changed since we looked it up in get_futex_key.
*/
ret =3D get_futex_value_locked(&uval, uaddr, futex64);
=20
@@ -1442,7 +1507,8 @@ static int futex_wait(unsigned long __us
* If we would have faulted, release mmap_sem, fault it in and
* start all over again.
*/
- up_read(&curr->mm->mmap_sem);
+ if (fshared)
+ up_read(fshared);
ret =3D futex_get_user(&uval, uaddr, futex64);
=20
if (!ret)
@@ -1468,7 +1534,8 @@ static int futex_wait(unsigned long __us
* Now the futex is queued and we have checked the data, we
* don't want to hold mmap_sem while we sleep.
*/
- up_read(&curr->mm->mmap_sem);
+ if (fshared)
+ up_read(fshared);
=20
/*
* There might have been scheduling since the queue_me(), as we
@@ -1568,7 +1635,8 @@ static int futex_wait(unsigned long __us
}
/* Unqueue and drop the lock */
unqueue_me_pi(&q);
- up_read(&curr->mm->mmap_sem);
+ if (fshared)
+ up_read(fshared);
}
=20
debug_rt_mutex_free_waiter(&q.waiter);
@@ -1597,7 +1665,13 @@ static int futex_wait(unsigned long __us
restart->arg0 =3D (unsigned long)uaddr;
restart->arg1 =3D val;
restart->arg2 =3D (unsigned long)abs_time;
- restart->arg3 =3D (unsigned long)futex64;
+ restart->arg3 =3D 0;
+#ifdef CONFIG_64BIT
+ if (futex64)
+ restart->arg3 |=3D ARG3_FUTEX64;
+#endif
+ if (fshared)
+ restart->arg3 |=3D ARG3_SHARED;
return -ERESTART_RESTARTBLOCK;
}
=20
@@ -1605,7 +1679,8 @@ static int futex_wait(unsigned long __us
queue_unlock(&q, hb);
=20
out_release_sem:
- up_read(&curr->mm->mmap_sem);
+ if (fshared)
+ up_read(fshared);
return ret;
}
=20
@@ -1615,10 +1690,17 @@ static long futex_wait_restart(struct re
unsigned long __user *uaddr =3D (unsigned long __user *)restart->arg0=
;
unsigned long val =3D restart->arg1;
ktime_t *abs_time =3D (ktime_t *)restart->arg2;
- int futex64 =3D (int)restart->arg3;
+ int futex64 =3D 0;
+ struct rw_semaphore *fshared =3D NULL;
=20
+#ifdef CONFIG_64BIT
+ if (restart->arg3 & ARG3_FUTEX64)
+ futex64 =3D 1;
+#endif
+ if (restart->arg3 & ARG3_SHARED)
+ fshared =3D &current->mm->mmap_sem;
restart->fn =3D do_no_restart_syscall;
- return (long)futex_wait(uaddr, val, abs_time, futex64);
+ return (long)futex_wait(uaddr, futex64, fshared, val, abs_time);
}
=20
=20
@@ -1674,7 +1756,7 @@ static void set_pi_futex_owner(struct fu
* races the kernel might see a 0 value of the futex too.)
*/
static int futex_lock_pi(unsigned long __user *uaddr, int detect, ktim=
e_t *time,
- int trylock, int futex64)
+ int trylock, int futex64, struct rw_semaphore *fshared)
{
struct hrtimer_sleeper timeout, *to =3D NULL;
struct task_struct *curr =3D current;
@@ -1682,6 +1764,7 @@ static int futex_lock_pi(unsigned long _
unsigned long uval, newval, curval;
struct futex_q q;
int ret, lock_held, attempt =3D 0;
+ int fsize =3D futex64 ? sizeof(u64) : sizeof(u32);
=20
if (refill_pi_state_cache())
return -ENOMEM;
@@ -1695,9 +1778,10 @@ static int futex_lock_pi(unsigned long _
=20
q.pi_state =3D NULL;
retry:
- down_read(&curr->mm->mmap_sem);
+ if (fshared)
+ down_read(fshared);
=20
- ret =3D get_futex_key(uaddr, &q.key);
+ ret =3D get_futex_key(uaddr, fsize, fshared, &q.key);
if (unlikely(ret !=3D 0))
goto out_release_sem;
=20
@@ -1818,7 +1902,8 @@ static int futex_lock_pi(unsigned long _
* Now the futex is queued and we have checked the data, we
* don't want to hold mmap_sem while we sleep.
*/
- up_read(&curr->mm->mmap_sem);
+ if (fshared)
+ up_read(fshared);
=20
WARN_ON(!q.pi_state);
/*
@@ -1832,7 +1917,8 @@ static int futex_lock_pi(unsigned long _
ret =3D ret ? 0 : -EWOULDBLOCK;
}
=20
- down_read(&curr->mm->mmap_sem);
+ if (fshared)
+ down_read(fshared);
spin_lock(q.lock_ptr);
=20
/*
@@ -1854,7 +1940,8 @@ static int futex_lock_pi(unsigned long _
}
/* Unqueue and drop the lock */
unqueue_me_pi(&q);
- up_read(&curr->mm->mmap_sem);
+ if (fshared)
+ up_read(fshared);
}
=20
if (!detect && ret =3D=3D -EDEADLK && 0)
@@ -1866,7 +1953,8 @@ static int futex_lock_pi(unsigned long _
queue_unlock(&q, hb);
=20
out_release_sem:
- up_read(&curr->mm->mmap_sem);
+ if (fshared)
+ up_read(fshared);
return ret;
=20
uaddr_faulted:
@@ -1877,15 +1965,16 @@ static int futex_lock_pi(unsigned long _
* still holding the mmap_sem.
*/
if (attempt++) {
- if (futex_handle_fault((unsigned long)uaddr, attempt)) {
- ret =3D -EFAULT;
+ ret =3D futex_handle_fault((unsigned long)uaddr, fshared,
+ attempt);
+ if (ret)
goto out_unlock_release_sem;
- }
goto retry_locked;
}
=20
queue_unlock(&q, hb);
- up_read(&curr->mm->mmap_sem);
+ if (fshared)
+ up_read(fshared);
=20
ret =3D futex_get_user(&uval, uaddr, futex64);
if (!ret && (uval !=3D -EFAULT))
@@ -1899,7 +1988,8 @@ static int futex_lock_pi(unsigned long _
* This is the in-kernel slowpath: we look up the PI state (if any),
* and do the rt-mutex unlock.
*/
-static int futex_unlock_pi(unsigned long __user *uaddr, int futex64)
+static int futex_unlock_pi(unsigned long __user *uaddr, int futex64,
+ struct rw_semaphore *fshared)
{
struct futex_hash_bucket *hb;
struct futex_q *this, *next;
@@ -1907,6 +1997,7 @@ static int futex_unlock_pi(unsigned long
struct plist_head *head;
union futex_key key;
int ret, attempt =3D 0;
+ int fsize =3D futex64 ? sizeof(u64) : sizeof(u32);
=20
retry:
if (futex_get_user(&uval, uaddr, futex64))
@@ -1919,9 +2010,10 @@ retry:
/*
* First take all the futex related locks:
*/
- down_read(&current->mm->mmap_sem);
+ if (fshared)
+ down_read(fshared);
=20
- ret =3D get_futex_key(uaddr, &key);
+ ret =3D get_futex_key(uaddr, fsize, fshared, &key);
if (unlikely(ret !=3D 0))
goto out;
=20
@@ -1980,7 +2072,8 @@ retry_locked:
out_unlock:
spin_unlock(&hb->lock);
out:
- up_read(&current->mm->mmap_sem);
+ if (fshared)
+ up_read(fshared);
=20
return ret;
=20
@@ -1992,15 +2085,16 @@ pi_faulted:
* still holding the mmap_sem.
*/
if (attempt++) {
- if (futex_handle_fault((unsigned long)uaddr, attempt)) {
- ret =3D -EFAULT;
+ ret =3D futex_handle_fault((unsigned long)uaddr, fshared,
+ attempt);
+ if (ret)
goto out_unlock;
- }
goto retry_locked;
}
=20
spin_unlock(&hb->lock);
- up_read(&current->mm->mmap_sem);
+ if (fshared)
+ up_read(fshared);
=20
ret =3D futex_get_user(&uval, uaddr, futex64);
if (!ret && (uval !=3D -EFAULT))
@@ -2052,6 +2146,7 @@ static int futex_fd(u32 __user *uaddr, i
struct futex_q *q;
struct file *filp;
int ret, err;
+ struct rw_semaphore *fshared;
static unsigned long printk_interval;
=20
if (printk_timed_ratelimit(&printk_interval, 60 * 60 * 1000)) {
@@ -2093,11 +2188,12 @@ static int futex_fd(u32 __user *uaddr, i
}
q->pi_state =3D NULL;
=20
- down_read(&current->mm->mmap_sem);
- err =3D get_futex_key(uaddr, &q->key);
+ fshared =3D &current->mm->mmap_sem;
+ down_read(fshared);
+ err =3D get_futex_key(uaddr, sizeof(u32), fshared, &q->key);
=20
if (unlikely(err !=3D 0)) {
- up_read(&current->mm->mmap_sem);
+ up_read(fshared);
kfree(q);
goto error;
}
@@ -2109,7 +2205,7 @@ static int futex_fd(u32 __user *uaddr, i
filp->private_data =3D q;
=20
queue_me(q, ret, filp);
- up_read(&current->mm->mmap_sem);
+ up_read(fshared);
=20
/* Now we map fd to filp, so userspace can access it */
fd_install(ret, filp);
@@ -2238,7 +2334,8 @@ retry:
*/
if (!pi) {
if (uval & FUTEX_WAITERS)
- futex_wake((unsigned long __user *)uaddr, 1);
+ futex_wake((unsigned long __user *)uaddr,
+ 0, &curr->mm->mmap_sem, 1);
}
}
return 0;
@@ -2326,13 +2423,18 @@ long do_futex(unsigned long __user *uadd
unsigned long val2, unsigned long val3, int fut64)
{
int ret;
+ int cmd =3D op & FUTEX_CMD_MASK;
+ struct rw_semaphore *fshared =3D NULL;
+
+ if (!(op & FUTEX_PRIVATE_FLAG))
+ fshared =3D &current->mm->mmap_sem;
=20
- switch (op) {
+ switch (cmd) {
case FUTEX_WAIT:
- ret =3D futex_wait(uaddr, val, timeout, fut64);
+ ret =3D futex_wait(uaddr, fut64, fshared, val, timeout);
break;
case FUTEX_WAKE:
- ret =3D futex_wake(uaddr, val);
+ ret =3D futex_wake(uaddr, fut64, fshared, val);
break;
case FUTEX_FD:
if (fut64)
@@ -2342,25 +2444,29 @@ long do_futex(unsigned long __user *uadd
ret =3D futex_fd((u32 __user *)uaddr, val);
break;
case FUTEX_REQUEUE:
- ret =3D futex_requeue(uaddr, uaddr2, val, val2, NULL, fut64);
+ ret =3D futex_requeue(uaddr, uaddr2, val, val2, NULL, fut64,
+ fshared);
break;
case FUTEX_CMP_REQUEUE:
- ret =3D futex_requeue(uaddr, uaddr2, val, val2, &val3, fut64);
+ ret =3D futex_requeue(uaddr, uaddr2, val, val2, &val3, fut64,
+ fshared);
break;
case FUTEX_WAKE_OP:
- ret =3D futex_wake_op(uaddr, uaddr2, val, val2, val3, fut64);
+ ret =3D futex_wake_op(uaddr, uaddr2, val, val2, val3, fut64,
+ fshared);
break;
case FUTEX_LOCK_PI:
- ret =3D futex_lock_pi(uaddr, val, timeout, 0, fut64);
+ ret =3D futex_lock_pi(uaddr, val, timeout, 0, fut64, fshared);
break;
case FUTEX_UNLOCK_PI:
- ret =3D futex_unlock_pi(uaddr, fut64);
+ ret =3D futex_unlock_pi(uaddr, fut64, fshared);
break;
case FUTEX_TRYLOCK_PI:
- ret =3D futex_lock_pi(uaddr, 0, timeout, 1, fut64);
+ ret =3D futex_lock_pi(uaddr, 0, timeout, 1, fut64, fshared);
break;
case FUTEX_CMP_REQUEUE_PI:
- ret =3D futex_requeue_pi(uaddr, uaddr2, val, val2, &val3, fut64);
+ ret =3D futex_requeue_pi(uaddr, uaddr2, val, val2, &val3, fut64,
+ fshared);
break;
default:
ret =3D -ENOSYS;
@@ -2377,23 +2483,24 @@ sys_futex64(u64 __user *uaddr, int op, u
struct timespec ts;
ktime_t t, *tp =3D NULL;
u64 val2 =3D 0;
+ cmd opm =3D op & FUTEX_CMD_MASK;
=20
- if (utime && (op =3D=3D FUTEX_WAIT || op =3D=3D FUTEX_LOCK_PI)) {
+ if (utime && (cmd =3D=3D FUTEX_WAIT || cmd =3D=3D FUTEX_LOCK_PI)) {
if (copy_from_user(&ts, utime, sizeof(ts)) !=3D 0)
return -EFAULT;
if (!timespec_valid(&ts))
return -EINVAL;
=20
t =3D timespec_to_ktime(ts);
- if (op =3D=3D FUTEX_WAIT)
+ if (cmd =3D=3D FUTEX_WAIT)
t =3D ktime_add(ktime_get(), t);
tp =3D &t;
}
/*
- * requeue parameter in 'utime' if op =3D=3D FUTEX_REQUEUE.
+ * requeue parameter in 'utime' if cmd =3D=3D FUTEX_REQUEUE.
*/
- if (op =3D=3D FUTEX_REQUEUE || op =3D=3D FUTEX_CMP_REQUEUE
- || op =3D=3D FUTEX_CMP_REQUEUE_PI)
+ if (cmd =3D=3D FUTEX_REQUEUE || cmd =3D=3D FUTEX_CMP_REQUEUE
+ || cmd =3D=3D FUTEX_CMP_REQUEUE_PI)
val2 =3D (unsigned long) utime;
=20
return do_futex((unsigned long __user*)uaddr, op, val, tp,
@@ -2409,23 +2516,24 @@ asmlinkage long sys_futex(u32 __user *ua
struct timespec ts;
ktime_t t, *tp =3D NULL;
u32 val2 =3D 0;
+ int cmd =3D op & FUTEX_CMD_MASK;
=20
- if (utime && (op =3D=3D FUTEX_WAIT || op =3D=3D FUTEX_LOCK_PI)) {
+ if (utime && (cmd =3D=3D FUTEX_WAIT || cmd =3D=3D FUTEX_LOCK_PI)) {
if (copy_from_user(&ts, utime, sizeof(ts)) !=3D 0)
return -EFAULT;
if (!timespec_valid(&ts))
return -EINVAL;
=20
t =3D timespec_to_ktime(ts);
- if (op =3D=3D FUTEX_WAIT)
+ if (cmd =3D=3D FUTEX_WAIT)
t =3D ktime_add(ktime_get(), t);
tp =3D &t;
}
/*
* requeue parameter in 'utime' if op =3D=3D FUTEX_REQUEUE.
*/
- if (op =3D=3D FUTEX_REQUEUE || op =3D=3D FUTEX_CMP_REQUEUE
- || op =3D=3D FUTEX_CMP_REQUEUE_PI)
+ if (cmd =3D=3D FUTEX_REQUEUE || cmd =3D=3D FUTEX_CMP_REQUEUE
+ || cmd =3D=3D FUTEX_CMP_REQUEUE_PI)
val2 =3D (u32) (unsigned long) utime;
=20
return do_futex((unsigned long __user*)uaddr, op, val, tp,
Nick Piggin
2007-04-07 09:30:14 UTC
Permalink
Post by Eric Dumazet
Hi all
- All remarks from Nick were addressed I hope
Yeah looks very nice. Thanks for doing that.
Post by Eric Dumazet
get_futex_key() does a check against sizeof(u32) regardless of futex being 64bits or not.
So it is possible a 64bit futex spans two pages of memory...
I had to change get_futex_key() prototype to be able to do a correct test.
I wonder if it should be encfocing alignment to keep in on 1 page?

Otherwise,
Acked-by: Nick Piggin <***@suse.de>

I'll be away for a couple of days, but I'll look at running some performance
tests when I get back.

Thanks,
Nick
--
SUSE Labs, Novell Inc.
Eric Dumazet
2007-04-07 10:00:51 UTC
Permalink
On Sat, 07 Apr 2007 19:30:14 +1000
Post by Nick Piggin
Post by Eric Dumazet
get_futex_key() does a check against sizeof(u32) regardless of futex being 64bits or not.
So it is possible a 64bit futex spans two pages of memory...
I had to change get_futex_key() prototype to be able to do a correct test.
I wonder if it should be encfocing alignment to keep in on 1 page?
I believe I just did that :)

Before the patch :

Alignment was only 4 bytes for all futexes, but some user app could trigger a kernel bug (since one 64bit futex could sit on two different pages, so possible separate vmas, so the inode refcounting was wrong, and access_ok did not a correct check)

After the patch :

Alignment is 8 bytes for 64 bit futexes, 4 bytes for 32bit futexes.
All futexes are contrained to be in one single page.
Nick Piggin
2007-04-11 09:23:26 UTC
Permalink
On Wed, 11 Apr 2007 17:22:57 +1000
Post by Eric Dumazet
On Sat, 07 Apr 2007 19:30:14 +1000
Post by Nick Piggin
Post by Eric Dumazet
get_futex_key() does a check against sizeof(u32) regardless of futex being 64bits or not.
So it is possible a 64bit futex spans two pages of memory...
I had to change get_futex_key() prototype to be able to do a correct test.
I wonder if it should be encfocing alignment to keep in on 1 page?
I believe I just did that :)
Yes :P What I was trying to say before jumping on a plane is that
sys_futex/sys_futex64 calls should each check their own address alignment, so
the deeper parts of the call stack always know alignment is correct.
This will remove all the fsize you pass around, and also sanitise the userspace
argument much higher in the call stack, which is very preferable and more
conventional.
Maybe this isn't possible (it's very obvious, so there may be a good reason it
hasn't been done).
I had this idea as well, but considering get_futex_key() is exported in include/linux/futex.h, I believe some out-of tree thing is using it.
You're changing the API anyway, so just get rid of the declaration from futex.h
(that doesn't actually make for an export anyway, unless it is inline).

But... that isn't there in mainline. Why is it in -mm? At any rate, that makes
it a no brainer to change.
As this external thing certainly is not doing the check itself, to be on the safe side we should enforce it in get_futex_key(). I agree with you : If we want to maximize performance, we could say : The check *must* be done by the caller.
Well we _control_ the API, so let's make it as clean and performant as possible
from the start.
--
SUSE Labs, Novell Inc.
Nick Piggin
2007-04-11 09:39:02 UTC
Permalink
=20
Post by Nick Piggin
But... that isn't there in mainline. Why is it in -mm?
=20
=20
This was introduced by lguest code....
I did not follow exaclty why.
OK, that's no problem, then it can remain exported but we just
have to document and audit that callers must pass in an aligned
address. We can also BUG_ON(address & ~PAGE_MASK); to handle
the security aspect.

--=20
SUSE Labs, Novell Inc.
Nick Piggin
2007-04-11 09:40:29 UTC
Permalink
Post by Nick Piggin
=20
Post by Nick Piggin
But... that isn't there in mainline. Why is it in -mm?
This was introduced by lguest code....
I did not follow exaclty why.
=20
=20
OK, that's no problem, then it can remain exported but we just
have to document and audit that callers must pass in an aligned
address. We can also BUG_ON(address & ~PAGE_MASK); to handle
the security aspect.
Err, duh no we can't because we don't know the size, which is the
whole point :P

Anyway.... just ensure callers have to fix alignment.

--=20
SUSE Labs, Novell Inc.
Nick Piggin
2007-04-11 07:22:57 UTC
Permalink
Post by Eric Dumazet
On Sat, 07 Apr 2007 19:30:14 +1000
Post by Nick Piggin
Post by Eric Dumazet
get_futex_key() does a check against sizeof(u32) regardless of futex being 64bits or not.
So it is possible a 64bit futex spans two pages of memory...
I had to change get_futex_key() prototype to be able to do a correct test.
I wonder if it should be encfocing alignment to keep in on 1 page?
I believe I just did that :)
Yes :P What I was trying to say before jumping on a plane is that
sys_futex/sys_futex64 calls should each check their own address alignment, so
the deeper parts of the call stack always know alignment is correct.

This will remove all the fsize you pass around, and also sanitise the userspace
argument much higher in the call stack, which is very preferable and more
conventional.

Maybe this isn't possible (it's very obvious, so there may be a good reason it
hasn't been done).
--
SUSE Labs, Novell Inc.
Eric Dumazet
2007-04-11 08:14:58 UTC
Permalink
On Wed, 11 Apr 2007 17:22:57 +1000
Post by Eric Dumazet
On Sat, 07 Apr 2007 19:30:14 +1000
Post by Nick Piggin
Post by Eric Dumazet
get_futex_key() does a check against sizeof(u32) regardless of futex being 64bits or not.
So it is possible a 64bit futex spans two pages of memory...
I had to change get_futex_key() prototype to be able to do a correct test.
I wonder if it should be encfocing alignment to keep in on 1 page?
I believe I just did that :)
Yes :P What I was trying to say before jumping on a plane is that
sys_futex/sys_futex64 calls should each check their own address alignment, so
the deeper parts of the call stack always know alignment is correct.
This will remove all the fsize you pass around, and also sanitise the userspace
argument much higher in the call stack, which is very preferable and more
conventional.
Maybe this isn't possible (it's very obvious, so there may be a good reason it
hasn't been done).
I had this idea as well, but considering get_futex_key() is exported in include/linux/futex.h, I believe some out-of tree thing is using it.

As this external thing certainly is not doing the check itself, to be on the safe side we should enforce it in get_futex_key(). I agree with you : If we want to maximize performance, we could say : The check *must* be done by the caller.
Jakub Jelinek
2007-04-07 11:18:49 UTC
Permalink
Post by Eric Dumazet
get_futex_key() does a check against sizeof(u32) regardless of futex being 64bits or not.
So it is possible a 64bit futex spans two pages of memory...
That would be a user bug. 32-bit futexes have to be 32-bit aligned, 64-bit
futexes have to be 64-bit aligned.

Jakub
Eric Dumazet
2007-04-07 11:54:30 UTC
Permalink
get_futex_key() does a check against sizeof(u32) regardless of futex=
being 64bits or not.
So it is possible a 64bit futex spans two pages of memory...
=20
That would be a user bug. 32-bit futexes have to be 32-bit aligned, =
64-bit
futexes have to be 64-bit aligned.
I am not sure what you want to say.

User doing sys_futex64(0x......FFC, FUTEX_WAKE_OP, ...) and crashing ke=
rnel or=20
corrupting data is ok because its a user bug ?


User is allowed to do anything, kernel must check and protect innocents=
=2E
Ulrich Drepper
2007-04-07 16:40:06 UTC
Permalink
Post by Eric Dumazet
I am not sure what you want to say.
What Jakub meant is that it is OK for the kernel to reject using
unaligned 64-bit futexes. Just return an error in all cases (not just
in some).
Andrew Morton
2007-04-07 22:15:56 UTC
Permalink
Post by Eric Dumazet
Hi all
- All remarks from Nick were addressed I hope
get_futex_key() does a check against sizeof(u32) regardless of futex being 64bits or not.
So it is possible a 64bit futex spans two pages of memory...
I had to change get_futex_key() prototype to be able to do a correct test.
Cold we please have that in a separate patch? It's logically a part of the
64-bit-futex work, is it not?
Post by Eric Dumazet
+
+/**
+ * get_futex_key - Get parameters which are the keys for a futex.
+ * &current->mm->mmap_sem for a PROCESS_SHARED futex
+ *
+ * Returns an error code or 0
+ */
+int get_futex_key(void __user *uaddr, int size, struct rw_semaphore *shared,
+ union futex_key *key);
Thanks for documenting the interface, but please do it in the .c file at
the function's definition site.
Eric Dumazet
2007-04-10 09:21:22 UTC
Permalink
On Sat, 7 Apr 2007 15:15:56 -0700
Post by Andrew Morton
Post by Eric Dumazet
get_futex_key() does a check against sizeof(u32) regardless of futex being 64bits or not.
So it is possible a 64bit futex spans two pages of memory...
I had to change get_futex_key() prototype to be able to do a correct test.
Cold we please have that in a separate patch? It's logically a part of the
64-bit-futex work, is it not?
Yes you probably want this patch to fix 64bit futexes support.

[PATCH] get_futex_key() must check proper alignement for 64bit futexes

get_futex_key() does an alignment check against sizeof(u32) regardless of futex
being 64bits or not.

So it is possible a 64bit futex spans two pages of memory, and some
malicious user code can trigger data corruption.

We must add a 'fsize' parameter to get_futex_key(), telling it the size of the
futex (4 or 8 bytes)

Signed-off-by: Eric Dumazet <***@cosmosbay.com>
---
include/linux/futex.h | 2 -
kernel/futex.c | 78 ++++++++++++++++++++++++++--------------
2 files changed, 52 insertions(+), 28 deletions(-)

--- linux-2.6.21-rc6-mm1/kernel/futex.c
+++ linux-2.6.21-rc6-mm1-ed/kernel/futex.c
@@ -189,19 +189,22 @@ static inline int match_futex(union fute
&& key1->both.offset == key2->both.offset);
}

-/*
- * Get parameters which are the keys for a futex.
+/**
+ * get_futex_key - Get parameters which are the keys for a futex.
+ * @uaddr: virtual address of the futex
+ * @size: size of futex (4 or 8)
+ * @key: address where result is stored.
+ *
+ * Returns a negative error code or 0
+ * The key words are stored in *key on success.
*
* For shared mappings, it's (page->index, vma->vm_file->f_path.dentry->d_inode,
* offset_within_page). For private mappings, it's (uaddr, current->mm).
* We can usually work out the index without swapping in the page.
*
- * Returns: 0, or negative error code.
- * The key words are stored in *key on success.
- *
* Should be called with &current->mm->mmap_sem but NOT any spinlocks.
*/
-int get_futex_key(void __user *uaddr, union futex_key *key)
+int get_futex_key(void __user *uaddr, int size, union futex_key *key)
{
unsigned long address = (unsigned long)uaddr;
struct mm_struct *mm = current->mm;
@@ -213,7 +216,7 @@ int get_futex_key(void __user *uaddr, un
* The futex address must be "naturally" aligned.
*/
key->both.offset = address % PAGE_SIZE;
- if (unlikely((key->both.offset % sizeof(u32)) != 0))
+ if (unlikely((key->both.offset & (size - 1)) != 0))
return -EINVAL;
address -= key->both.offset;

@@ -705,17 +708,18 @@ double_lock_hb(struct futex_hash_bucket
* Wake up all waiters hashed on the physical page that is mapped
* to this virtual address:
*/
-static int futex_wake(unsigned long __user *uaddr, int nr_wake)
+static int futex_wake(unsigned long __user *uaddr, int futex64, int nr_wake)
{
struct futex_hash_bucket *hb;
struct futex_q *this, *next;
struct plist_head *head;
union futex_key key;
int ret;
+ int fsize = futex64 ? sizeof(u64) : sizeof(u32);

down_read(&current->mm->mmap_sem);

- ret = get_futex_key(uaddr, &key);
+ ret = get_futex_key(uaddr, fsize, &key);
if (unlikely(ret != 0))
goto out;

@@ -817,6 +821,7 @@ futex_requeue_pi(unsigned long __user *u
struct rt_mutex_waiter *waiter, *top_waiter = NULL;
struct rt_mutex *lock2 = NULL;
int ret, drop_count = 0;
+ int fsize = futex64 ? sizeof(u64) : sizeof(u32);

if (refill_pi_state_cache())
return -ENOMEM;
@@ -827,10 +832,10 @@ retry:
*/
down_read(&current->mm->mmap_sem);

- ret = get_futex_key(uaddr1, &key1);
+ ret = get_futex_key(uaddr1, fsize, &key1);
if (unlikely(ret != 0))
goto out;
- ret = get_futex_key(uaddr2, &key2);
+ ret = get_futex_key(uaddr2, fsize, &key2);
if (unlikely(ret != 0))
goto out;

@@ -1005,14 +1010,15 @@ futex_wake_op(unsigned long __user *uadd
struct plist_head *head;
struct futex_q *this, *next;
int ret, op_ret, attempt = 0;
+ int fsize = futex64 ? sizeof(u64) : sizeof(u32);

retryfull:
down_read(&current->mm->mmap_sem);

- ret = get_futex_key(uaddr1, &key1);
+ ret = get_futex_key(uaddr1, fsize, &key1);
if (unlikely(ret != 0))
goto out;
- ret = get_futex_key(uaddr2, &key2);
+ ret = get_futex_key(uaddr2, fsize, &key2);
if (unlikely(ret != 0))
goto out;

@@ -1125,14 +1131,15 @@ futex_requeue(unsigned long __user *uadd
struct plist_head *head1;
struct futex_q *this, *next;
int ret, drop_count = 0;
+ int fsize = futex64 ? sizeof(u64) : sizeof(u32);

retry:
down_read(&current->mm->mmap_sem);

- ret = get_futex_key(uaddr1, &key1);
+ ret = get_futex_key(uaddr1, fsize, &key1);
if (unlikely(ret != 0))
goto out;
- ret = get_futex_key(uaddr2, &key2);
+ ret = get_futex_key(uaddr2, fsize, &key2);
if (unlikely(ret != 0))
goto out;

@@ -1390,9 +1397,15 @@ static int fixup_pi_state_owner(unsigned
return ret;
}

+/*
+ * In case we must use restart_block to restart a futex_wait,
+ * we encode in the 'arg3' futex64 capability
+ */
+#define ARG3_FUTEX64 1
+
static long futex_wait_restart(struct restart_block *restart);
-static int futex_wait(unsigned long __user *uaddr, unsigned long val,
- ktime_t *abs_time, int futex64)
+static int futex_wait(unsigned long __user *uaddr, int futex64,
+ unsigned long val, ktime_t *abs_time)
{
struct task_struct *curr = current;
DECLARE_WAITQUEUE(wait, curr);
@@ -1402,12 +1415,13 @@ static int futex_wait(unsigned long __us
int ret;
struct hrtimer_sleeper t, *to = NULL;
int rem = 0;
+ int fsize = futex64 ? sizeof(u64) : sizeof(u32);

q.pi_state = NULL;
retry:
down_read(&curr->mm->mmap_sem);

- ret = get_futex_key(uaddr, &q.key);
+ ret = get_futex_key(uaddr, fsize, &q.key);
if (unlikely(ret != 0))
goto out_release_sem;

@@ -1597,7 +1611,11 @@ static int futex_wait(unsigned long __us
restart->arg0 = (unsigned long)uaddr;
restart->arg1 = val;
restart->arg2 = (unsigned long)abs_time;
- restart->arg3 = (unsigned long)futex64;
+ restart->arg3 = 0;
+#ifdef CONFIG_64BIT
+ if (futex64)
+ restart->arg3 |= ARG3_FUTEX64;
+#endif
return -ERESTART_RESTARTBLOCK;
}

@@ -1615,10 +1633,14 @@ static long futex_wait_restart(struct re
unsigned long __user *uaddr = (unsigned long __user *)restart->arg0;
unsigned long val = restart->arg1;
ktime_t *abs_time = (ktime_t *)restart->arg2;
- int futex64 = (int)restart->arg3;
+ int futex64 = 0;

+#ifdef CONFIG_64BIT
+ if (restart->arg3 & ARG3_FUTEX64)
+ futex64 = 1;
+#endif
restart->fn = do_no_restart_syscall;
- return (long)futex_wait(uaddr, val, abs_time, futex64);
+ return (long)futex_wait(uaddr, futex64, val, abs_time);
}


@@ -1682,6 +1704,7 @@ static int futex_lock_pi(unsigned long _
unsigned long uval, newval, curval;
struct futex_q q;
int ret, lock_held, attempt = 0;
+ int fsize = futex64 ? sizeof(u64) : sizeof(u32);

if (refill_pi_state_cache())
return -ENOMEM;
@@ -1697,7 +1720,7 @@ static int futex_lock_pi(unsigned long _
retry:
down_read(&curr->mm->mmap_sem);

- ret = get_futex_key(uaddr, &q.key);
+ ret = get_futex_key(uaddr, fsize, &q.key);
if (unlikely(ret != 0))
goto out_release_sem;

@@ -1907,6 +1930,7 @@ static int futex_unlock_pi(unsigned long
struct plist_head *head;
union futex_key key;
int ret, attempt = 0;
+ int fsize = futex64 ? sizeof(u64) : sizeof(u32);

retry:
if (futex_get_user(&uval, uaddr, futex64))
@@ -1921,7 +1945,7 @@ retry:
*/
down_read(&current->mm->mmap_sem);

- ret = get_futex_key(uaddr, &key);
+ ret = get_futex_key(uaddr, fsize, &key);
if (unlikely(ret != 0))
goto out;

@@ -2094,7 +2118,7 @@ static int futex_fd(u32 __user *uaddr, i
q->pi_state = NULL;

down_read(&current->mm->mmap_sem);
- err = get_futex_key(uaddr, &q->key);
+ err = get_futex_key(uaddr, sizeof(u32), &q->key);

if (unlikely(err != 0)) {
up_read(&current->mm->mmap_sem);
@@ -2238,7 +2262,7 @@ retry:
*/
if (!pi) {
if (uval & FUTEX_WAITERS)
- futex_wake((unsigned long __user *)uaddr, 1);
+ futex_wake((unsigned long __user *)uaddr, 0, 1);
}
}
return 0;
@@ -2329,10 +2353,10 @@ long do_futex(unsigned long __user *uadd

switch (op) {
case FUTEX_WAIT:
- ret = futex_wait(uaddr, val, timeout, fut64);
+ ret = futex_wait(uaddr, fut64, val, timeout);
break;
case FUTEX_WAKE:
- ret = futex_wake(uaddr, val);
+ ret = futex_wake(uaddr, fut64, val);
break;
case FUTEX_FD:
if (fut64)
--- linux-2.6.21-rc6-mm1/include/linux/futex.h
+++ linux-2.6.21-rc6-mm1-ed/include/linux/futex.h
@@ -135,7 +135,7 @@ union futex_key {
int offset;
} both;
};
-int get_futex_key(void __user *uaddr, union futex_key *key);
+int get_futex_key(void __user *uaddr, int size, union futex_key *key);
void get_futex_key_refs(union futex_key *key);
void drop_futex_key_refs(union futex_key *key);
Eric Dumazet
2007-04-26 12:55:21 UTC
Permalink
Hi Andrew

Not sure if you prefer to wait Pierre work on futex64, so just in case,=
I prepared this patch.

Update on this take6 :

- Rebased on linux-2.6.21-rc7-mm2 , since futex64 were droped from mm


Pierre, I can resubmit another patch on top on your next patch, so plea=
se do as you prefer (ignoring or not this patch)

Thank you

History :

take5 :

- Rebased on linux-2.6.21-rc6-mm1 + get_futex_key() must check proper a=
lignement for 64bit futexes
- compile test on x86_64 (one minor typo)
- Added Rusty in CC since he may have to change drivers/lguest/io.c aga=
in, since get_futex_key() have yet another parameter (fshared). (I coul=
dnt find this file in 2.6.21-rc6-mm1 tree)


take4 :

- All remarks from Nick were addressed I hope

- Current mm code have a problem with 64bit futexes, as spoted by Nick =
:

get_futex_key() does a check against sizeof(u32) regardless of futex be=
ing 64bits or not.
So it is possible a 64bit futex spans two pages of memory...
I had to change get_futex_key() prototype to be able to do a correct te=
st.

take3:

I'm pleased to present this patch which improves linux futexes performa=
nce and=20
scalability, merely avoiding taking mmap_sem rwlock.

Ulrich agreed with the API and said glibc work could start as soon
as he gets a Fedora kernel with it :)

In this third version I dropped the NUMA optims and process private has=
h table,
to let new API come in and be tested.

Thank you

[PATCH] FUTEX : new PRIVATE futexes

Analysis of current linux futex code :
--------------------------------------

A central hash table futex_queues[] holds all contexts (futex_q) of wai=
ting=20
threads.
Each futex_wait()/futex_wait() has to obtain a spinlock on a hash slot =
to=20
perform lookups or insert/deletion of a futex_q.

When a futex_wait() is done, calling thread has to :


1) - Obtain a read lock on mmap_sem to be able to validate the user poi=
nter
=A0 =A0 =A0(calling find_vma()). This validation tells us if the futex =
uses
=A0 =A0 =A0an inode based store (mapped file), or mm based store (anony=
mous mem)

2) - compute a hash key

3) - Atomic increment of reference counter on an inode or a mm_struct

4) - lock part of futex_queues[] hash table

5) - perform the test on value of futex.
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 (rollback is value !=3D expected_value,=
returns EWOULDBLOCK)
=A0 =A0 =A0 =A0 (various loops if test triggers mm faults)

6) queue the context into hash table, release the lock got in 4)

7) - release the read_lock on mmap_sem

=A0 =A0<block>

8) Eventually unqueue the context (but rarely, as this part
=A0may be done by the futex_wake())

=46utexes were designed to improve scalability but current implementati=
on
has various problems :

- Central hashtable :
=A0This means scalability problems if many processes/threads want to us=
e
=A0futexes at the same time.
=A0This means NUMA unbalance because this hashtable is located on one n=
ode.

- Using mmap_sem on every futex() syscall :

=A0Even if mmap_sem is a rw_semaphore, up_read()/down_read() are doing =
atomic
=A0ops on mmap_sem, dirtying cache line :
=A0 =A0 =A0 =A0 - lot of cache line ping pongs on SMP configurations.

=A0mmap_sem is also extensively used by mm code (page faults, mmap()/mu=
nmap())
=A0Highly threaded processes might suffer from mmap_sem contention.

=A0mmap_sem is also used by oprofile code. Enabling oprofile hurts thre=
aded
programs because of contention on the mmap_sem cache line.

- Using an atomic_inc()/atomic_dec() on inode ref counter or mm ref cou=
nter:
=A0It's also a cache line ping pong on SMP. It also increases mmap_sem =
hold time
=A0because of cache misses.

Most of these scalability problems come from the fact that futexes are =
in
one global namespace. As we use a central hash table, we must make sure
they are all using the same reference (given by the mm subsystem).
We chose to force all futexes be 'shared'. This has a cost.

But fact is POSIX defined PRIVATE and SHARED, allowing clear separation=
, and
optimal performance if carefuly implemented. Time has come for linux to=
have=20
better threading performance.

The goal is to permit new futex commands to avoid :
=A0- Taking the mmap_sem semaphore, conflicting with other subsystems.
=A0- Modifying a ref_count on mm or an inode, still conflicting with mm=
or fs.

This is possible because, for one process using PTHREAD_PROCESS_PRIVATE
futexes, we only need to distinguish futexes by their virtual address, =
no
matter the underlying mm storage is.



If glibc wants to exploit this new infrastructure, it should use new
_PRIVATE futex subcommands for PTHREAD_PROCESS_PRIVATE futexes. And
be prepared to fallback on old subcommands for old kernels. Using one
global variable with the FUTEX_PRIVATE_FLAG or 0 value should be OK.

PTHREAD_PROCESS_SHARED futexes should still use the old subcommands.

Compatibility with old applications is preserved, they still hit the
scalability problems, but new applications can fly :)

Note : the same SHARED futex (mapped on a file) can be used by old bina=
ries=20
*and* new binaries, because both binaries will use the old subcommands.

Note : Vast majority of futexes should be using PROCESS_PRIVATE semanti=
c,
as this is the default semantic. Almost all applications should benefit
of this changes (new kernel and updated libc)

Some bench results on a Pentium M 1.6 GHz (SMP kernel on a UP machine)

/* calling futex_wait(addr, value) with value !=3D *addr */
433 cycles per futex(FUTEX_WAIT) call (mixing 2 futexes)
424 cycles per futex(FUTEX_WAIT) call (using one futex)
334 cycles per futex(FUTEX_WAIT_PRIVATE) call (mixing 2 futexes)
334 cycles per futex(FUTEX_WAIT_PRIVATE) call (using one futex)
=46or reference :
187 cycles per getppid() call
188 cycles per umask() call
181 cycles per ni_syscall() call

Signed-off-by: Eric Dumazet <***@cosmosbay.com>
---
include/linux/futex.h | 29 +++
kernel/futex.c | 324 +++++++++++++++++++++++++---------------
2 files changed, 236 insertions(+), 117 deletions(-)

--- linux-2.6.21-rc7-mm2/include/linux/futex.h
+++ linux-2.6.21-rc7-mm2-ed/include/linux/futex.h
@@ -19,6 +19,18 @@ union ktime;
#define FUTEX_TRYLOCK_PI 8
#define FUTEX_CMP_REQUEUE_PI 9
=20
+#define FUTEX_PRIVATE_FLAG 128
+#define FUTEX_CMD_MASK ~FUTEX_PRIVATE_FLAG
+
+#define FUTEX_WAIT_PRIVATE (FUTEX_WAIT | FUTEX_PRIVATE_FLAG)
+#define FUTEX_WAKE_PRIVATE (FUTEX_WAKE | FUTEX_PRIVATE_FLAG)
+#define FUTEX_REQUEUE_PRIVATE (FUTEX_REQUEUE | FUTEX_PRIVATE_FLAG)
+#define FUTEX_CMP_REQUEUE_PRIVATE (FUTEX_CMP_REQUEUE | FUTEX_PRIVATE_F=
LAG)
+#define FUTEX_WAKE_OP_PRIVATE (FUTEX_WAKE_OP | FUTEX_PRIVATE_FLAG)
+#define FUTEX_LOCK_PI_PRIVATE (FUTEX_LOCK_PI | FUTEX_PRIVATE_FLAG)
+#define FUTEX_UNLOCK_PI_PRIVATE (FUTEX_UNLOCK_PI | FUTEX_PRIVATE_FLAG)
+#define FUTEX_TRYLOCK_PI_PRIVATE (FUTEX_TRYLOCK_PI | FUTEX_PRIVATE_FLA=
G)
+
/*
* Support for robust futexes: the kernel cleans up held futexes at
* thread exit time.
@@ -114,8 +126,18 @@ handle_futex_death(u32 __user *uaddr, st
* Don't rearrange members without looking at hash_futex().
*
* offset is aligned to a multiple of sizeof(u32) (=3D=3D 4) by defini=
tion.
- * We set bit 0 to indicate if it's an inode-based key.
- */
+ * We use the two low order bits of offset to tell what is the kind of=
key :
+ * 00 : Private process futex (PTHREAD_PROCESS_PRIVATE)
+ * (no reference on an inode or mm)
+ * 01 : Shared futex (PTHREAD_PROCESS_SHARED)
+ * mapped on a file (reference on the underlying inode)
+ * 10 : Shared futex (PTHREAD_PROCESS_SHARED)
+ * (but private mapping on an mm, and reference taken on it)
+*/
+
+#define FUT_OFF_INODE 1 /* We set bit 0 if key has a reference on i=
node */
+#define FUT_OFF_MMSHARED 2 /* We set bit 1 if key has a reference on m=
m */
+
union futex_key {
u32 __user *uaddr;
struct {
@@ -134,7 +156,8 @@ union futex_key {
int offset;
} both;
};
-int get_futex_key(u32 __user *uaddr, union futex_key *key);
+int get_futex_key(u32 __user *uaddr, struct rw_semaphore *shared,
+ union futex_key *key);
void get_futex_key_refs(union futex_key *key);
void drop_futex_key_refs(union futex_key *key);
=20
--- linux-2.6.21-rc7-mm2/kernel/futex.c
+++ linux-2.6.21-rc7-mm2-ed/kernel/futex.c
@@ -16,6 +16,9 @@
* Copyright (C) 2006 Red Hat, Inc., Ingo Molnar <***@redhat.com>
* Copyright (C) 2006 Timesys Corp., Thomas Gleixner <***@timesys.co=
m>
*
+ * PRIVATE futexes by Eric Dumazet
+ * Copyright (C) 2007 Eric Dumazet <***@cosmosbay.com>
+ *
* Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly
* enough at me, Linus for the original (flawed) idea, Matthew
* Kirkwood for proof-of-concept implementation.
@@ -150,19 +153,26 @@ static inline int match_futex(union fute
&& key1->both.offset =3D=3D key2->both.offset);
}
=20
-/*
- * Get parameters which are the keys for a futex.
+/**
+ * get_futex_key - Get parameters which are the keys for a futex.
+ * @uaddr: virtual address of the futex
+ * @shared: NULL for a PROCESS_PRIVATE futex,
+ * &current->mm->mmap_sem for a PROCESS_SHARED futex
+ * @key: address where result is stored.
+ *
+ * Returns a negative error code or 0
+ * The key words are stored in *key on success.
*
* For shared mappings, it's (page->index, vma->vm_file->f_path.dentry=
->d_inode,
* offset_within_page). For private mappings, it's (uaddr, current->m=
m).
* We can usually work out the index without swapping in the page.
*
- * Returns: 0, or negative error code.
- * The key words are stored in *key on success.
- *
- * Should be called with &current->mm->mmap_sem but NOT any spinlocks.
+ * fshared is NULL for PROCESS_PRIVATE futexes
+ * For other futexes, it points to &current->mm->mmap_sem and
+ * caller must have taken the reader lock. but NOT any spinlocks.
*/
-int get_futex_key(u32 __user *uaddr, union futex_key *key)
+int get_futex_key(u32 __user *uaddr, struct rw_semaphore *fshared,
+ union futex_key *key)
{
unsigned long address =3D (unsigned long)uaddr;
struct mm_struct *mm =3D current->mm;
@@ -174,11 +184,25 @@ int get_futex_key(u32 __user *uaddr, uni
* The futex address must be "naturally" aligned.
*/
key->both.offset =3D address % PAGE_SIZE;
- if (unlikely((key->both.offset % sizeof(u32)) !=3D 0))
+ if (unlikely((address % sizeof(u32)) !=3D 0))
return -EINVAL;
address -=3D key->both.offset;
=20
/*
+ * PROCESS_PRIVATE futexes are fast.
+ * As the mm cannot disappear under us and the 'key' only needs
+ * virtual address, we dont even have to find the underlying vma.
+ * Note : We do have to check 'uaddr' is a valid user address,
+ * but access_ok() should be faster than find_vma()
+ */
+ if (!fshared) {
+ if (unlikely(!access_ok(VERIFY_WRITE, uaddr, sizeof(u32))))
+ return -EFAULT;
+ key->private.mm =3D mm;
+ key->private.address =3D address;
+ return 0;
+ }
+ /*
* The futex is hashed differently depending on whether
* it's in a shared or private mapping. So check vma first.
*/
@@ -205,6 +229,7 @@ int get_futex_key(u32 __user *uaddr, uni
* mappings of _writable_ handles.
*/
if (likely(!(vma->vm_flags & VM_MAYSHARE))) {
+ key->both.offset |=3D FUT_OFF_MMSHARED; /* reference taken on mm */
key->private.mm =3D mm;
key->private.address =3D address;
return 0;
@@ -214,7 +239,7 @@ int get_futex_key(u32 __user *uaddr, uni
* Linear file mappings are also simple.
*/
key->shared.inode =3D vma->vm_file->f_path.dentry->d_inode;
- key->both.offset++; /* Bit 0 of offset indicates inode-based key. */
+ key->both.offset |=3D FUT_OFF_INODE; /* inode-based key. */
if (likely(!(vma->vm_flags & VM_NONLINEAR))) {
key->shared.pgoff =3D (((address - vma->vm_start) >> PAGE_SHIFT)
+ vma->vm_pgoff);
@@ -242,16 +267,18 @@ EXPORT_SYMBOL_GPL(get_futex_key);
* Take a reference to the resource addressed by a key.
* Can be called while holding spinlocks.
*
- * NOTE: mmap_sem MUST be held between get_futex_key() and calling thi=
s
- * function, if it is called at all. mmap_sem keeps key->shared.inode=
valid.
*/
inline void get_futex_key_refs(union futex_key *key)
{
- if (key->both.ptr !=3D 0) {
- if (key->both.offset & 1)
+ if (key->both.ptr =3D=3D 0)
+ return;
+ switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
+ case FUT_OFF_INODE:
atomic_inc(&key->shared.inode->i_count);
- else
+ break;
+ case FUT_OFF_MMSHARED:
atomic_inc(&key->private.mm->mm_count);
+ break;
}
}
EXPORT_SYMBOL_GPL(get_futex_key_refs);
@@ -262,11 +289,15 @@ EXPORT_SYMBOL_GPL(get_futex_key_refs);
*/
void drop_futex_key_refs(union futex_key *key)
{
- if (key->both.ptr !=3D 0) {
- if (key->both.offset & 1)
+ if (key->both.ptr =3D=3D 0)
+ return;
+ switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
+ case FUT_OFF_INODE:
iput(key->shared.inode);
- else
+ break;
+ case FUT_OFF_MMSHARED:
mmdrop(key->private.mm);
+ break;
}
}
EXPORT_SYMBOL_GPL(drop_futex_key_refs);
@@ -283,28 +314,38 @@ static inline int get_futex_value_locked
}
=20
/*
- * Fault handling. Called with current->mm->mmap_sem held.
+ * Fault handling.
+ * if fshared is non NULL, current->mm->mmap_sem is already held
*/
-static int futex_handle_fault(unsigned long address, int attempt)
+static int futex_handle_fault(unsigned long address,
+ struct rw_semaphore *fshared, int attempt)
{
struct vm_area_struct * vma;
struct mm_struct *mm =3D current->mm;
+ int ret =3D -EFAULT;
=20
- if (attempt > 2 || !(vma =3D find_vma(mm, address)) ||
- vma->vm_start > address || !(vma->vm_flags & VM_WRITE))
- return -EFAULT;
+ if (attempt > 2)
+ return ret;
=20
- switch (handle_mm_fault(mm, vma, address, 1)) {
- case VM_FAULT_MINOR:
- current->min_flt++;
- break;
- case VM_FAULT_MAJOR:
- current->maj_flt++;
- break;
- default:
- return -EFAULT;
+ if (!fshared)
+ down_read(&mm->mmap_sem);
+ vma =3D find_vma(mm, address);
+ if (vma && address >=3D vma->vm_start &&
+ (vma->vm_flags & VM_WRITE)) {
+ switch (handle_mm_fault(mm, vma, address, 1)) {
+ case VM_FAULT_MINOR:
+ ret =3D 0;
+ current->min_flt++;
+ break;
+ case VM_FAULT_MAJOR:
+ ret =3D 0;
+ current->maj_flt++;
+ break;
+ }
}
- return 0;
+ if (!fshared)
+ up_read(&mm->mmap_sem);
+ return ret;
}
=20
/*
@@ -647,7 +688,8 @@ double_lock_hb(struct futex_hash_bucket=20
* Wake up all waiters hashed on the physical page that is mapped
* to this virtual address:
*/
-static int futex_wake(u32 __user *uaddr, int nr_wake)
+static int futex_wake(u32 __user *uaddr, struct rw_semaphore *fshared,
+ int nr_wake)
{
struct futex_hash_bucket *hb;
struct futex_q *this, *next;
@@ -655,9 +697,10 @@ static int futex_wake(u32 __user *uaddr,
union futex_key key;
int ret;
=20
- down_read(&current->mm->mmap_sem);
+ if (fshared)
+ down_read(fshared);
=20
- ret =3D get_futex_key(uaddr, &key);
+ ret =3D get_futex_key(uaddr, fshared, &key);
if (unlikely(ret !=3D 0))
goto out;
=20
@@ -679,7 +722,8 @@ static int futex_wake(u32 __user *uaddr,
=20
spin_unlock(&hb->lock);
out:
- up_read(&current->mm->mmap_sem);
+ if (fshared)
+ up_read(fshared);
return ret;
}
=20
@@ -746,7 +790,9 @@ retry:
* and requeue the next nr_requeue waiters following hashed on
* one physical page to another physical page (PI-futex uaddr2)
*/
-static int futex_requeue_pi(u32 __user *uaddr1, u32 __user *uaddr2,
+static int futex_requeue_pi(u32 __user *uaddr1,=20
+ struct rw_semaphore *fshared,
+ u32 __user *uaddr2,
int nr_wake, int nr_requeue, u32 *cmpval)
{
union futex_key key1, key2;
@@ -765,12 +811,13 @@ retry:
/*
* First take all the futex related locks:
*/
- down_read(&current->mm->mmap_sem);
+ if (fshared)
+ down_read(fshared);
=20
- ret =3D get_futex_key(uaddr1, &key1);
+ ret =3D get_futex_key(uaddr1, fshared, &key1);
if (unlikely(ret !=3D 0))
goto out;
- ret =3D get_futex_key(uaddr2, &key2);
+ ret =3D get_futex_key(uaddr2, fshared, &key2);
if (unlikely(ret !=3D 0))
goto out;
=20
@@ -793,7 +840,8 @@ retry:
* If we would have faulted, release mmap_sem, fault
* it in and start all over again.
*/
- up_read(&current->mm->mmap_sem);
+ if (fshared)
+ up_read(fshared);
=20
ret =3D get_user(curval, uaddr1);
=20
@@ -927,7 +975,8 @@ out_unlock:
drop_futex_key_refs(&key1);
=20
out:
- up_read(&current->mm->mmap_sem);
+ if (fshared)
+ up_read(fshared);
return ret;
}
=20
@@ -936,7 +985,8 @@ out:
* to this virtual address:
*/
static int
-futex_wake_op(u32 __user *uaddr1, u32 __user *uaddr2,
+futex_wake_op(u32 __user *uaddr1, struct rw_semaphore *fshared,
+ u32 __user *uaddr2,
int nr_wake, int nr_wake2, int op)
{
union futex_key key1, key2;
@@ -946,12 +996,13 @@ futex_wake_op(u32 __user *uaddr1, u32 __
int ret, op_ret, attempt =3D 0;
=20
retryfull:
- down_read(&current->mm->mmap_sem);
+ if (fshared)
+ down_read(fshared);
=20
- ret =3D get_futex_key(uaddr1, &key1);
+ ret =3D get_futex_key(uaddr1, fshared, &key1);
if (unlikely(ret !=3D 0))
goto out;
- ret =3D get_futex_key(uaddr2, &key2);
+ ret =3D get_futex_key(uaddr2, fshared, &key2);
if (unlikely(ret !=3D 0))
goto out;
=20
@@ -991,11 +1042,10 @@ retry:
* still holding the mmap_sem.
*/
if (attempt++) {
- if (futex_handle_fault((unsigned long)uaddr2,
- attempt)) {
- ret =3D -EFAULT;
+ ret =3D futex_handle_fault((unsigned long)uaddr2,
+ fshared, attempt);
+ if (ret)
goto out;
- }
goto retry;
}
=20
@@ -1003,7 +1053,8 @@ retry:
* If we would have faulted, release mmap_sem,
* fault it in and start all over again.
*/
- up_read(&current->mm->mmap_sem);
+ if (fshared)
+ up_read(fshared);
=20
ret =3D get_user(dummy, uaddr2);
if (ret)
@@ -1040,7 +1091,8 @@ retry:
if (hb1 !=3D hb2)
spin_unlock(&hb2->lock);
out:
- up_read(&current->mm->mmap_sem);
+ if (fshared)
+ up_read(fshared);
return ret;
}
=20
@@ -1048,7 +1100,8 @@ out:
* Requeue all waiters hashed on one physical page to another
* physical page.
*/
-static int futex_requeue(u32 __user *uaddr1, u32 __user *uaddr2,
+static int futex_requeue(u32 __user *uaddr1, struct rw_semaphore *fsha=
red,
+ u32 __user *uaddr2,
int nr_wake, int nr_requeue, u32 *cmpval)
{
union futex_key key1, key2;
@@ -1058,12 +1111,13 @@ static int futex_requeue(u32 __user *uad
int ret, drop_count =3D 0;
=20
retry:
- down_read(&current->mm->mmap_sem);
+ if (fshared)
+ down_read(fshared);
=20
- ret =3D get_futex_key(uaddr1, &key1);
+ ret =3D get_futex_key(uaddr1, fshared, &key1);
if (unlikely(ret !=3D 0))
goto out;
- ret =3D get_futex_key(uaddr2, &key2);
+ ret =3D get_futex_key(uaddr2, fshared, &key2);
if (unlikely(ret !=3D 0))
goto out;
=20
@@ -1086,7 +1140,8 @@ static int futex_requeue(u32 __user *uad
* If we would have faulted, release mmap_sem, fault
* it in and start all over again.
*/
- up_read(&current->mm->mmap_sem);
+ if (fshared)
+ up_read(fshared);
=20
ret =3D get_user(curval, uaddr1);
=20
@@ -1139,7 +1194,8 @@ out_unlock:
drop_futex_key_refs(&key1);
=20
out:
- up_read(&current->mm->mmap_sem);
+ if (fshared)
+ up_read(fshared);
return ret;
}
=20
@@ -1273,7 +1329,8 @@ static void unqueue_me_pi(struct futex_q
* The cur->mm semaphore must be held, it is released at return of th=
is
* function.
*/
-static int fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q,
+static int fixup_pi_state_owner(u32 __user *uaddr, struct rw_semaphore=
*fshared,
+ struct futex_q *q,
struct futex_hash_bucket *hb,
struct task_struct *curr)
{
@@ -1300,7 +1357,8 @@ static int fixup_pi_state_owner(u32 __us
=20
/* Unqueue and drop the lock */
unqueue_me_pi(q);
- up_read(&curr->mm->mmap_sem);
+ if (fshared)
+ up_read(fshared);
/*
* We own it, so we have to replace the pending owner
* TID. This must be atomic as we have preserve the
@@ -1321,8 +1379,15 @@ static int fixup_pi_state_owner(u32 __us
return ret;
}
=20
+/*
+ * In case we must use restart_block to restart a futex_wait,
+ * we encode in the 'arg3' shared capability
+ */
+#define ARG3_SHARED 1
+
static long futex_wait_restart(struct restart_block *restart);
-static int futex_wait(u32 __user *uaddr, u32 val, ktime_t *abs_time)
+static int futex_wait(u32 __user *uaddr, struct rw_semaphore *fshared,
+ u32 val, ktime_t *abs_time)
{
struct task_struct *curr =3D current;
DECLARE_WAITQUEUE(wait, curr);
@@ -1335,9 +1400,10 @@ static int futex_wait(u32 __user *uaddr,
=20
q.pi_state =3D NULL;
retry:
- down_read(&curr->mm->mmap_sem);
+ if (fshared)
+ down_read(fshared);
=20
- ret =3D get_futex_key(uaddr, &q.key);
+ ret =3D get_futex_key(uaddr, fshared, &q.key);
if (unlikely(ret !=3D 0))
goto out_release_sem;
=20
@@ -1360,8 +1426,8 @@ static int futex_wait(u32 __user *uaddr,
* a wakeup when *uaddr !=3D val on entry to the syscall. This is
* rare, but normal.
*
- * We hold the mmap semaphore, so the mapping cannot have changed
- * since we looked it up in get_futex_key.
+ * for shared futexes, we hold the mmap semaphore, so the mapping
+ * cannot have changed since we looked it up in get_futex_key.
*/
ret =3D get_futex_value_locked(&uval, uaddr);
=20
@@ -1372,7 +1438,8 @@ static int futex_wait(u32 __user *uaddr,
* If we would have faulted, release mmap_sem, fault it in and
* start all over again.
*/
- up_read(&curr->mm->mmap_sem);
+ if (fshared)
+ up_read(fshared);
=20
ret =3D get_user(uval, uaddr);
=20
@@ -1399,7 +1466,8 @@ static int futex_wait(u32 __user *uaddr,
* Now the futex is queued and we have checked the data, we
* don't want to hold mmap_sem while we sleep.
*/
- up_read(&curr->mm->mmap_sem);
+ if (fshared)
+ up_read(fshared);
=20
/*
* There might have been scheduling since the queue_me(), as we
@@ -1469,7 +1537,8 @@ static int futex_wait(u32 __user *uaddr,
else
ret =3D rt_mutex_timed_lock(lock, to, 1);
=20
- down_read(&curr->mm->mmap_sem);
+ if (fshared)
+ down_read(fshared);
spin_lock(q.lock_ptr);
=20
/*
@@ -1486,7 +1555,8 @@ static int futex_wait(u32 __user *uaddr,
=20
/* mmap_sem and hash_bucket lock are unlocked at
return of this function */
- ret =3D fixup_pi_state_owner(uaddr, &q, hb, curr);
+ ret =3D fixup_pi_state_owner(uaddr, fshared,
+ &q, hb, curr);
} else {
/*
* Catch the rare case, where the lock was released
@@ -1499,7 +1569,8 @@ static int futex_wait(u32 __user *uaddr,
}
/* Unqueue and drop the lock */
unqueue_me_pi(&q);
- up_read(&curr->mm->mmap_sem);
+ if (fshared)
+ up_read(fshared);
}
=20
debug_rt_mutex_free_waiter(&q.waiter);
@@ -1528,6 +1599,9 @@ static int futex_wait(u32 __user *uaddr,
restart->arg0 =3D (unsigned long)uaddr;
restart->arg1 =3D (unsigned long)val;
restart->arg2 =3D (unsigned long)abs_time;
+ restart->arg3 =3D 0;
+ if (fshared)
+ restart->arg3 |=3D ARG3_SHARED;
return -ERESTART_RESTARTBLOCK;
}
=20
@@ -1535,7 +1609,8 @@ static int futex_wait(u32 __user *uaddr,
queue_unlock(&q, hb);
=20
out_release_sem:
- up_read(&curr->mm->mmap_sem);
+ if (fshared)
+ up_read(fshared);
return ret;
}
=20
@@ -1545,9 +1620,12 @@ static long futex_wait_restart(struct re
u32 __user *uaddr =3D (u32 __user *)restart->arg0;
u32 val =3D (u32)restart->arg1;
ktime_t *abs_time =3D (ktime_t *)restart->arg2;
+ struct rw_semaphore *fshared =3D NULL;
=20
restart->fn =3D do_no_restart_syscall;
- return (long)futex_wait(uaddr, val, abs_time);
+ if (restart->arg3 & ARG3_SHARED)
+ fshared =3D &current->mm->mmap_sem;
+ return (long)futex_wait(uaddr, fshared, val, abs_time);
}
=20
=20
@@ -1602,8 +1680,8 @@ static void set_pi_futex_owner(struct fu
* if there are waiters then it will block, it does PI, etc. (Due to
* races the kernel might see a 0 value of the futex too.)
*/
-static int futex_lock_pi(u32 __user *uaddr, int detect, ktime_t *time,
- int trylock)
+static int futex_lock_pi(u32 __user *uaddr, struct rw_semaphore *fshar=
ed,
+ int detect, ktime_t *time, int trylock)
{
struct hrtimer_sleeper timeout, *to =3D NULL;
struct task_struct *curr =3D current;
@@ -1624,9 +1702,10 @@ static int futex_lock_pi(u32 __user *uad
=20
q.pi_state =3D NULL;
retry:
- down_read(&curr->mm->mmap_sem);
+ if (fshared)
+ down_read(fshared);
=20
- ret =3D get_futex_key(uaddr, &q.key);
+ ret =3D get_futex_key(uaddr, fshared, &q.key);
if (unlikely(ret !=3D 0))
goto out_release_sem;
=20
@@ -1747,7 +1826,8 @@ static int futex_lock_pi(u32 __user *uad
* Now the futex is queued and we have checked the data, we
* don't want to hold mmap_sem while we sleep.
*/
- up_read(&curr->mm->mmap_sem);
+ if (fshared)
+ up_read(fshared);
=20
WARN_ON(!q.pi_state);
/*
@@ -1761,7 +1841,8 @@ static int futex_lock_pi(u32 __user *uad
ret =3D ret ? 0 : -EWOULDBLOCK;
}
=20
- down_read(&curr->mm->mmap_sem);
+ if (fshared)
+ down_read(fshared);
spin_lock(q.lock_ptr);
=20
/*
@@ -1770,7 +1851,7 @@ static int futex_lock_pi(u32 __user *uad
*/
if (!ret && q.pi_state->owner !=3D curr)
/* mmap_sem is unlocked at return of this function */
- ret =3D fixup_pi_state_owner(uaddr, &q, hb, curr);
+ ret =3D fixup_pi_state_owner(uaddr, fshared, &q, hb, curr);
else {
/*
* Catch the rare case, where the lock was released
@@ -1783,7 +1864,8 @@ static int futex_lock_pi(u32 __user *uad
}
/* Unqueue and drop the lock */
unqueue_me_pi(&q);
- up_read(&curr->mm->mmap_sem);
+ if (fshared)
+ up_read(fshared);
}
=20
if (!detect && ret =3D=3D -EDEADLK && 0)
@@ -1795,7 +1877,8 @@ static int futex_lock_pi(u32 __user *uad
queue_unlock(&q, hb);
=20
out_release_sem:
- up_read(&curr->mm->mmap_sem);
+ if (fshared)
+ up_read(fshared);
return ret;
=20
uaddr_faulted:
@@ -1806,15 +1889,16 @@ static int futex_lock_pi(u32 __user *uad
* still holding the mmap_sem.
*/
if (attempt++) {
- if (futex_handle_fault((unsigned long)uaddr, attempt)) {
- ret =3D -EFAULT;
+ ret =3D futex_handle_fault((unsigned long)uaddr, fshared,
+ attempt);
+ if (ret)
goto out_unlock_release_sem;
- }
goto retry_locked;
}
=20
queue_unlock(&q, hb);
- up_read(&curr->mm->mmap_sem);
+ if (fshared)
+ up_read(fshared);
=20
ret =3D get_user(uval, uaddr);
if (!ret && (uval !=3D -EFAULT))
@@ -1828,7 +1912,7 @@ static int futex_lock_pi(u32 __user *uad
* This is the in-kernel slowpath: we look up the PI state (if any),
* and do the rt-mutex unlock.
*/
-static int futex_unlock_pi(u32 __user *uaddr)
+static int futex_unlock_pi(u32 __user *uaddr, struct rw_semaphore *fsh=
ared)
{
struct futex_hash_bucket *hb;
struct futex_q *this, *next;
@@ -1848,9 +1932,10 @@ retry:
/*
* First take all the futex related locks:
*/
- down_read(&current->mm->mmap_sem);
+ if (fshared)
+ down_read(fshared);
=20
- ret =3D get_futex_key(uaddr, &key);
+ ret =3D get_futex_key(uaddr, fshared, &key);
if (unlikely(ret !=3D 0))
goto out;
=20
@@ -1909,7 +1994,8 @@ retry_locked:
out_unlock:
spin_unlock(&hb->lock);
out:
- up_read(&current->mm->mmap_sem);
+ if (fshared)
+ up_read(fshared);
=20
return ret;
=20
@@ -1921,15 +2007,16 @@ pi_faulted:
* still holding the mmap_sem.
*/
if (attempt++) {
- if (futex_handle_fault((unsigned long)uaddr, attempt)) {
- ret =3D -EFAULT;
+ ret =3D futex_handle_fault((unsigned long)uaddr, fshared,
+ attempt);
+ if (ret)
goto out_unlock;
- }
goto retry_locked;
}
=20
spin_unlock(&hb->lock);
- up_read(&current->mm->mmap_sem);
+ if (fshared)
+ up_read(fshared);
=20
ret =3D get_user(uval, uaddr);
if (!ret && (uval !=3D -EFAULT))
@@ -1981,6 +2068,7 @@ static int futex_fd(u32 __user *uaddr, i
struct futex_q *q;
struct file *filp;
int ret, err;
+ struct rw_semaphore *fshared;
static unsigned long printk_interval;
=20
if (printk_timed_ratelimit(&printk_interval, 60 * 60 * 1000)) {
@@ -2022,11 +2110,12 @@ static int futex_fd(u32 __user *uaddr, i
}
q->pi_state =3D NULL;
=20
- down_read(&current->mm->mmap_sem);
- err =3D get_futex_key(uaddr, &q->key);
+ fshared =3D &current->mm->mmap_sem;
+ down_read(fshared);
+ err =3D get_futex_key(uaddr, fshared, &q->key);
=20
if (unlikely(err !=3D 0)) {
- up_read(&current->mm->mmap_sem);
+ up_read(fshared);
kfree(q);
goto error;
}
@@ -2038,7 +2127,7 @@ static int futex_fd(u32 __user *uaddr, i
filp->private_data =3D q;
=20
queue_me(q, ret, filp);
- up_read(&current->mm->mmap_sem);
+ up_read(fshared);
=20
/* Now we map fd to filp, so userspace can access it */
fd_install(ret, filp);
@@ -2167,7 +2256,7 @@ retry:
*/
if (!pi) {
if (uval & FUTEX_WAITERS)
- futex_wake(uaddr, 1);
+ futex_wake(uaddr, &curr->mm->mmap_sem, 1);
}
}
return 0;
@@ -2223,7 +2312,8 @@ void exit_robust_list(struct task_struct
return;
=20
if (pending)
- handle_futex_death((void __user *)pending + futex_offset, curr, pip)=
;
+ handle_futex_death((void __user *)pending + futex_offset,
+ curr, pip);
=20
while (entry !=3D &head->list) {
/*
@@ -2253,38 +2343,43 @@ long do_futex(u32 __user *uaddr, int op,
u32 __user *uaddr2, u32 val2, u32 val3)
{
int ret;
+ int cmd =3D op & FUTEX_CMD_MASK;
+ struct rw_semaphore *fshared =3D NULL;
+
+ if (!(op & FUTEX_PRIVATE_FLAG))
+ fshared =3D &current->mm->mmap_sem;
=20
- switch (op) {
+ switch (cmd) {
case FUTEX_WAIT:
- ret =3D futex_wait(uaddr, val, timeout);
+ ret =3D futex_wait(uaddr, fshared, val, timeout);
break;
case FUTEX_WAKE:
- ret =3D futex_wake(uaddr, val);
+ ret =3D futex_wake(uaddr, fshared, val);
break;
case FUTEX_FD:
/* non-zero val means F_SETOWN(getpid()) & F_SETSIG(val) */
ret =3D futex_fd(uaddr, val);
break;
case FUTEX_REQUEUE:
- ret =3D futex_requeue(uaddr, uaddr2, val, val2, NULL);
+ ret =3D futex_requeue(uaddr, fshared, uaddr2, val, val2, NULL);
break;
case FUTEX_CMP_REQUEUE:
- ret =3D futex_requeue(uaddr, uaddr2, val, val2, &val3);
+ ret =3D futex_requeue(uaddr, fshared, uaddr2, val, val2, &val3);
break;
case FUTEX_WAKE_OP:
- ret =3D futex_wake_op(uaddr, uaddr2, val, val2, val3);
+ ret =3D futex_wake_op(uaddr, fshared, uaddr2, val, val2, val3);
break;
case FUTEX_LOCK_PI:
- ret =3D futex_lock_pi(uaddr, val, timeout, 0);
+ ret =3D futex_lock_pi(uaddr, fshared, val, timeout, 0);
break;
case FUTEX_UNLOCK_PI:
- ret =3D futex_unlock_pi(uaddr);
+ ret =3D futex_unlock_pi(uaddr, fshared);
break;
case FUTEX_TRYLOCK_PI:
- ret =3D futex_lock_pi(uaddr, 0, timeout, 1);
+ ret =3D futex_lock_pi(uaddr, fshared, 0, timeout, 1);
break;
case FUTEX_CMP_REQUEUE_PI:
- ret =3D futex_requeue_pi(uaddr, uaddr2, val, val2, &val3);
+ ret =3D futex_requeue_pi(uaddr, fshared, uaddr2, val, val2, &val3);
break;
default:
ret =3D -ENOSYS;
@@ -2300,23 +2395,24 @@ asmlinkage long sys_futex(u32 __user *ua
struct timespec ts;
ktime_t t, *tp =3D NULL;
u32 val2 =3D 0;
+ int cmd =3D op & FUTEX_CMD_MASK;
=20
- if (utime && (op =3D=3D FUTEX_WAIT || op =3D=3D FUTEX_LOCK_PI)) {
+ if (utime && (cmd =3D=3D FUTEX_WAIT || cmd =3D=3D FUTEX_LOCK_PI)) {
if (copy_from_user(&ts, utime, sizeof(ts)) !=3D 0)
return -EFAULT;
if (!timespec_valid(&ts))
return -EINVAL;
=20
t =3D timespec_to_ktime(ts);
- if (op =3D=3D FUTEX_WAIT)
+ if (cmd =3D=3D FUTEX_WAIT)
t =3D ktime_add(ktime_get(), t);
tp =3D &t;
}
/*
- * requeue parameter in 'utime' if op =3D=3D FUTEX_REQUEUE.
+ * requeue parameter in 'utime' if cmd =3D=3D FUTEX_REQUEUE.
*/
- if (op =3D=3D FUTEX_REQUEUE || op =3D=3D FUTEX_CMP_REQUEUE
- || op =3D=3D FUTEX_CMP_REQUEUE_PI)
+ if (cmd =3D=3D FUTEX_REQUEUE || cmd =3D=3D FUTEX_CMP_REQUEUE
+ || cmd =3D=3D FUTEX_CMP_REQUEUE_PI)
val2 =3D (u32) (unsigned long) utime;
=20
return do_futex(uaddr, op, val, tp, uaddr2, val2, val3);
Pierre Peiffer
2007-04-26 13:35:26 UTC
Permalink
Post by Eric Dumazet
Hi Andrew
=20
Not sure if you prefer to wait Pierre work on futex64, so just in cas=
e, I prepared this patch.
Post by Eric Dumazet
=20
=20
- Rebased on linux-2.6.21-rc7-mm2 , since futex64 were droped from mm
=20
=20
Pierre, I can resubmit another patch on top on your next patch, so pl=
ease do as you prefer (ignoring or not this patch)

Thank you for taking care of this.
But I think your patch is more mature than the futex64 patch; So, it's =
ok for=20
me, and it's probably simpler and better for "the community" (and for A=
ndrew's=20
work) to put this patch first, because futex64 still may need several r=
eworks...

--=20
Pierre

Eric Dumazet
2007-03-15 19:13:16 UTC
Permalink
[PATCH 1/3] FUTEX : introduce PROCESS_PRIVATE semantic

This first patch introduces XXX_PRIVATE futexes operations.

When a process uses a XXX_PRIVATE futex primitive, kernel can avoid
to take a read lock on mmap_sem, to find the vma that contains the futex,
to learn if it is associated to an inode (shared) or the mm (private to
process)

We also avoid taking a reference on the found inode or the mm.

Even if mmap_sem is a rw_semaphore, up_read()/down_read() are doing atomic
ops on mmap_sem, dirtying cache line :
- lot of cache line ping pongs on SMP configurations.

mmap_sem is also extensively used by mm code (page faults, mmap()/munmap())
Highly threaded processes might suffer from mmap_sem contention.

mmap_sem is also used by oprofile code. Enabling oprofile hurts threaded
programs because of contention on the mmap_sem cache line.

- Using an atomic_inc()/atomic_dec() on inode ref counter or mm ref counter:
It's also a cache line ping pong on SMP. It also increases mmap_sem hold time
because of cache misses.

This first patch is possible because, for one process using
PTHREAD_PROCESS_PRIVATE futexes, we only need to distinguish futexes by their
virtual address, no matter the underlying mm storage is. The case of multiple
virtual addresses mapped on the same physical address is just insane : "Dont
do it on PROCESS_PRIVATE futexes, please ?"

If glibc wants to exploit this new infrastructure, it should use new
_PRIVATE futex subcommands for PTHREAD_PROCESS_PRIVATE futexes. And
be prepared to fallback on old subcommands for old kernels. Using one
global variable with the FUTEX_PRIVATE_FLAG or 0 value should be OK, so that
only one syscall might fail.

Compatibility with old applications is preserved, they still hit the
scalability problems, but new applications can fly :)

Note : SHARED futexes can be used by old binaries *and* new binaries,
because both binaries will use the old subcommands.

Note : Vast majority of futexes should be using PROCESS_PRIVATE semantic,
as this is the default semantic. Almost all applications should benefit
of this changes (new kernel and updated libc)

Signed-off-by: Eric Dumazet <***@cosmosbay.com>
---
include/linux/futex.h | 12 +
kernel/futex.c | 273 +++++++++++++++++++++++++---------------
2 files changed, 188 insertions(+), 97 deletions(-)
Eric Dumazet
2007-03-15 19:16:25 UTC
Permalink
[PATCH 2/3] FUTEX : introduce private hashtables

This patch introduces a separate hashtable per process to store _PRIVATE
futexes.
This hashtable is dynamically allocated on the first _PRIVATE futex syscall.
If memory cannot be allocated, the process will use the global hashtable.

Using a separate hashtable has the advantage of lowering the contention on the
global hashtable. NUMA should benefits of this separation because the
allocation should respect the mm policy of the process.

Code is using kmalloc()/vmalloc() depending on the size of spinlocks. For
normal setup, size of the private hashtable should be 768 bytes on 32bit
arches, 1536 bytes on 64bit arches.

Private hashtable is freed() when process exits.

Signed-off-by: Eric Dumazet <***@cosmosbay.com>
---
include/linux/futex.h | 4 +
include/linux/sched.h | 7 ++
kernel/fork.c | 1
kernel/futex.c | 112 ++++++++++++++++++++++++++++++++++++++--
4 files changed, 120 insertions(+), 4 deletions(-)
Nick Piggin
2007-03-15 20:25:53 UTC
Permalink
Post by Eric Dumazet
[PATCH 2/3] FUTEX : introduce private hashtables
This patch introduces a separate hashtable per process to store _PRIVATE
futexes.
This hashtable is dynamically allocated on the first _PRIVATE futex syscall.
If memory cannot be allocated, the process will use the global hashtable.
Using a separate hashtable has the advantage of lowering the contention on the
global hashtable. NUMA should benefits of this separation because the
allocation should respect the mm policy of the process.
Code is using kmalloc()/vmalloc() depending on the size of spinlocks. For
normal setup, size of the private hashtable should be 768 bytes on 32bit
arches, 1536 bytes on 64bit arches.
Private hashtable is freed() when process exits.
I do disagree with this patch, though.

There should be little contention on the memory in the global hash anyway,
because we can roughly reduce contention as a factor of hash-size/cacheline-size.

What we will have are cache misses on the global table... but we're going to
get cache misses on those private tables as well. Also, you never know what
the use cases are going to be... there could be an application with thousands
of threads and mutexes in which case your private hash could be too small... I
think w e want to stay away from per-mm _anything_ with private futexes, which
is the direction that your patch 1 takes us.

I would just avoid the complexity and setup/teardown costs, and just use a
vmalloc'ed global hash for NUMA.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
Ulrich Drepper
2007-03-15 21:09:35 UTC
Permalink
Post by Nick Piggin
There should be little contention on the memory in the global hash anyway,
because we can roughly reduce contention as a factor of hash-size/cacheline-size.
What we will have are cache misses on the global table... but we're going to
get cache misses on those private tables as well.
I'm thinking about NUMA cases. If you have private tables for a
process which is pinned to some cluster in a NUMA machine the table is
local to the node. If you have a global table you cannot optimize
your application for such a situation because at least some of the
pages of the global table are remote.
Nick Piggin
2007-03-15 21:29:05 UTC
Permalink
Post by Ulrich Drepper
Post by Nick Piggin
There should be little contention on the memory in the global hash anyway,
because we can roughly reduce contention as a factor of
hash-size/cacheline-size.
What we will have are cache misses on the global table... but we're going to
get cache misses on those private tables as well.
I'm thinking about NUMA cases. If you have private tables for a
process which is pinned to some cluster in a NUMA machine the table is
local to the node. If you have a global table you cannot optimize
your application for such a situation because at least some of the
pages of the global table are remote.
That's true, but it also might be able to be improved in other ways.

At least once we get the basic support in the kernel, and glibc picks
it up, then we have a better base to evaluate these more exotic changes
against.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
William Lee Irwin III
2007-03-15 22:59:55 UTC
Permalink
Post by Nick Piggin
I would just avoid the complexity and setup/teardown costs, and just
use a vmalloc'ed global hash for NUMA.
This patch is not the way to go, but neither are vmalloc()'d global
hashtables. When you just happen to hash to the wrong node, you're in
for quasi-unreproducible poor performance. The size is never right, at
which point RCU resizing is required with all its overhead and memory
freeing delays and failure to resize (even if only to contract) under
pressure. Better would be to use a different data structure admitting
locality of reference and adaptively sizing itself, furthermore
localized to the appropriate sharing domain. For file-backed futexes,
this would be the struct address_space. For anonymous-backed futexes,
this would be the COW sharing group, which an anon_vma could almost be
used to represent. Using an object to properly represent the COW
sharing group (i.e. Hugh's struct anon) would do the trick, and one
might as well move the rmap code over to it while we're at it since the
anon_vma scanning tricks are all pointless overhead once the COW
sharing group is accurately tracked (the scanning around for nearby vmas
with ->anon_vma set is not great anyway, though the overhead is hidden
in the noise of large teardown and setup operations; inheriting on
fork() is much simpler and faster).

In such a manner localization is accomplished while no interface
extensions are required.


-- wli
Eric Dumazet
2007-03-15 19:20:43 UTC
Permalink
[PATCH 3/3] FUTEX : NUMA friendly global hashtable

On NUMA machines, we should get better performance using a big futex
hashtable, allocated with vmalloc() so that it is spreaded on several nodes.

I chose a static size of four pages. (Very big NUMA machines have 64k page
size)

This patch should have a temporary effect, as most futexes are expected to be
stored in process private tables. We probably can drop it in five years :)


Signed-off-by: Eric Dumazet <***@cosmosbay.com>
---
kernel/futex.c | 26 +++++++++++++++++++++++---
1 file changed, 23 insertions(+), 3 deletions(-)
Ravikiran G Thirumalai
2006-08-09 00:13:03 UTC
Permalink
Post by Eric Dumazet
Post by Andi Kleen
Post by Ravikiran G Thirumalai
Current futex hash scheme is not the best for NUMA. The futex hash
table is an array of struct futex_hash_bucket, which is just a spinlock
and a list_head -- this means multiple spinlocks on the same cacheline
and on NUMA machines, on the same internode cacheline. If futexes of two
unrelated threads running on two different nodes happen to hash onto
adjacent hash buckets, or buckets on the same internode cacheline, then
we have the internode cacheline bouncing between nodes.
When I did some testing with a (arguably far too lock intensive) benchmark
on a bigger box I got most bouncing cycles not in the futex locks itself,
but in the down_read on the mm semaphore.
This is true, even with a normal application (not a biased benchmark) and
using oprofile. mmap_sem is the killer.
Not if two threads of two different process (so no same mmap_sem) hash onto
futexes on the same cacheline. But agreed, mmap_sem needs to be fixed too.
If everyone agrees on a per-process hash table for private futexes, then
we will work on that approach.

Thanks,
Kiran
Continue reading on narkive:
Loading...