Discussion:
[patch] futex requeueing feature, futex-requeue-2.5.69-D3
(too old to reply)
Ingo Molnar
2003-05-19 09:31:51 UTC
Permalink
the attached patch addresses a futex related SMP scalability problem of
glibc. A number of regressions have been reported to the NTPL mailing list
when going to many CPUs, for applications that use condition variables and
the pthread_cond_broadcast() API call. Using this functionality, testcode
shows a slowdown from 0.12 seconds runtime to over 237 seconds (!)
runtime, on 4-CPU systems.

pthread condition variables use two futex-backed mutex-alike locks: an
internal one for the glibc CV state itself, and a user-supplied mutex
which the API guarantees to take in certain codepaths. (Unfortunately the
user-supplied mutex cannot be used to protect the CV state, so we've got
to deal with two locks.)

The cause of the slowdown is a 'swarm effect': if lots of threads are
blocked on a condition variable, and pthread_cond_broadcast() is done,
then glibc first does a FUTEX_WAKE on the cv-internal mutex, then down a
mutex_down() on the user-supplied mutex. Ie. a swarm of threads is created
which all race to serialize on the user-supplied mutex. The more threads
are used, the more likely it becomes that the scheduler will balance them
over to other CPUs - where they just schedule, try to lock the mutex, and
go to sleep. This 'swarm effect' is purely technical, a side-effect of
glibc's use of futexes, and the imperfect coupling of the two locks.

the solution to this problem is to not wake up the swarm of threads, but
'requeue' them from the CV-internal mutex to the user-supplied mutex. The
attached patch adds the FUTEX_REQUEUE feature FUTEX_REQUEUE requeues N
threads from futex address A to futex address B:

sys_futex(uaddr, FUTEX_REQUEUE, nr_wake, NULL, uaddr2);

the 'val' parameter to sys_futex (nr_wake) is the # of woken up threads.
This way glibc can wake up a single thread (which will take the
user-mutex), and can requeue the rest, with a single system-call.

Ulrich Drepper has implemented FUTEX_REQUEUE support in glibc, and a
number of people have tested it over the past couple of weeks. Here are
the measurements done by Saurabh Desai:

System: 4xPIII 700MHz

./cond-perf -r 100 -n 200: 1p 2p 4p
Default NPTL: 0.120s 0.211s 237.407s
requeue NPTL: 0.124s 0.156s 0.040s

./cond-perf -r 1000 -n 100:
Default NPTL: 0.276s 0.412s 0.530s
requeue NPTL: 0.349s 0.503s 0.550s

./pp -v -n 128 -i 1000 -S 32768:
Default NPTL: 128 games in 1.111s 1.270s 16.894s
requeue NPTL: 128 games in 1.111s 1.959s 2.426s

./pp -v -n 1024 -i 10 -S 32768:
Default NPTL: 1024 games in 0.181s 0.394s incompleted 2m+
requeue NPTL: 1024 games in 0.166s 0.254s 0.341s

the speedup with increasing number of threads is quite significant, in the
128 threads, case it's more than 8 times. In the cond-perf test, on 4 CPUs
it's almost infinitely faster than the 'swarm of threads' catastrophy
triggered by the old code.

there's a slowdown on UP, which is expected: on UP the O(1) scheduler
implicitly serializes all active threads on the runqueue, and doesnt
degrade under lots of threads. On SMP the 'point of breakdown' depends on
the precise amount of time needed for the threads to become rated as
'cache-cold' by the load-balancer.

(the patch adds a new futex syscall parameter (uaddr2), which is a
compatible extension of sys_futex. Old NPTL applications will continue to
work without any impact, only the FUTEX_REQUEUE codepath uses the new
parameter.)

Ingo

--- linux/include/linux/futex.h.orig
+++ linux/include/linux/futex.h
@@ -5,7 +5,8 @@
#define FUTEX_WAIT (0)
#define FUTEX_WAKE (1)
#define FUTEX_FD (2)
+#define FUTEX_REQUEUE (3)

-extern asmlinkage long sys_futex(u32 __user *uaddr, int op, int val, struct timespec __user *utime);
+extern asmlinkage long sys_futex(u32 __user *uaddr, int op, int val, struct timespec __user *utime, u32 __user *uaddr2);

#endif
--- linux/kernel/fork.c.orig
+++ linux/kernel/fork.c
@@ -457,7 +457,7 @@ void mm_release(struct task_struct *tsk,
* not set up a proper pointer then tough luck.
*/
put_user(0, tidptr);
- sys_futex(tidptr, FUTEX_WAKE, 1, NULL);
+ sys_futex(tidptr, FUTEX_WAKE, 1, NULL, NULL);
}
}

--- linux/kernel/compat.c.orig
+++ linux/kernel/compat.c
@@ -214,7 +214,7 @@ asmlinkage long compat_sys_sigprocmask(i
extern long do_futex(unsigned long, int, int, unsigned long);

asmlinkage long compat_sys_futex(u32 *uaddr, int op, int val,
- struct compat_timespec *utime)
+ struct compat_timespec *utime, u32 *uaddr2)
{
struct timespec t;
unsigned long timeout = MAX_SCHEDULE_TIMEOUT;
@@ -224,7 +224,7 @@ asmlinkage long compat_sys_futex(u32 *ua
return -EFAULT;
timeout = timespec_to_jiffies(&t) + 1;
}
- return do_futex((unsigned long)uaddr, op, val, timeout);
+ return do_futex((unsigned long)uaddr, op, val, timeout, (unsigned long)uaddr2);
}

asmlinkage long sys_setrlimit(unsigned int resource, struct rlimit *rlim);
--- linux/kernel/futex.c.orig
+++ linux/kernel/futex.c
@@ -2,6 +2,9 @@
* Fast Userspace Mutexes (which I call "Futexes!").
* (C) Rusty Russell, IBM 2002
*
+ * Generalized futexes, futex requeueing, misc fixes by Ingo Molnar
+ * (C) Copyright 2003 Red Hat Inc, All Rights Reserved
+ *
* Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly
* enough at me, Linus for the original (flawed) idea, Matthew
* Kirkwood for proof-of-concept implementation.
@@ -9,9 +12,6 @@
* "The futexes are also cursed."
* "But they come in a choice of three flavours!"
*
- * Generalized futexes for every mapping type, Ingo Molnar, 2002
- *
- *
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
@@ -216,6 +216,61 @@ static void futex_vcache_callback(vcache
spin_unlock(&futex_lock);
}

+/*
+ * Requeue all waiters hashed on one physical page to another
+ * physical page.
+ */
+static int futex_requeue(unsigned long uaddr1, int offset1, unsigned long uaddr2, int offset2, int num)
+{
+ struct list_head *i, *next, *head1, *head2;
+ struct page *page1, *page2;
+ int ret = 0;
+
+ lock_futex_mm();
+
+ page1 = __pin_page(uaddr1 - offset1);
+ if (!page1) {
+ unlock_futex_mm();
+ return -EFAULT;
+ }
+ page2 = __pin_page(uaddr2 - offset2);
+ if (!page2) {
+ unlock_futex_mm();
+ return -EFAULT;
+ }
+
+ head1 = hash_futex(page1, offset1);
+ head2 = hash_futex(page2, offset2);
+
+ list_for_each_safe(i, next, head1) {
+ struct futex_q *this = list_entry(i, struct futex_q, list);
+
+ if (this->page == page1 && this->offset == offset1) {
+ list_del_init(i);
+ __detach_vcache(&this->vcache);
+ if (++ret <= num) {
+ wake_up_all(&this->waiters);
+ if (this->filp)
+ send_sigio(&this->filp->f_owner, this->fd, POLL_IN);
+ } else {
+ unpin_page(this->page);
+ __pin_page_atomic (page2);
+ list_add_tail(i, head2);
+ __attach_vcache(&this->vcache, uaddr2, current->mm, futex_vcache_callback);
+ this->offset = offset2;
+ this->page = page2;
+ }
+ }
+ }
+
+ unlock_futex_mm();
+
+ unpin_page(page1);
+ unpin_page(page2);
+
+ return ret;
+}
+
static inline void __queue_me(struct futex_q *q, struct page *page,
unsigned long uaddr, int offset,
int fd, struct file *filp)
@@ -425,9 +480,9 @@ out:
return ret;
}

-long do_futex(unsigned long uaddr, int op, int val, unsigned long timeout)
+long do_futex(unsigned long uaddr, int op, int val, unsigned long timeout, unsigned long uaddr2)
{
- unsigned long pos_in_page;
+ unsigned long pos_in_page, pos_in_page2;
int ret;

pos_in_page = uaddr % PAGE_SIZE;
@@ -443,6 +498,14 @@ long do_futex(unsigned long uaddr, int o
case FUTEX_WAKE:
ret = futex_wake(uaddr, pos_in_page, val);
break;
+ case FUTEX_REQUEUE:
+ pos_in_page2 = uaddr2 % PAGE_SIZE;
+
+ /* Must be "naturally" aligned */
+ if (pos_in_page2 % sizeof(u32))
+ return -EINVAL;
+ ret = futex_requeue(uaddr, pos_in_page, uaddr2, pos_in_page2, val);
+ break;
case FUTEX_FD:
/* non-zero val means F_SETOWN(getpid()) & F_SETSIG(val) */
ret = futex_fd(uaddr, pos_in_page, val);
@@ -453,7 +516,7 @@ long do_futex(unsigned long uaddr, int o
return ret;
}

-asmlinkage long sys_futex(u32 __user *uaddr, int op, int val, struct timespec __user *utime)
+asmlinkage long sys_futex(u32 __user *uaddr, int op, int val, struct timespec __user *utime, u32 __user *uaddr2)
{
struct timespec t;
unsigned long timeout = MAX_SCHEDULE_TIMEOUT;
@@ -463,7 +526,7 @@ asmlinkage long sys_futex(u32 __user *ua
return -EFAULT;
timeout = timespec_to_jiffies(&t) + 1;
}
- return do_futex((unsigned long)uaddr, op, val, timeout);
+ return do_futex((unsigned long)uaddr, op, val, timeout, (unsigned long)uaddr2);
}

static struct super_block *
Christoph Hellwig
2003-05-19 10:10:28 UTC
Permalink
Post by Ingo Molnar
the solution to this problem is to not wake up the swarm of threads, but
'requeue' them from the CV-internal mutex to the user-supplied mutex. The
attached patch adds the FUTEX_REQUEUE feature FUTEX_REQUEUE requeues N
sys_futex(uaddr, FUTEX_REQUEUE, nr_wake, NULL, uaddr2);
the 'val' parameter to sys_futex (nr_wake) is the # of woken up threads.
This way glibc can wake up a single thread (which will take the
user-mutex), and can requeue the rest, with a single system-call.
Urgg, yet another sys_futex extension. Could you please split all
these totally different cases into separate syscalls instead?
Post by Ingo Molnar
+ wake_up_all(&this->waiters);
+ if (this->filp)
+ send_sigio(&this->filp->f_owner, this->fd, POLL_IN);
+ } else {
+ unpin_page(this->page);
+ __pin_page_atomic (page2);
+ list_add_tail(i, head2);
+ __attach_vcache(&this->vcache, uaddr2, current->mm, futex_vcache_callback);
Please linewrap after 80 lines, thanks.
Post by Ingo Molnar
+ pos_in_page2 = uaddr2 % PAGE_SIZE;
+
+ /* Must be "naturally" aligned */
+ if (pos_in_page2 % sizeof(u32))
+ return -EINVAL;
Who guarantess that the alignment of u32 is always the same as it's size?
Ingo Molnar
2003-05-19 10:16:02 UTC
Permalink
[...] Could you please split all these totally different cases into
separate syscalls instead?
sure, i'm all for it, but in a different pass, and after syncing up with
glibc. An API cleanup like this should have been done upon the
introduction of futexes, why didnt you comment on this then? Splitting off
FUTEX_REQUEUE in isolation is quite pointless.
Post by Ingo Molnar
+ pos_in_page2 = uaddr2 % PAGE_SIZE;
+
+ /* Must be "naturally" aligned */
+ if (pos_in_page2 % sizeof(u32))
+ return -EINVAL;
Who guarantess that the alignment of u32 is always the same as it's size?
glibc. We do not want to handle all the misaligned cases for obvious
reasons. The use of u32 (instead of a native word) is a bit unfortunate on
64-bit systems but now a reality.

Ingo
Christoph Hellwig
2003-05-19 11:49:10 UTC
Permalink
Post by Ingo Molnar
sure, i'm all for it, but in a different pass, and after syncing up with
glibc. An API cleanup like this should have been done upon the
introduction of futexes, why didnt you comment on this then? Splitting off
FUTEX_REQUEUE in isolation is quite pointless.
Maybe I don't spend all my time watching the futex API? :) Okay,
let's make a deal, you add a new syscall for this case and I'll fix
up the older ones in a patch that's ontop of yours?
Post by Ingo Molnar
Post by Christoph Hellwig
Post by Ingo Molnar
+ pos_in_page2 = uaddr2 % PAGE_SIZE;
+
+ /* Must be "naturally" aligned */
+ if (pos_in_page2 % sizeof(u32))
+ return -EINVAL;
Who guarantess that the alignment of u32 is always the same as it's size?
glibc. We do not want to handle all the misaligned cases for obvious
reasons. The use of u32 (instead of a native word) is a bit unfortunate on
64-bit systems but now a reality.
Sorry if the question wasn't clear, but who guarantess that the alignment
of u32 is the same as it's size? You test of the size of u32, not it's
alignment even if they usually are the same.
Ingo Molnar
2003-05-19 12:33:01 UTC
Permalink
Maybe I don't spend all my time watching the futex API? :) Okay, let's
make a deal, you add a new syscall for this case and I'll fix up the
older ones in a patch that's ontop of yours?
well, my point was that it's not an issue introduced by the FUTEX_REQUEUE
change. Here's a patch ontop of my last futex patch, which adds all the
interface cleanups. Changes:

- separate out sys_futex_wait, sys_futex_wake and sys_futex_requeue.

- inline the futex_wait/wake/requeue functionality, as it should be.

- start the phasing out of FUTEX_FD. This i believe is quite unclean and
unrobust, because it attaches a new concept (futexes) to a very old
(polling) concept. We want futex support in kernel-AIO, not in the
polling APIs. AFAIK only NGPT uses FUTEX_FD.
Post by Ingo Molnar
Post by Christoph Hellwig
Who guarantess that the alignment of u32 is always the same as it's size?
glibc. We do not want to handle all the misaligned cases for obvious
reasons. The use of u32 (instead of a native word) is a bit unfortunate on
64-bit systems but now a reality.
Sorry if the question wasn't clear, but who guarantess that the
alignment of u32 is the same as it's size? You test of the size of u32,
not it's alignment even if they usually are the same.
glibc (the main user and API-multiplexer of futexes) ensures that futex
variables are aligned on sizeof(u32).

Ingo

--- linux/include/linux/futex.h.orig
+++ linux/include/linux/futex.h
@@ -2,11 +2,13 @@
#define _LINUX_FUTEX_H

/* Second argument to futex syscall */
-#define FUTEX_WAIT (0)
-#define FUTEX_WAKE (1)
-#define FUTEX_FD (2)
-#define FUTEX_REQUEUE (3)

-extern asmlinkage long sys_futex(u32 __user *uaddr, int op, int val, struct timespec __user *utime, u32 __user *uaddr2);
+asmlinkage long sys_futex_wait(u32 __user *__uaddr,int val,
+ struct timespec __user *utime);
+
+asmlinkage long sys_futex_wake(u32 __user *__uaddr, int val);
+
+asmlinkage long sys_futex_requeue(u32 __user *__uaddr1, u32 __user *__uaddr2,
+ int nr_wake);

#endif
--- linux/arch/i386/kernel/entry.S.orig
+++ linux/arch/i386/kernel/entry.S
@@ -837,7 +837,7 @@ ENTRY(sys_call_table)
.long sys_fremovexattr
.long sys_tkill
.long sys_sendfile64
- .long sys_futex /* 240 */
+ .long old_futex /* 240 */
.long sys_sched_setaffinity
.long sys_sched_getaffinity
.long sys_set_thread_area
@@ -865,6 +865,8 @@ ENTRY(sys_call_table)
.long sys_clock_gettime /* 265 */
.long sys_clock_getres
.long sys_clock_nanosleep
-
+ .long sys_futex_wait
+ .long sys_futex_wake
+ .long sys_futex_requeue /* 270 */

nr_syscalls=(.-sys_call_table)/4
--- linux/kernel/futex.c.orig
+++ linux/kernel/futex.c
@@ -347,7 +347,7 @@ static int futex_wait(unsigned long uadd
* The get_user() above might fault and schedule so we
* cannot just set TASK_INTERRUPTIBLE state when queueing
* ourselves into the futex hash. This code thus has to
- * rely on the FUTEX_WAKE code doing a wakeup after removing
+ * rely on the futex_wake() code doing a wakeup after removing
* the waiter from the list.
*/
add_wait_queue(&q.waiters, &wait);
@@ -481,9 +481,13 @@ out:
return ret;
}

-long do_futex(unsigned long uaddr, int op, int val, unsigned long timeout, unsigned long uaddr2)
+#define OLD_FUTEX_WAIT (0)
+#define OLD_FUTEX_WAKE (1)
+#define OLD_FUTEX_FD (2)
+
+static long do_old_futex(unsigned long uaddr, int op, int val, unsigned long timeout, unsigned long uaddr2)
{
- unsigned long pos_in_page, pos_in_page2;
+ unsigned long pos_in_page;
int ret;

pos_in_page = uaddr % PAGE_SIZE;
@@ -493,21 +497,13 @@ long do_futex(unsigned long uaddr, int o
return -EINVAL;

switch (op) {
- case FUTEX_WAIT:
+ case OLD_FUTEX_WAIT:
ret = futex_wait(uaddr, pos_in_page, val, timeout);
break;
- case FUTEX_WAKE:
+ case OLD_FUTEX_WAKE:
ret = futex_wake(uaddr, pos_in_page, val);
break;
- case FUTEX_REQUEUE:
- pos_in_page2 = uaddr2 % PAGE_SIZE;
-
- /* Must be "naturally" aligned */
- if (pos_in_page2 % sizeof(u32))
- return -EINVAL;
- ret = futex_requeue(uaddr, pos_in_page, uaddr2, pos_in_page2, val);
- break;
- case FUTEX_FD:
+ case OLD_FUTEX_FD:
/* non-zero val means F_SETOWN(getpid()) & F_SETSIG(val) */
ret = futex_fd(uaddr, pos_in_page, val);
break;
@@ -517,17 +513,63 @@ long do_futex(unsigned long uaddr, int o
return ret;
}

-asmlinkage long sys_futex(u32 __user *uaddr, int op, int val, struct timespec __user *utime, u32 __user *uaddr2)
+asmlinkage long old_futex(u32 __user *uaddr, int op, int val, struct timespec __user *utime, u32 __user *uaddr2)
{
struct timespec t;
unsigned long timeout = MAX_SCHEDULE_TIMEOUT;

- if ((op == FUTEX_WAIT) && utime) {
+ if ((op == OLD_FUTEX_WAIT) && utime) {
+ if (copy_from_user(&t, utime, sizeof(t)) != 0)
+ return -EFAULT;
+ timeout = timespec_to_jiffies(&t) + 1;
+ }
+ return do_old_futex((unsigned long)uaddr, op, val, timeout, (unsigned long)uaddr2);
+}
+
+asmlinkage long sys_futex_wait(u32 __user *__uaddr, int val, struct timespec __user *utime)
+{
+ unsigned long uaddr = (unsigned long)__uaddr;
+ unsigned long pos_in_page = uaddr % PAGE_SIZE;
+ unsigned long timeout = MAX_SCHEDULE_TIMEOUT;
+ struct timespec t;
+
+ /* Must be "naturally" aligned */
+ if (pos_in_page % sizeof(u32))
+ return -EINVAL;
+
+ if (utime) {
if (copy_from_user(&t, utime, sizeof(t)) != 0)
return -EFAULT;
timeout = timespec_to_jiffies(&t) + 1;
}
- return do_futex((unsigned long)uaddr, op, val, timeout, (unsigned long)uaddr2);
+ return futex_wait(uaddr, pos_in_page, val, timeout);
+}
+
+asmlinkage long sys_futex_wake(u32 __user *__uaddr, int val)
+{
+ unsigned long uaddr = (unsigned long)__uaddr;
+ unsigned long pos_in_page = uaddr % PAGE_SIZE;
+
+ /* Must be "naturally" aligned */
+ if (pos_in_page % sizeof(u32))
+ return -EINVAL;
+
+ return futex_wake(uaddr, pos_in_page, val);
+}
+
+asmlinkage long sys_futex_requeue(u32 __user *__uaddr1,
+ u32 __user *__uaddr2, int nr_wake)
+{
+ unsigned long uaddr1 = (unsigned long)__uaddr1,
+ uaddr2 = (unsigned long)__uaddr2;
+ unsigned long pos_in_page1 = uaddr1 % PAGE_SIZE,
+ pos_in_page2 = uaddr2 % PAGE_SIZE;
+
+ /* Must be "naturally" aligned */
+ if ((pos_in_page1 | pos_in_page2) % sizeof(u32))
+ return -EINVAL;
+
+ return futex_requeue(uaddr1, pos_in_page1, uaddr2, pos_in_page2, nr_wake);
}

static struct super_block *
--- linux/kernel/fork.c.orig
+++ linux/kernel/fork.c
@@ -457,7 +457,7 @@ void mm_release(struct task_struct *tsk,
* not set up a proper pointer then tough luck.
*/
put_user(0, tidptr);
- sys_futex(tidptr, FUTEX_WAKE, 1, NULL, NULL);
+ sys_futex_wake(tidptr, 1);
}
}
Ingo Molnar
2003-05-19 12:51:50 UTC
Permalink
updated patch (the inlining changes were missing):

--- linux/include/linux/futex.h.orig
+++ linux/include/linux/futex.h
@@ -2,11 +2,13 @@
#define _LINUX_FUTEX_H

/* Second argument to futex syscall */
-#define FUTEX_WAIT (0)
-#define FUTEX_WAKE (1)
-#define FUTEX_FD (2)
-#define FUTEX_REQUEUE (3)

-extern asmlinkage long sys_futex(u32 __user *uaddr, int op, int val, struct timespec __user *utime, u32 __user *uaddr2);
+asmlinkage long sys_futex_wait(u32 __user *__uaddr,int val,
+ struct timespec __user *utime);
+
+asmlinkage long sys_futex_wake(u32 __user *__uaddr, int val);
+
+asmlinkage long sys_futex_requeue(u32 __user *__uaddr1, u32 __user *__uaddr2,
+ int nr_wake);

#endif
--- linux/arch/i386/kernel/entry.S.orig
+++ linux/arch/i386/kernel/entry.S
@@ -837,7 +837,7 @@ ENTRY(sys_call_table)
.long sys_fremovexattr
.long sys_tkill
.long sys_sendfile64
- .long sys_futex /* 240 */
+ .long old_futex /* 240 */
.long sys_sched_setaffinity
.long sys_sched_getaffinity
.long sys_set_thread_area
@@ -865,6 +865,8 @@ ENTRY(sys_call_table)
.long sys_clock_gettime /* 265 */
.long sys_clock_getres
.long sys_clock_nanosleep
-
+ .long sys_futex_wait
+ .long sys_futex_wake
+ .long sys_futex_requeue /* 270 */

nr_syscalls=(.-sys_call_table)/4
--- linux/kernel/futex.c.orig
+++ linux/kernel/futex.c
@@ -98,13 +98,13 @@ static inline struct list_head *hash_fut
*
* Must be called with (and returns with) all futex-MM locks held.
*/
-static inline
-struct page *__pin_page_atomic (struct page *page)
+static inline struct page *__pin_page_atomic (struct page *page)
{
if (!PageReserved(page))
get_page(page);
return page;
}
+
static struct page *__pin_page(unsigned long addr)
{
struct mm_struct *mm = current->mm;
@@ -155,7 +155,7 @@ static inline void unpin_page(struct pag
* Wake up all waiters hashed on the physical page that is mapped
* to this virtual address:
*/
-static int futex_wake(unsigned long uaddr, int offset, int num)
+static inline int futex_wake(unsigned long uaddr, int offset, int num)
{
struct list_head *i, *next, *head;
struct page *page;
@@ -220,7 +220,8 @@ static void futex_vcache_callback(vcache
* Requeue all waiters hashed on one physical page to another
* physical page.
*/
-static int futex_requeue(unsigned long uaddr1, int offset1, unsigned long uaddr2, int offset2, int num)
+static inline int futex_requeue(unsigned long uaddr1, int offset1,
+ unsigned long uaddr2, int offset2, int num)
{
struct list_head *i, *next, *head1, *head2;
struct page *page1 = NULL, *page2 = NULL;
@@ -308,7 +309,7 @@ static inline int unqueue_me(struct fute
return ret;
}

-static int futex_wait(unsigned long uaddr,
+static inline int futex_wait(unsigned long uaddr,
int offset,
int val,
unsigned long time)
@@ -347,7 +348,7 @@ static int futex_wait(unsigned long uadd
* The get_user() above might fault and schedule so we
* cannot just set TASK_INTERRUPTIBLE state when queueing
* ourselves into the futex hash. This code thus has to
- * rely on the FUTEX_WAKE code doing a wakeup after removing
+ * rely on the futex_wake() code doing a wakeup after removing
* the waiter from the list.
*/
add_wait_queue(&q.waiters, &wait);
@@ -481,9 +482,13 @@ out:
return ret;
}

-long do_futex(unsigned long uaddr, int op, int val, unsigned long timeout, unsigned long uaddr2)
+#define OLD_FUTEX_WAIT (0)
+#define OLD_FUTEX_WAKE (1)
+#define OLD_FUTEX_FD (2)
+
+static long do_old_futex(unsigned long uaddr, int op, int val, unsigned long timeout, unsigned long uaddr2)
{
- unsigned long pos_in_page, pos_in_page2;
+ unsigned long pos_in_page;
int ret;

pos_in_page = uaddr % PAGE_SIZE;
@@ -493,21 +498,13 @@ long do_futex(unsigned long uaddr, int o
return -EINVAL;

switch (op) {
- case FUTEX_WAIT:
+ case OLD_FUTEX_WAIT:
ret = futex_wait(uaddr, pos_in_page, val, timeout);
break;
- case FUTEX_WAKE:
+ case OLD_FUTEX_WAKE:
ret = futex_wake(uaddr, pos_in_page, val);
break;
- case FUTEX_REQUEUE:
- pos_in_page2 = uaddr2 % PAGE_SIZE;
-
- /* Must be "naturally" aligned */
- if (pos_in_page2 % sizeof(u32))
- return -EINVAL;
- ret = futex_requeue(uaddr, pos_in_page, uaddr2, pos_in_page2, val);
- break;
- case FUTEX_FD:
+ case OLD_FUTEX_FD:
/* non-zero val means F_SETOWN(getpid()) & F_SETSIG(val) */
ret = futex_fd(uaddr, pos_in_page, val);
break;
@@ -517,17 +514,63 @@ long do_futex(unsigned long uaddr, int o
return ret;
}

-asmlinkage long sys_futex(u32 __user *uaddr, int op, int val, struct timespec __user *utime, u32 __user *uaddr2)
+asmlinkage long old_futex(u32 __user *uaddr, int op, int val, struct timespec __user *utime, u32 __user *uaddr2)
{
struct timespec t;
unsigned long timeout = MAX_SCHEDULE_TIMEOUT;

- if ((op == FUTEX_WAIT) && utime) {
+ if ((op == OLD_FUTEX_WAIT) && utime) {
if (copy_from_user(&t, utime, sizeof(t)) != 0)
return -EFAULT;
timeout = timespec_to_jiffies(&t) + 1;
}
- return do_futex((unsigned long)uaddr, op, val, timeout, (unsigned long)uaddr2);
+ return do_old_futex((unsigned long)uaddr, op, val, timeout, (unsigned long)uaddr2);
+}
+
+asmlinkage long sys_futex_wait(u32 __user *__uaddr, int val, struct timespec __user *utime)
+{
+ unsigned long uaddr = (unsigned long)__uaddr;
+ unsigned long pos_in_page = uaddr % PAGE_SIZE;
+ unsigned long timeout = MAX_SCHEDULE_TIMEOUT;
+ struct timespec t;
+
+ /* Must be "naturally" aligned */
+ if (pos_in_page % sizeof(u32))
+ return -EINVAL;
+
+ if (utime) {
+ if (copy_from_user(&t, utime, sizeof(t)) != 0)
+ return -EFAULT;
+ timeout = timespec_to_jiffies(&t) + 1;
+ }
+ return futex_wait(uaddr, pos_in_page, val, timeout);
+}
+
+asmlinkage long sys_futex_wake(u32 __user *__uaddr, int val)
+{
+ unsigned long uaddr = (unsigned long)__uaddr;
+ unsigned long pos_in_page = uaddr % PAGE_SIZE;
+
+ /* Must be "naturally" aligned */
+ if (pos_in_page % sizeof(u32))
+ return -EINVAL;
+
+ return futex_wake(uaddr, pos_in_page, val);
+}
+
+asmlinkage long sys_futex_requeue(u32 __user *__uaddr1,
+ u32 __user *__uaddr2, int nr_wake)
+{
+ unsigned long uaddr1 = (unsigned long)__uaddr1,
+ uaddr2 = (unsigned long)__uaddr2;
+ unsigned long pos_in_page1 = uaddr1 % PAGE_SIZE,
+ pos_in_page2 = uaddr2 % PAGE_SIZE;
+
+ /* Must be "naturally" aligned */
+ if ((pos_in_page1 | pos_in_page2) % sizeof(u32))
+ return -EINVAL;
+
+ return futex_requeue(uaddr1, pos_in_page1, uaddr2, pos_in_page2, nr_wake);
}

static struct super_block *
--- linux/kernel/fork.c.orig
+++ linux/kernel/fork.c
@@ -457,7 +457,7 @@ void mm_release(struct task_struct *tsk,
* not set up a proper pointer then tough luck.
*/
put_user(0, tidptr);
- sys_futex(tidptr, FUTEX_WAKE, 1, NULL, NULL);
+ sys_futex_wake(tidptr, 1);
}
}
bert hubert
2003-05-19 14:47:17 UTC
Permalink
Post by Ingo Molnar
- start the phasing out of FUTEX_FD. This i believe is quite unclean and
unrobust, because it attaches a new concept (futexes) to a very old
(polling) concept. We want futex support in kernel-AIO, not in the
polling APIs. AFAIK only NGPT uses FUTEX_FD.
I for one would want the ability to select, poll and epoll on a futex while
also being notified of availability of data on sockets. I see no alternative
even, except for messing with signals or running select with a small
timeout, introducing needless latency.

It may be weird, but it does work in practice. 'Unrobust' would be a problem
but I fail to see how this is unclean.

Thanks.
--
http://www.PowerDNS.com Open source, database driven DNS Software
http://lartc.org Linux Advanced Routing & Traffic Control HOWTO
Ingo Molnar
2003-05-19 16:02:20 UTC
Permalink
Post by bert hubert
I for one would want the ability to select, poll and epoll on a futex
while also being notified of availability of data on sockets. I see no
alternative even, except for messing with signals or running select with
a small timeout, introducing needless latency.
It may be weird, but it does work in practice. 'Unrobust' would be a
problem but I fail to see how this is unclean.
ok, i was flaming away mindlessly. New patch on the way.

Ingo
Linus Torvalds
2003-05-19 15:18:19 UTC
Permalink
Post by Ingo Molnar
- start the phasing out of FUTEX_FD. This i believe is quite unclean and
unrobust, because it attaches a new concept (futexes) to a very old
(polling) concept. We want futex support in kernel-AIO, not in the
polling APIs. AFAIK only NGPT uses FUTEX_FD.
This sounds like a bad idea.

Expecting "select()" and "poll()" to go away, and calling them "unclean
and unrobust" is just silly and stupid. There are a hell of a lot more
programs using select-loops out there than there are AIO versions, and I'd
argue that AIO is likely to be the much more "unrobust" solution, and
probably doesn't even scale any better than using epoll.

In fact, it's hard to see any real advantages of aio over a sane polling
loop, as long as the polling doesn't have some O(n) overhead (in other
words, as long as you use epoll).

So stop pushing your own agenda and broken morals down other peoples
throats, and re-do this patch properly.

Linus
Ingo Molnar
2003-05-19 16:06:50 UTC
Permalink
Post by Ingo Molnar
- start the phasing out of FUTEX_FD. This i believe is quite unclean and
unrobust, [...]
FUTEX_FD is an instant DoS, it allows the pinning of one page per file
descriptor, per thread. With a default limit of 1024 open files per
thread, and 256 threads (on a sane/conservative setup), this means 1 GB of
RAM can be pinned down by a normal unprivileged user.

Any suggestions how to fix it - or am i missing something why it should
not be considered a problem?

updated patch attached - it now has proper sys_futex_fd() support as well.

Ingo

--- linux/include/linux/futex.h.orig
+++ linux/include/linux/futex.h
@@ -2,11 +2,15 @@
#define _LINUX_FUTEX_H

/* Second argument to futex syscall */
-#define FUTEX_WAIT (0)
-#define FUTEX_WAKE (1)
-#define FUTEX_FD (2)
-#define FUTEX_REQUEUE (3)

-extern asmlinkage long sys_futex(u32 __user *uaddr, int op, int val, struct timespec __user *utime, u32 __user *uaddr2);
+asmlinkage long sys_futex_wait(u32 __user *__uaddr,int val,
+ struct timespec __user *utime);
+
+asmlinkage long sys_futex_wake(u32 __user *__uaddr, int nr_wake);
+
+asmlinkage long sys_futex_fd(u32 __user *__uaddr, int signal);
+
+asmlinkage long sys_futex_requeue(u32 __user *__uaddr1, u32 __user *__uaddr2,
+ int nr_wake);

#endif
--- linux/arch/i386/kernel/entry.S.orig
+++ linux/arch/i386/kernel/entry.S
@@ -837,7 +837,7 @@ ENTRY(sys_call_table)
.long sys_fremovexattr
.long sys_tkill
.long sys_sendfile64
- .long sys_futex /* 240 */
+ .long old_futex /* 240 */
.long sys_sched_setaffinity
.long sys_sched_getaffinity
.long sys_set_thread_area
@@ -865,6 +865,9 @@ ENTRY(sys_call_table)
.long sys_clock_gettime /* 265 */
.long sys_clock_getres
.long sys_clock_nanosleep
-
+ .long sys_futex_wait
+ .long sys_futex_wake
+ .long sys_futex_fd /* 270 */
+ .long sys_futex_requeue

nr_syscalls=(.-sys_call_table)/4
--- linux/kernel/futex.c.orig
+++ linux/kernel/futex.c
@@ -98,13 +98,13 @@ static inline struct list_head *hash_fut
*
* Must be called with (and returns with) all futex-MM locks held.
*/
-static inline
-struct page *__pin_page_atomic (struct page *page)
+static inline struct page *__pin_page_atomic (struct page *page)
{
if (!PageReserved(page))
get_page(page);
return page;
}
+
static struct page *__pin_page(unsigned long addr)
{
struct mm_struct *mm = current->mm;
@@ -155,7 +155,7 @@ static inline void unpin_page(struct pag
* Wake up all waiters hashed on the physical page that is mapped
* to this virtual address:
*/
-static int futex_wake(unsigned long uaddr, int offset, int num)
+static inline int futex_wake(unsigned long uaddr, int offset, int num)
{
struct list_head *i, *next, *head;
struct page *page;
@@ -220,7 +220,8 @@ static void futex_vcache_callback(vcache
* Requeue all waiters hashed on one physical page to another
* physical page.
*/
-static int futex_requeue(unsigned long uaddr1, int offset1, unsigned long uaddr2, int offset2, int num)
+static inline int futex_requeue(unsigned long uaddr1, int offset1,
+ unsigned long uaddr2, int offset2, int num)
{
struct list_head *i, *next, *head1, *head2;
struct page *page1 = NULL, *page2 = NULL;
@@ -308,7 +309,7 @@ static inline int unqueue_me(struct fute
return ret;
}

-static int futex_wait(unsigned long uaddr,
+static inline int futex_wait(unsigned long uaddr,
int offset,
int val,
unsigned long time)
@@ -347,7 +348,7 @@ static int futex_wait(unsigned long uadd
* The get_user() above might fault and schedule so we
* cannot just set TASK_INTERRUPTIBLE state when queueing
* ourselves into the futex hash. This code thus has to
- * rely on the FUTEX_WAKE code doing a wakeup after removing
+ * rely on the futex_wake() code doing a wakeup after removing
* the waiter from the list.
*/
add_wait_queue(&q.waiters, &wait);
@@ -481,9 +482,13 @@ out:
return ret;
}

-long do_futex(unsigned long uaddr, int op, int val, unsigned long timeout, unsigned long uaddr2)
+#define OLD_FUTEX_WAIT (0)
+#define OLD_FUTEX_WAKE (1)
+#define OLD_FUTEX_FD (2)
+
+static long do_old_futex(unsigned long uaddr, int op, int val, unsigned long timeout, unsigned long uaddr2)
{
- unsigned long pos_in_page, pos_in_page2;
+ unsigned long pos_in_page;
int ret;

pos_in_page = uaddr % PAGE_SIZE;
@@ -493,21 +498,13 @@ long do_futex(unsigned long uaddr, int o
return -EINVAL;

switch (op) {
- case FUTEX_WAIT:
+ case OLD_FUTEX_WAIT:
ret = futex_wait(uaddr, pos_in_page, val, timeout);
break;
- case FUTEX_WAKE:
+ case OLD_FUTEX_WAKE:
ret = futex_wake(uaddr, pos_in_page, val);
break;
- case FUTEX_REQUEUE:
- pos_in_page2 = uaddr2 % PAGE_SIZE;
-
- /* Must be "naturally" aligned */
- if (pos_in_page2 % sizeof(u32))
- return -EINVAL;
- ret = futex_requeue(uaddr, pos_in_page, uaddr2, pos_in_page2, val);
- break;
- case FUTEX_FD:
+ case OLD_FUTEX_FD:
/* non-zero val means F_SETOWN(getpid()) & F_SETSIG(val) */
ret = futex_fd(uaddr, pos_in_page, val);
break;
@@ -517,17 +514,75 @@ long do_futex(unsigned long uaddr, int o
return ret;
}

-asmlinkage long sys_futex(u32 __user *uaddr, int op, int val, struct timespec __user *utime, u32 __user *uaddr2)
+asmlinkage long old_futex(u32 __user *uaddr, int op, int val, struct timespec __user *utime, u32 __user *uaddr2)
{
struct timespec t;
unsigned long timeout = MAX_SCHEDULE_TIMEOUT;

- if ((op == FUTEX_WAIT) && utime) {
+ if ((op == OLD_FUTEX_WAIT) && utime) {
if (copy_from_user(&t, utime, sizeof(t)) != 0)
return -EFAULT;
timeout = timespec_to_jiffies(&t) + 1;
}
- return do_futex((unsigned long)uaddr, op, val, timeout, (unsigned long)uaddr2);
+ return do_old_futex((unsigned long)uaddr, op, val, timeout, (unsigned long)uaddr2);
+}
+
+asmlinkage long sys_futex_wait(u32 __user *__uaddr, int val, struct timespec __user *utime)
+{
+ unsigned long uaddr = (unsigned long)__uaddr;
+ unsigned long pos_in_page = uaddr % PAGE_SIZE;
+ unsigned long timeout = MAX_SCHEDULE_TIMEOUT;
+ struct timespec t;
+
+ /* Must be "naturally" aligned */
+ if (pos_in_page % sizeof(u32))
+ return -EINVAL;
+
+ if (utime) {
+ if (copy_from_user(&t, utime, sizeof(t)) != 0)
+ return -EFAULT;
+ timeout = timespec_to_jiffies(&t) + 1;
+ }
+ return futex_wait(uaddr, pos_in_page, val, timeout);
+}
+
+asmlinkage long sys_futex_wake(u32 __user *__uaddr, int nr_wake)
+{
+ unsigned long uaddr = (unsigned long)__uaddr;
+ unsigned long pos_in_page = uaddr % PAGE_SIZE;
+
+ /* Must be "naturally" aligned */
+ if (pos_in_page % sizeof(u32))
+ return -EINVAL;
+
+ return futex_wake(uaddr, pos_in_page, nr_wake);
+}
+
+asmlinkage long sys_futex_fd(u32 __user *__uaddr, int signal)
+{
+ unsigned long uaddr = (unsigned long)__uaddr;
+ unsigned long pos_in_page = uaddr % PAGE_SIZE;
+
+ /* Must be "naturally" aligned */
+ if (pos_in_page % sizeof(u32))
+ return -EINVAL;
+
+ return futex_fd(uaddr, pos_in_page, signal);
+}
+
+asmlinkage long sys_futex_requeue(u32 __user *__uaddr1,
+ u32 __user *__uaddr2, int nr_wake)
+{
+ unsigned long uaddr1 = (unsigned long)__uaddr1,
+ uaddr2 = (unsigned long)__uaddr2;
+ unsigned long pos_in_page1 = uaddr1 % PAGE_SIZE,
+ pos_in_page2 = uaddr2 % PAGE_SIZE;
+
+ /* Must be "naturally" aligned */
+ if ((pos_in_page1 | pos_in_page2) % sizeof(u32))
+ return -EINVAL;
+
+ return futex_requeue(uaddr1, pos_in_page1, uaddr2, pos_in_page2, nr_wake);
}

static struct super_block *
--- linux/kernel/fork.c.orig
+++ linux/kernel/fork.c
@@ -457,7 +457,7 @@ void mm_release(struct task_struct *tsk,
* not set up a proper pointer then tough luck.
*/
put_user(0, tidptr);
- sys_futex(tidptr, FUTEX_WAKE, 1, NULL, NULL);
+ sys_futex_wake(tidptr, 1);
}
}
Jamie Lokier
2003-05-19 23:33:53 UTC
Permalink
Post by Ingo Molnar
Post by Ingo Molnar
- start the phasing out of FUTEX_FD. This i believe is quite unclean and
unrobust, [...]
FUTEX_FD is an instant DoS, it allows the pinning of one page per file
descriptor, per thread. With a default limit of 1024 open files per
thread, and 256 threads (on a sane/conservative setup), this means 1 GB of
RAM can be pinned down by a normal unprivileged user.
The correct solution [;)] is EP_FUTEX - allow a futex to be specified
as the source of an epoll event.

-- Jamie
Rusty Russell
2003-05-20 00:39:38 UTC
Permalink
Post by Jamie Lokier
Post by Ingo Molnar
FUTEX_FD is an instant DoS, it allows the pinning of one page per file
descriptor, per thread. With a default limit of 1024 open files per
thread, and 256 threads (on a sane/conservative setup), this means 1 GB of
RAM can be pinned down by a normal unprivileged user.
The correct solution [;)] is EP_FUTEX - allow a futex to be specified
as the source of an epoll event.
Hi Jamie!

No, you don't understand (or maybe you were being facetious 8).

A normal futex blocks: ie. you can only have one futex being slept on
per process, so pinning a page is probably fine.

The NGPT guys wanted to allow waiting on more than one futex per
process. If you want to allow this through *any* mechanism, you have to:

1) Allocate something for each futex. The normal case uses the stack,
so never fails.

2) Control the number which can be used at once.

The current implementation needs a tighter #2.

Cheers,
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
Davide Libenzi
2003-05-20 01:14:38 UTC
Permalink
Post by Jamie Lokier
Post by Ingo Molnar
FUTEX_FD is an instant DoS, it allows the pinning of one page per file
descriptor, per thread. With a default limit of 1024 open files per
thread, and 256 threads (on a sane/conservative setup), this means 1 GB of
RAM can be pinned down by a normal unprivileged user.
The correct solution [;)] is EP_FUTEX - allow a futex to be specified
as the source of an epoll event.
Futexes do support f_op->poll(), isn't it Rusty ? If so, they're supported
by epoll ...



- Davide
Jamie Lokier
2003-05-20 01:44:03 UTC
Permalink
Post by Davide Libenzi
Post by Jamie Lokier
Post by Ingo Molnar
FUTEX_FD is an instant DoS, it allows the pinning of one page per file
descriptor, per thread. With a default limit of 1024 open files per
thread, and 256 threads (on a sane/conservative setup), this means 1 GB of
RAM can be pinned down by a normal unprivileged user.
The correct solution [;)] is EP_FUTEX - allow a futex to be specified
as the source of an epoll event.
Futexes do support f_op->poll(), isn't it Rusty ? If so, they're supported
by epoll ...
Yes, they do and it should work (I haven't tried, though).

There is a practical problem when waiting on a futex in multiple
threads using epoll: you need a separate fd per waiter, rather than an
fd per waited-on futex. This is because some uses of futexes depend
on waiters being woken in the exact order that they were queued.

To get this ordering, every waiter must allocate its own fd. The only
practical way to do this is allocate the fd just prior to waiting, and
release it afterwards.

As you can imagine, with many threads this implies a lot of fds, which
are in limited supply, and a high rate of allocation and deallocation,
which may be relatively slow.

-- Jamie
Davide Libenzi
2003-05-20 16:54:27 UTC
Permalink
Post by Jamie Lokier
Yes, they do and it should work (I haven't tried, though).
There is a practical problem when waiting on a futex in multiple
threads using epoll: you need a separate fd per waiter, rather than an
fd per waited-on futex. This is because some uses of futexes depend
on waiters being woken in the exact order that they were queued.
Not really Jamie. See a Futex event is not much different from a network
one. When the event shows up, one thread will pick it up (epoll_wait) and
will handle it. A futex event will very likely be a green light for some
sort of resource usage that whatever thread can pick up and handle.



- Davide
Jamie Lokier
2003-05-20 17:59:22 UTC
Permalink
Post by Davide Libenzi
Post by Jamie Lokier
Yes, they do and it should work (I haven't tried, though).
There is a practical problem when waiting on a futex in multiple
threads using epoll: you need a separate fd per waiter, rather than an
fd per waited-on futex. This is because some uses of futexes depend
on waiters being woken in the exact order that they were queued.
Not really Jamie. See a Futex event is not much different from a network
one. When the event shows up, one thread will pick it up (epoll_wait) and
will handle it. A futex event will very likely be a green light for some
sort of resource usage that whatever thread can pick up and handle.
That is true when a futex represents some item of work to do, or
readiness such as data coming in or going it. Then it is quite
reasonable to think of it like a pollable fd.

But futexes are also used to represent contention for shared
resources, and the properties needed for this are quite different.

See futex_up_fair() in Rusty's futex-2.2 library. That depends
crucially on getting exactly the right number of wakeup tokens passed,
and in the order the waiters blocked.

To pick an example, consider a dynamic web server which is
occasionally asked to render complex images, but where most of the
content is easily generated. You might have 30 concurrent page
serving threads, but want to limit the number of threads which are
generating a particularly complex response to 3 at a time because of
memory constraints.

You cannot program this as putting things on a work queue and having
arbitrary threads pick them off, unless you are prepared to represent
the problem as an explicit state machine, where the intermediate
states can be represented as a data structure. If the state is too
complex for that, which is often a reason for using threads in the
first place, that is not an option.

So it is important that _those_ kind of waiting threads are woken in
something approximating the order they went to sleep. Otherwise there
is a danger of certain requests being starved unfairly.

To implement this you can use a simple futex, or a significantly more
complex, and slower, queue structure.

Using the futex, you either need an fd per waiter to get this
fairness, when the waiters are using epoll to listen for more than one
event, _or_ you need epoll to be clever and preserve the sleep-to-wake
ordering over a single futex. I favour that last solution.

-- Jamie
Davide Libenzi
2003-05-21 23:56:56 UTC
Permalink
Post by Jamie Lokier
Post by Davide Libenzi
Not really Jamie. See a Futex event is not much different from a network
one. When the event shows up, one thread will pick it up (epoll_wait) and
will handle it. A futex event will very likely be a green light for some
sort of resource usage that whatever thread can pick up and handle.
That is true when a futex represents some item of work to do, or
readiness such as data coming in or going it. Then it is quite
reasonable to think of it like a pollable fd.
But futexes are also used to represent contention for shared
resources, and the properties needed for this are quite different.
See futex_up_fair() in Rusty's futex-2.2 library. That depends
crucially on getting exactly the right number of wakeup tokens passed,
and in the order the waiters blocked.
To pick an example, consider a dynamic web server which is
occasionally asked to render complex images, but where most of the
content is easily generated. You might have 30 concurrent page
serving threads, but want to limit the number of threads which are
generating a particularly complex response to 3 at a time because of
memory constraints.
You cannot program this as putting things on a work queue and having
arbitrary threads pick them off, unless you are prepared to represent
the problem as an explicit state machine, where the intermediate
states can be represented as a data structure. If the state is too
complex for that, which is often a reason for using threads in the
first place, that is not an option.
Jamie, if you use the one thread per connection model you don't need epoll
at all. The std poll() scale very well with one fd :) If you're using a
N:1 (N connections over one thread/task) you definitely need either a
state machine or something like coroutines. With futex->poll() (under
epoll) you can easily create the object you need to solve your server
problem. You create an object (structure) that contain the futex and a
list of coroutines/state-structures. You drop the pointer to this
structure inside data.ptr of the epoll fd data. When you get an event you
invoke/proceed with a FIFO order. When an object wait on this "mutex" you
drop the calling coroutine/state-structure inside the list tail. All those
ops (getting data from epoll event, removing list head, adding to tail)
are lightning fast. Actually, strictly speaking, you don't even need a
futex to do this with a N:1 model :) But you do need it for a N:M (M > 1) model.



- Davide
Jamie Lokier
2003-05-20 01:50:47 UTC
Permalink
Post by Rusty Russell
2) Control the number which can be used at once.
The current implementation needs a tighter #2.
Alternatively, make a page swappable even while it's being waited on
somehow. Can the vcache help with that?

-- Jamie
Rusty Russell
2003-05-20 00:15:23 UTC
Permalink
Post by Ingo Molnar
Post by Ingo Molnar
- start the phasing out of FUTEX_FD. This i believe is quite unclean and
unrobust, [...]
FUTEX_FD is an instant DoS, it allows the pinning of one page per file
descriptor, per thread. With a default limit of 1024 open files per
thread, and 256 threads (on a sane/conservative setup), this means 1 GB of
RAM can be pinned down by a normal unprivileged user.
Yes. There was a patch which limited it, never got applied.

The real solution is not to pin the page: I pinned the page originally
to prevent dealing with addresses changing due to swap out, but you
found the COW bug and that blew away that theory anyway 8)

I think the vcache callbacks or similar could be extended to cover the
swap out/swap in case.

Cheers,
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
Rusty Russell
2003-05-20 00:08:46 UTC
Permalink
Post by Christoph Hellwig
Urgg, yet another sys_futex extension. Could you please split all
these totally different cases into separate syscalls instead?
I don't mind either way, but wouldn't that be pointless incompatible
churn?

Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
V***@vt.edu
2003-05-20 00:31:58 UTC
Permalink
Post by Rusty Russell
Post by Christoph Hellwig
Urgg, yet another sys_futex extension. Could you please split all
these totally different cases into separate syscalls instead?
I don't mind either way, but wouldn't that be pointless incompatible
churn?
Doesn't matter, the churn hasn't been reflected in glibc-kernheaders yet
either way. Once kernheaders gets updated to match, you're stuck with whatever
it is at that point, so churn now while you have a chance. ;)

(Sorry, I couldn't resist)
Andrew Morton
2003-05-19 10:23:25 UTC
Permalink
Post by Ingo Molnar
+ page1 = __pin_page(uaddr1 - offset1);
+ if (!page1) {
+ unlock_futex_mm();
+ return -EFAULT;
+ }
+ page2 = __pin_page(uaddr2 - offset2);
+ if (!page2) {
+ unlock_futex_mm();
+ return -EFAULT;
+ }
page1 is leaked.
Ingo Molnar
2003-05-19 10:30:11 UTC
Permalink
Post by Andrew Morton
page1 is leaked.
doh, indeed. -D4 patch attached. It also fixes the line-wrap noticed by
Christoph Hellwig. Patch applies, compiles and boots fine.

Ingo

--- linux/include/linux/futex.h.orig
+++ linux/include/linux/futex.h
@@ -5,7 +5,8 @@
#define FUTEX_WAIT (0)
#define FUTEX_WAKE (1)
#define FUTEX_FD (2)
+#define FUTEX_REQUEUE (3)

-extern asmlinkage long sys_futex(u32 __user *uaddr, int op, int val, struct timespec __user *utime);
+extern asmlinkage long sys_futex(u32 __user *uaddr, int op, int val, struct timespec __user *utime, u32 __user *uaddr2);

#endif
--- linux/kernel/fork.c.orig
+++ linux/kernel/fork.c
@@ -457,7 +457,7 @@ void mm_release(struct task_struct *tsk,
* not set up a proper pointer then tough luck.
*/
put_user(0, tidptr);
- sys_futex(tidptr, FUTEX_WAKE, 1, NULL);
+ sys_futex(tidptr, FUTEX_WAKE, 1, NULL, NULL);
}
}

--- linux/kernel/compat.c.orig
+++ linux/kernel/compat.c
@@ -214,7 +214,7 @@ asmlinkage long compat_sys_sigprocmask(i
extern long do_futex(unsigned long, int, int, unsigned long);

asmlinkage long compat_sys_futex(u32 *uaddr, int op, int val,
- struct compat_timespec *utime)
+ struct compat_timespec *utime, u32 *uaddr2)
{
struct timespec t;
unsigned long timeout = MAX_SCHEDULE_TIMEOUT;
@@ -224,7 +224,7 @@ asmlinkage long compat_sys_futex(u32 *ua
return -EFAULT;
timeout = timespec_to_jiffies(&t) + 1;
}
- return do_futex((unsigned long)uaddr, op, val, timeout);
+ return do_futex((unsigned long)uaddr, op, val, timeout, (unsigned long)uaddr2);
}

asmlinkage long sys_setrlimit(unsigned int resource, struct rlimit *rlim);
--- linux/kernel/futex.c.orig
+++ linux/kernel/futex.c
@@ -2,6 +2,9 @@
* Fast Userspace Mutexes (which I call "Futexes!").
* (C) Rusty Russell, IBM 2002
*
+ * Generalized futexes, futex requeueing, misc fixes by Ingo Molnar
+ * (C) Copyright 2003 Red Hat Inc, All Rights Reserved
+ *
* Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly
* enough at me, Linus for the original (flawed) idea, Matthew
* Kirkwood for proof-of-concept implementation.
@@ -9,9 +12,6 @@
* "The futexes are also cursed."
* "But they come in a choice of three flavours!"
*
- * Generalized futexes for every mapping type, Ingo Molnar, 2002
- *
- *
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
@@ -216,6 +216,62 @@ static void futex_vcache_callback(vcache
spin_unlock(&futex_lock);
}

+/*
+ * Requeue all waiters hashed on one physical page to another
+ * physical page.
+ */
+static int futex_requeue(unsigned long uaddr1, int offset1, unsigned long uaddr2, int offset2, int num)
+{
+ struct list_head *i, *next, *head1, *head2;
+ struct page *page1 = NULL, *page2 = NULL;
+ int ret = 0;
+
+ lock_futex_mm();
+
+ page1 = __pin_page(uaddr1 - offset1);
+ if (!page1)
+ goto out;
+ page2 = __pin_page(uaddr2 - offset2);
+ if (!page2)
+ goto out;
+
+ head1 = hash_futex(page1, offset1);
+ head2 = hash_futex(page2, offset2);
+
+ list_for_each_safe(i, next, head1) {
+ struct futex_q *this = list_entry(i, struct futex_q, list);
+
+ if (this->page == page1 && this->offset == offset1) {
+ list_del_init(i);
+ __detach_vcache(&this->vcache);
+ if (++ret <= num) {
+ wake_up_all(&this->waiters);
+ if (this->filp)
+ send_sigio(&this->filp->f_owner,
+ this->fd, POLL_IN);
+ } else {
+ unpin_page(this->page);
+ __pin_page_atomic (page2);
+ list_add_tail(i, head2);
+ __attach_vcache(&this->vcache, uaddr2,
+ current->mm, futex_vcache_callback);
+ this->offset = offset2;
+ this->page = page2;
+ }
+ }
+ }
+
+out:
+ unlock_futex_mm();
+
+ if (page1)
+ unpin_page(page1);
+ if (page2)
+ unpin_page(page2);
+
+ return ret;
+}
+
static inline void __queue_me(struct futex_q *q, struct page *page,
unsigned long uaddr, int offset,
int fd, struct file *filp)
@@ -425,9 +481,9 @@ out:
return ret;
}

-long do_futex(unsigned long uaddr, int op, int val, unsigned long timeout)
+long do_futex(unsigned long uaddr, int op, int val, unsigned long timeout, unsigned long uaddr2)
{
- unsigned long pos_in_page;
+ unsigned long pos_in_page, pos_in_page2;
int ret;

pos_in_page = uaddr % PAGE_SIZE;
@@ -443,6 +499,14 @@ long do_futex(unsigned long uaddr, int o
case FUTEX_WAKE:
ret = futex_wake(uaddr, pos_in_page, val);
break;
+ case FUTEX_REQUEUE:
+ pos_in_page2 = uaddr2 % PAGE_SIZE;
+
+ /* Must be "naturally" aligned */
+ if (pos_in_page2 % sizeof(u32))
+ return -EINVAL;
+ ret = futex_requeue(uaddr, pos_in_page, uaddr2, pos_in_page2, val);
+ break;
case FUTEX_FD:
/* non-zero val means F_SETOWN(getpid()) & F_SETSIG(val) */
ret = futex_fd(uaddr, pos_in_page, val);
@@ -453,7 +517,7 @@ long do_futex(unsigned long uaddr, int o
return ret;
}

-asmlinkage long sys_futex(u32 __user *uaddr, int op, int val, struct timespec __user *utime)
+asmlinkage long sys_futex(u32 __user *uaddr, int op, int val, struct timespec __user *utime, u32 __user *uaddr2)
{
struct timespec t;
unsigned long timeout = MAX_SCHEDULE_TIMEOUT;
@@ -463,7 +527,7 @@ asmlinkage long sys_futex(u32 __user *ua
return -EFAULT;
timeout = timespec_to_jiffies(&t) + 1;
}
- return do_futex((unsigned long)uaddr, op, val, timeout);
+ return do_futex((unsigned long)uaddr, op, val, timeout, (unsigned long)uaddr2);
}

static struct super_block *
Rusty Russell
2003-05-20 00:04:31 UTC
Permalink
Post by Ingo Molnar
the attached patch addresses a futex related SMP scalability problem of
glibc. A number of regressions have been reported to the NTPL mailing list
when going to many CPUs, for applications that use condition variables and
the pthread_cond_broadcast() API call. Using this functionality, testcode
shows a slowdown from 0.12 seconds runtime to over 237 seconds (!)
runtime, on 4-CPU systems.
I gave feedback on this before, but didn't get a response.

1) Overload the last futex arg (change from timeval * to void *),
don't add YA arg at the end.

2) Use __alignof__(u32) not sizeof(u32). Sure, they're the same, but
you really mean __alignof__ here.

I'm glad you finally put your name at the top of the file...

Thanks!
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
Ulrich Drepper
2003-05-20 00:40:54 UTC
Permalink
Post by Rusty Russell
1) Overload the last futex arg (change from timeval * to void *),
don't add YA arg at the end.
It wasn't Ingo's idea. I suggested it. Overloading parameter types is
evil. This isn't an issue anymore if the extension is implemented as a
new syscall which certainly is better.
Post by Rusty Russell
2) Use __alignof__(u32) not sizeof(u32). Sure, they're the same, but
you really mean __alignof__ here.
I would always use sizeof() in this situation. alignof is a property
the compiler can potentially relax. On x86 it could easily be defined
to 1 in case someone wants to save memory. OK, not really useful for
x86, but some embedded archs might be in such a situation. Using
sizeof() makes sure that the variable is always 4-byte aligned. For
atomic operations this certainly is a good minimal requirement.

If you want to really correct you might have to introduce something like
atomic_alignof().

- --
- --------------. ,-. 444 Castro Street
Ulrich Drepper \ ,-----------------' \ Mountain View, CA 94041 USA
Red Hat `--' drepper at redhat.com `---------------------------
Rusty Russell
2003-05-20 01:46:45 UTC
Permalink
Post by Ulrich Drepper
Post by Rusty Russell
1) Overload the last futex arg (change from timeval * to void *),
don't add YA arg at the end.
It wasn't Ingo's idea. I suggested it. Overloading parameter types is
evil. This isn't an issue anymore if the extension is implemented as a
new syscall which certainly is better.
People are using the interface, so I don't think changing it because
"it's nicer this way" is worthwhile: Ingo's "new syscall" patch has
backwards compat code for the old syscalls. That's fugly 8(

So I'd say if we're going to multiplex, then overloading is easiest,
and in fact already done for FUTEX_FD.

But it's an issue over which sane people can disagree, IMHO.
Post by Ulrich Drepper
Post by Rusty Russell
2) Use __alignof__(u32) not sizeof(u32). Sure, they're the same, but
you really mean __alignof__ here.
I would always use sizeof() in this situation.
Comment says: /* Must be "naturally" aligned */. This used to be true
in a much earlier version of the code, now AFAICT the requirement test
should be:

/* Handling futexes on multiple pages? -ETOOHARD */
if (pos_in_page + sizeof(u32) > PAGE_SIZE)
return -EINVAL;

Ingo?

Thanks for the comments!
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
Ulrich Drepper
2003-05-20 02:11:51 UTC
Permalink
Post by Rusty Russell
People are using the interface, so I don't think changing it because
"it's nicer this way" is worthwhile: Ingo's "new syscall" patch has
backwards compat code for the old syscalls.
But not the new requeue variant and this is what needs the extra
parameter. So the parameter question is not issue.

- --
- --------------. ,-. 444 Castro Street
Ulrich Drepper \ ,-----------------' \ Mountain View, CA 94041 USA
Red Hat `--' drepper at redhat.com `---------------------------
Ingo Molnar
2003-05-20 06:27:03 UTC
Permalink
Post by Ulrich Drepper
It wasn't Ingo's idea. I suggested it. Overloading parameter types is
evil. This isn't an issue anymore if the extension is implemented as a
new syscall which certainly is better.
People are using the interface, [...]
it's a _new_ argument, only used by new glibc. Like i mentioned it in the
original announcement, it's fully backwards compatible with any older app.
This is a minimum requirement for a live syscall!
[...] so I don't think changing it because "it's nicer this way" is
worthwhile: Ingo's "new syscall" patch has backwards compat code for the
old syscalls. That's fugly 8(
yes, but the damage has been done already, and now we've got to start the
slow wait for the old syscall to flush out of our tree. It will a few
years to get rid of the compat code, but we better start now. hch is
perfectly right that the old futex multiplexer interface is quite ugly,
the requeue op only made this even more prominent.
Comment says: /* Must be "naturally" aligned */. This used to be true
in a much earlier version of the code, now AFAICT the requirement test
/* Handling futexes on multiple pages? -ETOOHARD */
if (pos_in_page + sizeof(u32) > PAGE_SIZE)
return -EINVAL;
yes - but i'd rather enforce this for every futex, than to hit it in every
1000th app that manages to misalign a futex _and_ lay it across two pages.

Also, it's only x86 that guarantees atomic instructions on misaligned
futexes (even then it comes with a cycle penalty), are you sure this also
works on other architectures? So i'd rather be a bit more strict with this
requirement.

Ingo
Christoph Hellwig
2003-05-20 06:56:27 UTC
Permalink
Post by Ingo Molnar
yes, but the damage has been done already, and now we've got to start the
slow wait for the old syscall to flush out of our tree.
Actually it should go away before 2.6.0. sys_futex never was part of a
released stable kernel so having the old_ version around is silly. I
Think it's enough time until 2.6 hits the roads for people to have those
vendor libc flushed out that use it. (and sys_futex still isn't used
in the glibc CVS, only in the addon nptl package with pre-1 release
numbers.)
Rusty Russell
2003-05-20 08:57:39 UTC
Permalink
Post by Christoph Hellwig
Post by Ingo Molnar
yes, but the damage has been done already, and now we've got to start the
slow wait for the old syscall to flush out of our tree.
Actually it should go away before 2.6.0. sys_futex never was part of a
released stable kernel so having the old_ version around is silly.
Hmm, in that case I'd say "just break it", and I'd be all in favour of
demuxing the syscall.

But I think vendors have backported and released futexes, which is why
Ingo did this...

(Although you might be right about "by the time 2.6 is out" 8)
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
Ingo Molnar
2003-05-20 09:03:36 UTC
Permalink
Post by Rusty Russell
Post by Christoph Hellwig
Actually it should go away before 2.6.0. sys_futex never was part of a
released stable kernel so having the old_ version around is silly.
Hmm, in that case I'd say "just break it", and I'd be all in favour of
demuxing the syscall.
have you all gone nuts??? It's not an option to break perfectly working
binaries out there. Hell, we didnt even reorder the new NPTL
syscalls/extensions 1-2 kernel releases after the fact. Please grow up!

the interface should have been gotten right initially. We are all guilty
of it - now lets face the consequences. It's only a couple of lines of
code in a well isolated place of the file so i dont know what the fuss is
about. I havent even added FUTEX_REQUEUE to the old API.
Post by Rusty Russell
But I think vendors have backported and released futexes, which is why
Ingo did this...
of course. And which brought the productization of futexes in the first
place.

Ingo
Christoph Hellwig
2003-05-20 09:51:16 UTC
Permalink
Post by Ingo Molnar
have you all gone nuts??? It's not an option to break perfectly working
binaries out there.
Of course it is. Linux has enough problem problems due to past mainline
stupidities, now we don't need to codify vendor braindamages aswell. E.g
mainline doesn't have the RH AS aio vsyscall crap or suse's get dev_t behind
/dev/console stuff either. If Red Hat thinks it needs to live with more interfaces
than what the stable kernel release provide their on their own luck.

And it's a lot of time to 2.6 anyway, don't tell me Jakub isn't smart enough to
get a glibc rpm out by then that works with the new and the old futex stuff.
Ingo Molnar
2003-05-20 12:56:42 UTC
Permalink
Post by Christoph Hellwig
Post by Ingo Molnar
have you all gone nuts??? It's not an option to break perfectly working
binaries out there.
Of course it is. Linux has enough problem problems due to past mainline
stupidities, now we don't need to codify vendor braindamages aswell.
it's not vendor braindamage. First, sys_futex() was pretty much given as
an API, contributed by IBM (Rusty) around 2.5.8, mainly for NGPT. The
sys_futex() API would have gone into 2.6.0 with flying colors as-is if
NPTL hadnt been around.

You probably wouldnt even have noticed this uncleanliness in the API, if
not for my requeue patch from yesterday. (which is another feature to make
futexes more useful, also driven by NPTL, not some evil ploy.)

face it, there will always be mistakes when creating APIs. But removing
them arbitrarily just makes the mistake worse, two mistakes do not make a
correct move.

most of the time we catch crap before it gets into the mainline kernel.
Sometimes it takes months to notice, sometimes it takes years. I certainly
saw no-one complaining since sys_futex() went into 2.5.8 or so, more than
a year ago!

i'd be the last one to complain about some system-utility API being
changed, but here we are talking about tons of user installations.
Post by Christoph Hellwig
And it's a lot of time to 2.6 anyway, don't tell me Jakub isn't smart
enough to get a glibc rpm out by then that works with the new and the
old futex stuff.
you dont understand, do you? There are very valid and perfectly working
glibc installations [and maybe NGPT installations, futex users, etc.] out
there that will break if you remove sys_futex(). No amount of rpm hacking
will fix them up.

what do you think the reaction would have been, if half a year ago, when i
sent the first NPTL patches, i had started off by rewriting the
sys_futex() API and break the hell out of NGPT installations, just for the
fun of having a slightly cleaner API. We'd still be stuck with
LinuxThreads, i'm quite sure.

hell, this is 20 lines of code and no strings attached, and i've done all
the work already, it's not a big issue. So stop this purist crap ...

full futex patch against BK-curr attached: it fixes the page count leaks,
cleans up the code, splits up the futex ops into separate syscalls, keeps
the compatibility syscall and adds the requeueing feature. FUTEX_FD is
still fully supported. (and there's a sys_futex_fd syscall) (The patch
also fixes some leftover uaddr2 parameter passing present in the previous
patches.)

(the patch compiles, boots & works on x86/SMP, with NPTL enabled and
disabled as well.)

i think it might be more productive to put all this energy into comments
directed at the new APIs this patch adds.

Eg. is this alignment check the best one we can get? Is this particular
requeue API really the one we want? Plus now is the right time to add or
remove any functionality from the new syscalls. Next time around it will
be too late.

Ingo

--- linux/include/linux/futex.h.orig
+++ linux/include/linux/futex.h
@@ -2,10 +2,15 @@
#define _LINUX_FUTEX_H

/* Second argument to futex syscall */
-#define FUTEX_WAIT (0)
-#define FUTEX_WAKE (1)
-#define FUTEX_FD (2)

-extern asmlinkage long sys_futex(u32 __user *uaddr, int op, int val, struct timespec __user *utime);
+asmlinkage long sys_futex_wait(u32 __user *__uaddr,int val,
+ struct timespec __user *utime);
+
+asmlinkage long sys_futex_wake(u32 __user *__uaddr, int nr_wake);
+
+asmlinkage long sys_futex_fd(u32 __user *__uaddr, int signal);
+
+asmlinkage long sys_futex_requeue(u32 __user *__uaddr1, u32 __user *__uaddr2,
+ int nr_wake);

#endif
--- linux/arch/i386/kernel/entry.S.orig
+++ linux/arch/i386/kernel/entry.S
@@ -837,7 +837,7 @@ ENTRY(sys_call_table)
.long sys_fremovexattr
.long sys_tkill
.long sys_sendfile64
- .long sys_futex /* 240 */
+ .long old_futex /* 240 */
.long sys_sched_setaffinity
.long sys_sched_getaffinity
.long sys_set_thread_area
@@ -865,6 +865,9 @@ ENTRY(sys_call_table)
.long sys_clock_gettime /* 265 */
.long sys_clock_getres
.long sys_clock_nanosleep
-
+ .long sys_futex_wait
+ .long sys_futex_wake
+ .long sys_futex_fd /* 270 */
+ .long sys_futex_requeue

nr_syscalls=(.-sys_call_table)/4
--- linux/kernel/compat.c.orig
+++ linux/kernel/compat.c
@@ -211,7 +211,7 @@ asmlinkage long compat_sys_sigprocmask(i
return ret;
}

-extern long do_futex(unsigned long, int, int, unsigned long);
+extern long do_old_futex(unsigned long uaddr, int op, int val, unsigned long timeout);

asmlinkage long compat_sys_futex(u32 *uaddr, int op, int val,
struct compat_timespec *utime)
--- linux/kernel/fork.c.orig
+++ linux/kernel/fork.c
@@ -457,7 +457,7 @@ void mm_release(struct task_struct *tsk,
* not set up a proper pointer then tough luck.
*/
put_user(0, tidptr);
- sys_futex(tidptr, FUTEX_WAKE, 1, NULL);
+ sys_futex_wake(tidptr, 1);
}
}

--- linux/kernel/futex.c.orig
+++ linux/kernel/futex.c
@@ -2,6 +2,9 @@
* Fast Userspace Mutexes (which I call "Futexes!").
* (C) Rusty Russell, IBM 2002
*
+ * Generalized futexes, futex requeueing, misc fixes by Ingo Molnar
+ * (C) Copyright 2003 Red Hat Inc, All Rights Reserved
+ *
* Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly
* enough at me, Linus for the original (flawed) idea, Matthew
* Kirkwood for proof-of-concept implementation.
@@ -9,9 +12,6 @@
* "The futexes are also cursed."
* "But they come in a choice of three flavours!"
*
- * Generalized futexes for every mapping type, Ingo Molnar, 2002
- *
- *
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
@@ -93,19 +93,18 @@ static inline struct list_head *hash_fut
FUTEX_HASHBITS)];
}

-/* Waiter either waiting in FUTEX_WAIT or poll(), or expecting signal */
-static inline void tell_waiter(struct futex_q *q)
-{
- wake_up_all(&q->waiters);
- if (q->filp)
- send_sigio(&q->filp->f_owner, q->fd, POLL_IN);
-}
-
/*
* Get kernel address of the user page and pin it.
*
* Must be called with (and returns with) all futex-MM locks held.
*/
+static inline struct page *__pin_page_atomic (struct page *page)
+{
+ if (!PageReserved(page))
+ get_page(page);
+ return page;
+}
+
static struct page *__pin_page(unsigned long addr)
{
struct mm_struct *mm = current->mm;
@@ -116,11 +115,8 @@ static struct page *__pin_page(unsigned
* Do a quick atomic lookup first - this is the fastpath.
*/
page = follow_page(mm, addr, 0);
- if (likely(page != NULL)) {
- if (!PageReserved(page))
- get_page(page);
- return page;
- }
+ if (likely(page != NULL))
+ return __pin_page_atomic(page);

/*
* No luck - need to fault in the page:
@@ -150,16 +146,11 @@ repeat_lookup:
return page;
}

-static inline void unpin_page(struct page *page)
-{
- put_page(page);
-}
-
/*
* Wake up all waiters hashed on the physical page that is mapped
* to this virtual address:
*/
-static int futex_wake(unsigned long uaddr, int offset, int num)
+static inline int futex_wake(unsigned long uaddr, int offset, int num)
{
struct list_head *i, *next, *head;
struct page *page;
@@ -181,7 +172,9 @@ static int futex_wake(unsigned long uadd
if (this->page == page && this->offset == offset) {
list_del_init(i);
__detach_vcache(&this->vcache);
- tell_waiter(this);
+ wake_up_all(&this->waiters);
+ if (this->filp)
+ send_sigio(&this->filp->f_owner, this->fd, POLL_IN);
ret++;
if (ret >= num)
break;
@@ -189,7 +182,7 @@ static int futex_wake(unsigned long uadd
}

unlock_futex_mm();
- unpin_page(page);
+ put_page(page);

return ret;
}
@@ -208,7 +201,9 @@ static void futex_vcache_callback(vcache
spin_lock(&futex_lock);

if (!list_empty(&q->list)) {
+ put_page(q->page);
q->page = new_page;
+ __pin_page_atomic(new_page);
list_del(&q->list);
list_add_tail(&q->list, head);
}
@@ -216,6 +211,63 @@ static void futex_vcache_callback(vcache
spin_unlock(&futex_lock);
}

+/*
+ * Requeue all waiters hashed on one physical page to another
+ * physical page.
+ */
+static inline int futex_requeue(unsigned long uaddr1, int offset1,
+ unsigned long uaddr2, int offset2, int num)
+{
+ struct list_head *i, *next, *head1, *head2;
+ struct page *page1 = NULL, *page2 = NULL;
+ int ret = 0;
+
+ lock_futex_mm();
+
+ page1 = __pin_page(uaddr1 - offset1);
+ if (!page1)
+ goto out;
+ page2 = __pin_page(uaddr2 - offset2);
+ if (!page2)
+ goto out;
+
+ head1 = hash_futex(page1, offset1);
+ head2 = hash_futex(page2, offset2);
+
+ list_for_each_safe(i, next, head1) {
+ struct futex_q *this = list_entry(i, struct futex_q, list);
+
+ if (this->page == page1 && this->offset == offset1) {
+ list_del_init(i);
+ __detach_vcache(&this->vcache);
+ if (++ret <= num) {
+ wake_up_all(&this->waiters);
+ if (this->filp)
+ send_sigio(&this->filp->f_owner,
+ this->fd, POLL_IN);
+ } else {
+ put_page(this->page);
+ __pin_page_atomic (page2);
+ list_add_tail(i, head2);
+ __attach_vcache(&this->vcache, uaddr2,
+ current->mm, futex_vcache_callback);
+ this->offset = offset2;
+ this->page = page2;
+ }
+ }
+ }
+
+out:
+ unlock_futex_mm();
+
+ if (page1)
+ put_page(page1);
+ if (page2)
+ put_page(page2);
+
+ return ret;
+}
+
static inline void __queue_me(struct futex_q *q, struct page *page,
unsigned long uaddr, int offset,
int fd, struct file *filp)
@@ -252,7 +304,7 @@ static inline int unqueue_me(struct fute
return ret;
}

-static int futex_wait(unsigned long uaddr,
+static inline int futex_wait(unsigned long uaddr,
int offset,
int val,
unsigned long time)
@@ -273,14 +325,17 @@ static int futex_wait(unsigned long uadd
}
__queue_me(&q, page, uaddr, offset, -1, NULL);

- unlock_futex_mm();
-
- /* Page is pinned, but may no longer be in this address space. */
+ /*
+ * Page is pinned, but may no longer be in this address space.
+ * It cannot schedule, so we access it with the spinlock held.
+ */
if (get_user(curval, (int *)uaddr) != 0) {
+ unlock_futex_mm();
ret = -EFAULT;
goto out;
}
if (curval != val) {
+ unlock_futex_mm();
ret = -EWOULDBLOCK;
goto out;
}
@@ -288,13 +343,15 @@ static int futex_wait(unsigned long uadd
* The get_user() above might fault and schedule so we
* cannot just set TASK_INTERRUPTIBLE state when queueing
* ourselves into the futex hash. This code thus has to
- * rely on the FUTEX_WAKE code doing a wakeup after removing
+ * rely on the futex_wake() code doing a wakeup after removing
* the waiter from the list.
*/
add_wait_queue(&q.waiters, &wait);
set_current_state(TASK_INTERRUPTIBLE);
- if (!list_empty(&q.list))
+ if (!list_empty(&q.list)) {
+ unlock_futex_mm();
time = schedule_timeout(time);
+ }
set_current_state(TASK_RUNNING);
/*
* NOTE: we don't remove ourselves from the waitqueue because
@@ -310,7 +367,7 @@ out:
/* Were we woken up anyway? */
if (!unqueue_me(&q))
ret = 0;
- unpin_page(page);
+ put_page(q.page);

return ret;
}
@@ -320,7 +377,7 @@ static int futex_close(struct inode *ino
struct futex_q *q = filp->private_data;

unqueue_me(q);
- unpin_page(q->page);
+ put_page(q->page);
kfree(filp->private_data);
return 0;
}
@@ -416,11 +473,76 @@ static int futex_fd(unsigned long uaddr,
page = NULL;
out:
if (page)
- unpin_page(page);
+ put_page(page);
return ret;
}

-long do_futex(unsigned long uaddr, int op, int val, unsigned long timeout)
+asmlinkage long sys_futex_wait(u32 __user *__uaddr, int val, struct timespec __user *utime)
+{
+ unsigned long uaddr = (unsigned long)__uaddr;
+ unsigned long pos_in_page = uaddr % PAGE_SIZE;
+ unsigned long timeout = MAX_SCHEDULE_TIMEOUT;
+ struct timespec t;
+
+ /* Must be "naturally" aligned */
+ if (pos_in_page % sizeof(u32))
+ return -EINVAL;
+
+ if (utime) {
+ if (copy_from_user(&t, utime, sizeof(t)) != 0)
+ return -EFAULT;
+ timeout = timespec_to_jiffies(&t) + 1;
+ }
+ return futex_wait(uaddr, pos_in_page, val, timeout);
+}
+
+asmlinkage long sys_futex_wake(u32 __user *__uaddr, int nr_wake)
+{
+ unsigned long uaddr = (unsigned long)__uaddr;
+ unsigned long pos_in_page = uaddr % PAGE_SIZE;
+
+ /* Must be "naturally" aligned */
+ if (pos_in_page % sizeof(u32))
+ return -EINVAL;
+
+ return futex_wake(uaddr, pos_in_page, nr_wake);
+}
+
+asmlinkage long sys_futex_fd(u32 __user *__uaddr, int signal)
+{
+ unsigned long uaddr = (unsigned long)__uaddr;
+ unsigned long pos_in_page = uaddr % PAGE_SIZE;
+
+ /* Must be "naturally" aligned */
+ if (pos_in_page % sizeof(u32))
+ return -EINVAL;
+
+ return futex_fd(uaddr, pos_in_page, signal);
+}
+
+asmlinkage long sys_futex_requeue(u32 __user *__uaddr1,
+ u32 __user *__uaddr2, int nr_wake)
+{
+ unsigned long uaddr1 = (unsigned long)__uaddr1,
+ uaddr2 = (unsigned long)__uaddr2;
+ unsigned long pos_in_page1 = uaddr1 % PAGE_SIZE,
+ pos_in_page2 = uaddr2 % PAGE_SIZE;
+
+ /* Must be "naturally" aligned */
+ if ((pos_in_page1 | pos_in_page2) % sizeof(u32))
+ return -EINVAL;
+
+ return futex_requeue(uaddr1, pos_in_page1, uaddr2, pos_in_page2, nr_wake);
+}
+
+/*
+ * Compatibility code:
+ */
+#define OLD_FUTEX_WAIT (0)
+#define OLD_FUTEX_WAKE (1)
+#define OLD_FUTEX_FD (2)
+
+long do_old_futex(unsigned long uaddr, int op, int val, unsigned long timeout)
{
unsigned long pos_in_page;
int ret;
@@ -432,13 +554,13 @@ long do_futex(unsigned long uaddr, int o
return -EINVAL;

switch (op) {
- case FUTEX_WAIT:
+ case OLD_FUTEX_WAIT:
ret = futex_wait(uaddr, pos_in_page, val, timeout);
break;
- case FUTEX_WAKE:
+ case OLD_FUTEX_WAKE:
ret = futex_wake(uaddr, pos_in_page, val);
break;
- case FUTEX_FD:
+ case OLD_FUTEX_FD:
/* non-zero val means F_SETOWN(getpid()) & F_SETSIG(val) */
ret = futex_fd(uaddr, pos_in_page, val);
break;
@@ -448,17 +570,17 @@ long do_futex(unsigned long uaddr, int o
return ret;
}

-asmlinkage long sys_futex(u32 __user *uaddr, int op, int val, struct timespec __user *utime)
+asmlinkage long old_futex(u32 __user *uaddr, int op, int val, struct timespec __user *utime)
{
struct timespec t;
unsigned long timeout = MAX_SCHEDULE_TIMEOUT;

- if ((op == FUTEX_WAIT) && utime) {
+ if ((op == OLD_FUTEX_WAIT) && utime) {
if (copy_from_user(&t, utime, sizeof(t)) != 0)
return -EFAULT;
timeout = timespec_to_jiffies(&t) + 1;
}
- return do_futex((unsigned long)uaddr, op, val, timeout);
+ return do_old_futex((unsigned long)uaddr, op, val, timeout);
}

static struct super_block *
Christoph Hellwig
2003-05-20 14:08:26 UTC
Permalink
Post by Ingo Molnar
it's not vendor braindamage.
I didn't sayd sys_futex is vendor braindamage. The problem is that vendors
think they need to put all the unstable features into their release tree.
There's a reaaon those aren';t in the released stable kernels.
Post by Ingo Molnar
you dont understand, do you? There are very valid and perfectly working
glibc installations [and maybe NGPT installations, futex users, etc.] out
there that will break if you remove sys_futex(). No amount of rpm hacking
will fix them up.
And they'll break if you run _any_ released stable kernel, so what?
Let the vendors fix the mess they introduced by backporting unstable features
instead of burdening it up on mainline.
Ingo Molnar
2003-05-20 16:02:07 UTC
Permalink
Post by Christoph Hellwig
Post by Ingo Molnar
you dont understand, do you? There are very valid and perfectly working
glibc installations [and maybe NGPT installations, futex users, etc.] out
there that will break if you remove sys_futex(). No amount of rpm hacking
will fix them up.
And they'll break if you run _any_ released stable kernel, so what?
you havent ever used Ulrich's nptl-enabled glibc, have you? It will boot
on any 2.4.1+ kernel, with and without nptl/tls support. It switches the
threading implementation depending on the kernel features it detects.

Ingo
Christoph Hellwig
2003-05-20 19:55:12 UTC
Permalink
Post by Ingo Molnar
you havent ever used Ulrich's nptl-enabled glibc, have you? It will boot
on any 2.4.1+ kernel, with and without nptl/tls support. It switches the
threading implementation depending on the kernel features it detects.
I have built a nptl-enabled glibc and no, it's doesn't work on 2.4 at all.
Christoph Hellwig
2003-05-21 06:31:12 UTC
Permalink
Post by Christoph Hellwig
Post by Ingo Molnar
you havent ever used Ulrich's nptl-enabled glibc, have you? It will boot
on any 2.4.1+ kernel, with and without nptl/tls support. It switches the
threading implementation depending on the kernel features it detects.
I have built a nptl-enabled glibc and no, it's doesn't work on 2.4 at all.
It is because you only compiled it with nptl support.
I know. I didn't ever claim you can install multiple glibc version.
In recent (nptl enabled) Redhat glibc's glibc is build two times.
For x86 it's even built three times..
Martin Schlemmer
2003-05-21 05:06:57 UTC
Permalink
Post by Christoph Hellwig
Post by Ingo Molnar
you havent ever used Ulrich's nptl-enabled glibc, have you? It will boot
on any 2.4.1+ kernel, with and without nptl/tls support. It switches the
threading implementation depending on the kernel features it detects.
I have built a nptl-enabled glibc and no, it's doesn't work on 2.4 at all.
It is because you only compiled it with nptl support.

In recent (nptl enabled) Redhat glibc's glibc is build two times.
1) without nptl
2) with nptl

The version without nptl support is then installed into the 'default'
location. The other is installed into /lib/tls/ and /usr/lib/tls/.
ld.so is then hacked (maybe in mainline glibc now?) to load the tls
enabled versions of the libraries only if the kernel/hardware support
it.


Regards,
--
Martin Schlemmer
Linus Torvalds
2003-05-20 15:46:09 UTC
Permalink
Post by Christoph Hellwig
Post by Ingo Molnar
have you all gone nuts??? It's not an option to break perfectly working
binaries out there.
Of course it is.
NO.

Christoph, get a grip. Ingo is 100% right.

IT IS NEVER ACCEPTABLE TO BREAK USER LEVEL BINARIES! In particular, we do
_not_ do it just because of some sense of "aesthetics".

If you want "aesthetics", go play with microkernels, or other academic
projects. They don't care about their users, they care about their ideas.
The end result is, in my opinion, CRAP.

Linux has never been that way. The _founding_ principle of Linux was
"compatibility". Don't ever forget that. The user comes first. ALWAYS.
Pretty code is a desirable feature, but if prettifying the code breaks
user apps, it doesn't get prettified.

Repeat after me: the goodness of an operating system is not in how pretty
it is, but in how well it supports the user.

Make it your mantra.

Linus
Ingo Oeser
2003-05-20 09:12:28 UTC
Permalink
Hi,
Post by Christoph Hellwig
Post by Ingo Molnar
yes, but the damage has been done already, and now we've got to start the
slow wait for the old syscall to flush out of our tree.
Actually it should go away before 2.6.0. sys_futex never was part of a
released stable kernel so having the old_ version around is silly. I
Think it's enough time until 2.6 hits the roads for people to have those
vendor libc flushed out that use it. (and sys_futex still isn't used
in the glibc CVS, only in the addon nptl package with pre-1 release
numbers.)
This sounds reasonable. The people who shipped that already to
customers have so heavily patched kernels, that this simple patch
for the old sys_futex shouldn't really matter.

But cluttering the kernel with an API/ABI that was born in the
same development cycle, where it has been obsoleted sounds not
worth the bytes it consumes. And so we have a syscall slot
available for the next development cycle.

Or did anybody promise that this was final already? Usally this
promise comes with x.even.y not with x.odd.y .

Regards

Ingo Oeser
Linus Torvalds
2003-05-20 15:41:53 UTC
Permalink
Post by Christoph Hellwig
Actually it should go away before 2.6.0. sys_futex never was part of a
released stable kernel so having the old_ version around is silly.
That's not a valid argument. That's an _excuse_.

Guys, binary compatibility is important. It's important enough that if
something got extensively used on development kernels, it's _still_ a hell
of a lot more important than most other things around. The _only_ things
that trump binary compatibility are

- developer sanity (ie it has to be truly mindbogglingly hard to support
the old interface)
- stability (ie if the old interface was so badly designed that it can't
be done right - mmap on /proc/<pid>/mem was one of these).
- it's been deprecated over at least one full stable release.

Something like "it's only been in the development kernels" is simply not
an issue. The only thing that matters is whether it is used by various
binaries or not. And I think futexes are used a lot by glibc..

Linus
Rusty Russell
2003-05-20 08:55:29 UTC
Permalink
Post by Ingo Molnar
yes, but the damage has been done already, and now we've got to start the
slow wait for the old syscall to flush out of our tree. It will a few
years to get rid of the compat code, but we better start now. hch is
perfectly right that the old futex multiplexer interface is quite ugly,
the requeue op only made this even more prominent.
It's a judgement call: how simple it is to change vs. the amount of
damage done by not changing it.

I don't think it's worth changing, but I don't think we're going to
convince each other.
Post by Ingo Molnar
Post by Rusty Russell
Comment says: /* Must be "naturally" aligned */. This used to be true
in a much earlier version of the code, now AFAICT the requirement test
/* Handling futexes on multiple pages? -ETOOHARD */
if (pos_in_page + sizeof(u32) > PAGE_SIZE)
return -EINVAL;
yes - but i'd rather enforce this for every futex, than to hit it in every
1000th app that manages to misalign a futex _and_ lay it across two pages.
Good point. I'd prefer to fix the comment though, since it's not
true. How about changing it to something like:

/* We can't handle futexes across multiple pages: best to reject
any crazy alignment to save the users from themselves. */
Post by Ingo Molnar
Also, it's only x86 that guarantees atomic instructions on misaligned
futexes (even then it comes with a cycle penalty), are you sure this also
works on other architectures? So i'd rather be a bit more strict with this
requirement.
Sure. My point was that this comment is actually from a v. early futex
version where the kernel actually did the atomic ops itself.

Hope that clarifies!
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
Ingo Molnar
2003-05-20 06:19:03 UTC
Permalink
Post by Rusty Russell
1) Overload the last futex arg (change from timeval * to void *),
don't add YA arg at the end.
the right approach is the splitup of the arguments, done in the later
cleanup patch. But i'd rather add yet another argument than some nasty
overload of arguments ...
Post by Rusty Russell
2) Use __alignof__(u32) not sizeof(u32). Sure, they're the same, but
you really mean __alignof__ here.
ok. Could they ever be realistically different?

Ingo
Rusty Russell
2003-05-21 02:34:50 UTC
Permalink
Ingo's last patch is just fine. New functionality should not be added
to the to-be-obsoleted interfaces.
Perhaps I was reading too much into Linus' mail, but I read it as
"don't obsolete the old interface and introduce a new one just because
of some sense of aesthetics".

ie. it's not a to-be-obsoleted interface.

I think we're long past this being a productive thread, though.
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
Ingo Molnar
2003-05-21 09:48:49 UTC
Permalink
Perhaps I was reading too much into Linus' mail, but I read it as "don't
obsolete the old interface and introduce a new one just because of some
sense of aesthetics".
no. The concept is: "dont cause the user any pain". Reshuffling the
syscall internally and providing new interfaces for the feature to be
exposed in a cleaner way is perfectly OK as long as this does not hurt
anything else - and it does not in this case. New glibc will use the new
syscalls and will fall back to the old one if they are -ENOSYS, so new
glibc will work on older kernels as well. Old glibc will work with old
kernels and new kernels as well. This is being done for other interfaces
currently, this is the only mechanism to 'flush out' old syscalls
gracefully.

the newer a kernel and a glibc is, the less 'fallback' happens - the
faster the application is, so this compatibility scheme makes perfect
sense both support-wise and technically. (not counting the speedup caused
by the interface cleanups themselves)

not adding FUTEX_REQUEUE to the old syscall is the right solution - it was
never exposed to anyone in this way, so old glibc doesnt know about it -
and new glibc will never use the old interfaces.

any other disagreements?

Ingo
Christoph Hellwig
2003-05-21 10:24:11 UTC
Permalink
Post by Ingo Molnar
no. The concept is: "dont cause the user any pain". Reshuffling the
syscall internally and providing new interfaces for the feature to be
exposed in a cleaner way is perfectly OK as long as this does not hurt
anything else - and it does not in this case. New glibc will use the new
syscalls and will fall back to the old one if they are -ENOSYS, so new
glibc will work on older kernels as well. Old glibc will work with old
kernels and new kernels as well. This is being done for other interfaces
currently, this is the only mechanism to 'flush out' old syscalls
gracefully.
I don't think anyone disagreed with that (at least not me). The only
thing I argued is that we don't need the usual flush-out period of two
stable series because this syscall never was implemented in any relead
stable kernel but rather a much shorter one so that this feature never
hits a released stable kernel.

Linus disagrees strongly so we'll have to keep this crap around for five
years - that's just yet another bit of bloat growing everyones kernel
but no irreperable damage.
Rusty Russell
2003-05-22 00:30:54 UTC
Permalink
Post by Ingo Molnar
Perhaps I was reading too much into Linus' mail, but I read it as "don't
obsolete the old interface and introduce a new one just because of some
sense of aesthetics".
no. The concept is: "dont cause the user any pain". Reshuffling the
syscall internally and providing new interfaces for the feature to be
exposed in a cleaner way is perfectly OK as long as this does not hurt
anything else
I understand what you're saying, but I disagree.

See, I don't think the current interface is too ugly to live.
Especially since you don't need to introduce a new arg to implement
FUTEX_REQUEUE, so there's no immediate problem with it.

Do you understand what I'm saying? If so, we can agree to disagree,
and it doesn't matter, because Linus will take your patch 8)

Cheers,
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
Ingo Molnar
2003-05-22 09:15:20 UTC
Permalink
See, I don't think the current interface is too ugly to live. Especially
since you don't need to introduce a new arg to implement FUTEX_REQUEUE,
so there's no immediate problem with it.
i think overloading a completely unrelated pointer (ie. overloading the
timespec pointer with a futex pointer, depending on the value of the 'op'
flag), within an already overcrowded multiplexing syscall is a 100%
non-starter to begin with.

really, i dont see what your problem with the new syscalls are. Futexes
started out as a special, limited hack for userspace threading, but by
today they have become a first-class concept: generic, user-accessible
waitqueues identified by the piece of VM they are located at. Very
flexible, very useful and very fast.

eg. under LinuxThreads, all the pthread_* functions were assuming a shared
VM. glibc functions like pthread_mutex_lock() only worked for 'threaded'
setups - ie. threads sharing pagetables via CLONE_VM. LinuxThreads had to
use thread pointers, lists and other constructs that made the library only
useful for the limited 'threaded' case.

today, with NPTL + glibc, all these pthread APIs are 100% available for
processes as well: all that is needed are the variables to be shared
between two processes. Ie. it works for shared mmap()-ed regions, or SysV
regions. Just #include <pthread.h> and use these rich locking APIs in your
process-based applications. No need to go to threads to be able to use the
pthread locking APIs. This, as far as i know, is a 'first' amongst unices.

While doing all this stuff we found a new, generic operation needed for
these 'user-accissible hashed waitqueues': the ability to requeue waiting
threads between two waitqueues, so that userspace can implement locking
primitives more flexibly. This is a clean, generic operation nicely
extending the already existing ability to wait on a waitqueue and to wake
up a thread in a waitqueue. Glibc will happily use all the new syscalls.
They are even slightly faster, and are much cleaner - and there's zero
slowdown for code that used the old API. Plus historically we've
constantly moved away from multiplexer syscalls. So it's precisely the
right time to do this, and it's a win-win situation all over.

all that is needed now is some actual review of the new APIs from the
conceptual angle (i've done that and i think they are okay, but more eyes
see more), so that we make sure these are good and we wont need to discard
any aspect of them anytime soon.

Ingo
Rusty Russell
2003-05-22 10:35:53 UTC
Permalink
Post by Ingo Molnar
really, i dont see what your problem with the new syscalls are.
That's clear. But I just changed my mind 8)

Because if you're going to demux the syscall, it might be worth
looking at FUTEX_FD: an "expected val" arg there might be worthwhile,
because to use it currently I think NPTL does:

try to get futex
fd = sys_futex(FUTEX_FD...)
try to get futex again because of race

Have the futex_fd act like futex_wait, ie. return -EWOULDBLOCK if the
value != expected value.
Post by Ingo Molnar
all that is needed now is some actual review of the new APIs from the
conceptual angle (i've done that and i think they are okay, but more eyes
see more), so that we make sure these are good and we wont need to discard
any aspect of them anytime soon.
Sorry, I didn't comment because I thought your explanation, concept,
analysis and code were all very neat.

Thanks,
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
Martin Wirth
2003-05-22 11:23:05 UTC
Permalink
Post by Ingo Molnar
all that is needed now is some actual review of the new APIs from the
conceptual angle (i've done that and i think they are okay, but more
eyes see more), so that we make sure these are good and we wont need to
discard any aspect of them anytime soon.
What about adding an u32 flags field to each one of the new futex
sys_calls. This gives you more freedom for future extensions without
changing the API again. Possible uses may be:

- Specify the futexes to be mm-local: By using the pair mm* and vaddr as
key it is possible to have process local futexes living on the same
hash with the following advantages:
1. no page_table lock contention (I implemented mm-local futexes
for my application after I noticed long latencies on SMP where a
high prio tasks spun in futex_wake while another task was doing
mmap/munmap on a second processor).
2. no vcache pollution (I guess 99% of all futexes will not be in
shared memory)
3. Slightly faster, since no page pinning is needed

- Specify queueing or unqueueing in priority order


Martin


P.S.

By the way, your latest futex patch still contains the bogus
if (!list_empty(&q.list)) conditional, that's always true since
you hold the locks at this point and no one can have removed us
Post by Ingo Molnar
add_wait_queue(&q.waiters, &wait);
set_current_state(TASK_INTERRUPTIBLE);
- if (!list_empty(&q.list))
+ if (!list_empty(&q.list)) {
+ unlock_futex_mm();
time = schedule_timeout(time);
+ }
Of course the test would be (and was) pretty necessary if you drop the
locks before the get_user(...) call. And I must admit that I still
can't see why you need to hold the locks across get_user. Even if it's
save to do so at least every automatic code checker will bark at this point.
Ingo Molnar
2003-05-22 11:34:42 UTC
Permalink
Post by Martin Wirth
- Specify the futexes to be mm-local: By using the pair mm* and vaddr as
key it is possible to have process local futexes living on the same
1. no page_table lock contention (I implemented mm-local futexes
for my application after I noticed long latencies on SMP where a
high prio tasks spun in futex_wake while another task was doing
mmap/munmap on a second processor).
2. no vcache pollution (I guess 99% of all futexes will not be in
shared memory)
3. Slightly faster, since no page pinning is needed
there's a patch in the works to change futex hashing to not require page
pinning, without making futexes process-local. This should also fix the
FUTEX_FD problems.

basically the idea is that struct page remains to be a good hash whenever
the page is present, and the swapout pte value is a good hash key when the
page is swapped out. So we basically can make futexes swappable.

this gets rid of pinned pages and pagetable locking, without the need for
API-level changes.
Post by Martin Wirth
By the way, your latest futex patch still contains the bogus if
(!list_empty(&q.list)) conditional, that's always true since you hold
ok, i'll fix this! I knew i missed something ...

Ingo
William Lee Irwin III
2003-05-22 12:04:24 UTC
Permalink
Post by Ingo Molnar
there's a patch in the works to change futex hashing to not require page
pinning, without making futexes process-local. This should also fix the
FUTEX_FD problems.
basically the idea is that struct page remains to be a good hash whenever
the page is present, and the swapout pte value is a good hash key when the
page is swapped out. So we basically can make futexes swappable.
this gets rid of pinned pages and pagetable locking, without the need for
API-level changes.
Someone's already doing it? I'll back off if so.


-- wli

Andi Kleen
2003-05-22 11:38:01 UTC
Permalink
Post by Martin Wirth
mmap/munmap on a second processor).
2. no vcache pollution (I guess 99% of all futexes will not be in
shared memory)
- Could also be specified as a mutex attribute in pthreads/NPTL to get faster
mutexes (or perhaps a environment variable to allow users easy tuning)

-Andi
Loading...