Discussion:
[PATCH] futex: Ensure get_futex_key_refs() always implies a barrier
Catalin Marinas
2014-10-17 16:38:49 UTC
Permalink
Commit b0c29f79ecea (futexes: Avoid taking the hb->lock if there's
nothing to wake up) changes the futex code to avoid taking a lock when
there are no waiters. This code has been subsequently fixed in commit
11d4616bd07f (futex: revert back to the explicit waiter counting code).
Both the original commit and the fix-up rely on get_futex_key_refs() to
always imply a barrier.

However, for private futexes, none of the cases in the switch statement
of get_futex_key_refs() would be hit and the function completes without
a memory barrier as required before checking the "waiters" in
futex_wake() -> hb_waiters_pending(). The consequence is a race with a
thread waiting on a futex on another CPU, allowing the waker thread to
read "waiters == 0" while the waiter thread to have read "futex_val ==
locked" (in kernel).

Without this fix, the problem (user space deadlocks) can be seen with
Android bionic's mutex implementation on an arm64 multi-cluster system.

Signed-off-by: Catalin Marinas <***@arm.com>
Reported-by: Matteo Franchin <***@arm.com>
Fixes: b0c29f79ecea (futexes: Avoid taking the hb->lock if there's nothing to wake up)
Cc: <***@vger.kernel.org>
Cc: Davidlohr Bueso <***@hp.com>
Cc: Linus Torvalds <***@linux-foundation.org>
Cc: Darren Hart <***@linux.intel.com>
Cc: Thomas Gleixner <***@linutronix.de>
Cc: Peter Zijlstra <***@infradead.org>
Cc: Ingo Molnar <***@kernel.org>
Cc: Paul E. McKenney <***@linux.vnet.ibm.com>
---
kernel/futex.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/kernel/futex.c b/kernel/futex.c
index 815d7af2ffe8..f3a3a071283c 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -343,6 +343,8 @@ static void get_futex_key_refs(union futex_key *key)
case FUT_OFF_MMSHARED:
futex_get_mm(key); /* implies MB (B) */
break;
+ default:
+ smp_mb(); /* explicit MB (B) */
}
}
Mike Galbraith
2014-10-18 06:54:45 UTC
Permalink
Post by Catalin Marinas
Commit b0c29f79ecea (futexes: Avoid taking the hb->lock if there's
nothing to wake up) changes the futex code to avoid taking a lock when
there are no waiters. This code has been subsequently fixed in commit
11d4616bd07f (futex: revert back to the explicit waiter counting code).
Both the original commit and the fix-up rely on get_futex_key_refs() to
always imply a barrier.
However, for private futexes, none of the cases in the switch statement
of get_futex_key_refs() would be hit and the function completes without
a memory barrier as required before checking the "waiters" in
futex_wake() -> hb_waiters_pending(). The consequence is a race with a
thread waiting on a futex on another CPU, allowing the waker thread to
read "waiters == 0" while the waiter thread to have read "futex_val ==
locked" (in kernel).
Without this fix, the problem (user space deadlocks) can be seen with
Android bionic's mutex implementation on an arm64 multi-cluster system.
How 'bout that, you just triggered my "watch this pot" alarm.

https://lkml.org/lkml/2014/10/8/406

The hang I encountered with stockfish only ever happened on one specific
box. Linus/Thomas said it I was likely a problem with the futex usage,
but it suspiciously deterministic, so I put this on the "watch out for
further evidence" back burner.

The barrier fixing up my problematic box smells a lot like evidence.
Post by Catalin Marinas
Fixes: b0c29f79ecea (futexes: Avoid taking the hb->lock if there's nothing to wake up)
---
kernel/futex.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/kernel/futex.c b/kernel/futex.c
index 815d7af2ffe8..f3a3a071283c 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -343,6 +343,8 @@ static void get_futex_key_refs(union futex_key *key)
futex_get_mm(key); /* implies MB (B) */
break;
+ smp_mb(); /* explicit MB (B) */
}
}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Mike Galbraith
2014-10-18 07:09:14 UTC
Permalink
(fixes Davidlohr bounce)
Post by Mike Galbraith
Post by Catalin Marinas
Commit b0c29f79ecea (futexes: Avoid taking the hb->lock if there's
nothing to wake up) changes the futex code to avoid taking a lock when
there are no waiters. This code has been subsequently fixed in commit
11d4616bd07f (futex: revert back to the explicit waiter counting code).
Both the original commit and the fix-up rely on get_futex_key_refs() to
always imply a barrier.
However, for private futexes, none of the cases in the switch statement
of get_futex_key_refs() would be hit and the function completes without
a memory barrier as required before checking the "waiters" in
futex_wake() -> hb_waiters_pending(). The consequence is a race with a
thread waiting on a futex on another CPU, allowing the waker thread to
read "waiters == 0" while the waiter thread to have read "futex_val ==
locked" (in kernel).
Without this fix, the problem (user space deadlocks) can be seen with
Android bionic's mutex implementation on an arm64 multi-cluster system.
How 'bout that, you just triggered my "watch this pot" alarm.
https://lkml.org/lkml/2014/10/8/406
The hang I encountered with stockfish only ever happened on one specific
box. Linus/Thomas said it I was likely a problem with the futex usage,
but it suspiciously deterministic, so I put this on the "watch out for
further evidence" back burner.
The barrier fixing up my problematic box smells a lot like evidence.
Post by Catalin Marinas
Fixes: b0c29f79ecea (futexes: Avoid taking the hb->lock if there's nothing to wake up)
---
kernel/futex.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/kernel/futex.c b/kernel/futex.c
index 815d7af2ffe8..f3a3a071283c 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -343,6 +343,8 @@ static void get_futex_key_refs(union futex_key *key)
futex_get_mm(key); /* implies MB (B) */
break;
+ smp_mb(); /* explicit MB (B) */
}
}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Linus Torvalds
2014-10-18 15:28:23 UTC
Permalink
On Fri, Oct 17, 2014 at 11:54 PM, Mike Galbraith
Post by Mike Galbraith
The barrier fixing up my problematic box smells a lot like evidence.
Is this a "tested-by"? Did you actuallyu verify that the patch ends up
fixing the problem you saw?

Linus
Mike Galbraith
2014-10-18 16:15:29 UTC
Permalink
Post by Linus Torvalds
On Fri, Oct 17, 2014 at 11:54 PM, Mike Galbraith
Post by Mike Galbraith
The barrier fixing up my problematic box smells a lot like evidence.
Is this a "tested-by"? Did you actuallyu verify that the patch ends up
fixing the problem you saw?
Yup, it definitely fixed it up.

Weird that it didn't show at all on the 2 socket 20 core box the problem
I was looking into was reported on, but was 100% busted on the similar 2
socket 28 core box I borrowed to have a peek, and only that box. My
(crippled/slow) 64 core DL980 was perfectly happy, as were my 3 home
boxen, not a peep from anywhere but that one 28 core box.

Hohum.. Tested-by: Mike Galbraith <***@gmail.com>

-Mike
Davidlohr Bueso
2014-10-18 07:33:00 UTC
Permalink
Post by Catalin Marinas
Commit b0c29f79ecea (futexes: Avoid taking the hb->lock if there's
nothing to wake up) changes the futex code to avoid taking a lock when
there are no waiters. This code has been subsequently fixed in commit
11d4616bd07f (futex: revert back to the explicit waiter counting code).
Both the original commit and the fix-up rely on get_futex_key_refs() to
always imply a barrier.
However, for private futexes, none of the cases in the switch statement
of get_futex_key_refs() would be hit and the function completes without
a memory barrier as required before checking the "waiters" in
futex_wake() -> hb_waiters_pending().
Good catch, glad I ran into this thread (my email recently changed).
Private process futex (PTHREAD_PROCESS_PRIVATE) have no reference on an
inode or mm so it would need the explicit barrier in those cases.
Post by Catalin Marinas
The consequence is a race with a
thread waiting on a futex on another CPU, allowing the waker thread to
read "waiters == 0" while the waiter thread to have read "futex_val ==
locked" (in kernel).
Yeah missing wakeups are a strong sign of a problem with the
hb_waiters_pending() side.
Post by Catalin Marinas
Without this fix, the problem (user space deadlocks) can be seen with
Android bionic's mutex implementation on an arm64 multi-cluster system.
Fixes: b0c29f79ecea (futexes: Avoid taking the hb->lock if there's nothing to wake up)
---
kernel/futex.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/kernel/futex.c b/kernel/futex.c
index 815d7af2ffe8..f3a3a071283c 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -343,6 +343,8 @@ static void get_futex_key_refs(union futex_key *key)
futex_get_mm(key); /* implies MB (B) */
break;
+ smp_mb(); /* explicit MB (B) */
}
Should we comment that this default is for the private futex case?
Otherwise:

Acked-by: Davidlohr Bueso <***@stgolabs.net>
Davidlohr Bueso
2014-10-18 19:58:34 UTC
Permalink
Post by Davidlohr Bueso
Post by Catalin Marinas
Commit b0c29f79ecea (futexes: Avoid taking the hb->lock if there's
nothing to wake up) changes the futex code to avoid taking a lock when
there are no waiters. This code has been subsequently fixed in commit
11d4616bd07f (futex: revert back to the explicit waiter counting code).
Both the original commit and the fix-up rely on get_futex_key_refs() to
always imply a barrier.
However, for private futexes, none of the cases in the switch statement
of get_futex_key_refs() would be hit and the function completes without
a memory barrier as required before checking the "waiters" in
futex_wake() -> hb_waiters_pending().
Good catch, glad I ran into this thread (my email recently changed).
Private process futex (PTHREAD_PROCESS_PRIVATE) have no reference on an
inode or mm so it would need the explicit barrier in those cases.
And [get/put]_futex_keys() shouldn't even be called for private futexes.
The following patch had some very minor testing on a 60 core box last
night, but passes both Darren's and perf's tests. So I *think* this is
right, but lack of sleep and I overall just don't trust them futexes!

8<----------------------------------------------------------
From: Davidlohr Bueso <***@stgolabs.net>
Date: Sat, 18 Oct 2014 12:30:37 -0700
Subject: [PATCH 2/1] futex: No key referencing for private futexes

Because private futexes do not hold references on either
an inode or mm, they should not be calling key referencing
operations (even though they are practically a nop). However,
we need to call the get part only because we need the barrier
in order to maintain correct ordering guarantees for the lockless
waiter checking. In addition, we can avoid calling the put part
for private futexes altogether, as it serves no purpose in the
ordering.

This patch 1) documents the situation, 2) explicitly avoids calling
drop_futex_key_refs() when calling put_futex_keys() for private futexes
and 3) changes the interface of the function to pass the 'fshared'
variable, similarly to get_futex_key_refs(). In theory this should apply
to all drop_futex_key_refs() callers, but just keep it simple and apply
it as the get/put alternatives when calling futex(2).

Signed-off-by: Davidlohr Bueso <***@suse.de>
---
kernel/futex.c | 99 ++++++++++++++++++++++++++++++++--------------------------
1 file changed, 55 insertions(+), 44 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 815d7af..21f7e41 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -415,6 +415,11 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw)
if (!fshared) {
key->private.mm = mm;
key->private.address = address;
+ /*
+ * Private futexes do not hold reference on an inode or
+ * mm, therefore the only purpose of calling get_futex_key_refs
+ * is because we need the barrier for the lockless waiter check.
+ */
get_futex_key_refs(key); /* implies MB (B) */
return 0;
}
@@ -530,9 +535,14 @@ out:
return err;
}

-static inline void put_futex_key(union futex_key *key)
+static inline void put_futex_key(int fshared, union futex_key *key)
{
- drop_futex_key_refs(key);
+ /*
+ * See comment in get_futex_key() about key
+ * referencing when dealing with private futexes.
+ */
+ if (fshared)
+ drop_futex_key_refs(key);
}

/**
@@ -1202,12 +1212,12 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
struct futex_hash_bucket *hb;
struct futex_q *this, *next;
union futex_key key = FUTEX_KEY_INIT;
- int ret;
+ int ret, fshared = flags & FLAGS_SHARED;

if (!bitset)
return -EINVAL;

- ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key, VERIFY_READ);
+ ret = get_futex_key(uaddr, fshared, &key, VERIFY_READ);
if (unlikely(ret != 0))
goto out;

@@ -1238,7 +1248,7 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)

spin_unlock(&hb->lock);
out_put_key:
- put_futex_key(&key);
+ put_futex_key(fshared, &key);
out:
return ret;
}
@@ -1254,13 +1264,13 @@ futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
union futex_key key1 = FUTEX_KEY_INIT, key2 = FUTEX_KEY_INIT;
struct futex_hash_bucket *hb1, *hb2;
struct futex_q *this, *next;
- int ret, op_ret;
+ int ret, op_ret, fshared = flags & FLAGS_SHARED;

retry:
- ret = get_futex_key(uaddr1, flags & FLAGS_SHARED, &key1, VERIFY_READ);
+ ret = get_futex_key(uaddr1, fshared, &key1, VERIFY_READ);
if (unlikely(ret != 0))
goto out;
- ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2, VERIFY_WRITE);
+ ret = get_futex_key(uaddr2, fshared, &key2, VERIFY_WRITE);
if (unlikely(ret != 0))
goto out_put_key1;

@@ -1292,11 +1302,11 @@ retry_private:
if (ret)
goto out_put_keys;

- if (!(flags & FLAGS_SHARED))
+ if (!fshared)
goto retry_private;

- put_futex_key(&key2);
- put_futex_key(&key1);
+ put_futex_key(fshared, &key2);
+ put_futex_key(fshared, &key1);
goto retry;
}

@@ -1331,9 +1341,9 @@ retry_private:
out_unlock:
double_unlock_hb(hb1, hb2);
out_put_keys:
- put_futex_key(&key2);
+ put_futex_key(fshared, &key2);
out_put_key1:
- put_futex_key(&key1);
+ put_futex_key(fshared, &key1);
out:
return ret;
}
@@ -1485,10 +1495,11 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags,
u32 *cmpval, int requeue_pi)
{
union futex_key key1 = FUTEX_KEY_INIT, key2 = FUTEX_KEY_INIT;
- int drop_count = 0, task_count = 0, ret;
+ int drop_count = 0, task_count = 0;
struct futex_pi_state *pi_state = NULL;
struct futex_hash_bucket *hb1, *hb2;
struct futex_q *this, *next;
+ int ret, fshared = flags & FLAGS_SHARED;

if (requeue_pi) {
/*
@@ -1528,10 +1539,10 @@ retry:
pi_state = NULL;
}

- ret = get_futex_key(uaddr1, flags & FLAGS_SHARED, &key1, VERIFY_READ);
+ ret = get_futex_key(uaddr1, fshared, &key1, VERIFY_READ);
if (unlikely(ret != 0))
goto out;
- ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2,
+ ret = get_futex_key(uaddr2, fshared, &key2,
requeue_pi ? VERIFY_WRITE : VERIFY_READ);
if (unlikely(ret != 0))
goto out_put_key1;
@@ -1565,11 +1576,11 @@ retry_private:
if (ret)
goto out_put_keys;

- if (!(flags & FLAGS_SHARED))
+ if (!fshared)
goto retry_private;

- put_futex_key(&key2);
- put_futex_key(&key1);
+ put_futex_key(fshared, &key2);
+ put_futex_key(fshared, &key1);
goto retry;
}
if (curval != *cmpval) {
@@ -1619,8 +1630,8 @@ retry_private:
case -EFAULT:
double_unlock_hb(hb1, hb2);
hb_waiters_dec(hb2);
- put_futex_key(&key2);
- put_futex_key(&key1);
+ put_futex_key(fshared, &key2);
+ put_futex_key(fshared, &key1);
ret = fault_in_user_writeable(uaddr2);
if (!ret)
goto retry;
@@ -1634,8 +1645,8 @@ retry_private:
*/
double_unlock_hb(hb1, hb2);
hb_waiters_dec(hb2);
- put_futex_key(&key2);
- put_futex_key(&key1);
+ put_futex_key(fshared, &key2);
+ put_futex_key(fshared, &key1);
cond_resched();
goto retry;
default:
@@ -1721,9 +1732,9 @@ out_unlock:
drop_futex_key_refs(&key1);

out_put_keys:
- put_futex_key(&key2);
+ put_futex_key(fshared, &key2);
out_put_key1:
- put_futex_key(&key1);
+ put_futex_key(fshared, &key1);
out:
if (pi_state != NULL)
free_pi_state(pi_state);
@@ -2098,7 +2109,7 @@ static int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
struct futex_q *q, struct futex_hash_bucket **hb)
{
u32 uval;
- int ret;
+ int ret, fshared = flags & FLAGS_SHARED;

/*
* Access the page AFTER the hash-bucket is locked.
@@ -2119,7 +2130,7 @@ static int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
* while the syscall executes.
*/
retry:
- ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q->key, VERIFY_READ);
+ ret = get_futex_key(uaddr, fshared, &q->key, VERIFY_READ);
if (unlikely(ret != 0))
return ret;

@@ -2135,10 +2146,10 @@ retry_private:
if (ret)
goto out;

- if (!(flags & FLAGS_SHARED))
+ if (!fshared)
goto retry_private;

- put_futex_key(&q->key);
+ put_futex_key(fshared, &q->key);
goto retry;
}

@@ -2149,7 +2160,7 @@ retry_private:

out:
if (ret)
- put_futex_key(&q->key);
+ put_futex_key(fshared, &q->key);
return ret;
}

@@ -2256,7 +2267,7 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags, int detect,
struct hrtimer_sleeper timeout, *to = NULL;
struct futex_hash_bucket *hb;
struct futex_q q = futex_q_init;
- int res, ret;
+ int res, ret, fshared = flags & FLAGS_SHARED;

if (refill_pi_state_cache())
return -ENOMEM;
@@ -2270,7 +2281,7 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags, int detect,
}

retry:
- ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q.key, VERIFY_WRITE);
+ ret = get_futex_key(uaddr, fshared, &q.key, VERIFY_WRITE);
if (unlikely(ret != 0))
goto out;

@@ -2294,7 +2305,7 @@ retry_private:
* - The user space value changed.
*/
queue_unlock(hb);
- put_futex_key(&q.key);
+ put_futex_key(fshared, &q.key);
cond_resched();
goto retry;
default:
@@ -2348,7 +2359,7 @@ out_unlock_put_key:
queue_unlock(hb);

out_put_key:
- put_futex_key(&q.key);
+ put_futex_key(fshared, &q.key);
out:
if (to)
destroy_hrtimer_on_stack(&to->timer);
@@ -2361,10 +2372,10 @@ uaddr_faulted:
if (ret)
goto out_put_key;

- if (!(flags & FLAGS_SHARED))
+ if (!fshared)
goto retry_private;

- put_futex_key(&q.key);
+ put_futex_key(fshared, &q.key);
goto retry;
}

@@ -2379,7 +2390,7 @@ static int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
union futex_key key = FUTEX_KEY_INIT;
struct futex_hash_bucket *hb;
struct futex_q *match;
- int ret;
+ int ret, fshared = flags & FLAGS_SHARED;

retry:
if (get_user(uval, uaddr))
@@ -2390,7 +2401,7 @@ retry:
if ((uval & FUTEX_TID_MASK) != vpid)
return -EPERM;

- ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key, VERIFY_WRITE);
+ ret = get_futex_key(uaddr, fshared, &key, VERIFY_WRITE);
if (ret)
return ret;

@@ -2431,12 +2442,12 @@ retry:

out_unlock:
spin_unlock(&hb->lock);
- put_futex_key(&key);
+ put_futex_key(fshared, &key);
return ret;

pi_faulted:
spin_unlock(&hb->lock);
- put_futex_key(&key);
+ put_futex_key(fshared, &key);

ret = fault_in_user_writeable(uaddr);
if (!ret)
@@ -2544,7 +2555,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
struct futex_hash_bucket *hb;
union futex_key key2 = FUTEX_KEY_INIT;
struct futex_q q = futex_q_init;
- int res, ret;
+ int res, ret, fshared = flags & FLAGS_SHARED;

if (uaddr == uaddr2)
return -EINVAL;
@@ -2571,7 +2582,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
RB_CLEAR_NODE(&rt_waiter.tree_entry);
rt_waiter.task = NULL;

- ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2, VERIFY_WRITE);
+ ret = get_futex_key(uaddr2, fshared, &key2, VERIFY_WRITE);
if (unlikely(ret != 0))
goto out;

@@ -2673,9 +2684,9 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
}

out_put_keys:
- put_futex_key(&q.key);
+ put_futex_key(fshared, &q.key);
out_key2:
- put_futex_key(&key2);
+ put_futex_key(fshared, &key2);

out:
if (to) {
--
1.8.4.5
Linus Torvalds
2014-10-18 20:50:18 UTC
Permalink
Post by Davidlohr Bueso
And [get/put]_futex_keys() shouldn't even be called for private futexes.
The following patch had some very minor testing on a 60 core box last
night, but passes both Darren's and perf's tests. So I *think* this is
right, but lack of sleep and I overall just don't trust them futexes!
Hmm. I don't see the advantage of making the code more complex in
order to avoid the functions that are no-ops for the !fshared case?

IOW, as far as I can tell, this patch doesn't actually really *change*
anything. Am I missing something?

Linus
Davidlohr Bueso
2014-10-19 02:16:18 UTC
Permalink
Post by Linus Torvalds
Post by Davidlohr Bueso
And [get/put]_futex_keys() shouldn't even be called for private futexes.
The following patch had some very minor testing on a 60 core box last
night, but passes both Darren's and perf's tests. So I *think* this is
right, but lack of sleep and I overall just don't trust them futexes!
Hmm. I don't see the advantage of making the code more complex in
order to avoid the functions that are no-ops for the !fshared case?
IOW, as far as I can tell, this patch doesn't actually really *change*
anything. Am I missing something?
Right, all we do is avoid a NOP, but I don't see how this patch makes
the code more complex. In fact, the whole idea is to make it easier to
read and makes the key referencing differences between shared and
private futexes crystal clear, hoping to mitigate future bugs.

Thanks,
Davidlohr
Thomas Gleixner
2014-10-20 09:11:40 UTC
Permalink
Post by Davidlohr Bueso
Post by Linus Torvalds
Post by Davidlohr Bueso
And [get/put]_futex_keys() shouldn't even be called for private futexes.
The following patch had some very minor testing on a 60 core box last
night, but passes both Darren's and perf's tests. So I *think* this is
right, but lack of sleep and I overall just don't trust them futexes!
Hmm. I don't see the advantage of making the code more complex in
order to avoid the functions that are no-ops for the !fshared case?
IOW, as far as I can tell, this patch doesn't actually really *change*
anything. Am I missing something?
Right, all we do is avoid a NOP, but I don't see how this patch makes
the code more complex. In fact, the whole idea is to make it easier to
read and makes the key referencing differences between shared and
private futexes crystal clear, hoping to mitigate future bugs.
I tend to disagree. The current code is symetric versus get/drop and
you make it unsymetric by avoiding the drop call with a pointless
extra conditional in all call sites.

I really had to look twice to figure out that the patch is correct,
but I really cannot see any value and definitely have a hard time how
this makes the code clearer and would prevent future bugs.

I rather keep it symetric and document the NOP property for private
futexes in both get and drop.

Thanks,

tglx
Catalin Marinas
2014-10-20 10:49:26 UTC
Permalink
Post by Thomas Gleixner
Post by Davidlohr Bueso
Post by Linus Torvalds
Post by Davidlohr Bueso
And [get/put]_futex_keys() shouldn't even be called for private futexes.
The following patch had some very minor testing on a 60 core box last
night, but passes both Darren's and perf's tests. So I *think* this is
right, but lack of sleep and I overall just don't trust them futexes!
Hmm. I don't see the advantage of making the code more complex in
order to avoid the functions that are no-ops for the !fshared case?
IOW, as far as I can tell, this patch doesn't actually really *change*
anything. Am I missing something?
Right, all we do is avoid a NOP, but I don't see how this patch makes
the code more complex. In fact, the whole idea is to make it easier to
read and makes the key referencing differences between shared and
private futexes crystal clear, hoping to mitigate future bugs.
I tend to disagree. The current code is symetric versus get/drop and
you make it unsymetric by avoiding the drop call with a pointless
extra conditional in all call sites.
Since you mention symmetry, something like below makes the barriers more
explicit. However, it removes the implied barrier for get_futex_key_refs
and the other calling places need to be verified (requeue_futex and
requeue_pi_futex; if barrier is not required here, the result may be
slightly more optimal, not sure you would see the difference though).


diff --git a/kernel/futex.c b/kernel/futex.c
index f3a3a071283c..5b9d857d0816 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -143,9 +143,7 @@
*
* Where (A) orders the waiters increment and the futex value read through
* atomic operations (see hb_waiters_inc) and where (B) orders the write
- * to futex and the waiters read -- this is done by the barriers in
- * get_futex_key_refs(), through either ihold or atomic_inc, depending on the
- * futex type.
+ * to futex and the waiters read (see hb_waiters_pending).
*
* This yields the following case (where X:=waiters, Y:=futex):
*
@@ -262,12 +260,6 @@ static struct futex_hash_bucket *futex_queues;
static inline void futex_get_mm(union futex_key *key)
{
atomic_inc(&key->private.mm->mm_count);
- /*
- * Ensure futex_get_mm() implies a full barrier such that
- * get_futex_key() implies a full barrier. This is relied upon
- * as full barrier (B), see the ordering comment above.
- */
- smp_mb__after_atomic();
}

/*
@@ -297,6 +289,10 @@ static inline void hb_waiters_dec(struct futex_hash_bucket *hb)

static inline int hb_waiters_pending(struct futex_hash_bucket *hb)
{
+ /*
+ * Full barrier (B), see the ordering comment above.
+ */
+ smp_mb__before_atomic();
#ifdef CONFIG_SMP
return atomic_read(&hb->waiters);
#else
@@ -338,13 +334,11 @@ static void get_futex_key_refs(union futex_key *key)

switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
case FUT_OFF_INODE:
- ihold(key->shared.inode); /* implies MB (B) */
+ __iget(key->shared.inode);
break;
case FUT_OFF_MMSHARED:
- futex_get_mm(key); /* implies MB (B) */
+ futex_get_mm(key);
break;
- default:
- smp_mb(); /* explicit MB (B) */
}
}

@@ -417,7 +411,7 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw)
if (!fshared) {
key->private.mm = mm;
key->private.address = address;
- get_futex_key_refs(key); /* implies MB (B) */
+ get_futex_key_refs(key);
return 0;
}

@@ -524,7 +518,7 @@ again:
key->shared.pgoff = basepage_index(page);
}

- get_futex_key_refs(key); /* implies MB (B) */
+ get_futex_key_refs(key);

out:
unlock_page(page_head);
Linus Torvalds
2014-10-20 15:32:00 UTC
Permalink
On Mon, Oct 20, 2014 at 3:49 AM, Catalin Marinas
Post by Catalin Marinas
Since you mention symmetry, something like below makes the barriers more
explicit.
diff --git a/kernel/futex.c b/kernel/futex.c
index f3a3a071283c..5b9d857d0816 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -143,9 +143,7 @@
static inline void futex_get_mm(union futex_key *key)
{
atomic_inc(&key->private.mm->mm_count);
- /*
- * Ensure futex_get_mm() implies a full barrier such that
- * get_futex_key() implies a full barrier. This is relied upon
- * as full barrier (B), see the ordering comment above.
- */
- smp_mb__after_atomic();
}
So the thing is, this means that we can't take advantage of the fact
that "atomic_inc" is already an atomic. So this is just a performance
Post by Catalin Marinas
static inline int hb_waiters_pending(struct futex_hash_bucket *hb)
{
+ /*
+ * Full barrier (B), see the ordering comment above.
+ */
+ smp_mb__before_atomic();
#ifdef CONFIG_SMP
return atomic_read(&hb->waiters);
This is just entirely broken.

"atomic_read()" isn't really an "atomic op" at all. despite the name,
it's just a read that is basically ACCESS_ONCE.

So smp_mb__before_atomic() doesn't work for atomic_read(), and the
code is nonsensical and doesn't work. It would need to be a full
memory barrier.

Linus
Catalin Marinas
2014-10-20 16:00:09 UTC
Permalink
Post by Linus Torvalds
On Mon, Oct 20, 2014 at 3:49 AM, Catalin Marinas
Post by Catalin Marinas
Since you mention symmetry, something like below makes the barriers more
explicit.
diff --git a/kernel/futex.c b/kernel/futex.c
index f3a3a071283c..5b9d857d0816 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -143,9 +143,7 @@
static inline void futex_get_mm(union futex_key *key)
{
atomic_inc(&key->private.mm->mm_count);
- /*
- * Ensure futex_get_mm() implies a full barrier such that
- * get_futex_key() implies a full barrier. This is relied upon
- * as full barrier (B), see the ordering comment above.
- */
- smp_mb__after_atomic();
}
So the thing is, this means that we can't take advantage of the fact
that "atomic_inc" is already an atomic. So this is just a performance
OK, I looked at this from an ARM perspective only and it would not make
any difference. But it seems that MIPS makes a distinction between
"before" and "after" barriers with "before" defined as wmb in some
configuration, so the hunk below would break it.
Post by Linus Torvalds
Post by Catalin Marinas
static inline int hb_waiters_pending(struct futex_hash_bucket *hb)
{
+ /*
+ * Full barrier (B), see the ordering comment above.
+ */
+ smp_mb__before_atomic();
#ifdef CONFIG_SMP
return atomic_read(&hb->waiters);
This is just entirely broken.
"atomic_read()" isn't really an "atomic op" at all. despite the name,
it's just a read that is basically ACCESS_ONCE.
So smp_mb__before_atomic() doesn't work for atomic_read(), and the
code is nonsensical and doesn't work. It would need to be a full
memory barrier.
Looking at the semantics of smp_mb__*_atomic(), it would indeed have to
be a full smp_mb() here.
--
Catalin
Darren Hart
2014-10-18 19:32:21 UTC
Permalink
Post by Catalin Marinas
Commit b0c29f79ecea (futexes: Avoid taking the hb->lock if there's
nothing to wake up) changes the futex code to avoid taking a lock when
there are no waiters. This code has been subsequently fixed in commit
11d4616bd07f (futex: revert back to the explicit waiter counting code).
Both the original commit and the fix-up rely on get_futex_key_refs() to
always imply a barrier.
However, for private futexes, none of the cases in the switch statement
of get_futex_key_refs() would be hit and the function completes without
a memory barrier as required before checking the "waiters" in
futex_wake() -> hb_waiters_pending(). The consequence is a race with a
thread waiting on a futex on another CPU, allowing the waker thread to
read "waiters == 0" while the waiter thread to have read "futex_val ==
locked" (in kernel).
Verified that this is:

a) how it is documented to work
b) not how it actually works currently

Nice catch indeed.

...
Post by Catalin Marinas
diff --git a/kernel/futex.c b/kernel/futex.c
index 815d7af2ffe8..f3a3a071283c 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -343,6 +343,8 @@ static void get_futex_key_refs(union futex_key *key)
futex_get_mm(key); /* implies MB (B) */
break;
A comment here indicating this covers the PROCESS_PRIVATE futex case
would be welcome, given the complexity involved.
Post by Catalin Marinas
+ smp_mb(); /* explicit MB (B) */
Also, the "Basic" futex operation and ordering guarantees documentation
currently reads:

* Where (A) orders the waiters increment and the futex value read through
* atomic operations (see hb_waiters_inc) and where (B) orders the write
* to futex and the waiters read -- this is done by the barriers in
* get_futex_key_refs(), through either ihold or atomic_inc, depending
on the
* futex type.

Which is not incomplete (lacking the explicit smp_mb()) added by this
patch. Perhaps the MB implementation of get_futex_key_refs() need not be
explicitly enumerated here?
--
Darren Hart
Intel Open Source Technology Center
Davidlohr Bueso
2014-10-18 20:19:50 UTC
Permalink
Post by Darren Hart
Which is not incomplete (lacking the explicit smp_mb()) added by this
patch. Perhaps the MB implementation of get_futex_key_refs() need not be
explicitly enumerated here?
Agreed, how about this:

diff --git a/kernel/futex.c b/kernel/futex.c
index 21f7e41..7a0805a 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -143,9 +143,8 @@
*
* Where (A) orders the waiters increment and the futex value read through
* atomic operations (see hb_waiters_inc) and where (B) orders the write
- * to futex and the waiters read -- this is done by the barriers in
- * get_futex_key_refs(), through either ihold or atomic_inc, depending on the
- * futex type.
+ * to futex and the waiters read -- this is done by the barriers for both
+ * shared and private futexes in get_futex_key_refs().
*
* This yields the following case (where X:=waiters, Y:=futex):
*
Darren Hart
2014-10-18 20:25:07 UTC
Permalink
Post by Catalin Marinas
Post by Darren Hart
Which is not incomplete (lacking the explicit smp_mb()) added by this
patch. Perhaps the MB implementation of get_futex_key_refs() need not
be
Post by Darren Hart
explicitly enumerated here?
diff --git a/kernel/futex.c b/kernel/futex.c
index 21f7e41..7a0805a 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -143,9 +143,8 @@
*
* Where (A) orders the waiters increment and the futex value read through
* atomic operations (see hb_waiters_inc) and where (B) orders the write
- * to futex and the waiters read -- this is done by the barriers in
- * get_futex_key_refs(), through either ihold or atomic_inc, depending on the
- * futex type.
+ * to futex and the waiters read -- this is done by the barriers for both
+ * shared and private futexes in get_futex_key_refs().
Works for me.
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
Catalin Marinas
2014-10-20 10:15:24 UTC
Permalink
Post by Catalin Marinas
Post by Darren Hart
Which is not incomplete (lacking the explicit smp_mb()) added by this
patch. Perhaps the MB implementation of get_futex_key_refs() need not be
explicitly enumerated here?
diff --git a/kernel/futex.c b/kernel/futex.c
index 21f7e41..7a0805a 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -143,9 +143,8 @@
*
* Where (A) orders the waiters increment and the futex value read through
* atomic operations (see hb_waiters_inc) and where (B) orders the write
- * to futex and the waiters read -- this is done by the barriers in
- * get_futex_key_refs(), through either ihold or atomic_inc, depending on the
- * futex type.
+ * to futex and the waiters read -- this is done by the barriers for both
+ * shared and private futexes in get_futex_key_refs().
*
Looks fine to me. Since Linus already picked the original patch, if you
plan to send an update for the comments please also mention the
"explicit MB (B) for private futexes" in get_futex_key_refs().

Thanks.
--
Catalin
Loading...