[PATCH RFC] sched: Revert delayed_put_task_struct() and fix use after free

Post by Kirill Tkhai
@@ -2852,6 +2852,7 @@ static void __sched __schedule(void)
if (likely(prev != next)) {
rq->nr_switches++;
+ WARN_ON_ONCE(atomic_read(&prev->usage) == 1);

I think you know this, but let me clarify just in case that this WARN()
is wrong, prev->usage == 1 is fine if the task does its last schedule()
and it was already (auto)reaped.

Post by Kirill Tkhai
This means the final put_task_struct() happens against RCU rules.

Well, yes, it doesn't use delayed_put_pid(). But this should be fine,
this drops the extra reference created by dup_task_struct().

However,

Post by Kirill Tkhai
Regarding to scheduler this may be a reason of use-after-free.
task_numa_compare() schedule()
rcu_read_lock() ...
cur = ACCESS_ONCE(dst_rq->curr) ...
... rq->curr = next;
... context_switch()
... finish_task_switch()
... put_task_struct()
... __put_task_struct()
... free_task_struct()
task_numa_assign() ...
get_task_struct() ...

Agreed. I don't understand this code (will try to take another look later),
but at first glance this looks wrong.

At least the code like

rcu_read_lock();
get_task_struct(foreign_rq->curr);
rcu_read_unlock();

is certainly wrong. And _probably_ the problem should be fixed here. Perhaps
we can add try_to_get_task_struct() which does atomic_inc_not_zero() ...

Post by Kirill Tkhai
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1854,11 +1854,12 @@ extern void free_task(struct task_struct *tsk);
#define get_task_struct(tsk) do { atomic_inc(&(tsk)->usage); } while(0)
extern void __put_task_struct(struct task_struct *t);
+extern void __put_task_struct_cb(struct rcu_head *rhp);
static inline void put_task_struct(struct task_struct *t)
{
if (atomic_dec_and_test(&t->usage))
- __put_task_struct(t);
+ call_rcu(&t->rcu, __put_task_struct_cb);
}
#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
diff --git a/kernel/exit.c b/kernel/exit.c
index 5d30019..326eae7 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -159,15 +159,15 @@ static void __exit_signal(struct task_struct *tsk)
}
}
-static void delayed_put_task_struct(struct rcu_head *rhp)
+void __put_task_struct_cb(struct rcu_head *rhp)
{
struct task_struct *tsk = container_of(rhp, struct task_struct, rcu);
perf_event_delayed_put(tsk);
trace_sched_process_free(tsk);
- put_task_struct(tsk);
+ __put_task_struct(tsk);
}
-
+EXPORT_SYMBOL_GPL(__put_task_struct_cb);
void release_task(struct task_struct *p)
{
@@ -207,7 +207,7 @@ void release_task(struct task_struct *p)
write_unlock_irq(&tasklist_lock);
release_thread(p);
- call_rcu(&p->rcu, delayed_put_task_struct);
+ put_task_struct(p);
p = leader;
if (unlikely(zap_leader))
diff --git a/kernel/fork.c b/kernel/fork.c
index 9b7d746..4d3ac3c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -249,7 +249,6 @@ void __put_task_struct(struct task_struct *tsk)
if (!profile_handoff_task(tsk))
free_task(tsk);
}
-EXPORT_SYMBOL_GPL(__put_task_struct);
void __init __weak arch_task_cache_init(void) { }

Hmm. I am not sure I understand how this patch can actually fix this problem.
It seems that it is still possible that get_task_struct() can be called after
call_rcu(__put_task_struct_cb) ? But perhaps I misread this patch.

And I think it adds another problem. Suppose we have a zombie which already
called schedule() in TASK_DEAD state. IOW, its ->usage == 1, its parent will
free this task when it calls sys_wait().

With this patch the code like

rcu_read_lock();
for_each_process(p) {
if (pred(p) {
get_task_struct(p);
return p;
}
}
rcu_read_unlock();

becomes unsafe: we can race with release_task(p) and get_task_struct() can
can be called when prev->usage is already 0 and this task_struct can be freed
omce you drop rcu_read_lock().

Oleg.

Oleg Nesterov

2014-10-15 19:40:44 UTC

Post by Oleg Nesterov

Agreed. I don't understand this code (will try to take another look later),
but at first glance this looks wrong.
At least the code like
rcu_read_lock();
get_task_struct(foreign_rq->curr);
rcu_read_unlock();
is certainly wrong. And _probably_ the problem should be fixed here. Perhaps
we can add try_to_get_task_struct() which does atomic_inc_not_zero() ...

Yes, but perhaps in this particular case another simple fix makes more
sense. The patch below needs a comment to explain that we check PF_EXITING
because:

1. It doesn't make sense to migrate the exiting task. Although perhaps
we could check ->mm == NULL instead.

But let me repeat that I do not understand this code, I am not sure
we can equally treat is_idle_task() and PF_EXITING here...

2. If PF_EXITING is not set (or ->mm != NULL) then delayed_put_task_struct()
won't be called until we drop rcu_read_lock(), and thus get_task_struct()
is safe.

And. it seems that there is another problem? Can't task_h_load(cur) race
with itself if 2 CPU's call task_numa_migrate() and inspect the same rq
in parallel? Again, I don't understand this code, but update_cfs_rq_h_load()
doesn't look "atomic". In fact I am not even sure about task_h_load(env->p),
p == current but we do not disable preemption.

What do you think?

Oleg.

--- x/kernel/sched/fair.c
+++ x/kernel/sched/fair.c
@@ -1165,7 +1165,7 @@ static void task_numa_compare(struct tas

rcu_read_lock();
cur = ACCESS_ONCE(dst_rq->curr);
- if (cur->pid == 0) /* idle */
+ if (is_idle_task(cur) || (curr->flags & PF_EXITING))
cur = NULL;

/*

Kirill Tkhai

2014-10-15 21:46:07 UTC

Yeah, you're sure about initial patch. Thanks for signal explanation.

Post by Oleg Nesterov

Agreed. I don't understand this code (will try to take another look later),
but at first glance this looks wrong.
At least the code like
rcu_read_lock();
get_task_struct(foreign_rq->curr);
rcu_read_unlock();
is certainly wrong. And _probably_ the problem should be fixed here. Perhaps
we can add try_to_get_task_struct() which does atomic_inc_not_zero() ...

Yes, but perhaps in this particular case another simple fix makes more
sense. The patch below needs a comment to explain that we check PF_EXITING
1. It doesn't make sense to migrate the exiting task. Although perhaps
we could check ->mm == NULL instead.
But let me repeat that I do not understand this code, I am not sure
we can equally treat is_idle_task() and PF_EXITING here...
2. If PF_EXITING is not set (or ->mm != NULL) then delayed_put_task_struct()
won't be called until we drop rcu_read_lock(), and thus get_task_struct()
is safe.

Cool! Elegant fix. We set PF_EXITING in exit_signals(), which is earlier
than release_task() is called.

Shouldn't we use smp_rmb/smp_wmb here?

Post by Oleg Nesterov
And. it seems that there is another problem? Can't task_h_load(cur) race
with itself if 2 CPU's call task_numa_migrate() and inspect the same rq
in parallel? Again, I don't understand this code, but update_cfs_rq_h_load()
doesn't look "atomic". In fact I am not even sure about task_h_load(env->p),
p == current but we do not disable preemption.
What do you think?

We use it completely unlocked, so nothing good is here. Also we work
with pointers.

As I understand in update_cfs_rq_h_load() we go from bottom to top,
and then from top to bottom. We set cfs_rq::h_load_next to be able
to do top-bottom passage (top is a root of "tree").

Yeah, this "way" may be overwritten by competitor. Also, task may change
its cfs_rq.

Post by Oleg Nesterov
--- x/kernel/sched/fair.c
+++ x/kernel/sched/fair.c
@@ -1165,7 +1165,7 @@ static void task_numa_compare(struct tas
rcu_read_lock();
cur = ACCESS_ONCE(dst_rq->curr);
- if (cur->pid == 0) /* idle */
+ if (is_idle_task(cur) || (curr->flags & PF_EXITING))
cur = NULL;
/*

Looks like, we have to use the same fix for task_numa_group().

grp = rcu_dereference(tsk->numa_group);

Below we dereference grp->nr_tasks.

Also, the same in rt.c and deadline.c, but we do no take second
reference there. Wrong pointer dereference is not possible there,
not so bad.

Kirill

Kirill Tkhai

2014-10-15 22:02:38 UTC

Post by Kirill Tkhai
Yeah, you're sure about initial patch. Thanks for signal explanation.

Post by Oleg Nesterov

Agreed. I don't understand this code (will try to take another look later),
but at first glance this looks wrong.
At least the code like
rcu_read_lock();
get_task_struct(foreign_rq->curr);
rcu_read_unlock();
is certainly wrong. And _probably_ the problem should be fixed here. Perhaps
we can add try_to_get_task_struct() which does atomic_inc_not_zero() ...

Cool! Elegant fix. We set PF_EXITING in exit_signals(), which is earlier
than release_task() is called.
Shouldn't we use smp_rmb/smp_wmb here?

We use it completely unlocked, so nothing good is here. Also we work
with pointers.
As I understand in update_cfs_rq_h_load() we go from bottom to top,
and then from top to bottom. We set cfs_rq::h_load_next to be able
to do top-bottom passage (top is a root of "tree").
Yeah, this "way" may be overwritten by competitor. Also, task may change
its cfs_rq.

Wrong, it's not a task... Brain is sleepy, it's better tomorrow.

Looks like, we have to use the same fix for task_numa_group().
grp = rcu_dereference(tsk->numa_group);
Below we dereference grp->nr_tasks.
Also, the same in rt.c and deadline.c, but we do no take second
reference there. Wrong pointer dereference is not possible there,
not so bad.
Kirill

Peter Zijlstra

2014-10-16 07:59:50 UTC

Looks like, we have to use the same fix for task_numa_group().

Don't think so, task_numa_group() is only called from task_numa_fault()
which is on 'current' and neither idle and PF_EXITING should be
faulting.

Kirill Tkhai

2014-10-16 08:16:44 UTC

=D0=92 =D0=A7=D1=82, 16/10/2014 =D0=B2 09:59 +0200, Peter Zijlstra =D0=BF=

Post by Oleg Nesterov
--- x/kernel/sched/fair.c
+++ x/kernel/sched/fair.c
@@ -1165,7 +1165,7 @@ static void task_numa_compare(struct tas
=20
rcu_read_lock();
cur =3D ACCESS_ONCE(dst_rq->curr);
- if (cur->pid =3D=3D 0) /* idle */
+ if (is_idle_task(cur) || (curr->flags & PF_EXITING))
cur =3D NULL;
=20
/*
=20

=20
Looks like, we have to use the same fix for task_numa_group().

=20
Don't think so, task_numa_group() is only called from task_numa_fault=

()

which is on 'current' and neither idle and PF_EXITING should be
faulting.

Isn't task_numa_group() fully preemptible?

It seems cpu_rq(cpu)->curr is not always equal to p.

Peter Zijlstra

2014-10-16 09:43:33 UTC

Post by Kirill Tkhai
=D0=92 =D0=A7=D1=82, 16/10/2014 =D0=B2 09:59 +0200, Peter Zijlstra =D0=

=20
Looks like, we have to use the same fix for task_numa_group().

=20
Don't think so, task_numa_group() is only called from task_numa_fau=

lt()

which is on 'current' and neither idle and PF_EXITING should be
faulting.

=20
Isn't task_numa_group() fully preemptible?

Not seeing how that is relevant.

Post by Kirill Tkhai
It seems cpu_rq(cpu)->curr is not always equal to p.

It should be afaict:

task_numa_fault()
p =3D current;

task_numa_group(p, ..);

And like said, idle tasks and PF_EXITING task should never get (numa)
faults for they should never be touching userspace.

Kirill Tkhai

2014-10-16 09:50:02 UTC

=D0=92 =D0=A7=D1=82, 16/10/2014 =D0=B2 11:43 +0200, Peter Zijlstra =D0=BF=

Post by Kirill Tkhai
=D0=92 =D0=A7=D1=82, 16/10/2014 =D0=B2 09:59 +0200, Peter Zijlstra =

Post by Oleg Nesterov
--- x/kernel/sched/fair.c
+++ x/kernel/sched/fair.c
@@ -1165,7 +1165,7 @@ static void task_numa_compare(struct ta=

Post by Oleg Nesterov
=20
rcu_read_lock();
cur =3D ACCESS_ONCE(dst_rq->curr);
- if (cur->pid =3D=3D 0) /* idle */
+ if (is_idle_task(cur) || (curr->flags & PF_EXITING))
cur =3D NULL;
=20
/*
=20

=20
Looks like, we have to use the same fix for task_numa_group().

=20
Don't think so, task_numa_group() is only called from task_numa_f=

ault()

which is on 'current' and neither idle and PF_EXITING should be
faulting.

=20
Isn't task_numa_group() fully preemptible?

=20
Not seeing how that is relevant.
=20

Post by Kirill Tkhai
It seems cpu_rq(cpu)->curr is not always equal to p.

=20
=20
task_numa_fault()
p =3D current;
=20
task_numa_group(p, ..);
=20
And like said, idle tasks and PF_EXITING task should never get (numa)
faults for they should never be touching userspace.

I mean p can be moved to other cpu.

tsk =3D ACCESS_ONCE(cpu_rq(cpu)->curr);

tsk is not p, (i.e current) here.

Kirill Tkhai

2014-10-16 09:51:55 UTC

=D0=92 =D0=A7=D1=82, 16/10/2014 =D0=B2 13:50 +0400, Kirill Tkhai =D0=BF=

Post by Kirill Tkhai
=D0=92 =D0=A7=D1=82, 16/10/2014 =D0=B2 11:43 +0200, Peter Zijlstra =D0=

=D0=92 =D0=A7=D1=82, 16/10/2014 =D0=B2 09:59 +0200, Peter Zijlstr=

Post by Oleg Nesterov
--- x/kernel/sched/fair.c
+++ x/kernel/sched/fair.c
@@ -1165,7 +1165,7 @@ static void task_numa_compare(struct =

tas

Post by Oleg Nesterov
=20
rcu_read_lock();
cur =3D ACCESS_ONCE(dst_rq->curr);
- if (cur->pid =3D=3D 0) /* idle */
+ if (is_idle_task(cur) || (curr->flags & PF_EXITING))
cur =3D NULL;
=20
/*
=20

=20
Looks like, we have to use the same fix for task_numa_group()=

=2E

=20
Don't think so, task_numa_group() is only called from task_numa=

_fault()

which is on 'current' and neither idle and PF_EXITING should be
faulting.

=20
Isn't task_numa_group() fully preemptible?

=20
Not seeing how that is relevant.
=20

It seems cpu_rq(cpu)->curr is not always equal to p.

=20
=20
task_numa_fault()
p =3D current;
=20
task_numa_group(p, ..);
=20
And like said, idle tasks and PF_EXITING task should never get (num=

Post by Peter Zijlstra
faults for they should never be touching userspace.

=20
I mean p can be moved to other cpu.
=20
tsk =3D ACCESS_ONCE(cpu_rq(cpu)->curr);
=20
tsk is not p, (i.e current) here.

Maybe I undestand wrong and preemption is disabled in memory fault?

Kirill Tkhai

2014-10-16 10:04:53 UTC

=D0=92 =D0=A7=D1=82, 16/10/2014 =D0=B2 13:51 +0400, Kirill Tkhai =D0=BF=

Post by Kirill Tkhai
=D0=92 =D0=A7=D1=82, 16/10/2014 =D0=B2 13:50 +0400, Kirill Tkhai =D0=BF=

Post by Kirill Tkhai
=D0=92 =D0=A7=D1=82, 16/10/2014 =D0=B2 11:43 +0200, Peter Zijlstra =

=D0=92 =D0=A7=D1=82, 16/10/2014 =D0=B2 09:59 +0200, Peter Zijls=

Post by Oleg Nesterov
--- x/kernel/sched/fair.c
+++ x/kernel/sched/fair.c
@@ -1165,7 +1165,7 @@ static void task_numa_compare(struc=

t tas

Post by Oleg Nesterov
=20
rcu_read_lock();
cur =3D ACCESS_ONCE(dst_rq->curr);
- if (cur->pid =3D=3D 0) /* idle */
+ if (is_idle_task(cur) || (curr->flags & PF_EXITING))
cur =3D NULL;
=20
/*
=20

=20
Looks like, we have to use the same fix for task_numa_group=

().

=20
Don't think so, task_numa_group() is only called from task_nu=

ma_fault()

which is on 'current' and neither idle and PF_EXITING should =

faulting.

=20
Isn't task_numa_group() fully preemptible?

=20
Not seeing how that is relevant.
=20

It seems cpu_rq(cpu)->curr is not always equal to p.

=20
=20
task_numa_fault()
p =3D current;
=20
task_numa_group(p, ..);
=20
And like said, idle tasks and PF_EXITING task should never get (n=

uma)

Post by Peter Zijlstra
faults for they should never be touching userspace.

=20
I mean p can be moved to other cpu.
=20
tsk =3D ACCESS_ONCE(cpu_rq(cpu)->curr);
=20
tsk is not p, (i.e current) here.

=20
Maybe I undestand wrong and preemption is disabled in memory fault?

Ah, I found pagefault_disable(). No questions.

Oleg Nesterov

2014-10-17 21:34:40 UTC

Post by Kirill Tkhai
Cool! Elegant fix. We set PF_EXITING in exit_signals(), which is earlier
than release_task() is called.

OK, thanks, I am sending the patch...

Post by Kirill Tkhai
Shouldn't we use smp_rmb/smp_wmb here?

No, we do not. call_rcu(delayed_put_pid) itself implies the barrier on
all CPUs. IOW, by the time RCU actually calls delayed_put_pid() every
CPU must see all memory changes which were done before call_rcu() was
called. And otoh, all rcu-read-lock critical sections which could miss
PF_EXITING should be already finished.

Oleg.

Peter Zijlstra

2014-10-16 07:56:50 UTC

Post by Oleg Nesterov
What do you think?
Oleg.
--- x/kernel/sched/fair.c
+++ x/kernel/sched/fair.c
@@ -1165,7 +1165,7 @@ static void task_numa_compare(struct tas
rcu_read_lock();
cur = ACCESS_ONCE(dst_rq->curr);
- if (cur->pid == 0) /* idle */
+ if (is_idle_task(cur) || (curr->flags & PF_EXITING))
cur = NULL;
/*

That makes sense, is_idle_task() is indeed the right function there, and
PF_EXITING avoids doing work where it doesn't make sense.

Peter Zijlstra

2014-10-16 08:01:06 UTC

Post by Oleg Nesterov
At least the code like
rcu_read_lock();
get_task_struct(foreign_rq->curr);
rcu_read_unlock();
is certainly wrong. And _probably_ the problem should be fixed here. Perhaps
we can add try_to_get_task_struct() which does atomic_inc_not_zero() ...

There is an rcu_read_lock() around it through task_numa_compare().

Oleg Nesterov

2014-10-16 22:05:40 UTC

There is an rcu_read_lock() around it through task_numa_compare().

Yes, and the code above has rcu_read_lock() too. But it doesn't help
as Kirill pointed out.

Sorry, didn't have time today to read other emails in this thread,
will do tomorrow and (probably) send the patch which adds PF_EXITING
check.

Oleg.

Oleg Nesterov

2014-10-17 21:36:41 UTC

The lockless get_task_struct(tsk) is only safe if tsk == current
and didn't pass exit_notify(), or if this tsk was found on a rcu
protected list (say, for_each_process() or find_task_by_vpid()).
IOW, it is only safe if release_task() was not called before we
take rcu_read_lock(), in this case we can rely on the fact that
delayed_put_pid() can not drop the (potentially) last reference
until rcu_read_unlock().

And as Kirill pointed out task_numa_compare()->task_numa_assign()
path does get_task_struct(dst_rq->curr) and this is not safe. The
task_struct itself can't go away, but rcu_read_lock() can't save
us from the final put_task_struct() in finish_task_switch(); this
reference goes away without rcu gp.

Reported-by: Kirill Tkhai <***@parallels.com>
Signed-off-by: Oleg Nesterov <***@redhat.com>
---
kernel/sched/fair.c | 8 +++++++-
1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0090e8c..52049b9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1158,7 +1158,13 @@ static void task_numa_compare(struct task_numa_env *env,

rcu_read_lock();
cur = ACCESS_ONCE(dst_rq->curr);
- if (cur->pid == 0) /* idle */
+ /*
+ * No need to move the exiting task, and this ensures that ->curr
+ * wasn't reaped and thus get_task_struct() in task_numa_assign()
+ * is safe; note that rcu_read_lock() can't protect from the final
+ * put_task_struct() after the last schedule().
+ */
+ if (is_idle_task(cur) || (cur->flags & PF_EXITING))
cur = NULL;

/*

--
1.5.5.1

Kirill Tkhai

2014-10-18 08:15:01 UTC

The lockless get_task_struct(tsk) is only safe if tsk =3D=3D current
and didn't pass exit_notify(), or if this tsk was found on a rcu
protected list (say, for_each_process() or find_task_by_vpid()).
IOW, it is only safe if release_task() was not called before we
take rcu_read_lock(), in this case we can rely on the fact that
delayed_put_pid() can not drop the (potentially) last reference
until rcu_read_unlock().
And as Kirill pointed out task_numa_compare()->task_numa_assign()
path does get_task_struct(dst_rq->curr) and this is not safe. The
task_struct itself can't go away, but rcu_read_lock() can't save
us from the final put_task_struct() in finish_task_switch(); this
reference goes away without rcu gp.
---
=9Akernel/sched/fair.c | =9A=9A=9A8 +++++++-
=9A1 files changed, 7 insertions(+), 1 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0090e8c..52049b9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1158,7 +1158,13 @@ static void task_numa_compare(struct task_numa=

_env *env,

=9A=9A=9A=9A=9A=9A=9A=9A=9Arcu_read_lock();
=9A=9A=9A=9A=9A=9A=9A=9A=9Acur =3D ACCESS_ONCE(dst_rq->curr);
- if (cur->pid =3D=3D 0) /* idle */
+ /*
+ * No need to move the exiting task, and this ensures that ->curr
+ * wasn't reaped and thus get_task_struct() in task_numa_assign()
+ * is safe; note that rcu_read_lock() can't protect from the final
+ * put_task_struct() after the last schedule().
+ */
+ if (is_idle_task(cur) || (cur->flags & PF_EXITING))
=9A=9A=9A=9A=9A=9A=9A=9A=9A=9A=9A=9A=9A=9A=9A=9A=9Acur =3D NULL;
=9A=9A=9A=9A=9A=9A=9A=9A=9A/*

Oleg, I've looked once again, and now it's not good for me.
Where is the guarantee this memory hasn't been allocated again?
If so, PF_EXITING is not of the task we are interesting, but it's
not a task's even.

rcu_read_lock() ... ...
cur =3D ACCESS_ONCE(dst_rq->curr); ... ...
<interrupt> rq->curr =3D next; ...
<interrupt> put_prev_task() ...
<interrupt> __put_prev_task ...
<interrupt> kmem_cache_free() ...
<interrupt> ... <alocat=
ed again>
<interrupt> ... memset(=
, 0, )
<interrupt> ... ...
if (cur->flags & PF_EXITING) ... ...
<no> ... ...
get_task_struct() ... ...

Kirill

Kirill Tkhai

2014-10-18 08:33:27 UTC

=9AThe lockless get_task_struct(tsk) is only safe if tsk =3D=3D curr=

ent

=9Aand didn't pass exit_notify(), or if this tsk was found on a rcu
=9Aprotected list (say, for_each_process() or find_task_by_vpid()).
=9AIOW, it is only safe if release_task() was not called before we
=9Atake rcu_read_lock(), in this case we can rely on the fact that
=9Adelayed_put_pid() can not drop the (potentially) last reference
=9Auntil rcu_read_unlock().
=9AAnd as Kirill pointed out task_numa_compare()->task_numa_assign()
=9Apath does get_task_struct(dst_rq->curr) and this is not safe. The
=9Atask_struct itself can't go away, but rcu_read_lock() can't save
=9Aus from the final put_task_struct() in finish_task_switch(); this
=9Areference goes away without rcu gp.
=9A---
=9A=9Akernel/sched/fair.c | =9A=9A=9A8 +++++++-
=9A=9A1 files changed, 7 insertions(+), 1 deletions(-)
=9Adiff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
=9Aindex 0090e8c..52049b9 100644
=9A--- a/kernel/sched/fair.c
=9A+++ b/kernel/sched/fair.c

numa_env *env,

=9A=9A=9A=9A=9A=9A=9A=9A=9A=9Arcu_read_lock();
=9A=9A=9A=9A=9A=9A=9A=9A=9A=9Acur =3D ACCESS_ONCE(dst_rq->curr);
=9A- if (cur->pid =3D=3D 0) /* idle */
=9A+ /*
=9A+ * No need to move the exiting task, and this ensures that ->cur=

=9A+ * wasn't reaped and thus get_task_struct() in task_numa_assign(=

)

=9A+ * is safe; note that rcu_read_lock() can't protect from the fin=