Discussion:
[PULL REQUEST] : ima-appraisal patches
(too old to reply)
Mimi Zohar
2012-04-18 13:04:29 UTC
Permalink
Hi James,

As the last IMA-appraisal posting on 3/29 addressed Al's
performance/maintenance concerns of deferring the __fput() and there
hasn't been any additional comments, please consider pulling the
IMA-appraisal patches.

The linux-integrity.git also contains the two prereqs:
vfs: fix IMA lockdep circular locking dependency (Acked by Eric)
vfs: iversion truncate bug fix (currently in linux-next, via Andrew)

The following changes since commit eadc10b3e17f00681f7bfb2ed6e4aee39ad93f03:

vfs: extend vfs_removexattr locking (2012-04-18 07:06:55 -0400)

are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity.git next-ima-appraisal

thanks,

Mimi

Dmitry Kasatkin (3):
ima: free securityfs violations file
ima: allocating iint improvements
ima: digital signature verification support

Mimi Zohar (8):
vfs: move ima_file_free before releasing the file
ima: integrity appraisal extension
ima: add appraise action keywords and default rules
ima: replace iint spinlock with rwlock/read_lock
ima: add inode_post_setattr call
ima: add ima_inode_setxattr/removexattr function and calls
ima: defer calling __fput()
ima: add support for different security.ima data types

Documentation/ABI/testing/ima_policy | 25 ++-
Documentation/kernel-parameters.txt | 8 +
fs/attr.c | 2 +
fs/file_table.c | 7 +-
include/linux/ima.h | 32 +++
include/linux/integrity.h | 7 +-
include/linux/xattr.h | 3 +
security/integrity/evm/evm_main.c | 3 +
security/integrity/iint.c | 64 +++----
security/integrity/ima/Kconfig | 15 ++
security/integrity/ima/Makefile | 2 +
security/integrity/ima/ima.h | 37 ++++-
security/integrity/ima/ima_api.c | 56 ++++--
security/integrity/ima/ima_appraise.c | 344 +++++++++++++++++++++++++++++++++
security/integrity/ima/ima_crypto.c | 8 +-
security/integrity/ima/ima_fs.c | 1 +
security/integrity/ima/ima_main.c | 89 ++++++---
security/integrity/ima/ima_policy.c | 181 +++++++++++++-----
security/integrity/integrity.h | 11 +-
security/security.c | 6 +
20 files changed, 754 insertions(+), 147 deletions(-)
create mode 100644 security/integrity/ima/ima_appraise.c


--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
James Morris
2012-04-18 15:02:05 UTC
Permalink
Post by Mimi Zohar
Hi James,
As the last IMA-appraisal posting on 3/29 addressed Al's
performance/maintenance concerns of deferring the __fput() and there
hasn't been any additional comments, please consider pulling the
IMA-appraisal patches.
I thought Dmitry was still waiting on comments from Al for this.



- James
--
James Morris
<***@namei.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Mimi Zohar
2012-04-18 18:07:52 UTC
Permalink
Post by James Morris
Post by Mimi Zohar
Hi James,
As the last IMA-appraisal posting on 3/29 addressed Al's
performance/maintenance concerns of deferring the __fput() and there
hasn't been any additional comments, please consider pulling the
IMA-appraisal patches.
I thought Dmitry was still waiting on comments from Al for this.
Al NAK'ed the original 'ima: defer calling __fput()' patch from 3/22,
but we addressed both of Al's performance and maintaince concerns in the
last posting on 3/29, which was three weeks ago. We haven't heard
back ...
ima_file_free(), which is called on __fput(), updates the file data
hash stored as an extended attribute to reflect file changes. If a
file is closed before it is munmapped, __fput() is called with the
mmap_sem taken. With IMA-appraisal enabled, this results in an
mmap_sem/i_mutex lockdep. ima_defer_fput() increments the f_count to
defer the __fput() being called until after the mmap_sem is released.

The number of __fput() calls needing to be deferred is minimal. Only
those files in policy, that were closed prior to the munmap and were
mmapped write, need to defer the __fput().

With this patch, on a clean F16 install, from boot to login, only
5 out of ~100,000 mmap_sem held fput() calls were deferred.

Changelog v4:
- ima_defer_fput() performance improvements
- move ima_defer_fput() call from remove_vma() to __fput()
- only defer mmapped files opened for write
- remove unnecessary ima_fput_mempool_destroy()

Mimi

--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Al Viro
2012-04-18 18:39:38 UTC
Permalink
Post by Mimi Zohar
ima_file_free(), which is called on __fput(), updates the file data
hash stored as an extended attribute to reflect file changes. If a
file is closed before it is munmapped, __fput() is called with the
mmap_sem taken. With IMA-appraisal enabled, this results in an
mmap_sem/i_mutex lockdep. ima_defer_fput() increments the f_count to
defer the __fput() being called until after the mmap_sem is released.
The number of __fput() calls needing to be deferred is minimal. Only
those files in policy, that were closed prior to the munmap and were
mmapped write, need to defer the __fput().
With this patch, on a clean F16 install, from boot to login, only
5 out of ~100,000 mmap_sem held fput() calls were deferred.
Assuming that it's commit 3cee52ffe8ca925bb1e96f804daa87f7e2e34e46
Author: Mimi Zohar <***@linux.vnet.ibm.com>
Date: Fri Feb 24 06:23:12 2012 -0500

ima: defer calling __fput()
in your tree, the NAK still stands. For starters, but you are creating a
different locking rules for IMA-enabled builds and for everything else.
Moreover, this deferral is done only for files opened for write; the
rules are convoluted as hell *and* inviting abuses.

NAKed at least until you come up with formal proof that there's no other
lock where fput() would be possible and ->i_mutex was not allowed. This
is not a way to go; that kind of kludges leads to locking code that is
impossible to reason about.

PS: BTW, what the hell is "fput already scheduled" codepath about?
Why is it pr_info() and not an outright BUG_ON()?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Mimi Zohar
2012-04-18 20:56:05 UTC
Permalink
Post by Al Viro
Post by Mimi Zohar
ima_file_free(), which is called on __fput(), updates the file data
hash stored as an extended attribute to reflect file changes. If a
file is closed before it is munmapped, __fput() is called with the
mmap_sem taken. With IMA-appraisal enabled, this results in an
mmap_sem/i_mutex lockdep. ima_defer_fput() increments the f_count to
defer the __fput() being called until after the mmap_sem is released.
The number of __fput() calls needing to be deferred is minimal. Only
those files in policy, that were closed prior to the munmap and were
mmapped write, need to defer the __fput().
With this patch, on a clean F16 install, from boot to login, only
5 out of ~100,000 mmap_sem held fput() calls were deferred.
Assuming that it's commit 3cee52ffe8ca925bb1e96f804daa87f7e2e34e46
Date: Fri Feb 24 06:23:12 2012 -0500
ima: defer calling __fput()
in your tree, the NAK still stands. For starters, but you are creating a
different locking rules for IMA-enabled builds and for everything else.
Moreover, this deferral is done only for files opened for write; the
rules are convoluted as hell *and* inviting abuses.
Yes, that is the updated version. For performance, we limited deferring
the __fput() to only those files that could possibly change - open for
write, were closed before being munmapped, and that IMA-appraisal
maintains a file data hash as an xattr. If the main concern is
different locking rules when IMA is enabled, then we could remove the
IMA criteria and rename ima_defer_fput() to something more generic.

As for "*and* inviting abuses", I'm not sure what you mean.
Post by Al Viro
NAKed at least until you come up with formal proof that there's no other
lock where fput() would be possible and ->i_mutex was not allowed. This
is not a way to go; that kind of kludges leads to locking code that is
impossible to reason about.
On __fput(), we need to update the security.ima xattr with a hash of the
file data. The original thread discussion suggested changing the xattr
locking. The filesystems seem to do their own xattr locking, but in
fs/xattr.c the i_mutex is taken before accessing the inode
setxattr/removexattr ops.

hm, lockdep isn't complaining about anything else. Not sure if that
qualifies as formal proof.
Post by Al Viro
PS: BTW, what the hell is "fput already scheduled" codepath about?
Why is it pr_info() and not an outright BUG_ON()?
I'll fix this.

thanks,

Mimi

--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Al Viro
2012-04-20 00:43:03 UTC
Permalink
Has the discussion here moved from deferring the __fput() for the
mmap_sem/i_mutex lockdep side case, to taking the i_mutex in __fput() in
general? Lockdep has not reported any problems, other than for the
mmap_sem/i_mutex scenario.
Post by Al Viro
This
is not a way to go; that kind of kludges leads to locking code that is
impossible to reason about.
Are you referring to defering the __fput() or taking the i_mutex in
__fput() in general?
I'm referring to having "are we holding ->mmap_sem" logics anywhere in fput().
Note that your "lockdep has not reported..." is a symptom of the same
problem - it's *NOT* enough to show the lack of deadlocks from the change
of locking rules. And after that change we'll get even worse situation,
since proving the safety will become harder (and even more so if we end
up adding other cases when we need to defer).

Summary for those who'd been spared the earlier fun: IMA stuff wants to grab
->i_mutex on the final fput() in some cases. That violates the locking
order in at least one way - final fput() may come under ->mmap_sem, in
which case grabbing ->i_mutex would be a Bloody Bad Idea(tm). "Solution"
proposed: a bit of spectaculary ugly logics checking in final fput() whether
we have ->mmap_sem locked. If we do, said final fput() becomes non-final
and is done asynchronously. This is fundamentally flawed, of course,
since "is ->mmap_sem unlocked" is used as "is it safe to have ->i_mutex
grabbed", and it's both unproven and brittle as hell even if it happens
to be true right now (and I wouldn't bet on that, TBH).

If it had been IMA alone, I would've cheerfully told them where to stuff
the whole thing. Unfortunately, fput() *is* lock-heavy even without them.
Note that it can easily end up triggering such things as final
deactivate_super(). Now, ->mmap_sem is irrelevant here - after all,
any inodes involved won't be mmapped, or deactivate_super() wouldn't
be final. However, fput() is called e.g. under rtnl_lock() and I'm
not at all sure that something like NFS won't try to grab it from its
->kill_sb().

So the question is, do we have any reasonable way to get to the situation
where the __fput() would only be done without any locks held? It might
be worth trying. What we CAN'T do:
* defer all __fput() (i.e. always do final fput() async). Obviously
no-go, for performance reasons alone.
* check some predicate about the set of locks held and defer if it's
false. That's what IMA folks have tried to do; no-go for the reasons described
above.
* add a new helper (fput_carefully() or something like that) that would
defer final __fput(), leaving fput() with the current behaviour. And convert
the potentially unsafe callers to it (starting with everything that holds
->mmap_sem). No-go since it's not maintainable - a change pretty far away
from a function that does (currently safe) fput() can end up requiring the
conversion to fput_carefully(). Too easy to miss, so this will decay and it
won't be easy to verify correctness several years down the road.

However, there's an approach that might be feasible. Most of the time
the final fput() *is* done without any locks held and there's a very
large subclass of those call sites - those that come via fput_light().
What we could do, and what might be maintainable is:
* prohibit fput_light() with locks held. Right now we are very
close to that (or already there - I haven't finished checking).
* convert low-hanging fget/fput in syscalls to fget_light/fput_light.
Makes sense anyway.
* split off fput_nodefer() (identical to what fput() is right now),
make fput_light() call it instead of fput(). Switch path_lookupat() and
path_openat() to fput_nodefer() as well. Ditto for sys_socketpair() and
sys_accept4().
* make fput() (now almost never leading to __fput()) defer the sucker
in very rare cases when it ends up dropping the last reference.
At that point we would have __fput() always done without any locks held,
which would remove all restrictions on the locks that can be taken from it.

Comments?

Disclaimer: I have not finished going through the call sites (note that
there are wrappers - e.g. sockfd_put() *and* sockfd_lookup()), so there might
be obstacles. In particular, I'm still not sure about SCM_RIGHTS datagrams
handling - IIRC, we had an interesting kludge in there, with a kinda-sorta
deferral/batching. BTW, I wonder about deadlocks around that one -
drivers/vhost/net.c is doing ->recvmsg() while holding vq->mutex and if
an SCM_RIGHTS datagram comes in, we'll get a bunch of fput() done. Which
can lead to final umount of a network filesystem, etc. If that thing can
lead to handle_rx() on the same sucker, we have a deadlock...
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Linus Torvalds
2012-04-20 02:31:01 UTC
Permalink
However, there's an approach that might be feasible. =A0Most of the t=
ime
the final fput() *is* done without any locks held and there's a very
large subclass of those call sites - those that come via fput_light()=
=2E
=A0 =A0 =A0 =A0* prohibit fput_light() with locks held. =A0Right now =
we are very
close to that (or already there - I haven't finished checking).
=A0 =A0 =A0 =A0* convert low-hanging fget/fput in syscalls to fget_li=
ght/fput_light.
Makes sense anyway.
Many of them would make sense, yes (looking at vfs_fstatat() etc.

But a lot of fput() calls come from close() -> filp_close -> fput().

And the "fput_light()" model *only* works together with fget_light()
as it is now.

So I do think you need some other model. Of course, we can just do
"fput_light(file, 1)" instead - that seems pretty ugly, though. But
just making "fput()" do a defer on the last count sounds actively
*wrong* for things like close(), which may actually have serious
consistency guarantees (ie the process doing the close() may "know"
that it is the last user, and depend on the fact that the close() did
actually delete the inode etc.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel=
" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Al Viro
2012-04-20 02:54:38 UTC
Permalink
Post by Linus Torvalds
However, there's an approach that might be feasible. ?Most of the time
the final fput() *is* done without any locks held and there's a very
large subclass of those call sites - those that come via fput_light().
? ? ? ?* prohibit fput_light() with locks held. ?Right now we are very
close to that (or already there - I haven't finished checking).
? ? ? ?* convert low-hanging fget/fput in syscalls to fget_light/fput_light.
Makes sense anyway.
Many of them would make sense, yes (looking at vfs_fstatat() etc.
But a lot of fput() calls come from close() -> filp_close -> fput().
And the "fput_light()" model *only* works together with fget_light()
as it is now.
So I do think you need some other model. Of course, we can just do
"fput_light(file, 1)" instead - that seems pretty ugly, though. But
just making "fput()" do a defer on the last count sounds actively
*wrong* for things like close(), which may actually have serious
consistency guarantees (ie the process doing the close() may "know"
that it is the last user, and depend on the fact that the close() did
actually delete the inode etc.
Umm... I really wonder if we *want* filp_close() under any kind of
locks. You are right - it should not be deferred. I haven't finished
checking the callers of that puppy, but if we really do it while holding
any kind of lock, we are asking for trouble. So I'd rather switch
filp_close() to use of fput_nodefer() if that turns out to be possible.

FWIW, the set of primitives I'm thinking of right now is

__fput(file) - same as now
schedule_fput(file) - takes the only reference to file and schedules __fput()
fput_nodefer(file)
{
if (atomic_long_dec_and_test(&file->f_count))
__fput(file);
}
fput(file)
{
if (unlikely(!fput_atomic(file))
schedule_fput(file);
}
fput_light(file, need_fput)
{
if (need_fput)
fput_nodefer(file);
}
fput_light_defer(file, need_fput) // for callers in some weird ioctls, might
// not be needed at all
{
if (need_fput)
fput(file);
}

and filp_close() would, if that turns out to be possible, call fput_nodefer()
instead of fput(). If we *do* have places where we need deferral in
filp_close() (and I'm fairly sure that any such place is a deadlock right
now), well, we'll need a variant of filp_close() sans the call of fput...()
and those places would call that, followed by full (deferring) fput().
Linus Torvalds
2012-04-20 02:58:57 UTC
Permalink
Umm... =A0I really wonder if we *want* filp_close() under any kind of
locks. =A0You are right - it should not be deferred. =A0I haven't fin=
ished
checking the callers of that puppy, but if we really do it while hold=
ing
any kind of lock, we are asking for trouble. =A0So I'd rather switch
filp_close() to use of fput_nodefer() if that turns out to be possibl=
e.

Ok, fair enough, looks like a reasonable plan to me.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-securit=
y-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Al Viro
2012-04-20 08:09:14 UTC
Permalink
Post by Linus Torvalds
Umm... ?I really wonder if we *want* filp_close() under any kind of
locks. ?You are right - it should not be deferred. ?I haven't finished
checking the callers of that puppy, but if we really do it while holding
any kind of lock, we are asking for trouble. ?So I'd rather switch
filp_close() to use of fput_nodefer() if that turns out to be possible.
Ok, fair enough, looks like a reasonable plan to me.
Actually, scratch that; I think I have a better variant
* switch to fget_light/fput_light where possible; it's not needed
for the rest, but is useful anyway
* move the guts of filp_close() (everything prior to fput() it does
in the end) into a new helper, turning filp_close() into a couple of calls
(inlined, at that). Equivalent transformation.
* add fput_nodefer(); does the same thing fput() does now. Make
fput() defer the call of __fput(). Add a filp_close_nodefer() - same as
filp_close(), but the second call in it is fput_nodefer(), not fput(). And
switch the callers that are known to be done outside of any locks to
filp_close_nodefer(). Switch accept4()/socketpair() to fput_nodefer()
(for freshly created struct file in case of cleanup on failure exit).

Note that we do *not* need to bother with fput_light() - even if it does
fput(), that fput() won't usually be the final one. We'd need it to
race with close() or dup2() from another thread for that to happen. So
we only need to review the callers of filp_close() and there are much
fewer of those.

We also get something else out of that - AFAICS, the kludge in __scm_destroy()
can be killed after that. We did it to prevent recursion on fput(), right?
Now that recursion will be gone...

I'd probably not bother with trying to be clever in doing the deferral itself -
the point is to make it rare, so it's not a hot path anyway. We can play
with per-CPU lists, etc., but in this case dumber is better.

Comments? I'm half-asleep right now, so I might be missing something
obvious...
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Linus Torvalds
2012-04-20 15:56:57 UTC
Permalink
Note that we do *not* need to bother with fput_light() - even if it d=
oes
fput(), that fput() won't usually be the final one.
Ack. Most of the time the fput_light()->fput will just decrement the us=
e count.
We also get something else out of that - AFAICS, the kludge in __scm_=
destroy()
can be killed after that. =A0We did it to prevent recursion on fput()=
, right?
Now that recursion will be gone...
Hmm.. That points out that we may have a *lot* of these pending final
fput's, though. So the deferral queueing should be fairly light. What
were your particular plans for it?

This actually sounds like a fairly good usage-case for Oleg's new
task_work_add() thing. That would defer the final fput, but at the
same time guarantee that it gets done before returning to user space -
in case there are any issues with synchronous actions. Have you looked
at Oleg's series? You weren't cc'd because it didn't affect you, but
look at lkml for "task_work_add()" to find it.

NOTE! If pure kernel threads do fput() deferral (and maybe they do -
I'm thinking nfsd etc), then the task-work thing might need some extra
thought.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel=
" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Al Viro
2012-04-20 16:08:48 UTC
Permalink
Post by Linus Torvalds
Post by Al Viro
We also get something else out of that - AFAICS, the kludge in __scm_destroy()
can be killed after that. ?We did it to prevent recursion on fput(), right?
Now that recursion will be gone...
Hmm.. That points out that we may have a *lot* of these pending final
fput's, though. So the deferral queueing should be fairly light. What
were your particular plans for it?
Doing removal from per-sb list immediately (i.e. before possible
deferral; we skip ones with zero ->f_count when we walk the list
anyway), then in case we decide to defer just move them to per-CPU
list and schedule work on that CPU, with handler that will pull the
corresponding list out and do the rest of __fput() for everything
in that list. No extra locking, just preempt_disable() around the
"move to per-CPU list" bit. Or a per-CPU spinlock with worker not
being tied to specific CPU and told which CPU's list to work with.
How does CPU hotplug interact with work scheduled on CPU about to
be taken down, BTW?
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Al Viro
2012-04-20 16:42:39 UTC
Permalink
Post by Al Viro
Doing removal from per-sb list immediately (i.e. before possible
deferral; we skip ones with zero ->f_count when we walk the list
anyway), then in case we decide to defer just move them to per-CPU
list and schedule work on that CPU, with handler that will pull the
corresponding list out and do the rest of __fput() for everything
in that list. No extra locking, just preempt_disable() around the
"move to per-CPU list" bit. Or a per-CPU spinlock with worker not
being tied to specific CPU and told which CPU's list to work with.
How does CPU hotplug interact with work scheduled on CPU about to
be taken down, BTW?
Actually, I like the per-CPU spinlock variant better; the thing is,
with that scheme we get normal fput() (i.e. non-nodefer variant)
non-blocking. How about this:

__fput() loses file_sb_list_del() call

fput(file)
{
if (atomic_long_dec_and_test(...)) {
unsigned long flags;
struct foo *p;
file_sb_list_del(file);
p = get_cpu_var(deferral_lists);
spin_lock_irqsave(&p->lock, flags);
list_move(&file->f_u.fu_list, &p->list);
spin_unlock_irqrestore(&p->lock, flags);
schedule_work(&p->work);
put_cpu_var(p);
}
}

fput_nodefer(file)
{
if (atomic_long_dec_and_test(...)) {
file_sb_list_del(file);
__fput(file);
}
}

do_deferred_fput_work(work)
{
struct foo *p = container_of(work, struct foo, work);
LIST_HEAD(list);
spin_lock_irq(&p->lock);
list_splice_init(&p->list, list);
spin_unlock_irq(&p->lock);
while (!list_empty(list)) {
struct file *file = list_entry(list, struct file, f_u.fu_list);
list_del_init(&file->f_u.fu_list);
__fput(file);
}
}

Voila - now only fput_nodefer() is blocking! fput() can be used from
any context that way, which should kill e.g. a kludge in fs/aio.c.

Comments?
Linus Torvalds
2012-04-20 17:21:35 UTC
Permalink
Post by Al Viro
Actually, I like the per-CPU spinlock variant better; the thing is,
with that scheme we get normal fput() (i.e. non-nodefer variant)
What's the advantage of a per-cpu lock?

If you make the work be per-cpu, then you're better with no locking at
all: just disable interrupts (which you do anyway).

And if you want to use a spinlock, don't bother with the percpu side.

The thing I do not like about the schedule_work approach is that it
(a) totally hides the real cost - which is the scheduling - and (b)
it is so asynchronous that it will happen potentially long after the
task dropped the reference.

And seriously - that is user-visible behavior.

=46or example, think about this *common* pattern:

open+mmap+close+unlink+munmap

which would trigger the whole deferred fput, but also triggers the
actual real unlink() at fput time.

Right now, you can have that kind of thing in a program and
immediately unmount the filesystem afterwards (replace "unmount" with
"cannot see silly-renamed files" etc).

The "totally asynchronous deferral" literally *breaks*semantics*.

Sure, it won't be noticeable in 99.99% of all cases, and I doubt you
can trigger much of a test for it. But it's potential real breakage,
and it's going to be hard to ever see. And then when it *does* happen,
it's going to be totally impossible to debug.

It's not just the "last unlink" thing that gets delayed. It things
like file locking. It's "drop_file_write_access()". It's whatever
random thing that file does at "release()". It's a ton of things like
that. Delaying them has user-visible actions.

That's a whole can of complexities and worries outside of the kernel
interface that you are completely ignoring - just because you are
trying to solve the *simple* complexity of locking interaction
entirely within the kernel.

I think that's a bit myopic. We don't even *know* what the problems
with the async approach might be. Your "simple" solution is simple
only inside the kernel.

This is why I suggested you look at Oleg's patches. If we guarantee
that things won't be delayed past re-entering user mode, all those
issues go away.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-securit=
y-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Al Viro
2012-04-20 18:07:48 UTC
Permalink
Post by Linus Torvalds
Post by Al Viro
Actually, I like the per-CPU spinlock variant better; the thing is,
with that scheme we get normal fput() (i.e. non-nodefer variant)
What's the advantage of a per-cpu lock?
If you make the work be per-cpu, then you're better with no locking at
all: just disable interrupts (which you do anyway).
Point taken.
Post by Linus Torvalds
The thing I do not like about the schedule_work approach is that it
(a) totally hides the real cost - which is the scheduling - and (b)
it is so asynchronous that it will happen potentially long after the
task dropped the reference.
[snip]
Post by Linus Torvalds
This is why I suggested you look at Oleg's patches. If we guarantee
that things won't be delayed past re-entering user mode, all those
issues go away.
I've looked at them. One obvious problem is that it tracehook_notify_resume()
is not universally called. AFAICS, hexagon, m68k, microblaze, score, um
and xtensa never call tracehook_notify_resume(). Out of those, hexagon is
manually checking TIF_NOTIFY_RESUME and does key_replace_session_keyring(),
so the call could be easily added into the same place; the rest of those
guys don't even look at TIF_NOTIFY_RESUME anywhere near their signal*.c
and m68k/um/xtensa don't even have it defined, let alone handled. So this
stuff depends on some amount of asm glue hacking on several architectures ;-/

Another is that final fput() can, indeed, happen in kernel threads, so
pure switch to task_work_add() won't be enough. I agree that having this
stuff flushed before we return to userland would be a good thing; the
question is, can we easily tell if there will be such a return to userland?
Al Viro
2012-04-23 18:01:50 UTC
Permalink
Post by Al Viro
Post by Linus Torvalds
This is why I suggested you look at Oleg's patches. If we guarantee
that things won't be delayed past re-entering user mode, all those
issues go away.
I've looked at them. One obvious problem is that it tracehook_notify_resume()
is not universally called. AFAICS, hexagon, m68k, microblaze, score, um
and xtensa never call tracehook_notify_resume(). Out of those, hexagon is
manually checking TIF_NOTIFY_RESUME and does key_replace_session_keyring(),
so the call could be easily added into the same place; the rest of those
guys don't even look at TIF_NOTIFY_RESUME anywhere near their signal*.c
and m68k/um/xtensa don't even have it defined, let alone handled. So this
stuff depends on some amount of asm glue hacking on several architectures ;-/
BTW, I've looked into dealing with that; I think I have a tentative solution
for all these architectures.
* hexagon: just needs tracehook_notify_resume() added, everything
else is already in place
* score: TIF_NOTIFY_RESUME is defined *and* included into the
"we need to call do_notify_resume()" logics in assembler glue; just need
to add the usual boilerplate into said do_notify_resume()
* um: glue in question is in C; easily dealt with, I can do that
(and test the results) tonight
* m68k: that'll need some glue changes; AFAICS, the easiest solution
is to put TIF_NOTIFY_RESUME into bit 5 - then nommu glue needs no changes
at all, and entry_mm.S needs two "jmi do_signal_return" replaced with
"jne do_signal_return"; the code before those shifts bit 6 to MSBit and
currently bits 0--5 are unused. Replacing "most significant bit is set" with
"some bits are set" would do the right thing, AFAICT - make the sucker
go into do_signal() handling if either TIF_SIGPENDING (bit 6) or
TIF_NOTIFY_RESUME (bit 5) is set (at that point it has already checked that
TIF_NEED_RESCHEDULE is not set). On top of that it will need the obvious
changes in do_signal() itself - boilerplate added and current contents
made conditional on TIF_SIGPENDING being set. I can only test that
on aranym, though - all real m68k hardware I have is pining for fjords right
now.
* microblaze: TIF_NOTIFY_RESUME is defined, but not hooked anywhere.
Fortunately, the glue is easy enough there - all relevant spots have the
same form
lwi r11, r11, TI_FLAGS; /* get flags in thread info */
andi r11, r11, _TIF_SIGPENDING;
beqi r11, 1f; /* Signals to handle, handle them */
and replacing that _TIF_SIGPENDING with _TIF_SIGPENDING | _TIF_NOTIFY_RESUME
will do the right thing; of course, do_signal() itself will need to be
taught about TIF_NOTIFY_RESUME - same as in case of m68k. No hardware, no
emulators set up, but then it's less intrusive in the glue part than m68k
counterpart.
* xtensa: TIF_NOTIFY_RESUME needs to be defined (bit 7 would do,
AFAICS) and there the glue does need some change:
l32i a4, a2, TI_FLAGS

_bbsi.l a4, TIF_NEED_RESCHED, 3f
_bbci.l a4, TIF_SIGPENDING, 4f
should be replaced with (if I'm not misreading their ISA documentation)
l32i a4, a2, TI_FLAGS

_bbsi.l a4, TIF_NEED_RESCHED, 3f
_bbsi.l a4, TIF_SIGPENDING, 2f
_bbci.l a4, TIF_NOTIFY_RESUME, 4f
2:
(and do_signal() changes, of course). That's the most intrusive one and
again, I've neither hw nor emulators for that sucker.

I'll post the patches for all of those tonight; if everything ends up working,
at least we can get rid of the ifdefs on TIF_NOTIFY_RESUME.

Oleg, where does your tree live? I've walked through the signal handling
on all targets over this weekend (and it's still not complete - there are
fun bugs re multiple sigframes and restart handling on many of those) and
my current queue is at git.kernel.org/pub/scm/linux/kernel/git/viro/signal;
I don't promise that it even builds on everything, so it's *not* a pull
request. Besides, it's still growing and will be reordered... The issues
dealt with by now:
[done] don't open-code force_sigsegv()
[done] don't open-code block_sigmask()
[done] If we have failed to set sigframe up, we should send SIGSEGV with
force_sigsegv() (or force_sigsegv_info()) and leave the sigmask alone;
otherwise we need to call block_sigmask() and clear RESTORE_SIGMASK
[done] looking at SA_ONESHOT is pointless (it's a rudiment of very old things;
these days kernel/signal.c does it properly and arch/* code doesn't need to
and actually can't - it only gets a local copy filled, so clearing sa_handler
in it is bloody pointless)
[done] all sigsuspend variants should use sigsuspend() helper (i.e. be based
on use of ->saved_sigmask; see the first patch in queue introducing the helper
in question)
[done] restart_block.fn should be reset on sigreturn, not on signal delivery
[done] sigreturn variants should use set_current_blocked(); make sure to remove KILL/STOP from the set first
[mostly done; parisc and blackfin left] check __get_user()/__put_user() results

I really want to get arch/*/*/*signal* more or less in sync wrt fixes; having
TIF_NOTIFY_RESUME working fits nicely into that. I'd appreciate a look through
that stuff...

PS: I don't think I'll be posting any pull requests on that tree; it's just
a staging ground for future linux-arch patchbomb(s).
Oleg Nesterov
2012-04-23 18:37:34 UTC
Permalink
Post by Al Viro
Post by Al Viro
Post by Linus Torvalds
This is why I suggested you look at Oleg's patches. If we guarantee
that things won't be delayed past re-entering user mode, all those
issues go away.
I've looked at them. One obvious problem is that it tracehook_notify_resume()
is not universally called. AFAICS, hexagon, m68k, microblaze, score, um
and xtensa never call tracehook_notify_resume().
Yes,
Post by Al Viro
Post by Al Viro
Out of those, hexagon is
manually checking TIF_NOTIFY_RESUME and does key_replace_session_keyring(),
so the call could be easily added into the same place;
Yes, I sent the trivial patch.
Post by Al Viro
Post by Al Viro
the rest of those
guys don't even look at TIF_NOTIFY_RESUME anywhere near their signal*.c
and m68k/um/xtensa don't even have it defined, let alone handled. So this
stuff depends on some amount of asm glue hacking on several architectures ;-/
which I don't understand even remotely :/ But I'll try to understand at
least something.
Post by Al Viro
Oleg, where does your tree live?
Cough. Nowhere. I still can't restore my korg account.
Post by Al Viro
I've walked through the signal handling
on all targets over this weekend (and it's still not complete - there are
fun bugs re multiple sigframes and restart handling on many of those) and
my current queue is at git.kernel.org/pub/scm/linux/kernel/git/viro/signal;
Thanks. I don't think you need my help, but I'll certainly try to look this
week anyway.

Oleg.
Al Viro
2012-04-24 07:26:17 UTC
Permalink
Post by Al Viro
BTW, I've looked into dealing with that; I think I have a tentative solution
for all these architectures.
* hexagon: just needs tracehook_notify_resume() added, everything
else is already in place
* score: TIF_NOTIFY_RESUME is defined *and* included into the
"we need to call do_notify_resume()" logics in assembler glue; just need
to add the usual boilerplate into said do_notify_resume()
* um: glue in question is in C; easily dealt with, I can do that
(and test the results) tonight
* m68k: that'll need some glue changes; AFAICS, the easiest solution
is to put TIF_NOTIFY_RESUME into bit 5 - then nommu glue needs no changes
at all, and entry_mm.S needs two "jmi do_signal_return" replaced with
"jne do_signal_return"; the code before those shifts bit 6 to MSBit and
currently bits 0--5 are unused. Replacing "most significant bit is set" with
"some bits are set" would do the right thing, AFAICT - make the sucker
go into do_signal() handling if either TIF_SIGPENDING (bit 6) or
TIF_NOTIFY_RESUME (bit 5) is set (at that point it has already checked that
TIF_NEED_RESCHEDULE is not set). On top of that it will need the obvious
changes in do_signal() itself - boilerplate added and current contents
made conditional on TIF_SIGPENDING being set. I can only test that
on aranym, though - all real m68k hardware I have is pining for fjords right
now.
* microblaze: TIF_NOTIFY_RESUME is defined, but not hooked anywhere.
Fortunately, the glue is easy enough there - all relevant spots have the
same form
lwi r11, r11, TI_FLAGS; /* get flags in thread info */
andi r11, r11, _TIF_SIGPENDING;
beqi r11, 1f; /* Signals to handle, handle them */
and replacing that _TIF_SIGPENDING with _TIF_SIGPENDING | _TIF_NOTIFY_RESUME
will do the right thing; of course, do_signal() itself will need to be
taught about TIF_NOTIFY_RESUME - same as in case of m68k. No hardware, no
emulators set up, but then it's less intrusive in the glue part than m68k
counterpart.
* xtensa: TIF_NOTIFY_RESUME needs to be defined (bit 7 would do,
l32i a4, a2, TI_FLAGS
_bbsi.l a4, TIF_NEED_RESCHED, 3f
_bbci.l a4, TIF_SIGPENDING, 4f
should be replaced with (if I'm not misreading their ISA documentation)
l32i a4, a2, TI_FLAGS
_bbsi.l a4, TIF_NEED_RESCHED, 3f
_bbsi.l a4, TIF_SIGPENDING, 2f
_bbci.l a4, TIF_NOTIFY_RESUME, 4f
(and do_signal() changes, of course). That's the most intrusive one and
again, I've neither hw nor emulators for that sucker.
I'll post the patches for all of those tonight; if everything ends up working,
at least we can get rid of the ifdefs on TIF_NOTIFY_RESUME.
Untested variants pushed into signal.git#master; will test tomorrow. In
the meanwhile, any code review (and testing of the entire thing on as many
targets as possible) would be very welcome.

Next target in signal fixes: handling of multiple signal arrivals. We really
need to handle all of them (i.e. build a stack frame for each, with sigcontext
pointing to the entry into the previous one); otherwise we are risking to get
things like e.g. SIGSEGV on failure to build a sigframe being handled at
random point - we handle one arriving signal, send ourselves SIGSEGV in
process and return to userland (without actually going into the signal
handler stack frame we'd failed to build). Broken architectures: blackfin,
cris, h8300, hexagon, microblaze and probably ia64...

And then there's a lovely issue of what syscall restarts - both on multiple
arriving signals (we want the restart to apply on the _first_ signal being
processed, TYVM, since the rest of those signals are not interrupting
a syscall - conceptually, they are interrupting the previous signal handlers
at the point of entry) and on {rt_,}sigreturn(2) (where we might get a value
in the register normally used to return sys_whatever() value that would look
like one of restart-related special errors, except that it's simply what that
register used to have when e.g. a timer interrupt had hit while we had a signal
pending; hell to debug, since it looks e.g. like one register in userland
process getting its value randomly replaced with -EINTR if it happened to
contain -ERESTARTSYS when an interrupt had happened). I'd fixed one like
that on arm a couple of years ago, but AFAICS we still have several on
other architectures ;-/

BTW, another fun issue is FUBAR assembler variant of rt_sigprocmask() on
itanic; it's missing the fixes done to set_current_blocked() in commit
e6fa16ab "signal: sigprocmask() should do retarget_shared_pending()".
I'm _not_ about to transpose those fixes into ia64 assembler, thank you
very much - Itanic maintainers are whole-heartedly welcome to deal with
that one. The horror in question is fsys_rt_sigprocmask()...
Al Viro
2012-04-25 03:06:59 UTC
Permalink
Post by Al Viro
And then there's a lovely issue of what syscall restarts - both on multiple
arriving signals (we want the restart to apply on the _first_ signal being
processed, TYVM, since the rest of those signals are not interrupting
a syscall - conceptually, they are interrupting the previous signal handlers
at the point of entry) and on {rt_,}sigreturn(2) (where we might get a value
in the register normally used to return sys_whatever() value that would look
like one of restart-related special errors, except that it's simply what that
register used to have when e.g. a timer interrupt had hit while we had a signal
pending; hell to debug, since it looks e.g. like one register in userland
process getting its value randomly replaced with -EINTR if it happened to
contain -ERESTARTSYS when an interrupt had happened). I'd fixed one like
that on arm a couple of years ago, but AFAICS we still have several on
other architectures ;-/
FWIW, there's an interesting question rmk has brought up. Consider the
following scenario (on any architecture):
sigsuspend(2) sees a signal and returns ERESTARTNOHAND.
do_signal() is called and calls get_signal_to_deliver() and gets 0,
for whatever reason.
We decide to restart, return address adjusted, syscall number
returned to the right register in pt_regs. In the meanwhile, no matter what
state interrupts used to have before, get_signal_to_deliver() has enabled
them when returning, so we'll need to reload thread flags. And we find that
another signal has arrived in the meanwhile.
OK, do_signal() is called again, and this time we have a handler for
the arrived signal. We form a stack frame and return to userland, into the
beginning of the handler. We don't even look at the restart-related logics
this time around, due to the usual logics protecting us from double restarts.
Handler is executed, up to rt_sigreturn(2).
We decode the sigcontext, restore pt_regs and return to userland.
Right into the beginning of interrupted sigsuspend()

So we have sigsuspend() hit by a signal we have a handler for. Handler is
executed and we are stuck is sigsuspend() again, all because a signal without
a handler has arrived just before that one - close enough for our signal to
come right after get_signal_to_deliver() has returned zero to do_signal().

AFAICS, that's a clear bug. Arrival of a signal that has userland handler
and that isn't blocked by the mask given to sigsuspend() should terminate
sigsuspend(). Sure, we can weasel around the wording in POSIX (it says
"sigsuspend() will return after the signal-catching function returns" without
explicitly saying that it shouldn't sit around waiting for another signals
to arrive), but it's obviously against the intent of standard (as well as
expectations of any programmer). Note that if the handler-less signal had
arrived a bit earlier, we would've returned to userland and reenter the
kernel before the arrival of our signal. Then it *would* terminate
sigsuspend() - handler would be executed and we would've returned to caller
of sigsuspend() with EINTR as return value. If they came simultaneously,
the same thing would've happened - get_signal_to_deliver() would have
returned non-zero the first time around and that would be it. It's just the
right timing for the second signal that triggers that race.

Solution proposed last summer when that had been noticed by arm folks was
more or less along the lines of
* new thread flag, checked after we'd seen that no SIGPENDING et.al.
is there. If it's set, we clear it, do syscall restart work as we would for
handlerless signal and recheck the flags if we had to do something like
__put_user() in process (arm might have to do that for ERESTART_RESTARTBLOCK)[1]
* do_signal() would set that flag if
+ anti double-restart logics would not have prevented
restarts
+ error value was ERESTART_...
* no restart work on "no signal" path in do_signal()
* if we have a handler and the flag is set, clear it and do what
we normally do for restarts (including the "has ptrace mangled registers
in a way that would prevent restarts in the current code" logics for
architectures that have such logics - arm and sparc, at least).

I would've probably made that TS_... instead of TIF_... at least for something
like x86 - it's obviously thread-synchronous.

[1] it wasn't covered in that thread, but if a signal arrives during that
__put_user() we really won't care - the usual logics in sigreturn() will
make sure that when we return from handler into the resulting restart_syscall(2)
we'll have it immediately fail with -EINTR.

Comments? AFAICS, the bug is arch-independent; it's not just arm and it's
worse than the example mentioned last year - sigsuspend() is a lot more
common than ppoll() and behaviour clearly goes against what sigsuspend()
exists for.
Oleg Nesterov
2012-04-25 12:37:46 UTC
Permalink
Post by Al Viro
FWIW, there's an interesting question rmk has brought up. Consider the
sigsuspend(2) sees a signal and returns ERESTARTNOHAND.
do_signal() is called and calls get_signal_to_deliver() and gets 0,
for whatever reason.
We decide to restart, return address adjusted, syscall number
returned to the right register in pt_regs. In the meanwhile, no matter what
state interrupts used to have before, get_signal_to_deliver() has enabled
them when returning
Afaics this doesn't really matter, TIF_SIGPENDING can be set by another CPU
once get_signal_to_deliver() drops ->siglock.
Post by Al Viro
, so we'll need to reload thread flags. And we find that
another signal has arrived in the meanwhile.
OK, do_signal() is called again, and this time we have a handler for
the arrived signal. We form a stack frame and return to userland, into the
beginning of the handler. We don't even look at the restart-related logics
this time around, due to the usual logics protecting us from double restarts.
Handler is executed, up to rt_sigreturn(2).
We decode the sigcontext, restore pt_regs and return to userland.
Right into the beginning of interrupted sigsuspend()
So we have sigsuspend() hit by a signal we have a handler for. Handler is
executed and we are stuck is sigsuspend() again, all because a signal without
a handler has arrived just before that one - close enough for our signal to
come right after get_signal_to_deliver() has returned zero to do_signal().
Yes, this (and the similar races) were already discussed a couple of times.
In short, regs->ax = -ERESTART* and ->ip doesn't survive after do_signal().
In this case the syscall was already restarted after the first do_signal()
even if we do not return to user-mode yet.
Post by Al Viro
AFAICS, that's a clear bug.
I do not know. So far it was decided that we do not really care, but
I won't argue if we decide to change the current behaviour.
Post by Al Viro
Arrival of a signal that has userland handler
and that isn't blocked by the mask given to sigsuspend() should terminate
sigsuspend().
Yes. But note that do_signal() restores the old sigmask. This means that
the signal we get after the first do_signal() was not blocked before
sigsuspend() was called. So, to some extent, we can pretend that the
handler was executed before sigsuspend() and it was never restarted.

IOW, I tend to agree with the comments from Roland, see for example

HR timers prevent an itimer from generating EINTR?
http://marc.info/?t=125210012600005

[RESEND] [RFC][PATCH X86_32 1/2]: Call do_notify_resume() with interrupts enabled
http://marc.info/?t=131955450100004

But let me repeat that I never really understood if this is "by design"
or not.
Post by Al Viro
Solution proposed last summer when that had been noticed by arm folks was
more or less along the lines of
* new thread flag, checked after we'd seen that no SIGPENDING et.al.
is there. If it's set, we clear it, do syscall restart work as we would for
handlerless signal and recheck the flags if we had to do something like
__put_user() in process (arm might have to do that for ERESTART_RESTARTBLOCK)[1]
* do_signal() would set that flag if
+ anti double-restart logics would not have prevented
restarts
+ error value was ERESTART_...
* no restart work on "no signal" path in do_signal()
* if we have a handler and the flag is set, clear it and do what
we normally do for restarts (including the "has ptrace mangled registers
in a way that would prevent restarts in the current code" logics for
architectures that have such logics - arm and sparc, at least).
Hmm. Not sure I understand this in details. But at first glance,
"do_signal() would set that flag" is not enough. We have the similar problem
if we dequeue a SA_RESTART signal first, then another signal without SA_RESTART.
Or I simply misunderstood.

Oleg.
Al Viro
2012-04-25 12:50:42 UTC
Permalink
Post by Oleg Nesterov
Post by Al Viro
Arrival of a signal that has userland handler
and that isn't blocked by the mask given to sigsuspend() should terminate
sigsuspend().
Yes. But note that do_signal() restores the old sigmask. This means that
the signal we get after the first do_signal() was not blocked before
sigsuspend() was called. So, to some extent, we can pretend that the
handler was executed before sigsuspend() and it was never restarted.
Signal might have already arrived by the time we restore sigmask. So no,
it might have been blocked prior to sigsuspend().

I agree that relying on disabled interrupts doesn't work - objection would
have worked if it was just "what if we get NEED_RESCHED and while we are
scheduled away a signal arrives", but this scenario doesn't depend on
that. We definitely want interrupts enabled before we start playing with
do_notify_resume(), especially if things like deferred fput, etc. end up
there as well.
Oleg Nesterov
2012-04-25 13:03:29 UTC
Permalink
Post by Al Viro
Post by Oleg Nesterov
Post by Al Viro
Arrival of a signal that has userland handler
and that isn't blocked by the mask given to sigsuspend() should terminate
sigsuspend().
Yes. But note that do_signal() restores the old sigmask. This means that
the signal we get after the first do_signal() was not blocked before
sigsuspend() was called. So, to some extent, we can pretend that the
handler was executed before sigsuspend() and it was never restarted.
Signal might have already arrived by the time we restore sigmask.
Yes, and it sets TIF_SIGPENDING, but unless I missed something this doesn't
matter.
Post by Al Viro
So no,
it might have been blocked prior to sigsuspend().
If it was not blocked, then the next do_signal()->get_signal_to_deliver()
returns 0 and clears TIF_SIGPENDING. After that we finally re-enter
sys_sigsuspend() and (assuming it unblocks this sig) notice this pending
signal again and return -EINTR eventually.

Oleg.
Al Viro
2012-04-25 13:32:38 UTC
Permalink
Post by Oleg Nesterov
Yes, and it sets TIF_SIGPENDING, but unless I missed something this doesn't
matter.
Post by Al Viro
So no,
it might have been blocked prior to sigsuspend().
If it was not blocked, then the next do_signal()->get_signal_to_deliver()
returns 0 and clears TIF_SIGPENDING. After that we finally re-enter
sys_sigsuspend() and (assuming it unblocks this sig) notice this pending
signal again and return -EINTR eventually.
Point... Still, since we are talking about an arbitrary wide window (the
damn thing is waiting for signals to arrive, after all) this doesn't
sound good; sure, we might have been hit by SIGSTOP just before entering
that sigsuspend() with SIGCONT arriving only when the signal in question
had, ending up with handler running before sigsuspend(), but... IMO it's
a QoI problem at the very least.

As for SA_RESTART/!SA_RESTART mixes, if SA_RESTART comes first we should
just take that restart and pretend that the second signal has arrived at
the very beginning of handler, I think. Assuming I understood you correctly,
that is.
Oleg Nesterov
2012-04-25 14:52:39 UTC
Permalink
Post by Al Viro
Point... Still, since we are talking about an arbitrary wide window (the
damn thing is waiting for signals to arrive, after all) this doesn't
sound good;
...
IMO it's
a QoI problem at the very least.
and looks confusing, agreed.
Post by Al Viro
As for SA_RESTART/!SA_RESTART mixes, if SA_RESTART comes first we should
just take that restart and pretend that the second signal has arrived at
the very beginning of handler, I think.
Yes. My point was, this confuses the user-space developers too. And this
case is equally unclear to me wrt should we (at least try to) change this
or not.

Oleg.
Oleg Nesterov
2012-04-25 15:46:11 UTC
Permalink
Post by Oleg Nesterov
Post by Al Viro
Point... Still, since we are talking about an arbitrary wide window (the
damn thing is waiting for signals to arrive, after all) this doesn't
sound good;
...
IMO it's
a QoI problem at the very least.
and looks confusing, agreed.
OK, I didn't really try to think, and somehow I simply can't wake up today.
But perhaps we can do something the following? We add the new syscall

sys_eintr(void)
{
return -EINTR;
}

(perhaps not strictly needed, perhaps we can reuse sys_restart_syscal)

Now,

--- x/arch/x86/kernel/signal.c
+++ x/arch/x86/kernel/signal.c
@@ -711,6 +711,13 @@ handle_signal(unsigned long sig, siginfo
regs->ax = regs->orig_ax;
regs->ip -= 2;
break;
+
+ case -EINTR:
+ break;
+
+ default:
+ if (regs->orig_ax == NR_eintr)
+ regs->ax = NR_eintr;
}
}

@@ -791,6 +798,7 @@ static void do_signal(struct pt_regs *re
case -ERESTARTSYS:
case -ERESTARTNOINTR:
regs->ax = regs->orig_ax;
+ regs->orig_ax = NR_eintr;
regs->ip -= 2;
break;


this ignores ERESTART_RESTARTBLOCK for simplicity.

And I am not sure this can't confuse the tools like strace...

Oleg.
Al Viro
2012-04-25 16:10:02 UTC
Permalink
Post by Oleg Nesterov
OK, I didn't really try to think, and somehow I simply can't wake up today.
But perhaps we can do something the following? We add the new syscall
sys_eintr(void)
{
return -EINTR;
}
(perhaps not strictly needed, perhaps we can reuse sys_restart_syscal)
Now,
--- x/arch/x86/kernel/signal.c
+++ x/arch/x86/kernel/signal.c
@@ -711,6 +711,13 @@ handle_signal(unsigned long sig, siginfo
regs->ax = regs->orig_ax;
regs->ip -= 2;
break;
+
+ break;
+
+ if (regs->orig_ax == NR_eintr)
+ regs->ax = NR_eintr;
}
}
@@ -791,6 +798,7 @@ static void do_signal(struct pt_regs *re
regs->ax = regs->orig_ax;
+ regs->orig_ax = NR_eintr;
regs->ip -= 2;
break;
this ignores ERESTART_RESTARTBLOCK for simplicity.
Ehh... You do realize that it's that simple only on architectures that
have syscall number in a register? arm/parisc/unicore32 have really
painful code dealing with restart_syscall(2); building trampoline on
stack and all such... Take a look at e.g. insert_restart_trampoline()
in arch/parisc/kernel/signal.c (BTW, it doesn't check for put_user()
failures while building that thing).

FWIW, I wonder if e.g. arm would be better off with the following
trick: new flag added to _TIF_SYSCALL_WORK, with __sys_trace checking
it and if it's set, clearing it and simulating sys_restart_syscall().
ERESTART_RESTARTBLOCK would become practically identical to ERESTARTNOHAND,
except that if we decide to restart it we would set the flag. All the
trampoline crap goes away that way and restart logics gets simpler on
any platform doing that kind of thing.

I'm not sure that SA_RESTART case is actually worth bothering with -
AFAICS, your mechanism won't cover SA_RESTART/SA_RESTART/!SA_RESTART
combinations anyway, and I'm not at all sure that we _want_ to change
user-visible behaviour. Basically, you have
syscall interrupted
SA_RESTART signal arrives, handler1 entered
!SA_RESTART signal arrives, handler2 entered
handler2 returns
hander1 returns
syscall restarts
and to get behaviour independent of relative timing of those two signals
you'd have to replace the last one with "syscall fails with EINTR".
It's a user-visible change and I'm not at all sure that we won't break
userland code with it.
Oleg Nesterov
2012-04-25 17:02:30 UTC
Permalink
Post by Al Viro
Post by Oleg Nesterov
--- x/arch/x86/kernel/signal.c
+++ x/arch/x86/kernel/signal.c
@@ -711,6 +711,13 @@ handle_signal(unsigned long sig, siginfo
regs->ax = regs->orig_ax;
regs->ip -= 2;
break;
+
+ break;
+
+ if (regs->orig_ax == NR_eintr)
+ regs->ax = NR_eintr;
}
}
@@ -791,6 +798,7 @@ static void do_signal(struct pt_regs *re
regs->ax = regs->orig_ax;
+ regs->orig_ax = NR_eintr;
regs->ip -= 2;
break;
this ignores ERESTART_RESTARTBLOCK for simplicity.
Ehh... You do realize that it's that simple only on architectures that
have syscall number in a register?
No, I do not ;) I simply know nothing about other architectures.
Post by Al Viro
Take a look at e.g. insert_restart_trampoline()
in arch/parisc/kernel/signal.c
Will do. Not sure I will be able to undestand it though. And btw I still
didn't look at the signal patches in your tree, I've just returned from
vacation today.
Post by Al Viro
I'm not sure that SA_RESTART case is actually worth bothering with -
Neither me. But,
Post by Al Viro
AFAICS, your mechanism won't cover SA_RESTART/SA_RESTART/!SA_RESTART
combinations anyway,
Confused. The hack above doesn't even try to cover this case (and it is
obviously wrong in ERESTARTSYS/NOINTR case), I only tried to illustrate
the idea.
Post by Al Viro
syscall interrupted
SA_RESTART signal arrives, handler1 entered
!SA_RESTART signal arrives, handler2 entered
handler2 returns
hander1 returns
syscall restarts
And there is another case:

syscall interrupted, it returns ERESTARTSYS.

do_signal() sees no signal, restarts the syscall

!SA_RESTART comes before we return to user-mode

This doesn't really differ from sigsuspend/ERESTARTNOHAND we discussed
before.

Oleg.
Al Viro
2012-04-25 17:51:13 UTC
Permalink
Post by Oleg Nesterov
syscall interrupted, it returns ERESTARTSYS.
do_signal() sees no signal, restarts the syscall
!SA_RESTART comes before we return to user-mode
This doesn't really differ from sigsuspend/ERESTARTNOHAND we discussed
before.
... and dealt with the same way:
NEED_RESTART flag is set the first time around
no handler for the first signal, so we do nothing else
handler is found for the second signal
NEED_RESTART flag is present, so
we clear NEED_RESTART
we look at errno and see ERESTARTSYS
we check sa_flags and see !SA_RESTART
we set return value to -EINTR and proceed to set sigframe up
on the exit to userland we see NEED_RESTART not set, so off we go
I think I hadn't been clear enough; here's pseudocode for what I meant

do_signal()
{
if (we have any business doing restarts)
// note: we won't get here on subsequent calls of do_signal()
// due to the checks above; same logics that currently prevents
// double restarts
set NEED_RESTART flag
sig = get_signal_to_deliver(...)
if (sig) {
if (NEED_RESTART set) {
clear NEED_RESTART
same thing we do at that spot now - restart or EINTR
handle_signal(...)
...
return;
}
}
/* no handler */
if (test_and_clear_...(RESTORE_SIGMASK))
set_current_blocked(&current->saved_sigmask);
}
and in asm glue, *after* checking for SIGPENDING/NOTIFY_RESUME, check
NEED_RESTART and if it's set do what we currently do for restarts on
handlerless signal.

BTW, if we combine that with ERESTART_RESTARTBLOCK trick on arm/parisc/unicore,
there won't be much work to do for restarts in that case; in x86ese it
would be simply
if (regs->ax == -ERESTART_RESTARTBLOCK)
set_thread_flag(TIF_EMULATE_RESTART_SYSCALL)
regs->ax = regs->orig_ax;
regs->ip -= 2;
all of that conditional on TS_NEED_RESTART...
Martin Schwidefsky
2012-04-26 07:15:19 UTC
Permalink
On Wed, 25 Apr 2012 18:51:13 +0100
Post by Al Viro
do_signal()
{
if (we have any business doing restarts)
// note: we won't get here on subsequent calls of do_signal()
// due to the checks above; same logics that currently prevents
// double restarts
set NEED_RESTART flag
sig = get_signal_to_deliver(...)
if (sig) {
if (NEED_RESTART set) {
clear NEED_RESTART
same thing we do at that spot now - restart or EINTR
handle_signal(...)
...
return;
}
}
/* no handler */
if (test_and_clear_...(RESTORE_SIGMASK))
set_current_blocked(&current->saved_sigmask);
}
and in asm glue, *after* checking for SIGPENDING/NOTIFY_RESUME, check
NEED_RESTART and if it's set do what we currently do for restarts on
handlerless signal.
You need to be careful with inferior calls there. gdb likes to play games
with the registers inside the get_signal_to_deliver call, it wants to be
able to jump out of an interrupted system call, do its inferior call in
the debugee and then return to the interrupted system call.
You would have to to read, modify & restore the NEED_RESTART flag in gdb
over an inferior call.
--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.
David Miller
2012-04-26 07:25:28 UTC
Permalink
From: Martin Schwidefsky <***@de.ibm.com>
Date: Thu, 26 Apr 2012 09:15:19 +0200
Post by Martin Schwidefsky
You need to be careful with inferior calls there. gdb likes to play games
with the registers inside the get_signal_to_deliver call, it wants to be
able to jump out of an interrupted system call, do its inferior call in
the debugee and then return to the interrupted system call.
You would have to to read, modify & restore the NEED_RESTART flag in gdb
over an inferior call.
Right and this is what orig_ax and friends are used for on x86.
Oleg Nesterov
2012-04-26 13:52:55 UTC
Permalink
Post by Martin Schwidefsky
On Wed, 25 Apr 2012 18:51:13 +0100
Post by Al Viro
do_signal()
{
if (we have any business doing restarts)
// note: we won't get here on subsequent calls of do_signal()
// due to the checks above; same logics that currently prevents
// double restarts
set NEED_RESTART flag
sig = get_signal_to_deliver(...)
if (sig) {
if (NEED_RESTART set) {
clear NEED_RESTART
same thing we do at that spot now - restart or EINTR
handle_signal(...)
...
return;
}
}
/* no handler */
if (test_and_clear_...(RESTORE_SIGMASK))
set_current_blocked(&current->saved_sigmask);
}
and in asm glue, *after* checking for SIGPENDING/NOTIFY_RESUME, check
NEED_RESTART and if it's set do what we currently do for restarts on
handlerless signal.
You need to be careful with inferior calls there. gdb likes to play games
with the registers inside the get_signal_to_deliver call, it wants to be
able to jump out of an interrupted system call, do its inferior call in
the debugee and then return to the interrupted system call.
Ah.
Post by Martin Schwidefsky
You would have to to read, modify & restore the NEED_RESTART flag in gdb
over an inferior call.
I am not sure, but perhaps this is not really needed...

But at least this means that "if (we have any business doing restarts)"
above is meaningless before get_signal_to_deliver().


And I am confused, off-topic question... How it is possible to
"then return to the interrupted system call" if that system call
returned -ERESTART_RESTARTBLOCK but the inferior call in turn
does the system call which changes restart_block->fn/etc ?

Oleg.
Martin Schwidefsky
2012-04-26 14:31:19 UTC
Permalink
On Thu, 26 Apr 2012 15:52:55 +0200
Post by Oleg Nesterov
Post by Martin Schwidefsky
You need to be careful with inferior calls there. gdb likes to play games
with the registers inside the get_signal_to_deliver call, it wants to be
able to jump out of an interrupted system call, do its inferior call in
the debugee and then return to the interrupted system call.
Ah.
Post by Martin Schwidefsky
You would have to to read, modify & restore the NEED_RESTART flag in gdb
over an inferior call.
I am not sure, but perhaps this is not really needed...
But at least this means that "if (we have any business doing restarts)"
above is meaningless before get_signal_to_deliver().
And I am confused, off-topic question... How it is possible to
"then return to the interrupted system call" if that system call
returned -ERESTART_RESTARTBLOCK but the inferior call in turn
does the system call which changes restart_block->fn/etc ?
Returning from an inferior gdb call to a system call that needs to be
restarted with -ERESTART_RESTARTBLOCK is broken on all architectures
as far as I can tell. And it would be hard to fix it, basically you
have to save and restore the restart block.
--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.
Oleg Nesterov
2012-04-26 13:22:03 UTC
Permalink
Post by Al Viro
do_signal()
{
if (we have any business doing restarts)
// note: we won't get here on subsequent calls of do_signal()
// due to the checks above; same logics that currently prevents
// double restarts
set NEED_RESTART flag
sig = get_signal_to_deliver(...)
if (sig) {
if (NEED_RESTART set) {
clear NEED_RESTART
same thing we do at that spot now - restart or EINTR
handle_signal(...)
...
return;
}
}
/* no handler */
if (test_and_clear_...(RESTORE_SIGMASK))
set_current_blocked(&current->saved_sigmask);
}
OK,
Post by Al Viro
and in asm glue,
Ah. So we are going to change the ret_from_sys_call-like code.
Post by Al Viro
check
NEED_RESTART and if it's set do what we currently do for restarts on
handlerless signal.
and probably we should clear NEED_RESTART?

OK, thanks, I am starting to understand.

However. Perhaps I missed something, but this doesn't look 100% correct.
Although even _if_ I am right I guess this is pure theoretical problem.

Suppose that we restart the syscall which returned ERESTARTNOHAND. I mean,
the task has already returned to the user-mode and CPU is going to execute
the syscall insn.

What if, say, reschedule interrupt comes before the task enters the kernel
again? The new signal can come before this task returns to the user-mode,
ret_from_intr: will notice it and it is still possible to run a signal
handler before re-entering the syscall.

No?

Oleg.
Oleg Nesterov
2012-04-25 13:32:09 UTC
Permalink
Post by Oleg Nesterov
Post by Al Viro
Post by Oleg Nesterov
Post by Al Viro
Arrival of a signal that has userland handler
and that isn't blocked by the mask given to sigsuspend() should terminate
sigsuspend().
Yes. But note that do_signal() restores the old sigmask. This means that
the signal we get after the first do_signal() was not blocked before
sigsuspend() was called. So, to some extent, we can pretend that the
handler was executed before sigsuspend() and it was never restarted.
Signal might have already arrived by the time we restore sigmask.
Yes, and it sets TIF_SIGPENDING, but unless I missed something this doesn't
matter.
Post by Al Viro
So no,
it might have been blocked prior to sigsuspend().
If it was not blocked,
^^^^^^^
I meant, If it WAS blocked.
Post by Oleg Nesterov
then the next do_signal()->get_signal_to_deliver()
returns 0 and clears TIF_SIGPENDING. After that we finally re-enter
sys_sigsuspend() and (assuming it unblocks this sig) notice this pending
signal again and return -EINTR eventually.
Oleg.
Oleg Nesterov
2012-04-26 18:37:42 UTC
Permalink
Post by Al Viro
Untested variants pushed into signal.git#master; will test tomorrow. In
the meanwhile, any code review (and testing of the entire thing on as many
targets as possible) would be very welcome.
I started to read these patches today, will continue tomorrow. Somehow
I got stuck at f1fcb14721b4f1e65387d4563311f15f0bd33684, please see the
question below. And a couple of minor nits.




b4b620b87fd2f388cf4c13fea21f31bed7c9a1b0 new helper: sigsuspend()

Looks obviously correct but I do not understand this chunk in kernel.c,

+ #ifndef __ARCH_HAS_SYS_RT_SIGSUSPEND
+ /**
+ * sys_rt_sigsuspend - replace the signal mask for a value with the
+
#ifdef __ARCH_WANT_SYS_RT_SIGSUSPEND

So this checks the (never used/defined?) __ARCH_HAS_SYS_RT_SIGSUSPEND
but comments out __ARCH_WANT_SYS_RT_SIGSUSPEND. Looks like a typo.





6b78370886e4f61187404b7737a831281bde35e8 xtensa: switch to generic rt_sigsuspend(2)
and
d978bf9dd41728dd60fe2269493fe8f21d28eef3 h8300: switch to saved_sigmask-based sigsuspend/rt_sigsuspend

(off-topic, but do_signal()->try_to_freeze() looks unneeded and wrong)

+ /* If there's no signal to deliver, we just restore the saved mask. */
+ if (test_thread_flag(TIF_RESTORE_SIGMASK)) {
+ clear_thread_flag(TIF_RESTORE_SIGMASK);
+ sigprocmask(SIG_SETMASK, &current->saved_sigmask, NULL);
^^^^^^^^^^^

set_current_blocked(&current->saved_sigmask) looks better.






f1fcb14721b4f1e65387d4563311f15f0bd33684 alpha: tidy signal delivery up

Everything looks fine, but I have the off-topic question. The changelog
says:

* checking for TIF_SIGPENDING is enough; set_restart_sigmask() sets this
one as well.

Agreed, but why set_restore_sigmask() sets TIF_SIGPENDING? It should be
never used without signal_pending() == T.

IOW, do you know a reason why this patch

--- x/arch/x86/include/asm/thread_info.h
+++ x/arch/x86/include/asm/thread_info.h
@@ -264,7 +264,7 @@ static inline void set_restore_sigmask(v
{
struct thread_info *ti = current_thread_info();
ti->status |= TS_RESTORE_SIGMASK;
- set_bit(TIF_SIGPENDING, (unsigned long *)&ti->flags);
+ WARN_ON(!test_bit(TIF_SIGPENDING, (unsigned long *)&ti->flags));
}

static inline bool is_ia32_task(void)

is not correct?

OK, say, sys_sigsuspend() does

current->state = TASK_INTERRUPTIBLE;
schedule();
set_restore_sigmask();
return -ERESTARTNOHAND;

so set_bit(TIF_SIGPENDING) saves us from the "spurious wakeup". But is
it really possible?

We had the bugs in ptrace some time ago (and iirc this is why sys_pause
checks signal_pending), but is there any reason today why the
TASK_INTERRUPTIBLE task can return from schedule() without SIGPENDING?
(of course, ignoring the case when this task was added to some
wait_queue_head_t).


I am just curious. Perhaps set_restore_sigmask() sets SIGPENDING just
to be safer, but otoh this can hide the problem.

Oleg.
Al Viro
2012-04-26 23:19:42 UTC
Permalink
Post by Oleg Nesterov
b4b620b87fd2f388cf4c13fea21f31bed7c9a1b0 new helper: sigsuspend()
Looks obviously correct but I do not understand this chunk in kernel.c,
+ #ifndef __ARCH_HAS_SYS_RT_SIGSUSPEND
+ /**
+ * sys_rt_sigsuspend - replace the signal mask for a value with the
+
#ifdef __ARCH_WANT_SYS_RT_SIGSUSPEND
So this checks the (never used/defined?) __ARCH_HAS_SYS_RT_SIGSUSPEND
but comments out __ARCH_WANT_SYS_RT_SIGSUSPEND. Looks like a typo.
Buggered manual cherry-pick - earlier there was a patch inverting
__ARCH_WANT_SYS_RT_SIGSUSPEND (only mips did _not_ have it after the
whole series, and only because it has sys_rt_sigsuspend() of its own
with unusual prototype). I ended up dropping it and mishandled conflict
resolution. Fixed.
Post by Oleg Nesterov
6b78370886e4f61187404b7737a831281bde35e8 xtensa: switch to generic rt_sigsuspend(2)
and
d978bf9dd41728dd60fe2269493fe8f21d28eef3 h8300: switch to saved_sigmask-based sigsuspend/rt_sigsuspend
(off-topic, but do_signal()->try_to_freeze() looks unneeded and wrong)
Yes, get_signal_to_deliver() will do it. I'd killed some instances in the
last round of signal fixes (2010), but never got around to doing that for
all architectures.
Post by Oleg Nesterov
+ /* If there's no signal to deliver, we just restore the saved mask. */
+ if (test_thread_flag(TIF_RESTORE_SIGMASK)) {
+ clear_thread_flag(TIF_RESTORE_SIGMASK);
+ sigprocmask(SIG_SETMASK, &current->saved_sigmask, NULL);
^^^^^^^^^^^
set_current_blocked(&current->saved_sigmask) looks better.
In principle, yes. FWIW, I think that the entire thing should be a helper
to go along with set_restore_sigmask(). With
if (test_and_clear_thread_flag(TIF_RESTORE_SIGMASK))
set_current_blocked(&current->saved_sigmask);
as default implementation. The only question is where should it go -
asm/thread_info.h is not a good place due to header dependencies.
Kinda-sorta solution - in thread_info.h
{set,clear,test,test_and_clear}_restore_sigmask()
and in linux/signal.h
#ifdef HAVE_SET_RESTORE_SIGMASK
static inline void restore_saved_sigmask(void)
{
if (test_and_clear_restore_sigmask())
set_current_blocked(&current->saved_mask);
}
static inline sigset_t *sigmask_to_save(void)
{
struct sigset *res = &current->blocked;
if (unlikely(test_restore_sigmask()))
res = current->saved_sigmask;
return res;
}
#endif

Speaking of other helpers, pulling ppc restore_sigmask() into signal.h
(as static inline) might be a good idea. Every sigreturn instance is
open-coding it... We need saner names, though; this set is too easy to
confuse with each other.
Post by Oleg Nesterov
f1fcb14721b4f1e65387d4563311f15f0bd33684 alpha: tidy signal delivery up
Everything looks fine, but I have the off-topic question. The changelog
* checking for TIF_SIGPENDING is enough; set_restart_sigmask() sets this
one as well.
Agreed, but why set_restore_sigmask() sets TIF_SIGPENDING? It should be
never used without signal_pending() == T.
Umm... Probably, and as far as I can see all callers are only reached if
we have SIGPENDING, but that requires at least documenting what's going on.
Oleg Nesterov
2012-04-27 17:24:44 UTC
Permalink
Post by Al Viro
Post by Oleg Nesterov
+ /* If there's no signal to deliver, we just restore the saved mask. */
+ if (test_thread_flag(TIF_RESTORE_SIGMASK)) {
+ clear_thread_flag(TIF_RESTORE_SIGMASK);
+ sigprocmask(SIG_SETMASK, &current->saved_sigmask, NULL);
^^^^^^^^^^^
set_current_blocked(&current->saved_sigmask) looks better.
In principle, yes. FWIW, I think that the entire thing should be a helper
to go along with set_restore_sigmask(). With
if (test_and_clear_thread_flag(TIF_RESTORE_SIGMASK))
set_current_blocked(&current->saved_sigmask);
as default implementation. The only question is where should it go -
asm/thread_info.h is not a good place due to header dependencies.
Kinda-sorta solution - in thread_info.h
{set,clear,test,test_and_clear}_restore_sigmask()
and in linux/signal.h
#ifdef HAVE_SET_RESTORE_SIGMASK
static inline void restore_saved_sigmask(void)
{
if (test_and_clear_restore_sigmask())
set_current_blocked(&current->saved_mask);
}
static inline sigset_t *sigmask_to_save(void)
{
struct sigset *res = &current->blocked;
if (unlikely(test_restore_sigmask()))
res = current->saved_sigmask;
return res;
}
Perhaps... but test_*_restore_sigmask() depends on TIF_ or TS_
Post by Al Viro
Speaking of other helpers, pulling ppc restore_sigmask() into signal.h
(as static inline) might be a good idea.
Agreed.
Post by Al Viro
Every sigreturn instance is
open-coding it...
Not only sigreturn. Just look at sigprocmask(SIG_SETMASK) callers, almost
all should be converted to use set_current_blocked(). For example, pselect.
Post by Al Viro
We need saner names, though; this set is too easy to
confuse with each other.
Say, set_current_blocked_careful().
Post by Al Viro
Post by Oleg Nesterov
Agreed, but why set_restore_sigmask() sets TIF_SIGPENDING? It should be
never used without signal_pending() == T.
Umm... Probably, and as far as I can see all callers are only reached if
we have SIGPENDING, but that requires at least documenting what's going on.
WARN_ON(!test_bit(TIF_SIGPENDING)) looks like the perfect documentation ;)




OK, I have read the whole series. You do understand that I can't comment
the changes in asm, but I managed to convince myself I can more or less
understand them and I see nothing wrong.

The only comment I have,
38671a3e831ed7327affb24d715f98bb99c80e56 m68k: add TIF_NOTIFY_RESUME and handle it
forgets to unexport do_signal().



The last thing. Matt, could you please look at
git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal.git ? It seems to me
you already sent some of these changes (use set_current_blocked/block_sigmask).
Perhaps there are alreay in -mm or linux-next?

Oleg.
Oleg Nesterov
2012-04-27 17:54:34 UTC
Permalink
Post by Oleg Nesterov
Not only sigreturn. Just look at sigprocmask(SIG_SETMASK) callers, almost
all should be converted to use set_current_blocked(). For example, pselect.
which btw doesn't need sigsaved.
Post by Oleg Nesterov
The last thing. Matt, could you please look at
git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal.git ? It seems to me
you already sent some of these changes (use set_current_blocked/block_sigmask).
Perhaps there are alreay in -mm or linux-next?
Forgot to add Matt, sorry for noise...

Oleg.
Matt Fleming
2012-05-02 10:37:50 UTC
Permalink
Post by Oleg Nesterov
Post by Oleg Nesterov
Not only sigreturn. Just look at sigprocmask(SIG_SETMASK) callers, almost
all should be converted to use set_current_blocked(). For example, pselect.
which btw doesn't need sigsaved.
Post by Oleg Nesterov
The last thing. Matt, could you please look at
git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal.git ? It seems to me
you already sent some of these changes (use set_current_blocked/block_sigmask).
Perhaps there are alreay in -mm or linux-next?
Forgot to add Matt, sorry for noise...
Any of the changes that aren't in Linus' tree will be in -mm and
linux-next. The ones currently in linux-next are,

07d969a parisc: use set_current_blocked() and block_sigmask()
5becb45 frv: use set_current_blocked() and block_sigmask()
3608417 blackfin: use set_current_blocked() and block_sigmask()
cd21f1a unicore32: use block_sigmask()
2bb36fd h8300: use set_current_blocked() and block_sigmask()
a624f4f score: use set_current_blocked() and block_sigmask()
2326e2e score: don't mask signals if we fail to setup signal stack
b2f6181 microblaze: use set_current_blocked() and block_sigmask()
f557a56 microblaze: fix signal masking
9ff7b4a microblaze: no need to reset handler if SA_ONESHOT
6cb49f5 microblaze: don't reimplement force_sigsegv()
bd9767f ia64: use set_current_blocked() and block_sigmask()
cdbc96c cris: use set_current_blocked() and block_sigmask()
295f127 mn10300: use set_current_blocked() and block_sigmask()
524987e m68k: use set_current_blocked() and block_sigmask()
f0c96a3 m32r: use set_current_blocked() and block_sigmask()
f0f0acd avr32: use block_sigmask()
ef64059 avr32: don't mask signals in the error path

... and this one from you, Oleg, that was sent as part of my series,

3e6c120 avr32: use set_current_blocked() in handle_signal/sys_rt_sigreturn
--
Matt Fleming, Intel Open Source Technology Center
Al Viro
2012-05-02 14:14:01 UTC
Permalink
Post by Matt Fleming
Post by Oleg Nesterov
Post by Oleg Nesterov
The last thing. Matt, could you please look at
git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal.git ? It seems to me
you already sent some of these changes (use set_current_blocked/block_sigmask).
Perhaps there are alreay in -mm or linux-next?
Forgot to add Matt, sorry for noise...
Any of the changes that aren't in Linus' tree will be in -mm and
linux-next. The ones currently in linux-next are,
07d969a parisc: use set_current_blocked() and block_sigmask()
5becb45 frv: use set_current_blocked() and block_sigmask()
3608417 blackfin: use set_current_blocked() and block_sigmask()
cd21f1a unicore32: use block_sigmask()
2bb36fd h8300: use set_current_blocked() and block_sigmask()
a624f4f score: use set_current_blocked() and block_sigmask()
2326e2e score: don't mask signals if we fail to setup signal stack
b2f6181 microblaze: use set_current_blocked() and block_sigmask()
f557a56 microblaze: fix signal masking
9ff7b4a microblaze: no need to reset handler if SA_ONESHOT
6cb49f5 microblaze: don't reimplement force_sigsegv()
bd9767f ia64: use set_current_blocked() and block_sigmask()
cdbc96c cris: use set_current_blocked() and block_sigmask()
295f127 mn10300: use set_current_blocked() and block_sigmask()
524987e m68k: use set_current_blocked() and block_sigmask()
f0c96a3 m32r: use set_current_blocked() and block_sigmask()
f0f0acd avr32: use block_sigmask()
ef64059 avr32: don't mask signals in the error path
... and this one from you, Oleg, that was sent as part of my series,
3e6c120 avr32: use set_current_blocked() in handle_signal/sys_rt_sigreturn
OK, those commits cherry-picked into the beginning of the queue, the
rest rebased on top of them, dropping my duplicates.
Al Viro
2012-04-27 18:45:29 UTC
Permalink
Post by Oleg Nesterov
Post by Al Viro
static inline sigset_t *sigmask_to_save(void)
{
struct sigset *res = &current->blocked;
if (unlikely(test_restore_sigmask()))
res = current->saved_sigmask;
return res;
}
Perhaps... but test_*_restore_sigmask() depends on TIF_ or TS_
... and is in thread_info.h; you might want to pull it again ;-)
Right now the signal.git#master is at 349b4565ad9e9b891245590319567c0a042046d9
Post by Oleg Nesterov
Post by Al Viro
Umm... Probably, and as far as I can see all callers are only reached if
we have SIGPENDING, but that requires at least documenting what's going on.
WARN_ON(!test_bit(TIF_SIGPENDING)) looks like the perfect documentation ;)
OK, I'm convinced. Done.

BTW, I've a better name than set_current_blocked_carefully(); the thing
is, only 3 callers out of 63 are _not_ immediately preceded by excluding
SIGKILL/SIGSTOP out of the set. So I've copied the existing variant
to __set_current_blocked() and folded that sigdelsetmask(...) into the
set_current_blocked().
Post by Oleg Nesterov
The only comment I have,
38671a3e831ed7327affb24d715f98bb99c80e56 m68k: add TIF_NOTIFY_RESUME and handle it
forgets to unexport do_signal().
Meh... The thing is, there are _two_ of them. signal_mm.c and signal_no.c
badly need merging, with common stuff moved to signal.c. I really hate
to see static function that isn't called anywhere in file it's defined in,
leaving one to trace what happens to include that file. Running into that
in USB code is bloody annoying and I'd rather not add to that pile. It's
certainly a valid C, but it's hell on casual reader...

BTW, I'm somewhat tempted to do the following: *ALL* calls of
tracehook_signal_handler() are now immediately preceded by block_signals().
Moreover, tracehook_signal_handler(...., 0) is currently a no-op, so
it could be painlessly added after the remaining block_signals() instances.
How about *folding* block_signals() (along with clear_restore_sigmask())
into tracehook_signal_handler()? I don't know if anyone has conflicting
plans for that sucker; Roland?

Even more ambitious variant: take the "setting sigframe up has
failed, send ourselves SIGSEGV" in there as well (adding an extra argument
to tell which one it is). Hell knows; I know at least one place where
such failure does _not_ lead to SIGSEGV, but there we do do_exit(SIGILL)
right in setup_..._frame() and do_exit() never returns. There might be
other such suckers...
Post by Oleg Nesterov
The last thing. Matt, could you please look at
git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal.git ? It seems to me
you already sent some of these changes (use set_current_blocked/block_sigmask).
Perhaps there are alreay in -mm or linux-next?
Geert Uytterhoeven
2012-04-27 19:14:53 UTC
Permalink
Post by Oleg Nesterov
The only comment I have,
38671a3e831ed7327affb24d715f98bb99c80e56 m68k: add TIF_NOTIFY_RESUME=
and handle it
Post by Oleg Nesterov
forgets to unexport do_signal().
Meh... =C2=A0The thing is, there are _two_ of them. =C2=A0signal_mm.c=
and signal_no.c
badly need merging, with common stuff moved to signal.c. =C2=A0I real=
ly hate

Greg recently posted a patch to merge them:
http://www.spinics.net/lists/linux-m68k/msg04995.html

Gr{oetje,eeting}s,

=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ***@linux-=
m68k.org

In personal conversations with technical people, I call myself a hacker=
=2E But
when I'm talking to journalists I just say "programmer" or something li=
ke that.
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0 =C2=A0=C2=A0 -- Linus Torvalds
Al Viro
2012-04-27 19:34:13 UTC
Permalink
Post by Geert Uytterhoeven
Post by Oleg Nesterov
The only comment I have,
38671a3e831ed7327affb24d715f98bb99c80e56 m68k: add TIF_NOTIFY_RESUME and handle it
forgets to unexport do_signal().
Meh... ??The thing is, there are _two_ of them. ??signal_mm.c and signal_no.c
badly need merging, with common stuff moved to signal.c. ??I really hate
http://www.spinics.net/lists/linux-m68k/msg04995.html
Nice... BTW, could you comment on
m68k: don't open-code block_sigmask()
m68k: use set_current_blocked in sigreturn/rt_sigreturn
m68k-nommu: do_signal() is only called when returning to user mode
m68k: add TIF_NOTIFY_RESUME and handle it.
in signal.git tree? The last one is really interesting...

Just pick the current tree - one that was there yesterday had a dumb typo
in thread_info.h part, so it wouldn't even compile (TIF_NOTIFE_RESUME
defined instead of TIF_NOTIFY_RESUME ;-/)
Al Viro
2012-04-29 22:51:00 UTC
Permalink
Post by Al Viro
Post by Geert Uytterhoeven
Post by Oleg Nesterov
The only comment I have,
38671a3e831ed7327affb24d715f98bb99c80e56 m68k: add TIF_NOTIFY_RESUME and handle it
forgets to unexport do_signal().
Meh... ??The thing is, there are _two_ of them. ??signal_mm.c and signal_no.c
badly need merging, with common stuff moved to signal.c. ??I really hate
http://www.spinics.net/lists/linux-m68k/msg04995.html
Nice... BTW, could you comment on
m68k: don't open-code block_sigmask()
m68k: use set_current_blocked in sigreturn/rt_sigreturn
m68k-nommu: do_signal() is only called when returning to user mode
m68k: add TIF_NOTIFY_RESUME and handle it.
in signal.git tree? The last one is really interesting...
Just pick the current tree - one that was there yesterday had a dumb typo
in thread_info.h part, so it wouldn't even compile (TIF_NOTIFE_RESUME
defined instead of TIF_NOTIFY_RESUME ;-/)
BTW, Greg's patch put into the queue in the very beginning, everything rebased
on top of it.
Greg Ungerer
2012-04-30 06:39:15 UTC
Permalink
Post by Al Viro
Post by Al Viro
Post by Geert Uytterhoeven
Post by Oleg Nesterov
The only comment I have,
38671a3e831ed7327affb24d715f98bb99c80e56 m68k: add TIF_NOTIFY_RESUME and handle it
forgets to unexport do_signal().
Meh... ??The thing is, there are _two_ of them. ??signal_mm.c and signal_no.c
badly need merging, with common stuff moved to signal.c. ??I really hate
http://www.spinics.net/lists/linux-m68k/msg04995.html
Nice... BTW, could you comment on
m68k: don't open-code block_sigmask()
m68k: use set_current_blocked in sigreturn/rt_sigreturn
m68k-nommu: do_signal() is only called when returning to user mode
m68k: add TIF_NOTIFY_RESUME and handle it.
in signal.git tree? The last one is really interesting...
Just pick the current tree - one that was there yesterday had a dumb typo
in thread_info.h part, so it wouldn't even compile (TIF_NOTIFE_RESUME
defined instead of TIF_NOTIFY_RESUME ;-/)
BTW, Greg's patch put into the queue in the very beginning, everything rebased
on top of it.
Those changes currently sitting in your signal tree look ok to me.

Acked-by: Greg Ungerer <***@uclinux.org>

Regards
Greg


------------------------------------------------------------------------
Greg Ungerer -- Principal Engineer EMAIL: ***@snapgear.com
SnapGear Group, McAfee PHONE: +61 7 3435 2888
8 Gardner Close FAX: +61 7 3217 5323
Milton, QLD, 4064, Australia WEB: http://www.SnapGear.com
Al Viro
2012-04-27 19:42:58 UTC
Permalink
Post by Oleg Nesterov
The last thing. Matt, could you please look at
git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal.git ? It seems to me
you already sent some of these changes (use set_current_blocked/block_sigmask).
Perhaps there are alreay in -mm or linux-next?
BTW, I'll gladly pick Matt's patches in front of that queue (and drop my
duplicates), but I doubt that signal.git is suitable for -next right now.
After some review from linux-arch folks - maybe, but right now it's too
dangerous and untested.
Roland McGrath
2012-04-27 20:20:02 UTC
Permalink
Post by Al Viro
BTW, I'm somewhat tempted to do the following: *ALL* calls of
tracehook_signal_handler() are now immediately preceded by block_signals().
Moreover, tracehook_signal_handler(...., 0) is currently a no-op, so
it could be painlessly added after the remaining block_signals() instances.
How about *folding* block_signals() (along with clear_restore_sigmask())
into tracehook_signal_handler()? I don't know if anyone has conflicting
plans for that sucker; Roland?
I'm not actually doing much in the way of kernel work these days so I
certainly don't have any actual plans and it's pretty much all up to Oleg.
I'm also not really following this thread in detail, as I've dropped too
much context in the last year or two to be of a whole lot of help without
spending a lot of time recovering knowledge.

But I will say that the intent of tracehook_signal_handler has always
been what its kerneldoc says: Call it once when handler setup is
complete (exactly once per signal delivery, so potentially multiple
times before actually returning to user mode). Though it is indeed a
no-op today when stepping==0, we want its use to continue to conform
to that exact definition so that one day we could add e.g. a
PTRACE_EVENT_SIGNAL_HANDLED feature just by hacking tracehook.h and
not have to go back into every arch's signal code and recover
understanding of how the call is being used. (It was more than enough
work to do that once when I broke out and documented the tracehook.h
interfaces the first time.) You know, as if we thought modularity
were a useful notion.


Thanks,
Roland
Al Viro
2012-04-27 21:12:44 UTC
Permalink
Post by Roland McGrath
But I will say that the intent of tracehook_signal_handler has always
been what its kerneldoc says: Call it once when handler setup is
complete (exactly once per signal delivery, so potentially multiple
times before actually returning to user mode). Though it is indeed a
no-op today when stepping==0, we want its use to continue to conform
to that exact definition so that one day we could add e.g. a
PTRACE_EVENT_SIGNAL_HANDLED feature just by hacking tracehook.h and
not have to go back into every arch's signal code and recover
understanding of how the call is being used. (It was more than enough
work to do that once when I broke out and documented the tracehook.h
interfaces the first time.) You know, as if we thought modularity
were a useful notion.
OK... FWIW, it sounds like an argument for using it (or a function in
kernel/signal.c that would call it) on all architectures.

Note that there's an extra complication on alpha and itanic - we have more
than just struct pt_regs * there. If we care about the rest of registers
(struct switch_stack on alpha), we probably need to do something about that.
Hell knows... On alpha, in particular, we always get switch_stack argument
of do_notify_resume() et.al. at the constant offset below pt_regs one -
mov $sp, $16
bsr $1, do_switch_stack
mov $sp, $17
is how it's done. So there we could switch to use of container_of(). On
ia64 we have struct sigscratch with pt_regs embedded into it, so there
container_of() is also reasonable. I wonder how alpha and itanic deal
with do_coredump(), BTW - it gets pt_regs, but not the rest...

Another fun story: x86 and mips has do_notify_resume() with void *unused as
its second argument. It used to be sigset_t *oldset and it remains unused
since 2006 or so. Three years later arch/score went in - with the same
void *unused as the second argument of do_notify_resume(). Gotta love the
cargo-cult programming...

BTW, what's the story with 'cookie' argument of get_signal_to_deliver()?
Everything passes NULL in it and nothing actually looks at the value
passed (it's passed further without being looked at, until it reaches
ptrace_signal_deliver(), which ignores it completely on all architectures).
AFAICS, it's a rudiment of something that was used only on sparc and got
discarded as hopeless in commit 28e6103665301ce60634e8a77f0b657c6cc099de
Author: David S. Miller <***@davemloft.net>
Date: Sun May 11 02:07:19 2008 -0700
sparc: Fix debugger syscall restart interactions.
Is there anything else to it? If not, I'd rather get rid of that thing...
Frankly, I'm somewhat tempted to add something like
struct k_sigcontext {
int sig;
siginfo_t info;
struct k_sigaction ka;
}; and turn get_signal_to_deliver into
bool get_signal_to_deliver(struct k_sigcontext *ctx, struct pt_regs *regs),
with ctx passed through handle_signal/setup_frame/tracehook_signal_deliver
instead of the same triple shoved into three arguments, in more or less
random order as it's done now. Perhaps let tracehook_signal_deliver() keep
its type and semantics and introduce
void signal_delivered(struct k_sigcontext *, struct pt_regs *, int stepping)
that would be called by all handle_signal() after successful frame setup
and would call tracehook_... itself, after having done block_sigmask()...
Roland McGrath
2012-04-27 21:27:29 UTC
Permalink
Post by Al Viro
OK... FWIW, it sounds like an argument for using it (or a function in
kernel/signal.c that would call it) on all architectures.
That was certainly always the intent. But I only touched myself the arch's
that I was prepared to test reasonably.
Post by Al Viro
Note that there's an extra complication on alpha and itanic - we have more
than just struct pt_regs * there. If we care about the rest of registers
(struct switch_stack on alpha), we probably need to do something about that.
The presumed plan so far was just that anybody would use user_regset for
any serious tracer/debugger stuff. That is certainly much more costly on
ia64 than passing along a few more values. If anybody ever cares, the
tracehook functions' signatures can always be adjusted in the future.
Post by Al Viro
I wonder how alpha and itanic deal with do_coredump(), BTW - it gets
pt_regs, but not the rest...
The expectation was that every arch would eventually switch on
CORE_DUMP_USE_REGSET. (Looks like so far 12 do and so ~16 don't.)
Certainly avoiding the overhead of user_regset for core dumping is not
worth any new code complexity or extra arch hooks, since that overhead
even on the worst-case arch (ia64) has got to be marginal in comparison
to all the memory-copying and i/o going on. For imagined potential
tracing/fancier-debugging cases that might be used in high-throughput
ways the question would be different, but such uses still remain to be
implemented.


Thanks,
Roland
Al Viro
2012-04-27 23:15:26 UTC
Permalink
Post by Roland McGrath
The expectation was that every arch would eventually switch on
CORE_DUMP_USE_REGSET. (Looks like so far 12 do and so ~16 don't.)
Certainly avoiding the overhead of user_regset for core dumping is not
worth any new code complexity or extra arch hooks, since that overhead
even on the worst-case arch (ia64) has got to be marginal in comparison
to all the memory-copying and i/o going on. For imagined potential
tracing/fancier-debugging cases that might be used in high-throughput
ways the question would be different, but such uses still remain to be
implemented.
BTW, speaking of tracehook - you do realize that there are some
architectures where check for user_mode() in do_signal() is not
useless? I.e. there do_notify_resume() _can_ be called when
returning to kernel mode. And that'll get you tracehook_notify_resume()
called when you probably wouldn't want it to be; key_replace_session_keyring()
call is not desirable in that situation and the stuff Oleg wants
to add in tracehook_notify_resume() won't be happy with that either...

I think all such architectures need that check lifted to do_notify_resume()
(and the rest needs it killed, of course). Including x86, by the look
of it - we _probably_ can't get there with TIF_NOTIFY_RESUME and
!user_mode(regs), but I'm not entirely sure of that. arm is in about the
same situation; alpha, ppc{32,64}, sparc{32,64} and m68k really can't get
there like that (they all check it in the asm glue). mips probably might,
unless I'm misreading their ret_from_fork()... Fun.
Al Viro
2012-04-27 23:32:35 UTC
Permalink
Post by Al Viro
I think all such architectures need that check lifted to do_notify_resume()
(and the rest needs it killed, of course). Including x86, by the look
of it - we _probably_ can't get there with TIF_NOTIFY_RESUME and
!user_mode(regs), but I'm not entirely sure of that. arm is in about the
same situation; alpha, ppc{32,64}, sparc{32,64} and m68k really can't get
there like that (they all check it in the asm glue). mips probably might,
unless I'm misreading their ret_from_fork()... Fun.
Speaking of user_mode() oddities - may I politely inquire what had
been smoked to inspire this (in arch/s390/kernel/signal.c):
/* This is the legacy signal stack switching. */
else if (!user_mode(regs) &&
!(ka->sa.sa_flags & SA_RESTORER) &&
ka->sa.sa_restorer) {
sp = (unsigned long) ka->sa.sa_restorer;
}
especially when all paths leading to that come through do_signal() that does
if (!user_mode(regs))
return;
on the same regs. It had been like that since 2.3.99pre8 when s390 went
into the tree... It looks vaguely similar to i386
/* This is the legacy signal stack switching. */
if ((regs->ss & 0xffff) != __USER_DS &&
!(ka->sa.sa_flags & SA_RESTORER) &&
ka->sa.sa_restorer)
sp = (unsigned long) ka->sa.sa_restorer;
but there the code is at least not unreachable...
Al Viro
2012-04-29 04:12:05 UTC
Permalink
Post by Al Viro
Post by Al Viro
I think all such architectures need that check lifted to do_notify_resume()
(and the rest needs it killed, of course). Including x86, by the look
of it - we _probably_ can't get there with TIF_NOTIFY_RESUME and
!user_mode(regs), but I'm not entirely sure of that. arm is in about the
same situation; alpha, ppc{32,64}, sparc{32,64} and m68k really can't get
there like that (they all check it in the asm glue). mips probably might,
unless I'm misreading their ret_from_fork()... Fun.
Speaking of user_mode() oddities - may I politely inquire what had
/* This is the legacy signal stack switching. */
else if (!user_mode(regs) &&
!(ka->sa.sa_flags & SA_RESTORER) &&
ka->sa.sa_restorer) {
sp = (unsigned long) ka->sa.sa_restorer;
}
especially when all paths leading to that come through do_signal() that does
if (!user_mode(regs))
return;
on the same regs. It had been like that since 2.3.99pre8 when s390 went
into the tree... It looks vaguely similar to i386
/* This is the legacy signal stack switching. */
if ((regs->ss & 0xffff) != __USER_DS &&
!(ka->sa.sa_flags & SA_RESTORER) &&
ka->sa.sa_restorer)
sp = (unsigned long) ka->sa.sa_restorer;
but there the code is at least not unreachable...
While we are at it, can we *ever* reach do_signal() (nevermind deep in its
guts) with !user_mode(regs)?
AFAICS, for 31bit possible paths are:
do_signal()
<- sysc_sigpending
<- sysc_work
<- sysc_tif, having checked for user_mode(%r11)
<- io_sigpending
<- io_work_tif
<- io_work_user
<- io_work, having checked for user_mode(%r11)

and identical for 64bit. *All* paths into do_signal() go through
tm __PT_PSW+1(%r11),0x01 # returning to user ?
and proceed towards do_signal() only if the bit is set. Which is precisely
what user_mode(%r11) is...

What the hell is going on in that code?
Al Viro
2012-04-27 23:50:23 UTC
Permalink
Post by Al Viro
I think all such architectures need that check lifted to do_notify_resume()
(and the rest needs it killed, of course). Including x86, by the look
of it - we _probably_ can't get there with TIF_NOTIFY_RESUME and
!user_mode(regs), but I'm not entirely sure of that. arm is in about the
same situation; alpha, ppc{32,64}, sparc{32,64} and m68k really can't get
there like that (they all check it in the asm glue). mips probably might,
unless I'm misreading their ret_from_fork()... Fun.
I'm completely unfamiliar with tile assembler, so I might be missing something
there, but AFAICS the following sequence can lead to do_signal() being
called for a kernel thread:
kernel_thread()
ret_from_fork() in child
.Lresume_userspace
do_work_pending()
do_signal()
and unlike many other, tile does _not_ check for !user_mode() anywhere
relevant... Broken?
Chris Metcalf
2012-04-28 18:51:43 UTC
Permalink
Calling interrupt_return will check the privilege of the context we're
returning to avoid the possibility of kernel threads doing any kind
of userspace actions (including signal handling) after a fork.

Signed-off-by: Chris Metcalf <***@tilera.com>
---
Al, thanks for noticing this. I've queued it up for 3.4.

Do you have a case that might provoke the signal behavior in the
unpatched code? The patched code passes our internal regressions.

arch/tile/kernel/intvec_32.S | 2 +-
arch/tile/kernel/intvec_64.S | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/tile/kernel/intvec_32.S b/arch/tile/kernel/intvec_32.S
index 5d56a1e..d0f48ca 100644
--- a/arch/tile/kernel/intvec_32.S
+++ b/arch/tile/kernel/intvec_32.S
@@ -1274,7 +1274,7 @@ STD_ENTRY(ret_from_fork)
FEEDBACK_REENTER(ret_from_fork)
{
movei r30, 0 /* not an NMI */
- j .Lresume_userspace /* jump into middle of interrupt_return */
+ j interrupt_return
}
STD_ENDPROC(ret_from_fork)

diff --git a/arch/tile/kernel/intvec_64.S b/arch/tile/kernel/intvec_64.S
index 49d9d66..252924b 100644
--- a/arch/tile/kernel/intvec_64.S
+++ b/arch/tile/kernel/intvec_64.S
@@ -1120,7 +1120,7 @@ STD_ENTRY(ret_from_fork)
FEEDBACK_REENTER(ret_from_fork)
{
movei r30, 0 /* not an NMI */
- j .Lresume_userspace /* jump into middle of interrupt_return */
+ j interrupt_return
}
STD_ENDPROC(ret_from_fork)
--
1.6.5.2
Al Viro
2012-04-28 20:55:17 UTC
Permalink
Post by Chris Metcalf
Calling interrupt_return will check the privilege of the context we're
returning to avoid the possibility of kernel threads doing any kind
of userspace actions (including signal handling) after a fork.
---
Al, thanks for noticing this. I've queued it up for 3.4.
Do you have a case that might provoke the signal behavior in the
unpatched code? The patched code passes our internal regressions.
arch/tile/kernel/intvec_32.S | 2 +-
arch/tile/kernel/intvec_64.S | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/tile/kernel/intvec_32.S b/arch/tile/kernel/intvec_32.S
index 5d56a1e..d0f48ca 100644
--- a/arch/tile/kernel/intvec_32.S
+++ b/arch/tile/kernel/intvec_32.S
@@ -1274,7 +1274,7 @@ STD_ENTRY(ret_from_fork)
FEEDBACK_REENTER(ret_from_fork)
{
movei r30, 0 /* not an NMI */
- j .Lresume_userspace /* jump into middle of interrupt_return */
+ j interrupt_return
}
STD_ENDPROC(ret_from_fork)
Umm... I'm not sure that it's correct. For one thing, ret_from_fork is
used both for kernel threads and for plain old fork(2). In the latter
case you want .Lresume_userspace, not interrupt_return. For another,
there's kernel_execve() and if it fails (binary doesn't exist/has wrong
format/etc.) you'll get to .Lresume_userspace with EX1_PL(regs->ex1)
unchanged, i.e. the kernel one...

Frankly, with the way you have that stuff done I'd rather do just this:

int do_work_pending(struct pt_regs *regs, u32 thread_info_flags)
{
if (!user_mode(regs))
return 0;
....
}

and be done with that. Unless I'm seriously misreading your code it'll do
the right thing with no changes to asm glue. As for the reproducer, just
guess the PID of modprobe when you are e.g. trying to mount a filesystem
with fs driver modular and not loaded; fork(), have parent wait a bit
and call mount(), while the child keeps sending something more or less
innocent (SIGCHLD, for example) to the guessed PID. And either have
/sbin/modprobe chmod -x before doing that (you'll need to remember to
chmod it back before reboot, of course) or just
mount --bind /dev/null /sbin/modprobe. Either way, kernel_execve() will
fail. And if you manage to hit the sucker just as it's being spawned,
you'll get the kernel_thread() codepath as well.

FWIW, I like what you've done with do_work_pending() - it's much cleaner
than usual loops and tests in assembler. The only question is, what's
going on with
push_extra_callee_saves r0
you are doing there - seems possibly over the top for situations when
SIGPENDING isn't set and, more seriously, what if you go through that
loop many times? You slap them again and again into pt_regs, overwriting
anything ptrace() might've done to r34..r51, right?

Smells like something that should be done only once, not on each iteration
through the loop...
Chris Metcalf
2012-04-28 21:46:13 UTC
Permalink
Post by Al Viro
Post by Chris Metcalf
Calling interrupt_return will check the privilege of the context we're
returning to avoid the possibility of kernel threads doing any kind
of userspace actions (including signal handling) after a fork.
---
Al, thanks for noticing this. I've queued it up for 3.4.
Do you have a case that might provoke the signal behavior in the
unpatched code? The patched code passes our internal regressions.
arch/tile/kernel/intvec_32.S | 2 +-
arch/tile/kernel/intvec_64.S | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/tile/kernel/intvec_32.S b/arch/tile/kernel/intvec_32.S
index 5d56a1e..d0f48ca 100644
--- a/arch/tile/kernel/intvec_32.S
+++ b/arch/tile/kernel/intvec_32.S
@@ -1274,7 +1274,7 @@ STD_ENTRY(ret_from_fork)
FEEDBACK_REENTER(ret_from_fork)
{
movei r30, 0 /* not an NMI */
- j .Lresume_userspace /* jump into middle of interrupt_return */
+ j interrupt_return
}
STD_ENDPROC(ret_from_fork)
Umm... I'm not sure that it's correct. For one thing, ret_from_fork is
used both for kernel threads and for plain old fork(2). In the latter
case you want .Lresume_userspace, not interrupt_return.
It's OK, since we will jump to .Lresume_userspace from interrupt_return in
the latter case:

STD_ENTRY(interrupt_return)
/* If we're resuming to kernel space, don't check thread flags. */
{
[...]
PTREGS_PTR(r29, PTREGS_OFFSET_EX1)
}
ld r29, r29
andi r29, r29, SPR_EX_CONTEXT_1_1__PL_MASK /* mask off ICS */
{
beqzt r29, .Lresume_userspace
[...]
}

The struct ptregs "ex1" field will reliably tell us whether we came from
kernel or userspace context. Certainly for fork() this will indicate
userspace, since it's the return register info for the syscall. And for
kernel_thread() we explicitly set up ex1 to indicate kernel privilege as well.
Post by Al Viro
For another,
there's kernel_execve() and if it fails (binary doesn't exist/has wrong
format/etc.) you'll get to .Lresume_userspace with EX1_PL(regs->ex1)
unchanged, i.e. the kernel one...
This is a good point. The current syscall return path goes directly to
.Lresume_userspace, which will screw up kernel syscalls. I think the right
answer is still to jump to interrupt_return from those cases, though. In
particular, this gets rid of the existing cruftiness where we have to
document that a local label (.Lresume_userspace) can be the target of jumps
from outside the containing function.
Post by Al Viro
As for the reproducer, just
guess the PID of modprobe when you are e.g. trying to mount a filesystem
with fs driver modular and not loaded; fork(), have parent wait a bit
and call mount(), while the child keeps sending something more or less
innocent (SIGCHLD, for example) to the guessed PID. And either have
/sbin/modprobe chmod -x before doing that (you'll need to remember to
chmod it back before reboot, of course) or just
mount --bind /dev/null /sbin/modprobe. Either way, kernel_execve() will
fail. And if you manage to hit the sucker just as it's being spawned,
you'll get the kernel_thread() codepath as well.
FWIW, I like what you've done with do_work_pending() - it's much cleaner
than usual loops and tests in assembler. The only question is, what's
going on with
push_extra_callee_saves r0
you are doing there - seems possibly over the top for situations when
SIGPENDING isn't set and, more seriously, what if you go through that
loop many times? You slap them again and again into pt_regs, overwriting
anything ptrace() might've done to r34..r51, right?
Yes, that's a good observation. I should hoist the push of callee-saves to
before the loop. I'll put out a new patch that incorporates both of those
changes.

Thanks!
--
Chris Metcalf, Tilera Corp.
http://www.tilera.com
Al Viro
2012-04-29 00:55:06 UTC
Permalink
Post by Chris Metcalf
It's OK, since we will jump to .Lresume_userspace from interrupt_return in
STD_ENTRY(interrupt_return)
/* If we're resuming to kernel space, don't check thread flags. */
{
[...]
PTREGS_PTR(r29, PTREGS_OFFSET_EX1)
}
ld r29, r29
andi r29, r29, SPR_EX_CONTEXT_1_1__PL_MASK /* mask off ICS */
{
beqzt r29, .Lresume_userspace
[...]
}
The struct ptregs "ex1" field will reliably tell us whether we came from
kernel or userspace context. Certainly for fork() this will indicate
userspace, since it's the return register info for the syscall. And for
kernel_thread() we explicitly set up ex1 to indicate kernel privilege as well.
Post by Al Viro
For another,
there's kernel_execve() and if it fails (binary doesn't exist/has wrong
format/etc.) you'll get to .Lresume_userspace with EX1_PL(regs->ex1)
unchanged, i.e. the kernel one...
This is a good point. The current syscall return path goes directly to
.Lresume_userspace, which will screw up kernel syscalls. I think the right
answer is still to jump to interrupt_return from those cases, though. In
particular, this gets rid of the existing cruftiness where we have to
document that a local label (.Lresume_userspace) can be the target of jumps
from outside the containing function.
Point, but... Are you sure you want to add extra work to a very hot path?
Leaving the "we don't have any pending work to do" unburdened by that check
would reduce the overhead a lot.
Chris Metcalf
2012-04-29 03:49:21 UTC
Permalink
Post by Al Viro
Post by Chris Metcalf
Post by Al Viro
For another,
there's kernel_execve() and if it fails (binary doesn't exist/has wrong
format/etc.) you'll get to .Lresume_userspace with EX1_PL(regs->ex1)
unchanged, i.e. the kernel one...
This is a good point. The current syscall return path goes directly to
.Lresume_userspace, which will screw up kernel syscalls. I think the right
answer is still to jump to interrupt_return from those cases, though. In
particular, this gets rid of the existing cruftiness where we have to
document that a local label (.Lresume_userspace) can be the target of jumps
from outside the containing function.
Point, but... Are you sure you want to add extra work to a very hot path?
Leaving the "we don't have any pending work to do" unburdened by that check
would reduce the overhead a lot.
I suppose that's plausible; it's only 5 cycles of work (including the one
cache-hot load), but might also be an extra icache line load, which could
be quite a bit slower.

I think the way to do this may be to both take your suggestion about
checking user_mode() in do_work_pending(), and also handle the looping over
the flag bits there as well. Then we can load the caller-saves once, prior
to calling do_work_pending(), and when we return we're guaranteed to have
interrupts disabled and TIF_ALLWORK clear.
--
Chris Metcalf, Tilera Corp.
http://www.tilera.com
Chris Metcalf
2012-04-28 18:51:43 UTC
Permalink
First, we were at risk of handling thread-info flags, in particular
do_signal(), when returning from kernel space. This could happen
after a failed kernel_execve(), or when forking a kernel thread.
The fix is to test in do_work_pending() for user_mode() and return
immediately if so; we already had this test for one of the flags,
so I just hoisted it to the top of the function.

Second, if a ptraced process updated the callee-saved registers
in the ptregs struct and then processed another thread-info flag, we
would overwrite the modifications with the original callee-saved
registers. To fix this, we add a register to note if we've already
saved the registers once, and skip doing it on additional passes
through the loop. To avoid a performance hit from the couple of
extra instructions involved, I modified the GET_THREAD_INFO() macro
to be guaranteed to be one instruction, then bundled it with adjacent
instructions, yielding an overall net savings.

Reported-By: Al Viro <***@ZenIV.linux.org.uk>
Signed-off-by: Chris Metcalf <***@tilera.com>
---
arch/tile/include/asm/thread_info.h | 9 ++++++-
arch/tile/kernel/intvec_32.S | 41 +++++++++++++++++++++++-----------
arch/tile/kernel/intvec_64.S | 38 +++++++++++++++++++++++--------
arch/tile/kernel/process.c | 7 ++++-
4 files changed, 68 insertions(+), 27 deletions(-)

diff --git a/arch/tile/include/asm/thread_info.h b/arch/tile/include/asm/thread_info.h
index bc4f562..7594764 100644
--- a/arch/tile/include/asm/thread_info.h
+++ b/arch/tile/include/asm/thread_info.h
@@ -100,9 +100,14 @@ extern void cpu_idle_on_new_stack(struct thread_info *old_ti,

#else /* __ASSEMBLY__ */

-/* how to get the thread information struct from ASM */
+/*
+ * How to get the thread information struct from assembly.
+ * Note that we use different macros since different architectures
+ * have different semantics in their "mm" instruction and we would
+ * like to guarantee that the macro expands to exactly one instruction.
+ */
#ifdef __tilegx__
-#define GET_THREAD_INFO(reg) move reg, sp; mm reg, zero, LOG2_THREAD_SIZE, 63
+#define EXTRACT_THREAD_INFO(reg) mm reg, zero, LOG2_THREAD_SIZE, 63
#else
#define GET_THREAD_INFO(reg) mm reg, sp, zero, LOG2_THREAD_SIZE, 31
#endif
diff --git a/arch/tile/kernel/intvec_32.S b/arch/tile/kernel/intvec_32.S
index 5d56a1e..6943515 100644
--- a/arch/tile/kernel/intvec_32.S
+++ b/arch/tile/kernel/intvec_32.S
@@ -839,6 +839,18 @@ STD_ENTRY(interrupt_return)
FEEDBACK_REENTER(interrupt_return)

/*
+ * Use r33 to hold whether we have already loaded the callee-saves
+ * into ptregs. We don't want to do it twice in this loop, since
+ * then we'd clobber whatever changes are made by ptrace, etc.
+ * Get base of stack in r32.
+ */
+ {
+ GET_THREAD_INFO(r32)
+ movei r33, 0
+ }
+
+.Lretry_work_pending:
+ /*
* Disable interrupts so as to make sure we don't
* miss an interrupt that sets any of the thread flags (like
* need_resched or sigpending) between sampling and the iret.
@@ -848,9 +860,6 @@ STD_ENTRY(interrupt_return)
IRQ_DISABLE(r20, r21)
TRACE_IRQS_OFF /* Note: clobbers registers r0-r29 */

- /* Get base of stack in r32; note r30/31 are used as arguments here. */
- GET_THREAD_INFO(r32)
-

/* Check to see if there is any work to do before returning to user. */
{
@@ -866,16 +875,18 @@ STD_ENTRY(interrupt_return)

/*
* Make sure we have all the registers saved for signal
- * handling or single-step. Call out to C code to figure out
- * exactly what we need to do for each flag bit, then if
- * necessary, reload the flags and recheck.
+ * handling, notify-resume, or single-step. Call out to C
+ * code to figure out exactly what we need to do for each flag bit,
+ * then if necessary, reload the flags and recheck.
*/
- push_extra_callee_saves r0
{
PTREGS_PTR(r0, PTREGS_OFFSET_BASE)
- jal do_work_pending
+ bnz r33, 1f
}
- bnz r0, .Lresume_userspace
+ push_extra_callee_saves r0
+ movei r33, 1
+1: jal do_work_pending
+ bnz r0, .Lretry_work_pending

/*
* In the NMI case we
@@ -1180,10 +1191,12 @@ handle_syscall:
add r20, r20, tp
lw r21, r20
addi r21, r21, 1
- sw r20, r21
+ {
+ sw r20, r21
+ GET_THREAD_INFO(r31)
+ }

/* Trace syscalls, if requested. */
- GET_THREAD_INFO(r31)
addi r31, r31, THREAD_INFO_FLAGS_OFFSET
lw r30, r31
andi r30, r30, _TIF_SYSCALL_TRACE
@@ -1362,7 +1375,10 @@ handle_ill:
3:
/* set PC and continue */
lw r26, r24
- sw r28, r26
+ {
+ sw r28, r26
+ GET_THREAD_INFO(r0)
+ }

/*
* Clear TIF_SINGLESTEP to prevent recursion if we execute an ill.
@@ -1370,7 +1386,6 @@ handle_ill:
* need to clear it here and can't really impose on all other arches.
* So what's another write between friends?
*/
- GET_THREAD_INFO(r0)

addi r1, r0, THREAD_INFO_FLAGS_OFFSET
{
diff --git a/arch/tile/kernel/intvec_64.S b/arch/tile/kernel/intvec_64.S
index 49d9d66..30ae76e 100644
--- a/arch/tile/kernel/intvec_64.S
+++ b/arch/tile/kernel/intvec_64.S
@@ -647,6 +647,20 @@ STD_ENTRY(interrupt_return)
FEEDBACK_REENTER(interrupt_return)

/*
+ * Use r33 to hold whether we have already loaded the callee-saves
+ * into ptregs. We don't want to do it twice in this loop, since
+ * then we'd clobber whatever changes are made by ptrace, etc.
+ */
+ {
+ movei r33, 0
+ move r32, sp
+ }
+
+ /* Get base of stack in r32. */
+ EXTRACT_THREAD_INFO(r32)
+
+.Lretry_work_pending:
+ /*
* Disable interrupts so as to make sure we don't
* miss an interrupt that sets any of the thread flags (like
* need_resched or sigpending) between sampling and the iret.
@@ -656,9 +670,6 @@ STD_ENTRY(interrupt_return)
IRQ_DISABLE(r20, r21)
TRACE_IRQS_OFF /* Note: clobbers registers r0-r29 */

- /* Get base of stack in r32; note r30/31 are used as arguments here. */
- GET_THREAD_INFO(r32)
-

/* Check to see if there is any work to do before returning to user. */
{
@@ -674,16 +685,18 @@ STD_ENTRY(interrupt_return)

/*
* Make sure we have all the registers saved for signal
- * handling or single-step. Call out to C code to figure out
+ * handling or notify-resume. Call out to C code to figure out
* exactly what we need to do for each flag bit, then if
* necessary, reload the flags and recheck.
*/
- push_extra_callee_saves r0
{
PTREGS_PTR(r0, PTREGS_OFFSET_BASE)
- jal do_work_pending
+ bnez r33, 1f
}
- bnez r0, .Lresume_userspace
+ push_extra_callee_saves r0
+ movei r33, 1
+1: jal do_work_pending
+ bnez r0, .Lretry_work_pending

/*
* In the NMI case we
@@ -968,11 +981,16 @@ handle_syscall:
shl16insli r20, r20, hw0(irq_stat + IRQ_CPUSTAT_SYSCALL_COUNT_OFFSET)
add r20, r20, tp
ld4s r21, r20
- addi r21, r21, 1
- st4 r20, r21
+ {
+ addi r21, r21, 1
+ move r31, sp
+ }
+ {
+ st4 r20, r21
+ EXTRACT_THREAD_INFO(r31)
+ }

/* Trace syscalls, if requested. */
- GET_THREAD_INFO(r31)
addi r31, r31, THREAD_INFO_FLAGS_OFFSET
ld r30, r31
andi r30, r30, _TIF_SYSCALL_TRACE
diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c
index 2d5ef61..54e6c64 100644
--- a/arch/tile/kernel/process.c
+++ b/arch/tile/kernel/process.c
@@ -567,6 +567,10 @@ struct task_struct *__sched _switch_to(struct task_struct *prev,
*/
int do_work_pending(struct pt_regs *regs, u32 thread_info_flags)
{
+ /* If we enter in kernel mode, do nothing and exit the caller loop. */
+ if (!user_mode(regs))
+ return 0;
+
if (thread_info_flags & _TIF_NEED_RESCHED) {
schedule();
return 1;
@@ -589,8 +593,7 @@ int do_work_pending(struct pt_regs *regs, u32 thread_info_flags)
return 1;
}
if (thread_info_flags & _TIF_SINGLESTEP) {
- if ((regs->ex1 & SPR_EX_CONTEXT_1_1__PL_MASK) == 0)
- single_step_once(regs);
+ single_step_once(regs);
return 0;
}
panic("work_pending: bad flags %#x\n", thread_info_flags);
--
1.6.5.2
Al Viro
2012-04-28 02:42:08 UTC
Permalink
Post by Al Viro
I think all such architectures need that check lifted to do_notify_resume()
(and the rest needs it killed, of course). Including x86, by the look
of it - we _probably_ can't get there with TIF_NOTIFY_RESUME and
!user_mode(regs), but I'm not entirely sure of that. arm is in about the
same situation; alpha, ppc{32,64}, sparc{32,64} and m68k really can't get
there like that (they all check it in the asm glue). mips probably might,
unless I'm misreading their ret_from_fork()... Fun.
It's actually worse than I thought - we can't just lift that check
to do_notify_resume() and be done with that. Suppose do_signal() does
get called on e.g. i386 or arm with !user_mode(regs). What'll happen next?

We have TIF_SIGPENDING set in thread flags - otherwise we wouldn't get
there at all. OK, do_signal() doesn't do anything and returns. So does
do_notify_resume(). And we are back into the loop in asm glue, rereading
the thread flags (still unchanged), checking if anything is to be done
(yes, it is - TIF_SIGPENDING is still set), calling do_notify_resume(),
ad infinitum.

Lifting the check into do_notify_resume() will not help at all, obviously.

AFAICS we can get hit by that. At least i386, arm and mips have
ret_from_fork going straight to "return from syscall" path, no checks for
return to user mode done. And process created by kernel_thread() will
go there. It's a narrow race, but AFAICS it's not impossible to hit -
guess the PID of kernel thread to be launched, send it a signal and hit
the moment before it gets to executing the payload.

It's probably not exploitable unless you are root, since most of the
threads are spawned either by kthreadd or by khelper, both running as
root. OTOH, there might be other places leading to the same fun - e.g.
kernel_execve() goes through the normal syscall return path almost on
everything and in case of failure it returns to kernel mode. Again,
that one is unlikely to be exploitable (it only happens from root-owned
threads), but I'm not sure if anything else gets there; IIRC, there had
been an effort to get rid of issuing syscalls via int/syscall/trap/whatnot,
but I don't remember how far did it go, especially under arch...

Comments?
Al Viro
2012-04-28 03:32:45 UTC
Permalink
Post by Al Viro
Post by Al Viro
I think all such architectures need that check lifted to do_notify_resume()
(and the rest needs it killed, of course). Including x86, by the look
of it - we _probably_ can't get there with TIF_NOTIFY_RESUME and
!user_mode(regs), but I'm not entirely sure of that. arm is in about the
same situation; alpha, ppc{32,64}, sparc{32,64} and m68k really can't get
there like that (they all check it in the asm glue). mips probably might,
unless I'm misreading their ret_from_fork()... Fun.
It's actually worse than I thought - we can't just lift that check
to do_notify_resume() and be done with that. Suppose do_signal() does
get called on e.g. i386 or arm with !user_mode(regs). What'll happen next?
We have TIF_SIGPENDING set in thread flags - otherwise we wouldn't get
there at all. OK, do_signal() doesn't do anything and returns. So does
do_notify_resume(). And we are back into the loop in asm glue, rereading
the thread flags (still unchanged), checking if anything is to be done
(yes, it is - TIF_SIGPENDING is still set), calling do_notify_resume(),
ad infinitum.
Lifting the check into do_notify_resume() will not help at all, obviously.
AFAICS we can get hit by that. At least i386, arm and mips have
ret_from_fork going straight to "return from syscall" path, no checks for
return to user mode done. And process created by kernel_thread() will
go there. It's a narrow race, but AFAICS it's not impossible to hit -
guess the PID of kernel thread to be launched, send it a signal and hit
the moment before it gets to executing the payload.
It's probably not exploitable unless you are root, since most of the
threads are spawned either by kthreadd or by khelper, both running as
root. OTOH, there might be other places leading to the same fun - e.g.
kernel_execve() goes through the normal syscall return path almost on
everything and in case of failure it returns to kernel mode. Again,
that one is unlikely to be exploitable (it only happens from root-owned
threads), but I'm not sure if anything else gets there; IIRC, there had
been an effort to get rid of issuing syscalls via int/syscall/trap/whatnot,
but I don't remember how far did it go, especially under arch...
Actually, it looks like on i386 the loop will be broken by checks in
resume_userspace_sig, so the worst thing that might happen would be
a bogus call of tracehook_notify_resume() if it's possible to get there
with TIF_NOTIFY_RESUME for kernel thread. No such luck on arm, though...
To be honest, I'd rather check for user_mode() before calling
do_notify_resume() and go away to no_work_pending if it's true. For arm
and i386 that would probably look like this, and I'd really, *really*
like review and comments on that. amd64 is, AFAICS, careful enough to
avoid hitting do_notify_resume() when returning into the kernel mode -
its implementations of ret_from_fork and kernel_execve take care to avoid
that.

diff --git a/arch/arm/kernel/entry-common.S b/arch/arm/kernel/entry-common.S
index 82aaf0a..e147619 100644
--- a/arch/arm/kernel/entry-common.S
+++ b/arch/arm/kernel/entry-common.S
@@ -57,6 +57,9 @@ work_pending:
* TIF_SIGPENDING or TIF_NOTIFY_RESUME must've been set if we got here
*/
mov r0, sp @ 'regs'
+ ldr r2, [sp, #S_PSR]
+ tst r2, #15
+ be no_work_pending
mov r2, why @ 'syscall'
tst r1, #_TIF_SIGPENDING @ delivering a signal?
movne why, #0 @ prevent further restarts
diff --git a/arch/arm/kernel/signal.c b/arch/arm/kernel/signal.c
index cd41742..f7b7a1c 100644
--- a/arch/arm/kernel/signal.c
+++ b/arch/arm/kernel/signal.c
@@ -641,15 +641,6 @@ static void do_signal(struct pt_regs *regs, int syscall)
int signr;

/*
- * We want the common case to go fast, which
- * is why we may in certain cases get here from
- * kernel mode. Just return without doing anything
- * if so.
- */
- if (!user_mode(regs))
- return;
-
- /*
* If we were from a system call, check for system call restarting...
*/
if (syscall) {
diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
index 7b784f4..e858462 100644
--- a/arch/x86/kernel/entry_32.S
+++ b/arch/x86/kernel/entry_32.S
@@ -321,7 +321,6 @@ ret_from_exception:
preempt_stop(CLBR_ANY)
ret_from_intr:
GET_THREAD_INFO(%ebp)
-resume_userspace_sig:
#ifdef CONFIG_VM86
movl PT_EFLAGS(%esp), %eax # mix EFLAGS and CS
movb PT_CS(%esp), %al
@@ -628,9 +627,13 @@ work_notifysig: # deal with pending signals and
# vm86-space
TRACE_IRQS_ON
ENABLE_INTERRUPTS(CLBR_NONE)
+ movb PT_CS(%esp), %bl
+ andl $SEGMENT_RPL_MASK, %ebx
+ cmpl $USER_RPL, %ebx
+ jb resume_kernel
xorl %edx, %edx
call do_notify_resume
- jmp resume_userspace_sig
+ jmp resume_userspace

ALIGN
work_notifysig_v86:
@@ -643,9 +646,13 @@ work_notifysig_v86:
#endif
TRACE_IRQS_ON
ENABLE_INTERRUPTS(CLBR_NONE)
+ movb PT_CS(%esp), %bl
+ andl $SEGMENT_RPL_MASK, %ebx
+ cmpl $USER_RPL, %ebx
+ jb resume_kernel
xorl %edx, %edx
call do_notify_resume
- jmp resume_userspace_sig
+ jmp resume_userspace
END(work_pending)

# perform syscall exit tracing
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 595969f..c4aa7c5 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -738,16 +738,6 @@ static void do_signal(struct pt_regs *regs)
siginfo_t info;
int signr;

- /*
- * We want the common case to go fast, which is why we may in certain
- * cases get here from kernel mode. Just return without doing anything
- * if so.
- * X86_32: vm86 regs switched out by assembly code before reaching
- * here, so testing against kernel CS suffices.
- */
- if (!user_mode(regs))
- return;
-
signr = get_signal_to_deliver(&info, &ka, regs, NULL);
if (signr > 0) {
/* Whee! Actually deliver the signal. */
Al Viro
2012-04-28 03:36:06 UTC
Permalink
Post by Al Viro
diff --git a/arch/arm/kernel/entry-common.S b/arch/arm/kernel/entry-common.S
index 82aaf0a..e147619 100644
--- a/arch/arm/kernel/entry-common.S
+++ b/arch/arm/kernel/entry-common.S
* TIF_SIGPENDING or TIF_NOTIFY_RESUME must've been set if we got here
*/
+ ldr r2, [sp, #S_PSR]
+ tst r2, #15
+ be no_work_pending
bne, actually. Apologies for braino...
Oleg Nesterov
2012-04-29 16:33:10 UTC
Permalink
Ah, I didn't notice this email...
Post by Al Viro
Actually, it looks like on i386 the loop will be broken by checks in
resume_userspace_sig,
Yes,
Post by Al Viro
so the worst thing that might happen would be
a bogus call of tracehook_notify_resume() if it's possible to get there
with TIF_NOTIFY_RESUME for kernel thread.
Afaics, the kernel should never have TIF_NOTIFY_RESUME. Except (afaics!)
a user-space task with TIF_NOTIFY_RESUME does kernel_thread(), and this
flag is copied by setup_thread_stack(). But this should be forbidden and
we are going to kill daemonize(), probably it already has no callers.
Post by Al Viro
To be honest, I'd rather check for user_mode() before calling
do_notify_resume() and go away to no_work_pending if it's true.
Agreed! I tried to suggest this when 29a2e283 was discussed, but my asm
skills are close to zero.
Post by Al Viro
--- a/arch/x86/kernel/entry_32.S
+++ b/arch/x86/kernel/entry_32.S
preempt_stop(CLBR_ANY)
GET_THREAD_INFO(%ebp)
#ifdef CONFIG_VM86
movl PT_EFLAGS(%esp), %eax # mix EFLAGS and CS
movb PT_CS(%esp), %al
@@ -628,9 +627,13 @@ work_notifysig: # deal with pending signals and
# vm86-space
TRACE_IRQS_ON
ENABLE_INTERRUPTS(CLBR_NONE)
+ movb PT_CS(%esp), %bl
+ andl $SEGMENT_RPL_MASK, %ebx
+ cmpl $USER_RPL, %ebx
+ jb resume_kernel
xorl %edx, %edx
call do_notify_resume
- jmp resume_userspace_sig
+ jmp resume_userspace
ALIGN
#endif
TRACE_IRQS_ON
ENABLE_INTERRUPTS(CLBR_NONE)
+ movb PT_CS(%esp), %bl
+ andl $SEGMENT_RPL_MASK, %ebx
+ cmpl $USER_RPL, %ebx
+ jb resume_kernel
xorl %edx, %edx
call do_notify_resume
- jmp resume_userspace_sig
+ jmp resume_userspace
END(work_pending)
# perform syscall exit tracing
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 595969f..c4aa7c5 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -738,16 +738,6 @@ static void do_signal(struct pt_regs *regs)
siginfo_t info;
int signr;
- /*
- * We want the common case to go fast, which is why we may in certain
- * cases get here from kernel mode. Just return without doing anything
- * if so.
- * X86_32: vm86 regs switched out by assembly code before reaching
- * here, so testing against kernel CS suffices.
- */
- if (!user_mode(regs))
- return;
-
Can't review, but like very much ;)

Oleg.
Oleg Nesterov
2012-04-29 16:18:18 UTC
Permalink
First of all, let me repeat I am useless when it comes to the low-level
or asm code. I can be easily wrong, and I am going to ask the questions.
Post by Al Viro
It's actually worse than I thought - we can't just lift that check
to do_notify_resume() and be done with that. Suppose do_signal() does
get called on e.g. i386 or arm with !user_mode(regs). What'll happen next?
We have TIF_SIGPENDING set in thread flags - otherwise we wouldn't get
there at all. OK, do_signal() doesn't do anything and returns. So does
do_notify_resume(). And we are back into the loop in asm glue, rereading
the thread flags (still unchanged), checking if anything is to be done
(yes, it is - TIF_SIGPENDING is still set), calling do_notify_resume(),
ad infinitum.
Lifting the check into do_notify_resume() will not help at all, obviously.
AFAICS we can get hit by that.
Please look at 29a2e2836ff9ea65a603c89df217f4198973a74f
x86-32: Fix endless loop when processing signals for kernel tasks
Post by Al Viro
At least i386, arm and mips have
ret_from_fork going straight to "return from syscall" path, no checks for
return to user mode done. And process created by kernel_thread() will
go there.
Looks like, the patch above fixes that.

But, we call do_notify_resume() first, it would be nice to avoid this
and remove the user_mode() check in do_signal() or lift into
do_notify_resume().
Post by Al Viro
It's a narrow race, but AFAICS it's not impossible to hit -
guess the PID of kernel thread to be launched, send it a signal and hit
the moment before it gets to executing the payload.
Yes. But note that the kernel threads run with all signals ignored.

This is still possible, but a kernel thread should do allow_signal()
and then call kernel_thread() (not kthread_create).



Question. So far I know that on x86 do_notify_resume() && !user_mode()
is only possible on 32bit system, and the possible callers are
ret_from_fork or kernel_execve (if it fails).

Plus, perhaps CONFIG_VM86 makes a difference?

Could you please clarify?

Oleg.
Al Viro
2012-04-29 18:05:35 UTC
Permalink
Post by Oleg Nesterov
Please look at 29a2e2836ff9ea65a603c89df217f4198973a74f
x86-32: Fix endless loop when processing signals for kernel tasks
Post by Al Viro
At least i386, arm and mips have
ret_from_fork going straight to "return from syscall" path, no checks for
return to user mode done. And process created by kernel_thread() will
go there.
Looks like, the patch above fixes that.
Yes, found that shortly after posting. No such luck for arm, though...
Post by Oleg Nesterov
But, we call do_notify_resume() first, it would be nice to avoid this
and remove the user_mode() check in do_signal() or lift into
do_notify_resume().
See the proposed patch. Does exactly that - lifts all the way to caller
of do_notify_resume(), buggers off if test fits.
Post by Oleg Nesterov
Question. So far I know that on x86 do_notify_resume() && !user_mode()
is only possible on 32bit system, and the possible callers are
ret_from_fork or kernel_execve (if it fails).
Plus, perhaps CONFIG_VM86 makes a difference?
Could you please clarify?
VM86 doesn't make a difference; we form normal pt_regs for it in case
we have a pending signal, but after that has been done, we don't need
to care. Look:
* NOTIFY_RESUME handling doesn't need to be done unless we are
returning to userland. IOW, the first step is to lift that if (!user_mode(...
into do_notify_resume(). Agreed?
* Now, if do_notify_resume() does nothing in case !user_mode(regs),
let's lift that check to (32bit) caller. What we have right now is
do_notify_resume(%esp, NULL, %ecx)
goto resume_userspace_sig;
resume_userspace_sig:
if (!user_mode_vm(%esp))
goto resume_kernel;
resume_userspace:
So after lifting the check we get
if (user_mode(%esp))
do_notify_resume(%esp, NULL, %ecx)
goto resume_userspace_sig;
resume_userspace_sig:
if (!user_mode_vm(%esp))
goto resume_kernel;
resume_userspace:
but user_mode(regs) being true means that user_mode_vm(regs) is also true,
so this code is equivalent to
if (!user_mode(%esp))
goto resume_kernel;
do_notify_resume(%esp, NULL, %ecx)
goto resume_userspace;
(with stuff around resume_userspace_sig left without changes).

diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
index 7b784f4..e858462 100644
--- a/arch/x86/kernel/entry_32.S
+++ b/arch/x86/kernel/entry_32.S
@@ -321,7 +321,6 @@ ret_from_exception:
preempt_stop(CLBR_ANY)
ret_from_intr:
GET_THREAD_INFO(%ebp)
-resume_userspace_sig:
#ifdef CONFIG_VM86
movl PT_EFLAGS(%esp), %eax # mix EFLAGS and CS
movb PT_CS(%esp), %al
@@ -628,9 +627,13 @@ work_notifysig: # deal with pending signals and
# vm86-space
TRACE_IRQS_ON
ENABLE_INTERRUPTS(CLBR_NONE)
+ movb PT_CS(%esp), %bl
+ andl $SEGMENT_RPL_MASK, %ebx
+ cmpl $USER_RPL, %ebx
+ jb resume_kernel
xorl %edx, %edx
call do_notify_resume
- jmp resume_userspace_sig
+ jmp resume_userspace

ALIGN
work_notifysig_v86:
@@ -643,9 +646,13 @@ work_notifysig_v86:
#endif
TRACE_IRQS_ON
ENABLE_INTERRUPTS(CLBR_NONE)
+ movb PT_CS(%esp), %bl
+ andl $SEGMENT_RPL_MASK, %ebx
+ cmpl $USER_RPL, %ebx
+ jb resume_kernel
xorl %edx, %edx
call do_notify_resume
- jmp resume_userspace_sig
+ jmp resume_userspace
END(work_pending)

# perform syscall exit tracing
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 595969f..c4aa7c5 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -738,16 +738,6 @@ static void do_signal(struct pt_regs *regs)
siginfo_t info;
int signr;

- /*
- * We want the common case to go fast, which is why we may in certain
- * cases get here from kernel mode. Just return without doing anything
- * if so.
- * X86_32: vm86 regs switched out by assembly code before reaching
- * here, so testing against kernel CS suffices.
- */
- if (!user_mode(regs))
- return;
-
signr = get_signal_to_deliver(&info, &ka, regs, NULL);
if (signr > 0) {
/* Whee! Actually deliver the signal. */
Al Viro
2012-05-02 17:24:53 UTC
Permalink
* add that regs->orig_p0 = -1; to handle_restart()
* add jump .Lresume_userspace_1 after the call of do_notify_resume()
* instead of RESTORE_CONTEXT/rti in ret_from_fork and kernel_execve()
have them jump to .Lresume_userspace.
arrrgh... The last one is unfortunately not so simple - you have that
in the guts of _system_call, which is called via pseudo_long_call,
so it would ought to be a call into the middle of _system_call, followed
by that RESTORE_CONTEXT/rti...
Oleg Nesterov
2012-05-02 18:30:57 UTC
Permalink
Post by Al Viro
* Now, if do_notify_resume() does nothing in case !user_mode(regs),
let's lift that check to (32bit) caller. What we have right now is
do_notify_resume(%esp, NULL, %ecx)
goto resume_userspace_sig;
if (!user_mode_vm(%esp))
goto resume_kernel;
So after lifting the check we get
if (user_mode(%esp))
do_notify_resume(%esp, NULL, %ecx)
goto resume_userspace_sig;
if (!user_mode_vm(%esp))
goto resume_kernel;
but user_mode(regs) being true means that user_mode_vm(regs) is also true,
so this code is equivalent to
if (!user_mode(%esp))
goto resume_kernel;
do_notify_resume(%esp, NULL, %ecx)
goto resume_userspace;
(with stuff around resume_userspace_sig left without changes).
Yes, thanks, this looks correct.

I've read the new patches in your tree. Again, I do not have any
useful comment, but a couple of questions.

And just in case... I will be completely offline till May 9.


----------------------------------------
046a099ad7b3791a7f9dfbe56ac1263bda8b1974 arm: if there's no handler we need to restore sigmask, syscall or no syscall

with or without this patch, set_current_blocked(->saved_sigmask) doesn't
look exactly right after force_sigsegv(), this can block SIGSEGV.

And force_sigsegv(sig => 0) looks strange, but this is off-topic.

And the question, I am just curious...

OTOH. I am not sure I understand the "int syscall" argument correctly,
I'll assume it means the same as "regs->orig_ax > 0" on x86. In this
case it is not clear to me how "!syscall && TIF_RESTORE_SIGMASK" is
possible.

x86 does this outside of the "if (syscall_get_nr(current, regs)" block
too. Probably this makes sense because debugger can change orig_ax in
between?

(The same for the next db7fddb9574c175aabdbcaa74b736bb3d1665a8e change
in unicore32)

----------------------------------------
415a12e79ebfa703a5ec91c85cb29f6ecc844aa1 most of set_current_blocked() callers want SIGKILL/SIGSTOP removed from set

Cosmetic nit. With this patch we have

void set_current_blocked(sigset_t *newset)
{
struct task_struct *tsk = current;
sigdelsetmask(newset, sigmask(SIGKILL) | sigmask(SIGSTOP));
spin_lock_irq(&tsk->sighand->siglock);
__set_task_blocked(tsk, newset);
spin_unlock_irq(&tsk->sighand->siglock);
}

but it could simply do

void set_current_blocked(sigset_t *newset)
{
sigdelsetmask(newset, sigmask(SIGKILL) | sigmask(SIGSTOP));
__set_current_blocked(newset);
}

-----------------------------------------
fa04e22b239aa035f3ae77151e26b03400303245 FRV: Shrink TIF_WORK_MASK [ver #2]

Off-topic/stupid question. Even if I know nothing about arch/frv, this looks
like a nice change to me because

#define _TIF_WORK_MASK 0x0000FFFE
#define _TIF_ALLWORK_MASK 0x0000FFFF

looks very confusing imho. I mean, it is not clear which bits do we actually
want to check.

Can't we (cough, you ;) also cleanup _TIF_WORK_MASK/_TIF_ALLWORK_MASK on x86?
Oleg Nesterov
2012-04-29 16:41:55 UTC
Permalink
Post by Al Viro
BTW, I'm somewhat tempted to do the following: *ALL* calls of
tracehook_signal_handler() are now immediately preceded by block_signals().
Moreover, tracehook_signal_handler(...., 0) is currently a no-op, so
it could be painlessly added after the remaining block_signals() instances.
How about *folding* block_signals() (along with clear_restore_sigmask())
into tracehook_signal_handler()?
Oh, please no. Imho, these two have nothing to do with each other.

Besides, at least on x86 tracehook_signal_handler's logic is not exactly
right and should be fixed.

And we are going to kill tracehook.h. While personally I do not think
this is the good idea, but the matter of fact is that tracehooks are
already destroyed.

Oleg.
Al Viro
2012-04-29 18:09:26 UTC
Permalink
Post by Oleg Nesterov
Post by Al Viro
BTW, I'm somewhat tempted to do the following: *ALL* calls of
tracehook_signal_handler() are now immediately preceded by block_signals().
Moreover, tracehook_signal_handler(...., 0) is currently a no-op, so
it could be painlessly added after the remaining block_signals() instances.
How about *folding* block_signals() (along with clear_restore_sigmask())
into tracehook_signal_handler()?
Oh, please no. Imho, these two have nothing to do with each other.
Besides, at least on x86 tracehook_signal_handler's logic is not exactly
right and should be fixed.
Details, please...
Post by Oleg Nesterov
And we are going to kill tracehook.h. While personally I do not think
this is the good idea, but the matter of fact is that tracehooks are
already destroyed.
See signal.git#master. I ended up with
signal_delivered(<tracehook_signal_handler arguments>)
and that sucker does both things. The funny thing is, block_sigmask()
(apologies for typo above) is not used anywhere else...
Oleg Nesterov
2012-04-29 18:25:10 UTC
Permalink
Post by Al Viro
Post by Oleg Nesterov
Besides, at least on x86 tracehook_signal_handler's logic is not exactly
right and should be fixed.
Details, please...
See http://marc.info/?t=127550678000005 and
https://bugzilla.kernel.org/show_bug.cgi?id=16061

- the tracee reports a signal SIG

- the tracer does ptrace(SINGLESTEP, SIG), this approves the signal
and also sets TF

- the tracee dequeues the signal, changes its IP to sig_handler().

then it notices TIF_SINGLESTEP and notifies the tracer without
return to user-space _and_ without clearing TF or TIF_SINGLESTEP

- the tracer does ptrace(SINGLESTEP) again, but now enable_single_step()
looses TIF_FORCED_TF.

- the tracer does ptrace(CONT), but user_disable_single_step() doesn't
clear TF since TIF_FORCED_TF is not set

- the tracee returns to user-space with X86_EFLAGS_TF in eflags

Somehow we forgot about this bug...

Oleg.
Al Viro
2012-04-20 03:15:09 UTC
Permalink
Post by Al Viro
and filp_close() would, if that turns out to be possible, call fput_nodefer()
instead of fput(). If we *do* have places where we need deferral in
filp_close() (and I'm fairly sure that any such place is a deadlock right
now), well, we'll need a variant of filp_close() sans the call of fput...()
and those places would call that, followed by full (deferring) fput().
BTW, some of those filp_close() are simply wrong - e.g. a big pile in
drivers/media/video/cx25821/*.c should be fput().

And yes, we have at least some under mutex - binder_lock held by
binder_ioctl(), which ends up doing binder_transaction() with its
failure cleanup hitting close_filp(). On an arbitrary struct file.
It *probably* doesn't deadlock on the spot, since binder_release()
itself contains a deferral mechanism (perhaps they'd spotted the
deadlock, perhaps there are other reasons to have it).
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Hugh Dickins
2012-04-20 18:54:01 UTC
Permalink
Post by Al Viro
Has the discussion here moved from deferring the __fput() for the
mmap_sem/i_mutex lockdep side case, to taking the i_mutex in __fput() in
general? Lockdep has not reported any problems, other than for the
mmap_sem/i_mutex scenario.
Post by Al Viro
This
is not a way to go; that kind of kludges leads to locking code that is
impossible to reason about.
Are you referring to defering the __fput() or taking the i_mutex in
__fput() in general?
I'm referring to having "are we holding ->mmap_sem" logics anywhere in fput().
Note that your "lockdep has not reported..." is a symptom of the same
problem - it's *NOT* enough to show the lack of deadlocks from the change
of locking rules. And after that change we'll get even worse situation,
since proving the safety will become harder (and even more so if we end
up adding other cases when we need to defer).
Summary for those who'd been spared the earlier fun: IMA stuff wants to grab
->i_mutex on the final fput() in some cases. That violates the locking
order in at least one way - final fput() may come under ->mmap_sem, in
which case grabbing ->i_mutex would be a Bloody Bad Idea(tm). "Solution"
proposed: a bit of spectaculary ugly logics checking in final fput() whether
we have ->mmap_sem locked. If we do, said final fput() becomes non-final
and is done asynchronously. This is fundamentally flawed, of course,
since "is ->mmap_sem unlocked" is used as "is it safe to have ->i_mutex
grabbed", and it's both unproven and brittle as hell even if it happens
to be true right now (and I wouldn't bet on that, TBH).
I can see that the discussion has since moved on quite a way from here.

But it looks to me fairly easy for mm to stop doing fput() under mmap_sem.

That's already the case when exiting (no mmap_sem held), and shouldn't add
observable cost when unmapping (we already work on a chain of vmas to be
freed, and when unmapping that chain will usually just be of one: shouldn't
matter to defer a final pass until after mmap_sem is dropped). Unless I'm
mistaken, the fput() buried in vma_adjust() can never be a final fput.

Is it worth my trying to implement that? Or do you see an immediate
gotcha that I'm missing? Or are you happy enough with your deferred
fput() ideas, that it would be a waste of time to rearrange the mm end?

Hugh
Post by Al Viro
If it had been IMA alone, I would've cheerfully told them where to stuff
the whole thing. Unfortunately, fput() *is* lock-heavy even without them.
Note that it can easily end up triggering such things as final
deactivate_super(). Now, ->mmap_sem is irrelevant here - after all,
any inodes involved won't be mmapped, or deactivate_super() wouldn't
be final. However, fput() is called e.g. under rtnl_lock() and I'm
not at all sure that something like NFS won't try to grab it from its
->kill_sb().
So the question is, do we have any reasonable way to get to the situation
where the __fput() would only be done without any locks held? It might
* defer all __fput() (i.e. always do final fput() async). Obviously
no-go, for performance reasons alone.
* check some predicate about the set of locks held and defer if it's
false. That's what IMA folks have tried to do; no-go for the reasons described
above.
* add a new helper (fput_carefully() or something like that) that would
defer final __fput(), leaving fput() with the current behaviour. And convert
the potentially unsafe callers to it (starting with everything that holds
->mmap_sem). No-go since it's not maintainable - a change pretty far away
from a function that does (currently safe) fput() can end up requiring the
conversion to fput_carefully(). Too easy to miss, so this will decay and it
won't be easy to verify correctness several years down the road.
However, there's an approach that might be feasible. Most of the time
the final fput() *is* done without any locks held and there's a very
large subclass of those call sites - those that come via fput_light().
* prohibit fput_light() with locks held. Right now we are very
close to that (or already there - I haven't finished checking).
* convert low-hanging fget/fput in syscalls to fget_light/fput_light.
Makes sense anyway.
* split off fput_nodefer() (identical to what fput() is right now),
make fput_light() call it instead of fput(). Switch path_lookupat() and
path_openat() to fput_nodefer() as well. Ditto for sys_socketpair() and
sys_accept4().
* make fput() (now almost never leading to __fput()) defer the sucker
in very rare cases when it ends up dropping the last reference.
At that point we would have __fput() always done without any locks held,
which would remove all restrictions on the locks that can be taken from it.
Comments?
Disclaimer: I have not finished going through the call sites (note that
there are wrappers - e.g. sockfd_put() *and* sockfd_lookup()), so there might
be obstacles. In particular, I'm still not sure about SCM_RIGHTS datagrams
handling - IIRC, we had an interesting kludge in there, with a kinda-sorta
deferral/batching. BTW, I wonder about deadlocks around that one -
drivers/vhost/net.c is doing ->recvmsg() while holding vq->mutex and if
an SCM_RIGHTS datagram comes in, we'll get a bunch of fput() done. Which
can lead to final umount of a network filesystem, etc. If that thing can
lead to handle_rx() on the same sucker, we have a deadlock...
--
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Al Viro
2012-04-20 19:04:18 UTC
Permalink
Post by Hugh Dickins
I can see that the discussion has since moved on quite a way from here.
But it looks to me fairly easy for mm to stop doing fput() under mmap_sem.
That's already the case when exiting (no mmap_sem held), and shouldn't add
observable cost when unmapping (we already work on a chain of vmas to be
freed, and when unmapping that chain will usually just be of one: shouldn't
matter to defer a final pass until after mmap_sem is dropped). Unless I'm
mistaken, the fput() buried in vma_adjust() can never be a final fput.
Is it worth my trying to implement that? Or do you see an immediate
gotcha that I'm missing? Or are you happy enough with your deferred
fput() ideas, that it would be a waste of time to rearrange the mm end?
Deferring the final pass after dropping ->mmap_sem is going to be
interesting; what would protect ->vm_next on those suckers? Locking
rules are already bloody complicated in mm/*; this will only add more
subtle fun. Note that right now both the dissolution of ->vm_next
list and freeing of VMAs happen under ->mmap_sem; changing that might
cost us a lot of headache...
Linus Torvalds
2012-04-20 19:18:43 UTC
Permalink
Post by Al Viro
Deferring the final pass after dropping ->mmap_sem is going to be
interesting; what would protect ->vm_next on those suckers?
Just remove them from the active list, and keep them linked to each
other using vm_next.

After all, the only case we care about is the case where the vma gets
removed entirely, so we just put them on their own list.

In fact, that's what we already do for other reasons. See
detach_vmas_to_be_unmapped().

So vm_next is actually entirely *private* by this time, and needs no
locking at all.

As far as I can tell, we could just make do_munmap() return that
private list, and then do the fput's and freeing of the list outside
the mmap_sem lock.

That actually looks pretty simple. There are a fair number of callers,
which looks to be the main annoyance. But fixing it up with some
pattern of "do finish_munmap after drooping the mmap_sem" doesn't look
*complicated*, just slightly annoying.

The *bigger* annoyance is actually "do_mmap()", which does a
do_munmap() as part of it, so it needs the same cleanup too.

There might be other cases that do munmap as part of their operation
(and that have the mmap_sem held outside of the caller), but
do_munmap() and do_mmap() seem to be the two obvious ones.

Many of the callers seem to do the mmap_sem() right around the call,
though (see binfmt_elf.c for example), so it really would be a rather
local fixup.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Hugh Dickins
2012-04-20 19:32:45 UTC
Permalink
Post by Linus Torvalds
Post by Al Viro
Deferring the final pass after dropping ->mmap_sem is going to be
interesting; what would protect ->vm_next on those suckers?
Just remove them from the active list, and keep them linked to each
other using vm_next.
After all, the only case we care about is the case where the vma gets
removed entirely, so we just put them on their own list.
In fact, that's what we already do for other reasons. See
detach_vmas_to_be_unmapped().
So vm_next is actually entirely *private* by this time, and needs no
locking at all.
As far as I can tell, we could just make do_munmap() return that
private list, and then do the fput's and freeing of the list outside
the mmap_sem lock.
That actually looks pretty simple. There are a fair number of callers,
which looks to be the main annoyance. But fixing it up with some
pattern of "do finish_munmap after drooping the mmap_sem" doesn't look
*complicated*, just slightly annoying.
The *bigger* annoyance is actually "do_mmap()", which does a
do_munmap() as part of it, so it needs the same cleanup too.
There might be other cases that do munmap as part of their operation
(and that have the mmap_sem held outside of the caller), but
do_munmap() and do_mmap() seem to be the two obvious ones.
Many of the callers seem to do the mmap_sem() right around the call,
though (see binfmt_elf.c for example), so it really would be a rather
local fixup.
Yes, that's exactly how I was thinking of it.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Al Viro
2012-04-20 19:58:33 UTC
Permalink
Post by Linus Torvalds
The *bigger* annoyance is actually "do_mmap()", which does a
do_munmap() as part of it, so it needs the same cleanup too.
So does a bunch of other places. Let me dig out the call graph circa
3.3.0... Here is the relevant part:
do_munmap() -
<- pfm_do_munmap() <- pfm_remove_smpl_mapping() which grabs mmap_sem excl
<- 64_munmap(2) which grabs mmap_sem excl
<- kvm_arch_commit_memory_region() which grabs mmap_sem excl
<- i810_unmap_buffer() which grabs mmap_sem excl
<- aio_free_ring() which grabs mmap_sem excl
<- elf_map() which grabs mmap_sem excl
<- [flat] load_flat_file() --- BUG HERE
<- shmdt(2) which grabs mmap_sem excl
<- brk(2) which grabs mmap_sem excl
<- mmap_region() [see below]
<- munmap(2) which grabs mmap_sem excl
<- do_brk() [see below]
<- move_vma()
<- mremap_to() <- do_mremap() [see below]
<- do_mremap() [see below]
<- mremap_to() <- do_mremap() [see below]
<- do_mremap() [see below]
do_brk() -
<- brk(2) which grabs mmap_sem excl
<- [ia32_aout] set_brk() which grabs mmap_sem excl
<- [ia32_aout] load_aout_binary() which grabs mmap_sem excl
<- [ia32_aout] load_aout_library() which grabs mmap_sem excl
<- [aout] set_brk() which grabs mmap_sem excl
<- [aout] load_aout_binary() which grabs mmap_sem excl
<- [aout] load_aout_library() which grabs mmap_sem excl
<- [elf] set_brk() which grabs mmap_sem excl
<- [elf] load_elf_interp() which grabs mmap_sem excl
<- [elf] load_elf_library() which grabs mmap_sem excl
mmap_region() -
<- remap_file_pages(2) which grabs mmap_sem excl
<- do_mmap_pgoff() [see below]
<- [tile] arch_setup_additional_pages() which grabs mmap_sem excl
(a bit too late, BTW, but not for this one)
do_mmap_pgoff() -
<- do_mmap() [see below]
<- mmap_pgoff(2) which grabs mmap_sem excl
do_mmap() -
<- shmat(2) which grabs mmap_sem excl
<- aio_setup_ring() which grabs mmap_sem excl [NB: only because ctx->mm == current->mm]
<- kvm_arch_prepare_memory_region() which grabs mmap_sem excl
<- drm_mapbufs() which grabs mmap_sem excl
<- exynos_drm_gem_mmap_ioctl() which grabs mmap_sem excl
<- i810_map_buffer() which grabs mmap_sem excl [NB: racy changes of ->f_op]
<- i915_gem_mmap_ioctl() which grabs mmap_sem excl
<- [tile] single_step_once() which grabs mmap_sem excl
<- [elf] elf_map() which grabs mmap_sem excl
<- [elf] load_elf_binary() which grabs mmap_sem excl
<- [elf_fdpic] load_elf_fdpic_binary() which grabs mmap_sem excl
<- [elf_fdpic] elf_fdpic_map_file_constdisp_on_uclinux() which grabs mmap_sem excl
<- [elf_fdpic] elf_fdpic_map_file_by_direct_mmap() which grabs mmap_sem excl
<- [aout] load_aout_binary() which grabs mmap_sem excl
<- [aout] load_aout_library() which grabs mmap_sem excl
<- [ia32_aout] load_aout_binary() which grabs mmap_sem excl
<- [ia32_aout] load_aout_library() which grabs mmap_sem excl
<- [flat] load_flat_file() which grabs mmap_sem excl
<- [som] map_som_binary() which grabs mmap_sem excl
do_mremap() -
<- mremap(2) which grabs mmap_sem excl

(bug mentioned re load_flat_file() is still there, but it's irrelevant
for our purposes - no ->mmap_sem held by caller of do_munmap()). That's
a metric arseload of sites to propagate that thing to...
Linus Torvalds
2012-04-20 21:12:55 UTC
Permalink
So does a bunch of other places. =A0Let me dig out the call graph cir=
ca
Yes, but a lot of those would actually be helped by a helper function
that does all of:
- grab mmap_sem
- call do_m[un]map()
- release mmap_sem

and that would actually clean them up even in the current case.

And then we could do the cleanup in just the helper function.

Not all, no. But a preparatory patch that just creates the helper
functions for doing brk/mmap/munmap would get rid of a fairly big
chunk of them.

You can visualize how many of them do that by just doing

git grep -5 do_m[un]*map

and then high-lghting '_write(' (to visually show the
down_write/up_write pairs that surround most of them) by searching for
it.

Are they all like that? No. But most of the ones outside of mm/ do fit
that simple pattern and should probably be fixed up just to have them
not contain VM locking details in them *anyway*.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-securit=
y-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Al Viro
2012-04-20 22:13:15 UTC
Permalink
Post by Linus Torvalds
Are they all like that? No. But most of the ones outside of mm/ do fit
that simple pattern and should probably be fixed up just to have them
not contain VM locking details in them *anyway*.
Kinda-sorta. I agree that such helpers make sense, but you are too
optimistic about the number of such places. And clusterfuck around
mremap() is fairly deep, so propagating all way back to up_write()
wont' be fun.

sys_shmdt() is misusing do_munmap(), AFAICS. And there we call it many
times in a loop, unmapping only a subset, so it's not like we could
blindly pick a string of vmas and handle them all separately afterwards.
It could be handled if we passed an initially empty list, collecting
vmas in it as we go, but... ouch

BTW, looks like I've just found a bug in oprofile - take a look at
sys_munmap() and you'll see that it's almost exactly your proposed
helper. The only difference is that it starts with calling
profile_munmap(addr);
(mind you, *before* grabbing ->mmap_sem), which is to say
blocking_notifier_call_chain(&munmap_notifier, 0, (void *)addr);
i.e. a really fancy way to arrange a call of munmap_notify() from
drivers/oprofile/buffer_sync.c. Which does the following:
{
unsigned long addr = (unsigned long)data;
struct mm_struct *mm = current->mm;
struct vm_area_struct *mpnt;

down_read(&mm->mmap_sem);

mpnt = find_vma(mm, addr);
if (mpnt && mpnt->vm_file && (mpnt->vm_flags & VM_EXEC)) {
up_read(&mm->mmap_sem);
/* To avoid latency problems, we only process the current CPU,
* hoping that most samples for the task are on this CPU
*/
sync_buffer(raw_smp_processor_id());
return 0;
}

up_read(&mm->mmap_sem);
return 0;
}

Leaving aside the convoluted way it's done, it's obviously both racy and
incomplete. The worst part is, sync_buffer() *can't* be called with
any ->mmap_sem held; it goes through a bunch of task_struct, grabbing
->mmap_sem shared. Fun...
Linus Torvalds
2012-04-20 22:35:29 UTC
Permalink
Kinda-sorta. =A0I agree that such helpers make sense, but you are too
optimistic about the number of such places. =A0And clusterfuck around
mremap() is fairly deep, so propagating all way back to up_write()
wont' be fun.
=46air enough.

I'll do the helpers and see how much they get rid of, just because
looking at all the callers, those helpers seem to be obviously the
right thing anyway. So even if we don't do anything else, we can
improve things regardless.

=46or do_brk(), for example, it looks like do_brk() itself should
actually be entirely static to mm/mmap.c, because every single caller
from the outside actually wants the self-locking version.

So plan right now: do "vm_xyzzy()" helper functions that do
"do_xyzzy()" and take the lock (and do not take the "mm" argument,
because it had better always be the current one - keep the calling
convention as simple as possible).

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-securit=
y-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Kasatkin, Dmitry
2012-04-27 07:35:25 UTC
Permalink
On Sat, Apr 21, 2012 at 1:35 AM, Linus Torvalds
Kinda-sorta. =C2=A0I agree that such helpers make sense, but you ar=
e too
optimistic about the number of such places. =C2=A0And clusterfuck a=
round
mremap() is fairly deep, so propagating all way back to up_write()
wont' be fun.
Fair enough.
I'll do the helpers and see how much they get rid of, just because
looking at all the callers, those helpers seem to be obviously the
right thing anyway. So even if we don't do anything else, we can
improve things regardless.
For do_brk(), for example, it looks like do_brk() itself should
actually be entirely static to mm/mmap.c, because every single caller
from the outside actually wants the self-locking version.
So plan right now: do "vm_xyzzy()" helper functions that do
"do_xyzzy()" and take the lock (and do not take the "mm" argument,
because it had better always be the current one - keep the calling
convention as simple as possible).
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 Linus
Moi Linus,

Taking the list of files to fput out of do_munmap() and doing it after
unlocking mmap sem
was one of possible solutions when we discussed it with Mimi.
But we were looking for the solution with less modifications of existin=
g kernel.

In fact IMA is already doing some work before taking mmap sem.
---
ret =3D ima_file_mmap(filep, prot);
if (ret < 0)
return ret;
down_write(&current->mm->mmap_sem);
--

But have you seen the proposed patch for __fput()?
[PATCH v4 10/12] ima: defer calling __fput()

It defers only of course the last AND mmap_sem is locked AND open for w=
rite.

if (current->mm && rwsem_is_locked(&current->mm->mmap_sem)) {
if (ima_defer_fput(file) =3D=3D 0)
return;
}

Just 5 out of ~100,000 mmap_sem held fput() calls were deferred.

t.
- Dmitry
Al Viro
2012-04-27 17:34:18 UTC
Permalink
Post by Kasatkin, Dmitry
But have you seen the proposed patch for __fput()?
[PATCH v4 10/12] ima: defer calling __fput()
It defers only of course the last AND mmap_sem is locked AND open for write.
if (current->mm && rwsem_is_locked(&current->mm->mmap_sem)) {
if (ima_defer_fput(file) == 0)
return;
}
Just 5 out of ~100,000 mmap_sem held fput() calls were deferred.
Let me get it straight.
a) You still ignore all the problems with that described in the
posting right in the beginning of this thread.
b) You ignore the problems with semantics changes from user-visible
delays of fput() past the return from syscall (described in Linus' posting
upthread - they apply to this "solution" as well).
c) You seem to consider the fact that this path will be exercised
very rarely, thus making any races on it damn hard to reproduce and debug
as a good thing.

And as for the sentiment expressed in the beginning of your posting (that
smaller patch size is worth more than clean locking rules, maintainability
of resulting kernel, etc.)... I'm sorry, but you guys need to decide
what IMA is. If it's a first-class part of the kernel, you have your
priorities backwards...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Kasatkin, Dmitry
2012-04-27 18:52:09 UTC
Permalink
Post by Al Viro
Post by Kasatkin, Dmitry
But have you seen the proposed patch for __fput()?
[PATCH v4 10/12] ima: defer calling __fput()
It defers only of course the last AND mmap_sem is locked AND open fo=
r write.
Post by Al Viro
Post by Kasatkin, Dmitry
=C2=A0 =C2=A0 =C2=A0 if (current->mm && rwsem_is_locked(&current->mm=
->mmap_sem)) {
Post by Al Viro
Post by Kasatkin, Dmitry
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (ima_defer_fput(=
file) =3D=3D 0)
Post by Al Viro
Post by Kasatkin, Dmitry
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
=C2=A0 return;
Post by Al Viro
Post by Kasatkin, Dmitry
=C2=A0 =C2=A0 =C2=A0 }
Just 5 out of ~100,000 mmap_sem held fput() calls were deferred.
Let me get it straight.
=C2=A0 =C2=A0 =C2=A0 =C2=A0a) You still ignore all the problems with =
that described in the
Post by Al Viro
posting right in the beginning of this thread.
=C2=A0 =C2=A0 =C2=A0 =C2=A0b) You ignore the problems with semantics =
changes from user-visible
Post by Al Viro
delays of fput() past the return from syscall (described in Linus' po=
sting
Post by Al Viro
upthread - they apply to this "solution" as well).
=C2=A0 =C2=A0 =C2=A0 =C2=A0c) You seem to consider the fact that this=
path will be exercised
Post by Al Viro
very rarely, thus making any races on it damn hard to reproduce and d=
ebug
Post by Al Viro
as a good thing.
And as for the sentiment expressed in the beginning of your posting (=
that
Post by Al Viro
smaller patch size is worth more than clean locking rules, maintainab=
ility
Post by Al Viro
of resulting kernel, etc.)... =C2=A0I'm sorry, but you guys need to d=
ecide
Post by Al Viro
what IMA is. =C2=A0If it's a first-class part of the kernel, you have=
your
Post by Al Viro
priorities backwards...
Hello,

I do not ignore anything.
I said that we were thinking about solution to get the list of file to
fput them after mmap unlock.
And I do understand the issues discussed.
I just wanted to know more opinions on proposed patch.

- Dmitry
--
To unsubscribe from this list: send the line "unsubscribe linux-securit=
y-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Kasatkin, Dmitry
2012-04-27 19:15:10 UTC
Permalink
On Fri, Apr 27, 2012 at 9:52 PM, Kasatkin, Dmitry
Post by Kasatkin, Dmitry
Post by Al Viro
Post by Kasatkin, Dmitry
But have you seen the proposed patch for __fput()?
[PATCH v4 10/12] ima: defer calling __fput()
It defers only of course the last AND mmap_sem is locked AND open f=
or write.
Post by Kasatkin, Dmitry
Post by Al Viro
Post by Kasatkin, Dmitry
=C2=A0 =C2=A0 =C2=A0 if (current->mm && rwsem_is_locked(&current->m=
m->mmap_sem)) {
Post by Kasatkin, Dmitry
Post by Al Viro
Post by Kasatkin, Dmitry
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (ima_defer_fput=
(file) =3D=3D 0)
Post by Kasatkin, Dmitry
Post by Al Viro
Post by Kasatkin, Dmitry
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
=C2=A0 return;
Post by Kasatkin, Dmitry
Post by Al Viro
Post by Kasatkin, Dmitry
=C2=A0 =C2=A0 =C2=A0 }
Just 5 out of ~100,000 mmap_sem held fput() calls were deferred.
Let me get it straight.
=C2=A0 =C2=A0 =C2=A0 =C2=A0a) You still ignore all the problems with=
that described in the
Post by Kasatkin, Dmitry
Post by Al Viro
posting right in the beginning of this thread.
=C2=A0 =C2=A0 =C2=A0 =C2=A0b) You ignore the problems with semantics=
changes from user-visible
Post by Kasatkin, Dmitry
Post by Al Viro
delays of fput() past the return from syscall (described in Linus' p=
osting
Post by Kasatkin, Dmitry
Post by Al Viro
upthread - they apply to this "solution" as well).
=C2=A0 =C2=A0 =C2=A0 =C2=A0c) You seem to consider the fact that thi=
s path will be exercised
Post by Kasatkin, Dmitry
Post by Al Viro
very rarely, thus making any races on it damn hard to reproduce and =
debug
Post by Kasatkin, Dmitry
Post by Al Viro
as a good thing.
And as for the sentiment expressed in the beginning of your posting =
(that
Post by Kasatkin, Dmitry
Post by Al Viro
smaller patch size is worth more than clean locking rules, maintaina=
bility
Post by Kasatkin, Dmitry
Post by Al Viro
of resulting kernel, etc.)... =C2=A0I'm sorry, but you guys need to =
decide
Post by Kasatkin, Dmitry
Post by Al Viro
what IMA is. =C2=A0If it's a first-class part of the kernel, you hav=
e your
Post by Kasatkin, Dmitry
Post by Al Viro
priorities backwards...
Hello,
I do not ignore anything.
I said that we were thinking about solution to get the list of file t=
o
Post by Kasatkin, Dmitry
fput them after mmap unlock.
And I do understand the issues discussed.
I just wanted to know more opinions on proposed patch.
- Dmitry
=46or sure we are happy to see the discussion about the problem. Thanks
for your attention to it.
And our target to solve it in best possible way.
So let's agree the solution and we will be happy to implement the patch=
=2E

- Dmitry
James Morris
2012-05-03 04:23:48 UTC
Permalink
Jon, thank you for summarizing the discussion - article
http://lwn.net/Articles/494158/. As Jake Edge's previous LWN article
http://lwn.net/Articles/488906/ said, we've been working on upstreaming
the different integrity components for quite a while. Although
IMA-appraisal isn't the last component, it is a major one and were
hoping that it would be upstreamed in the near future. Is 3.5 still a
possibility?
Not sure why you're asking Jon, but the technical issues raised by Al need
to be resolved before the code can be considered for upstreaming.


- James
--
James Morris
<***@namei.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Al Viro
2012-04-20 19:37:19 UTC
Permalink
Post by Al Viro
Deferring the final pass after dropping ->mmap_sem is going to be
interesting; what would protect ->vm_next on those suckers? Locking
rules are already bloody complicated in mm/*; this will only add more
subtle fun. Note that right now both the dissolution of ->vm_next
list and freeing of VMAs happen under ->mmap_sem; changing that might
cost us a lot of headache...
BTW, even leaving that aside, we have at least one definite deadlock
(mainline, not on ->mmap_sem - see the mess in drivers/video/msm/mdp.c)
and a possibility of other fun (e.g. any network filesystem that
ends up triggering rtnl_lock() during umount is going to deadlock if
you combine it with sch_atm.c mess where we do fget()/check/fput()
in netlink message handling, close() from another thread racing with
it and descriptor being the only thing that keeps that network fs
alive past lazy umount or namespace dissolution). And then there's
SCM_RIGHTS vs. drivers/vhost/net.c, which might or might not be
deadlockable - I'm not familiar enough with that driver to tell
right now.

_If_ it had been ->mmap_sem alone, the things would be much simpler.
There wouldn't be a problem, to start with - IMA stuff is not in the
mainline and that's the only thing that really wants ->i_mutex in
fput(); the few existing places that grab it in ->release() are racy
anyway and need to be fixed with proper locking.
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Continue reading on narkive:
Loading...