Discussion:
[PATCHv4 RESEND 0/3] syscalls,x86: Add execveat() system call
David Drysdale
2014-06-05 13:40:32 UTC
Permalink
Resending, adding cc:linux-api.

Also, it may help to add a little more background -- this patch is
needed as a (small) part of implementing Capsicum in the Linux kernel.

Capsicum is a security framework that has been present in FreeBSD since
version 9.0 (Jan 2012), and is based on concepts from object-capability
security [1].

One of the features of Capsicum is capability mode, which locks down
access to global namespaces such as the filesystem hierarchy. In
capability mode, /proc is thus inaccessible and so fexecve(3) doesn't
work -- hence the need for a kernel-space alternative.

[1] http://www.cl.cam.ac.uk/research/security/capsicum/papers/2010usenix-security-capsicum-website.pdf

------

This patch set adds execveat(2) for x86, and is derived from Meredydd
Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).

The primary aim of adding an execveat syscall is to allow an
implementation of fexecve(3) that does not rely on the /proc
filesystem. The current glibc version of fexecve(3) is implemented
via /proc, which causes problems in sandboxed or otherwise restricted
environments.

Given the desire for a /proc-free fexecve() implementation, HPA
suggested (https://lkml.org/lkml/2006/7/11/556) that an execveat(2)
syscall would be an appropriate generalization.

Also, having a new syscall means that it can take a flags argument
without back-compatibility concerns. The current implementation just
defines the AT_SYMLINK_NOFOLLOW flag, but other flags could be added
in future -- for example, flags for new namespaces (as suggested at
https://lkml.org/lkml/2006/7/11/474).

Related history:
- https://lkml.org/lkml/2006/12/27/123 is an example of someone
realizing that fexecve() is likely to fail in a chroot environment.
- http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
documenting the /proc requirement of fexecve(3) in its manpage, to
"prevent other people from wasting their time".
- https://bugzilla.kernel.org/show_bug.cgi?id=74481 documented that
it's not possible to fexecve() a file descriptor for a script with
close-on-exec set (which is possible with the implementation here).
- https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
problem where a process that did setuid() could not fexecve()
because it no longer had access to /proc/self/fd; this has since
been fixed.


Changes since Meredydd's v3 patch:
- Added a selftest.
- Added a man page.
- Left open_exec() signature untouched to reduce patch impact
elsewhere (as suggested by Al Viro).
- Filled in bprm->filename with d_path() into a buffer, to avoid use
of potentially-ephemeral dentry->d_name.
- Patch against v3.14 (455c6fdbd21916).


David Drysdale (2):
syscalls,x86: implement execveat() system call
syscalls,x86: add selftest for execveat(2)

arch/x86/ia32/audit.c | 1 +
arch/x86/ia32/ia32entry.S | 1 +
arch/x86/kernel/audit_64.c | 1 +
arch/x86/kernel/entry_64.S | 28 ++++
arch/x86/syscalls/syscall_32.tbl | 1 +
arch/x86/syscalls/syscall_64.tbl | 2 +
arch/x86/um/sys_call_table_64.c | 1 +
fs/exec.c | 153 ++++++++++++++++---
include/linux/compat.h | 3 +
include/linux/sched.h | 4 +
include/linux/syscalls.h | 4 +
include/uapi/asm-generic/unistd.h | 4 +-
kernel/sys_ni.c | 3 +
lib/audit.c | 3 +
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/exec/.gitignore | 6 +
tools/testing/selftests/exec/Makefile | 32 ++++
tools/testing/selftests/exec/execveat.c | 251 ++++++++++++++++++++++++++++++++
18 files changed, 476 insertions(+), 23 deletions(-)
create mode 100644 tools/testing/selftests/exec/.gitignore
create mode 100644 tools/testing/selftests/exec/Makefile
create mode 100644 tools/testing/selftests/exec/execveat.c

--
1.9.1.423.g4596e3a
David Drysdale
2014-06-05 13:40:34 UTC
Permalink
Signed-off-by: David Drysdale <***@google.com>
---
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/exec/.gitignore | 6 +
tools/testing/selftests/exec/Makefile | 32 ++++
tools/testing/selftests/exec/execveat.c | 251 ++++++++++++++++++++++++++++++++
4 files changed, 290 insertions(+)
create mode 100644 tools/testing/selftests/exec/.gitignore
create mode 100644 tools/testing/selftests/exec/Makefile
create mode 100644 tools/testing/selftests/exec/execveat.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 32487ed18354..26b7e6410fcd 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -10,6 +10,7 @@ TARGETS += timers
TARGETS += vm
TARGETS += powerpc
TARGETS += user
+TARGETS += exec

all:
for TARGET in $(TARGETS); do \
diff --git a/tools/testing/selftests/exec/.gitignore b/tools/testing/selftests/exec/.gitignore
new file mode 100644
index 000000000000..349a786899a7
--- /dev/null
+++ b/tools/testing/selftests/exec/.gitignore
@@ -0,0 +1,6 @@
+subdir*
+script*
+execveat
+execveat.symlink
+execveat.moved
+execveat.ephemeral
diff --git a/tools/testing/selftests/exec/Makefile b/tools/testing/selftests/exec/Makefile
new file mode 100644
index 000000000000..55771ed05dda
--- /dev/null
+++ b/tools/testing/selftests/exec/Makefile
@@ -0,0 +1,32 @@
+CC = $(CROSS_COMPILE)gcc
+CFLAGS = -Wall
+BINARIES = execveat
+DEPS = execveat.symlink script execveat.ephemeral script.ephemeral subdir subdir.ephemeral subdir.ephemeral/script
+all: $(BINARIES) $(DEPS)
+
+subdir:
+ mkdir -p $@
+subdir.ephemeral:
+ mkdir -p $@
+subdir.ephemeral/script: | subdir.ephemeral
+ echo '#!/bin/sh' > $@
+ echo 'exit $$*' >> $@
+ chmod +x $@
+script:
+ echo '#!/bin/sh' > $@
+ echo 'exit $$*' >> $@
+ chmod +x $@
+script.ephemeral: script
+ cp $< $@
+execveat.symlink: execveat
+ ln -s -f $< $@
+execveat.ephemeral: execveat
+ cp $< $@
+%: %.c
+ $(CC) $(CFLAGS) -o $@ $^
+
+run_tests: all
+ ./execveat
+
+clean:
+ rm -rf $(BINARIES) $(DEPS) subdir.moved execveat.moved
diff --git a/tools/testing/selftests/exec/execveat.c b/tools/testing/selftests/exec/execveat.c
new file mode 100644
index 000000000000..f67862fa3f38
--- /dev/null
+++ b/tools/testing/selftests/exec/execveat.c
@@ -0,0 +1,251 @@
+/*
+ * Copyright (c) 2014 Google, Inc.
+ *
+ * Licensed under the terms of the GNU GPL License version 2
+ *
+ * Selftests for execveat(2).
+ */
+
+#define _GNU_SOURCE /* to get O_PATH */
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <sys/syscall.h>
+#include <sys/stat.h>
+#include <stdio.h>
+#include <errno.h>
+#include <string.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <limits.h>
+#include <stdlib.h>
+
+static char *envp[] = { "IN_TEST=yes", NULL };
+static char *argv[] = { "execveat", "99", NULL };
+
+static int execveat_(int fd, const char *path, char **argv, char **envp,
+ int flags)
+{
+#ifdef __NR_execveat
+ return syscall(__NR_execveat, fd, path, argv, envp, flags);
+#else
+ errno = -ENOSYS;
+ return -1;
+#endif
+}
+
+#define check_execveat_fail(fd, path, flags, errno) \
+ _check_execveat_fail(fd, path, flags, errno, #errno)
+static int _check_execveat_fail(int fd, const char *path, int flags,
+ int expected_errno, const char *errno_str)
+{
+ errno = 0;
+ printf("Check failure of execveat(%d, '%s', %d) with %s... ",
+ fd, path?:"(null)", flags, errno_str);
+ int rc = execveat_(fd, path, argv, envp, flags);
+ if (rc > 0) {
+ printf("[FAIL] (unexpected success from execveat(2))\n");
+ return 1;
+ }
+ if (errno != expected_errno) {
+ printf("[FAIL] (expected errno %d (%s) not %d (%s)\n",
+ expected_errno, strerror(expected_errno),
+ errno, strerror(errno));
+ return 1;
+ }
+ printf("[OK]\n");
+ return 0;
+}
+
+static int check_execveat_invoked_rc(int fd, const char *path, int flags,
+ int expected_rc)
+{
+ int status;
+ int rc;
+ pid_t child;
+ printf("Check success of execveat(%d, '%s', %d)... ",
+ fd, path?:"(null)", flags);
+ child = fork();
+ if (child < 0) {
+ printf("[FAIL] (fork() failed)\n");
+ return 1;
+ }
+ if (child == 0) {
+ /* Child: do execveat(). */
+ rc = execveat_(fd, path, argv, envp, flags);
+ printf("[FAIL]: execveat() failed, rc=%d errno=%d (%s)\n",
+ rc, errno, strerror(errno));
+ exit(1); /* should not reach here */
+ }
+ /* Parent: wait for & check child's exit status. */
+ rc = waitpid(child, &status, 0);
+ if (rc != child) {
+ printf("[FAIL] (waitpid(%d,...) returned %d)\n", child, rc);
+ return 1;
+ }
+ if (!WIFEXITED(status)) {
+ printf("[FAIL] (child %d did not exit cleanly, status=%08x)\n",
+ child, status);
+ return 1;
+ }
+ if (WEXITSTATUS(status) != expected_rc) {
+ printf("[FAIL] (child %d exited with %d not %d)\n",
+ child, WEXITSTATUS(status), expected_rc);
+ return 1;
+ }
+ printf("[OK]\n");
+ return 0;
+}
+
+static int check_execveat(int fd, const char *path, int flags)
+{
+ return check_execveat_invoked_rc(fd, path, flags, 99);
+}
+
+static char *concat(const char *left, const char *right)
+{
+ char *result = malloc(strlen(left) + strlen(right) + 1);
+ strcpy(result, left);
+ strcat(result, right);
+ return result;
+}
+
+static int open_or_die(const char *filename, int flags)
+{
+ int fd = open(filename, flags);
+ if (fd < 0) {
+ printf("Failed to open '%s'; "
+ "check prerequisites are available\n", filename);
+ exit(1);
+ }
+ return fd;
+}
+
+static int run_tests(void)
+{
+ int fail = 0;
+ char *fullname = realpath("execveat", NULL);
+ char *fullname_script = realpath("script", NULL);
+ char *fullname_symlink = concat(fullname, ".symlink");
+ int subdir_dfd = open_or_die("subdir", O_DIRECTORY|O_RDONLY);
+ int subdir_dfd_ephemeral = open_or_die("subdir.ephemeral",
+ O_DIRECTORY|O_RDONLY);
+ int dot_dfd = open_or_die(".", O_DIRECTORY|O_RDONLY);
+ int dot_dfd_path = open_or_die(".", O_DIRECTORY|O_RDONLY|O_PATH);
+ int fd = open_or_die("execveat", O_RDONLY);
+ int fd_path = open_or_die("execveat", O_RDONLY|O_PATH);
+ int fd_symlink = open_or_die("execveat.symlink", O_RDONLY);
+ int fd_script = open_or_die("script", O_RDONLY);
+ int fd_ephemeral = open_or_die("execveat.ephemeral", O_RDONLY);
+ int fd_script_ephemeral = open_or_die("script.ephemeral", O_RDONLY);
+
+ /* Normal executable file: */
+ /* dfd + path */
+ fail |= check_execveat(subdir_dfd, "../execveat", 0);
+ fail |= check_execveat(dot_dfd, "execveat", 0);
+ fail |= check_execveat(dot_dfd_path, "execveat", 0);
+ /* absolute path */
+ fail |= check_execveat(AT_FDCWD, fullname, 0);
+ /* absolute path with nonsense dfd */
+ fail |= check_execveat(99, fullname, 0);
+ /* fd + no path */
+ fail |= check_execveat(fd, NULL, 0);
+
+ /* Mess with executable file that's already open: */
+ /* fd + no path to a file that's been renamed */
+ rename("execveat.ephemeral", "execveat.moved");
+ fail |= check_execveat(fd_ephemeral, NULL, 0);
+ /* fd + no path to a file that's been deleted */
+ unlink("execveat.moved"); /* remove the file now fd open */
+ fail |= check_execveat(fd_ephemeral, NULL, 0);
+
+ /* Symlink to executable file: */
+ /* dfd + path */
+ fail |= check_execveat(dot_dfd, "execveat.symlink", 0);
+ fail |= check_execveat(dot_dfd_path, "execveat.symlink", 0);
+ /* absolute path */
+ fail |= check_execveat(AT_FDCWD, fullname_symlink, 0);
+ /* fd + no path, even with AT_SYMLINK_NOFOLLOW (already followed) */
+ fail |= check_execveat(fd_symlink, NULL, 0);
+ fail |= check_execveat(fd_symlink, NULL, AT_SYMLINK_NOFOLLOW);
+
+ /* Symlink fails when AT_SYMLINK_NOFOLLOW set: */
+ /* dfd + path */
+ fail |= check_execveat_fail(dot_dfd, "execveat.symlink",
+ AT_SYMLINK_NOFOLLOW, ELOOP);
+ fail |= check_execveat_fail(dot_dfd_path, "execveat.symlink",
+ AT_SYMLINK_NOFOLLOW, ELOOP);
+ /* absolute path */
+ fail |= check_execveat_fail(AT_FDCWD, fullname_symlink,
+ AT_SYMLINK_NOFOLLOW, ELOOP);
+
+ /* Shell script wrapping executable file: */
+ /* dfd + path */
+ fail |= check_execveat(subdir_dfd, "../script", 0);
+ fail |= check_execveat(dot_dfd, "script", 0);
+ fail |= check_execveat(dot_dfd_path, "script", 0);
+ /* absolute path */
+ fail |= check_execveat(AT_FDCWD, fullname_script, 0);
+ /* fd + no path */
+ fail |= check_execveat(fd_script, NULL, 0);
+ fail |= check_execveat(fd_script, NULL, AT_SYMLINK_NOFOLLOW);
+
+ /* Mess with script file that's already open: */
+ /* fd + no path to a file that's been renamed */
+ rename("script.ephemeral", "script.moved");
+ fail |= check_execveat(fd_script_ephemeral, NULL, 0);
+ /* fd + no path to a file that's been deleted */
+ unlink("script.moved"); /* remove the file while fd open */
+ /* Shell attempts to load the deleted file but fails => rc=127 */
+ fail |= check_execveat_invoked_rc(fd_script_ephemeral, NULL, 0, 127);
+
+ /* Rename a subdirectory in the path: */
+ rename("subdir.ephemeral", "subdir.moved");
+ fail |= check_execveat(subdir_dfd_ephemeral, "../script", 0);
+ fail |= check_execveat(subdir_dfd_ephemeral, "script", 0);
+ /* Remove the subdir and its contents */
+ unlink("subdir.moved/script");
+ unlink("subdir.moved");
+ /* Shell loads via deleted subdir OK because name starts with .. */
+ fail |= check_execveat(subdir_dfd_ephemeral, "../script", 0);
+ fail |= check_execveat_fail(subdir_dfd_ephemeral, "script", 0, ENOENT);
+
+ /* Flag values other than AT_SYMLINK_NOFOLLOW => EINVAL */
+ fail |= check_execveat_fail(dot_dfd, "execveat", 0xFFFF, EINVAL);
+ /* Invalid path => ENOENT */
+ fail |= check_execveat_fail(dot_dfd, "no-such-file", 0, ENOENT);
+ fail |= check_execveat_fail(dot_dfd_path, "no-such-file", 0, ENOENT);
+ fail |= check_execveat_fail(AT_FDCWD, "no-such-file", 0, ENOENT);
+ /* Attempt to execute directory => EACCES */
+ fail |= check_execveat_fail(dot_dfd, NULL, 0, EACCES);
+ /* Attempt to execute non-executable => EACCES */
+ fail |= check_execveat_fail(dot_dfd, "Makefile", 0, EACCES);
+ /* Attempt to execute file opened with O_PATH => EBADF */
+ fail |= check_execveat_fail(dot_dfd_path, NULL, 0, EBADF);
+ fail |= check_execveat_fail(fd_path, NULL, 0, EBADF);
+ /* Attempt to execute nonsense FD => EBADF */
+ fail |= check_execveat_fail(99, NULL, 0, EBADF);
+ fail |= check_execveat_fail(99, "execveat", 0, EBADF);
+ /* Attempt to execute relative to non-directory => ENOTDIR */
+ fail |= check_execveat_fail(fd, "execveat", 0, ENOTDIR);
+
+ return fail ? -1 : 0;
+}
+
+int main(int argc, char **argv)
+{
+ if (argc >= 2) {
+ /* If we are invoked with an argument, exit immediately. */
+ /* Check expected environment transferred. */
+ if (strcmp(getenv("IN_TEST"), "yes") != 0) {
+ printf("[FAIL] (no IN_TEST=yes in env)\n");
+ return 1;
+ }
+
+ /* Use the argument as an exit code. */
+ int rc = atoi(argv[1]);
+ fflush(stdout);
+ return rc;
+ } else {
+ return run_tests();
+ }
+}
--
1.9.1.423.g4596e3a
David Drysdale
2014-06-05 13:40:35 UTC
Permalink
Signed-off-by: David Drysdale <***@google.com>
---
man2/execveat.2 | 132 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 132 insertions(+)
create mode 100644 man2/execveat.2

diff --git a/man2/execveat.2 b/man2/execveat.2
new file mode 100644
index 000000000000..fbbb9f824290
--- /dev/null
+++ b/man2/execveat.2
@@ -0,0 +1,132 @@
+.\" Copyright (c) 2014 Google, Inc.
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of this
+.\" manual under the conditions for verbatim copying, provided that the
+.\" entire resulting derived work is distributed under the terms of a
+.\" permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date. The author(s) assume no
+.\" responsibility for errors or omissions, or for damages resulting from
+.\" the use of the information contained herein. The author(s) may not
+.\" have taken the same level of care in the production of this manual,
+.\" which is licensed free of charge, as they might when working
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.\"
+.TH EXECVEAT 2 2014-04-02 "Linux" "Linux Programmer's Manual"
+.SH NAME
+execveat \- execute program relative to a directory file descriptor
+.SH SYNOPSIS
+.B #include <unistd.h>
+.sp
+.BI "int execve(int " fd ", const char *" pathname ","
+.br
+.BI " char *const " argv "[], char *const " envp "[],"
+.br
+.BI " int " flags);
+.SH DESCRIPTION
+The
+.BR execveat ()
+system call executes the program pointed to by the combination of \fIfd\fP and \fIpathname\fP.
+The
+.BR execveat ()
+system call operates in exactly the same way as
+.BR execve (2),
+except for the differences described in this manual page.
+
+If the pathname given in
+.I pathname
+is relative, then it is interpreted relative to the directory
+referred to by the file descriptor
+.I fd
+(rather than relative to the current working directory of
+the calling process, as is done by
+.BR execve (2)
+for a relative pathname).
+
+If
+.I pathname
+is relative and
+.I fd
+is the special value
+.BR AT_FDCWD ,
+then
+.I pathname
+is interpreted relative to the current working
+directory of the calling process (like
+.BR execve (2)).
+
+If
+.I pathname
+is absolute, then
+.I fd
+is ignored.
+
+If
+.I pathname
+is NULL, then the file descriptor
+.I fd
+specifies the file to be executed.
+
+.I flags
+can either be 0, or include the following flag:
+.TP
+.B AT_SYMLINK_NOFOLLOW
+If the file identified by
+.I fd
+and a non-NULL
+.I pathname
+is a symbolic link, then the call fails with the error
+.BR EINVAL .
+.SH "RETURN VALUE"
+On success,
+.BR execveat ()
+does not return. On error \-1 is returned, and
+.I errno
+is set appropriately.
+.SH ERRORS
+The same errors that occur for
+.BR execve (2)
+can also occur for
+.BR execveat ().
+The following additional errors can occur for
+.BR execveat ():
+.TP
+.B EBADF
+.I fd
+is not a valid file descriptor.
+.TP
+.B EINVAL
+Invalid flag specified in
+.IR flags .
+.TP
+.B ENOTDIR
+.I pathname
+is relative and
+.I fd
+is a file descriptor referring to a file other than a directory.
+.SH VERSIONS
+.BR execveat ()
+was added to Linux in kernel 3.???.
+.SH NOTES
+In addition to the reasons explained in
+.BR openat (2),
+the
+.BR execveat ()
+system call is also needed to allow
+.BR fexecve (3)
+to be implemented on systems that do not have the
+.I /proc
+filesystem mounted.
+.SH SEE ALSO
+.BR execve (2),
+.BR fexecve (3)
--
1.9.1.423.g4596e3a
David Drysdale
2014-06-05 13:40:33 UTC
Permalink
This patch set adds execveat(2) for x86, and is derived from Meredydd
Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).

The primary aim of adding an execveat syscall is to allow an
implementation of fexecve(3) that does not rely on the /proc
filesystem. The current glibc version of fexecve(3) is implemented
via /proc, which causes problems in sandboxed or otherwise restricted
environments.

Given the desire for a /proc-free fexecve() implementation, HPA
suggested (https://lkml.org/lkml/2006/7/11/556) that an execveat(2)
syscall would be an appropriate generalization.

Also, having a new syscall means that it can take a flags argument
without back-compatibility concerns. The current implementation just
defines the AT_SYMLINK_NOFOLLOW flag, but other flags could be added
in future -- for example, flags for new namespaces (as suggested at
https://lkml.org/lkml/2006/7/11/474).

Related history:
- https://lkml.org/lkml/2006/12/27/123 is an example of someone
realizing that fexecve() is likely to fail in a chroot environment.
- http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
documenting the /proc requirement of fexecve(3) in its manpage, to
"prevent other people from wasting their time".
- https://bugzilla.kernel.org/show_bug.cgi?id=74481 documented that
it's not possible to fexecve() a file descriptor for a script with
close-on-exec set (which is possible with the implementation here).
- https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
problem where a process that did setuid() could not fexecve()
because it no longer had access to /proc/self/fd; this has since
been fixed.

This patch in particular (of 3 = 2 kernel + 1 man-pages):

Adds the new system execveat(2) syscall. execveat() is to execve() as
openat() is to open(): it takes a file descriptor that refers to a
directory, and resolves the filename relative to that.

In addition, if the filename is NULL, execveat() executes the file
to which the file descriptor refers. This replicates the functionality
of fexecve(), which is a system call in other UNIXen, but in Linux
glibc it depends on /proc being mounted.

Only x86-64, i386 and x32 ABIs are supported in this patch.

Based on patches by Meredydd Luff <***@senatehouse.org>

Signed-off-by: David Drysdale <***@google.com>
---
arch/x86/ia32/audit.c | 1 +
arch/x86/ia32/ia32entry.S | 1 +
arch/x86/kernel/audit_64.c | 1 +
arch/x86/kernel/entry_64.S | 28 +++++++
arch/x86/syscalls/syscall_32.tbl | 1 +
arch/x86/syscalls/syscall_64.tbl | 2 +
arch/x86/um/sys_call_table_64.c | 1 +
fs/exec.c | 153 ++++++++++++++++++++++++++++++++------
include/linux/compat.h | 3 +
include/linux/sched.h | 4 +
include/linux/syscalls.h | 4 +
include/uapi/asm-generic/unistd.h | 4 +-
kernel/sys_ni.c | 3 +
lib/audit.c | 3 +
14 files changed, 186 insertions(+), 23 deletions(-)

diff --git a/arch/x86/ia32/audit.c b/arch/x86/ia32/audit.c
index 5d7b381da692..2eccc8932ae6 100644
--- a/arch/x86/ia32/audit.c
+++ b/arch/x86/ia32/audit.c
@@ -35,6 +35,7 @@ int ia32_classify_syscall(unsigned syscall)
case __NR_socketcall:
return 4;
case __NR_execve:
+ case __NR_execveat:
return 5;
default:
return 1;
diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 4299eb05023c..2516c09743e0 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -464,6 +464,7 @@ GLOBAL(\label)
PTREGSCALL stub32_rt_sigreturn, sys32_rt_sigreturn
PTREGSCALL stub32_sigreturn, sys32_sigreturn
PTREGSCALL stub32_execve, compat_sys_execve
+ PTREGSCALL stub32_execveat, compat_sys_execveat
PTREGSCALL stub32_fork, sys_fork
PTREGSCALL stub32_vfork, sys_vfork

diff --git a/arch/x86/kernel/audit_64.c b/arch/x86/kernel/audit_64.c
index 06d3e5a14d9d..f3672508b249 100644
--- a/arch/x86/kernel/audit_64.c
+++ b/arch/x86/kernel/audit_64.c
@@ -50,6 +50,7 @@ int audit_classify_syscall(int abi, unsigned syscall)
case __NR_openat:
return 3;
case __NR_execve:
+ case __NR_execveat:
return 5;
default:
return 0;
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 1e96c3628bf2..f9a6c2fdda15 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -872,6 +872,20 @@ ENTRY(stub_execve)
CFI_ENDPROC
END(stub_execve)

+ENTRY(stub_execveat)
+ CFI_STARTPROC
+ addq $8, %rsp
+ PARTIAL_FRAME 0
+ SAVE_REST
+ FIXUP_TOP_OF_STACK %r11
+ call sys_execveat
+ RESTORE_TOP_OF_STACK %r11
+ movq %rax,RAX(%rsp)
+ RESTORE_REST
+ jmp int_ret_from_sys_call
+ CFI_ENDPROC
+END(stub_execveat)
+
/*
* sigreturn is special because it needs to restore all registers on return.
* This cannot be done with SYSRET, so use the IRET return path instead.
@@ -917,6 +931,20 @@ ENTRY(stub_x32_execve)
CFI_ENDPROC
END(stub_x32_execve)

+ENTRY(stub_x32_execveat)
+ CFI_STARTPROC
+ addq $8, %rsp
+ PARTIAL_FRAME 0
+ SAVE_REST
+ FIXUP_TOP_OF_STACK %r11
+ call compat_sys_execveat
+ RESTORE_TOP_OF_STACK %r11
+ movq %rax,RAX(%rsp)
+ RESTORE_REST
+ jmp int_ret_from_sys_call
+ CFI_ENDPROC
+END(stub_x32_execveat)
+
#endif

/*
diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index 96bc506ac6de..2ab0712a0e7c 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -359,3 +359,4 @@
350 i386 finit_module sys_finit_module
351 i386 sched_setattr sys_sched_setattr
352 i386 sched_getattr sys_sched_getattr
+353 i386 execveat sys_execveat stub32_execveat
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index a12bddc7ccea..2e4058c14b4f 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -322,6 +322,7 @@
313 common finit_module sys_finit_module
314 common sched_setattr sys_sched_setattr
315 common sched_getattr sys_sched_getattr
+316 64 execveat stub_execveat

#
# x32-specific system call numbers start at 512 to avoid cache impact
@@ -358,3 +359,4 @@
540 x32 process_vm_writev compat_sys_process_vm_writev
541 x32 setsockopt compat_sys_setsockopt
542 x32 getsockopt compat_sys_getsockopt
+543 x32 execveat stub_x32_execveat
diff --git a/arch/x86/um/sys_call_table_64.c b/arch/x86/um/sys_call_table_64.c
index f2f0723070ca..20c3649d0691 100644
--- a/arch/x86/um/sys_call_table_64.c
+++ b/arch/x86/um/sys_call_table_64.c
@@ -31,6 +31,7 @@
#define stub_fork sys_fork
#define stub_vfork sys_vfork
#define stub_execve sys_execve
+#define stub_execveat sys_execveat
#define stub_rt_sigreturn sys_rt_sigreturn

#define __SYSCALL_COMMON(nr, sym, compat) __SYSCALL_64(nr, sym, compat)
diff --git a/fs/exec.c b/fs/exec.c
index 3d78fccdd723..a8676ce571ce 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -748,7 +748,22 @@ EXPORT_SYMBOL(setup_arg_pages);

#endif /* CONFIG_MMU */

-static struct file *do_open_exec(struct filename *name)
+/*
+ * Perform the extra checks that open_exec() needs over and above a normal
+ * open.
+ */
+static int check_exec_and_deny_write(struct file *file)
+{
+ if (!S_ISREG(file_inode(file)->i_mode))
+ return -EACCES;
+
+ if (file->f_path.mnt->mnt_flags & MNT_NOEXEC)
+ return -EACCES;
+
+ return deny_write_access(file);
+}
+
+static struct file *do_open_execat(int fd, struct filename *name, int flags)
{
struct file *file;
int err;
@@ -758,24 +773,42 @@ static struct file *do_open_exec(struct filename *name)
.intent = LOOKUP_OPEN,
.lookup_flags = LOOKUP_FOLLOW,
};
+ static const struct open_flags open_exec_nofollow_flags = {
+ .open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
+ .acc_mode = MAY_EXEC | MAY_OPEN,
+ .intent = LOOKUP_OPEN,
+ .lookup_flags = 0,
+ };

- file = do_filp_open(AT_FDCWD, name, &open_exec_flags);
- if (IS_ERR(file))
- goto out;
-
- err = -EACCES;
- if (!S_ISREG(file_inode(file)->i_mode))
- goto exit;
+ if (flags & ~AT_SYMLINK_NOFOLLOW)
+ return ERR_PTR(-EINVAL);

- if (file->f_path.mnt->mnt_flags & MNT_NOEXEC)
- goto exit;
+ if (name) {
+ const struct open_flags *oflags = ((flags & AT_SYMLINK_NOFOLLOW)
+ ? &open_exec_nofollow_flags
+ : &open_exec_flags);

- fsnotify_open(file);
+ file = do_filp_open(fd, name, oflags);
+ if (IS_ERR(file))
+ goto out;
+ } else {
+ file = fget(fd);
+ if (!file)
+ return ERR_PTR(-EBADF);
+
+ err = inode_permission(file->f_path.dentry->d_inode,
+ open_exec_flags.acc_mode);
+ if (err)
+ goto exit;
+ }

- err = deny_write_access(file);
+ err = check_exec_and_deny_write(file);
if (err)
goto exit;

+ if (name)
+ fsnotify_open(file);
+
out:
return file;

@@ -787,7 +820,7 @@ exit:
struct file *open_exec(const char *name)
{
struct filename tmp = { .name = name };
- return do_open_exec(&tmp);
+ return do_open_execat(AT_FDCWD, &tmp, 0);
}
EXPORT_SYMBOL(open_exec);

@@ -1437,10 +1470,12 @@ static int exec_binprm(struct linux_binprm *bprm)
/*
* sys_execve() executes a new program.
*/
-static int do_execve_common(struct filename *filename,
- struct user_arg_ptr argv,
- struct user_arg_ptr envp)
+static int do_execveat_common(int fd, struct filename *filename,
+ struct user_arg_ptr argv,
+ struct user_arg_ptr envp,
+ int flags)
{
+ char *pathbuf = NULL;
struct linux_binprm *bprm;
struct file *file;
struct files_struct *displaced;
@@ -1481,7 +1516,7 @@ static int do_execve_common(struct filename *filename,
check_unsafe_exec(bprm);
current->in_execve = 1;

- file = do_open_exec(filename);
+ file = do_open_execat(fd, filename, flags);
retval = PTR_ERR(file);
if (IS_ERR(file))
goto out_unmark;
@@ -1489,7 +1524,21 @@ static int do_execve_common(struct filename *filename,
sched_exec();

bprm->file = file;
- bprm->filename = bprm->interp = filename->name;
+ if (filename && fd == AT_FDCWD) {
+ bprm->filename = filename->name;
+ } else {
+ pathbuf = kmalloc(PATH_MAX, GFP_TEMPORARY);
+ if (!pathbuf) {
+ retval = -ENOMEM;
+ goto out_unmark;
+ }
+ bprm->filename = d_path(&file->f_path, pathbuf, PATH_MAX);
+ if (IS_ERR(bprm->filename)) {
+ retval = PTR_ERR(bprm->filename);
+ goto out_unmark;
+ }
+ }
+ bprm->interp = bprm->filename;

retval = bprm_mm_init(bprm);
if (retval)
@@ -1530,7 +1579,8 @@ static int do_execve_common(struct filename *filename,
acct_update_integrals(current);
task_numa_free(current);
free_bprm(bprm);
- putname(filename);
+ if (filename)
+ putname(filename);
if (displaced)
put_files_struct(displaced);
return retval;
@@ -1547,12 +1597,14 @@ out_unmark:

out_free:
free_bprm(bprm);
+ kfree(pathbuf);

out_files:
if (displaced)
reset_files_struct(displaced);
out_ret:
- putname(filename);
+ if (filename)
+ putname(filename);
return retval;
}

@@ -1562,7 +1614,17 @@ int do_execve(struct filename *filename,
{
struct user_arg_ptr argv = { .ptr.native = __argv };
struct user_arg_ptr envp = { .ptr.native = __envp };
- return do_execve_common(filename, argv, envp);
+ return do_execveat_common(AT_FDCWD, filename, argv, envp, 0);
+}
+
+int do_execveat(int fd, struct filename *filename,
+ const char __user *const __user *__argv,
+ const char __user *const __user *__envp,
+ int flags)
+{
+ struct user_arg_ptr argv = { .ptr.native = __argv };
+ struct user_arg_ptr envp = { .ptr.native = __envp };
+ return do_execveat_common(fd, filename, argv, envp, flags);
}

#ifdef CONFIG_COMPAT
@@ -1578,7 +1640,23 @@ static int compat_do_execve(struct filename *filename,
.is_compat = true,
.ptr.compat = __envp,
};
- return do_execve_common(filename, argv, envp);
+ return do_execveat_common(AT_FDCWD, filename, argv, envp, 0);
+}
+
+static int compat_do_execveat(int fd, struct filename *filename,
+ const compat_uptr_t __user *__argv,
+ const compat_uptr_t __user *__envp,
+ int flags)
+{
+ struct user_arg_ptr argv = {
+ .is_compat = true,
+ .ptr.compat = __argv,
+ };
+ struct user_arg_ptr envp = {
+ .is_compat = true,
+ .ptr.compat = __envp,
+ };
+ return do_execveat_common(fd, filename, argv, envp, flags);
}
#endif

@@ -1618,6 +1696,22 @@ SYSCALL_DEFINE3(execve,
{
return do_execve(getname(filename), argv, envp);
}
+
+SYSCALL_DEFINE5(execveat,
+ int, fd, const char __user *, filename,
+ const char __user *const __user *, argv,
+ const char __user *const __user *, envp,
+ int, flags)
+{
+ struct filename *path = NULL;
+ if (filename) {
+ path = getname(filename);
+ if (IS_ERR(path))
+ return PTR_ERR(path);
+ }
+ return do_execveat(fd, path, argv, envp, flags);
+}
+
#ifdef CONFIG_COMPAT
asmlinkage long compat_sys_execve(const char __user * filename,
const compat_uptr_t __user * argv,
@@ -1625,4 +1719,19 @@ asmlinkage long compat_sys_execve(const char __user * filename,
{
return compat_do_execve(getname(filename), argv, envp);
}
+
+asmlinkage long compat_sys_execveat(int fd,
+ const char __user *filename,
+ const compat_uptr_t __user *argv,
+ const compat_uptr_t __user *envp,
+ int flags)
+{
+ struct filename *path = NULL;
+ if (filename) {
+ path = getname(filename);
+ if (IS_ERR(path))
+ return PTR_ERR(path);
+ }
+ return compat_do_execveat(fd, path, argv, envp, flags);
+}
#endif
diff --git a/include/linux/compat.h b/include/linux/compat.h
index 3f448c65511b..e875a5d97e08 100644
--- a/include/linux/compat.h
+++ b/include/linux/compat.h
@@ -341,6 +341,9 @@ asmlinkage long compat_sys_lseek(unsigned int, compat_off_t, unsigned int);

asmlinkage long compat_sys_execve(const char __user *filename, const compat_uptr_t __user *argv,
const compat_uptr_t __user *envp);
+asmlinkage long compat_sys_execveat(int dfd, const char __user *filename,
+ const compat_uptr_t __user *argv,
+ const compat_uptr_t __user *envp, int flags);

asmlinkage long compat_sys_select(int n, compat_ulong_t __user *inp,
compat_ulong_t __user *outp, compat_ulong_t __user *exp,
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a781dec1cd0b..92601818b4fb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2315,6 +2315,10 @@ extern int disallow_signal(int);
extern int do_execve(struct filename *,
const char __user * const __user *,
const char __user * const __user *);
+extern int do_execveat(int, struct filename *,
+ const char __user * const __user *,
+ const char __user * const __user *,
+ int);
extern long do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *);
struct task_struct *fork_idle(int);
extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index a747a77ea584..309a810cf39d 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -855,4 +855,8 @@ asmlinkage long sys_process_vm_writev(pid_t pid,
asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
unsigned long idx1, unsigned long idx2);
asmlinkage long sys_finit_module(int fd, const char __user *uargs, int flags);
+asmlinkage long sys_execveat(int dfd, const char __user *filename,
+ const char __user *const __user *argv,
+ const char __user *const __user *envp, int flags);
+
#endif
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index dde8041f40d2..4231bca3f95e 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -696,9 +696,11 @@ __SYSCALL(__NR_finit_module, sys_finit_module)
__SYSCALL(__NR_sched_setattr, sys_sched_setattr)
#define __NR_sched_getattr 275
__SYSCALL(__NR_sched_getattr, sys_sched_getattr)
+#define __NR_execveat 276
+__SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)

#undef __NR_syscalls
-#define __NR_syscalls 276
+#define __NR_syscalls 277

/*
* All syscalls below here should go away really,
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 7078052284fd..5ea7b8ab9e63 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -209,3 +209,6 @@ cond_syscall(compat_sys_open_by_handle_at);

/* compare kernel pointers */
cond_syscall(sys_kcmp);
+
+/* execveat */
+cond_syscall(sys_execveat);
diff --git a/lib/audit.c b/lib/audit.c
index 76bbed4a20e5..712456ed5960 100644
--- a/lib/audit.c
+++ b/lib/audit.c
@@ -48,6 +48,9 @@ int audit_classify_syscall(int abi, unsigned syscall)
case __NR_socketcall:
return 4;
#endif
+#ifdef __NR_execveat
+ case __NR_execveat:
+#endif
case __NR_execve:
return 5;
default:
--
1.9.1.423.g4596e3a
Kees Cook
2014-06-23 18:39:28 UTC
Permalink
Post by David Drysdale
This patch set adds execveat(2) for x86, and is derived from Meredydd
Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).
The primary aim of adding an execveat syscall is to allow an
implementation of fexecve(3) that does not rely on the /proc
filesystem. The current glibc version of fexecve(3) is implemented
via /proc, which causes problems in sandboxed or otherwise restricted
environments.
Given the desire for a /proc-free fexecve() implementation, HPA
suggested (https://lkml.org/lkml/2006/7/11/556) that an execveat(2)
syscall would be an appropriate generalization.
Also, having a new syscall means that it can take a flags argument
without back-compatibility concerns. The current implementation just
defines the AT_SYMLINK_NOFOLLOW flag, but other flags could be added
in future -- for example, flags for new namespaces (as suggested at
https://lkml.org/lkml/2006/7/11/474).
- https://lkml.org/lkml/2006/12/27/123 is an example of someone
realizing that fexecve() is likely to fail in a chroot environment.
- http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
documenting the /proc requirement of fexecve(3) in its manpage, to
"prevent other people from wasting their time".
- https://bugzilla.kernel.org/show_bug.cgi?id=74481 documented that
it's not possible to fexecve() a file descriptor for a script with
close-on-exec set (which is possible with the implementation here).
- https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
problem where a process that did setuid() could not fexecve()
because it no longer had access to /proc/self/fd; this has since
been fixed.
Adds the new system execveat(2) syscall. execveat() is to execve() as
openat() is to open(): it takes a file descriptor that refers to a
directory, and resolves the filename relative to that.
In addition, if the filename is NULL, execveat() executes the file
to which the file descriptor refers. This replicates the functionality
of fexecve(), which is a system call in other UNIXen, but in Linux
glibc it depends on /proc being mounted.
Only x86-64, i386 and x32 ABIs are supported in this patch.
Hi Al,

Any thoughts on this? I think it would be quite handy.

-Kees
Post by David Drysdale
---
arch/x86/ia32/audit.c | 1 +
arch/x86/ia32/ia32entry.S | 1 +
arch/x86/kernel/audit_64.c | 1 +
arch/x86/kernel/entry_64.S | 28 +++++++
arch/x86/syscalls/syscall_32.tbl | 1 +
arch/x86/syscalls/syscall_64.tbl | 2 +
arch/x86/um/sys_call_table_64.c | 1 +
fs/exec.c | 153 ++++++++++++++++++++++++++++++++------
include/linux/compat.h | 3 +
include/linux/sched.h | 4 +
include/linux/syscalls.h | 4 +
include/uapi/asm-generic/unistd.h | 4 +-
kernel/sys_ni.c | 3 +
lib/audit.c | 3 +
14 files changed, 186 insertions(+), 23 deletions(-)
diff --git a/arch/x86/ia32/audit.c b/arch/x86/ia32/audit.c
index 5d7b381da692..2eccc8932ae6 100644
--- a/arch/x86/ia32/audit.c
+++ b/arch/x86/ia32/audit.c
@@ -35,6 +35,7 @@ int ia32_classify_syscall(unsigned syscall)
return 4;
return 5;
return 1;
diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 4299eb05023c..2516c09743e0 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -464,6 +464,7 @@ GLOBAL(\label)
PTREGSCALL stub32_rt_sigreturn, sys32_rt_sigreturn
PTREGSCALL stub32_sigreturn, sys32_sigreturn
PTREGSCALL stub32_execve, compat_sys_execve
+ PTREGSCALL stub32_execveat, compat_sys_execveat
PTREGSCALL stub32_fork, sys_fork
PTREGSCALL stub32_vfork, sys_vfork
diff --git a/arch/x86/kernel/audit_64.c b/arch/x86/kernel/audit_64.c
index 06d3e5a14d9d..f3672508b249 100644
--- a/arch/x86/kernel/audit_64.c
+++ b/arch/x86/kernel/audit_64.c
@@ -50,6 +50,7 @@ int audit_classify_syscall(int abi, unsigned syscall)
return 3;
return 5;
return 0;
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 1e96c3628bf2..f9a6c2fdda15 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -872,6 +872,20 @@ ENTRY(stub_execve)
CFI_ENDPROC
END(stub_execve)
+ENTRY(stub_execveat)
+ CFI_STARTPROC
+ addq $8, %rsp
+ PARTIAL_FRAME 0
+ SAVE_REST
+ FIXUP_TOP_OF_STACK %r11
+ call sys_execveat
+ RESTORE_TOP_OF_STACK %r11
+ movq %rax,RAX(%rsp)
+ RESTORE_REST
+ jmp int_ret_from_sys_call
+ CFI_ENDPROC
+END(stub_execveat)
+
/*
* sigreturn is special because it needs to restore all registers on return.
* This cannot be done with SYSRET, so use the IRET return path instead.
@@ -917,6 +931,20 @@ ENTRY(stub_x32_execve)
CFI_ENDPROC
END(stub_x32_execve)
+ENTRY(stub_x32_execveat)
+ CFI_STARTPROC
+ addq $8, %rsp
+ PARTIAL_FRAME 0
+ SAVE_REST
+ FIXUP_TOP_OF_STACK %r11
+ call compat_sys_execveat
+ RESTORE_TOP_OF_STACK %r11
+ movq %rax,RAX(%rsp)
+ RESTORE_REST
+ jmp int_ret_from_sys_call
+ CFI_ENDPROC
+END(stub_x32_execveat)
+
#endif
/*
diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index 96bc506ac6de..2ab0712a0e7c 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -359,3 +359,4 @@
350 i386 finit_module sys_finit_module
351 i386 sched_setattr sys_sched_setattr
352 i386 sched_getattr sys_sched_getattr
+353 i386 execveat sys_execveat stub32_execveat
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index a12bddc7ccea..2e4058c14b4f 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -322,6 +322,7 @@
313 common finit_module sys_finit_module
314 common sched_setattr sys_sched_setattr
315 common sched_getattr sys_sched_getattr
+316 64 execveat stub_execveat
#
# x32-specific system call numbers start at 512 to avoid cache impact
@@ -358,3 +359,4 @@
540 x32 process_vm_writev compat_sys_process_vm_writev
541 x32 setsockopt compat_sys_setsockopt
542 x32 getsockopt compat_sys_getsockopt
+543 x32 execveat stub_x32_execveat
diff --git a/arch/x86/um/sys_call_table_64.c b/arch/x86/um/sys_call_table_64.c
index f2f0723070ca..20c3649d0691 100644
--- a/arch/x86/um/sys_call_table_64.c
+++ b/arch/x86/um/sys_call_table_64.c
@@ -31,6 +31,7 @@
#define stub_fork sys_fork
#define stub_vfork sys_vfork
#define stub_execve sys_execve
+#define stub_execveat sys_execveat
#define stub_rt_sigreturn sys_rt_sigreturn
#define __SYSCALL_COMMON(nr, sym, compat) __SYSCALL_64(nr, sym, compat)
diff --git a/fs/exec.c b/fs/exec.c
index 3d78fccdd723..a8676ce571ce 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -748,7 +748,22 @@ EXPORT_SYMBOL(setup_arg_pages);
#endif /* CONFIG_MMU */
-static struct file *do_open_exec(struct filename *name)
+/*
+ * Perform the extra checks that open_exec() needs over and above a normal
+ * open.
+ */
+static int check_exec_and_deny_write(struct file *file)
+{
+ if (!S_ISREG(file_inode(file)->i_mode))
+ return -EACCES;
+
+ if (file->f_path.mnt->mnt_flags & MNT_NOEXEC)
+ return -EACCES;
+
+ return deny_write_access(file);
+}
+
+static struct file *do_open_execat(int fd, struct filename *name, int flags)
{
struct file *file;
int err;
@@ -758,24 +773,42 @@ static struct file *do_open_exec(struct filename *name)
.intent = LOOKUP_OPEN,
.lookup_flags = LOOKUP_FOLLOW,
};
+ static const struct open_flags open_exec_nofollow_flags = {
+ .open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
+ .acc_mode = MAY_EXEC | MAY_OPEN,
+ .intent = LOOKUP_OPEN,
+ .lookup_flags = 0,
+ };
- file = do_filp_open(AT_FDCWD, name, &open_exec_flags);
- if (IS_ERR(file))
- goto out;
-
- err = -EACCES;
- if (!S_ISREG(file_inode(file)->i_mode))
- goto exit;
+ if (flags & ~AT_SYMLINK_NOFOLLOW)
+ return ERR_PTR(-EINVAL);
- if (file->f_path.mnt->mnt_flags & MNT_NOEXEC)
- goto exit;
+ if (name) {
+ const struct open_flags *oflags = ((flags & AT_SYMLINK_NOFOLLOW)
+ ? &open_exec_nofollow_flags
+ : &open_exec_flags);
- fsnotify_open(file);
+ file = do_filp_open(fd, name, oflags);
+ if (IS_ERR(file))
+ goto out;
+ } else {
+ file = fget(fd);
+ if (!file)
+ return ERR_PTR(-EBADF);
+
+ err = inode_permission(file->f_path.dentry->d_inode,
+ open_exec_flags.acc_mode);
+ if (err)
+ goto exit;
+ }
- err = deny_write_access(file);
+ err = check_exec_and_deny_write(file);
if (err)
goto exit;
+ if (name)
+ fsnotify_open(file);
+
return file;
struct file *open_exec(const char *name)
{
struct filename tmp = { .name = name };
- return do_open_exec(&tmp);
+ return do_open_execat(AT_FDCWD, &tmp, 0);
}
EXPORT_SYMBOL(open_exec);
@@ -1437,10 +1470,12 @@ static int exec_binprm(struct linux_binprm *bprm)
/*
* sys_execve() executes a new program.
*/
-static int do_execve_common(struct filename *filename,
- struct user_arg_ptr argv,
- struct user_arg_ptr envp)
+static int do_execveat_common(int fd, struct filename *filename,
+ struct user_arg_ptr argv,
+ struct user_arg_ptr envp,
+ int flags)
{
+ char *pathbuf = NULL;
struct linux_binprm *bprm;
struct file *file;
struct files_struct *displaced;
@@ -1481,7 +1516,7 @@ static int do_execve_common(struct filename *filename,
check_unsafe_exec(bprm);
current->in_execve = 1;
- file = do_open_exec(filename);
+ file = do_open_execat(fd, filename, flags);
retval = PTR_ERR(file);
if (IS_ERR(file))
goto out_unmark;
@@ -1489,7 +1524,21 @@ static int do_execve_common(struct filename *filename,
sched_exec();
bprm->file = file;
- bprm->filename = bprm->interp = filename->name;
+ if (filename && fd == AT_FDCWD) {
+ bprm->filename = filename->name;
+ } else {
+ pathbuf = kmalloc(PATH_MAX, GFP_TEMPORARY);
+ if (!pathbuf) {
+ retval = -ENOMEM;
+ goto out_unmark;
+ }
+ bprm->filename = d_path(&file->f_path, pathbuf, PATH_MAX);
+ if (IS_ERR(bprm->filename)) {
+ retval = PTR_ERR(bprm->filename);
+ goto out_unmark;
+ }
+ }
+ bprm->interp = bprm->filename;
retval = bprm_mm_init(bprm);
if (retval)
@@ -1530,7 +1579,8 @@ static int do_execve_common(struct filename *filename,
acct_update_integrals(current);
task_numa_free(current);
free_bprm(bprm);
- putname(filename);
+ if (filename)
+ putname(filename);
if (displaced)
put_files_struct(displaced);
return retval;
free_bprm(bprm);
+ kfree(pathbuf);
if (displaced)
reset_files_struct(displaced);
- putname(filename);
+ if (filename)
+ putname(filename);
return retval;
}
@@ -1562,7 +1614,17 @@ int do_execve(struct filename *filename,
{
struct user_arg_ptr argv = { .ptr.native = __argv };
struct user_arg_ptr envp = { .ptr.native = __envp };
- return do_execve_common(filename, argv, envp);
+ return do_execveat_common(AT_FDCWD, filename, argv, envp, 0);
+}
+
+int do_execveat(int fd, struct filename *filename,
+ const char __user *const __user *__argv,
+ const char __user *const __user *__envp,
+ int flags)
+{
+ struct user_arg_ptr argv = { .ptr.native = __argv };
+ struct user_arg_ptr envp = { .ptr.native = __envp };
+ return do_execveat_common(fd, filename, argv, envp, flags);
}
#ifdef CONFIG_COMPAT
@@ -1578,7 +1640,23 @@ static int compat_do_execve(struct filename *filename,
.is_compat = true,
.ptr.compat = __envp,
};
- return do_execve_common(filename, argv, envp);
+ return do_execveat_common(AT_FDCWD, filename, argv, envp, 0);
+}
+
+static int compat_do_execveat(int fd, struct filename *filename,
+ const compat_uptr_t __user *__argv,
+ const compat_uptr_t __user *__envp,
+ int flags)
+{
+ struct user_arg_ptr argv = {
+ .is_compat = true,
+ .ptr.compat = __argv,
+ };
+ struct user_arg_ptr envp = {
+ .is_compat = true,
+ .ptr.compat = __envp,
+ };
+ return do_execveat_common(fd, filename, argv, envp, flags);
}
#endif
@@ -1618,6 +1696,22 @@ SYSCALL_DEFINE3(execve,
{
return do_execve(getname(filename), argv, envp);
}
+
+SYSCALL_DEFINE5(execveat,
+ int, fd, const char __user *, filename,
+ const char __user *const __user *, argv,
+ const char __user *const __user *, envp,
+ int, flags)
+{
+ struct filename *path = NULL;
+ if (filename) {
+ path = getname(filename);
+ if (IS_ERR(path))
+ return PTR_ERR(path);
+ }
+ return do_execveat(fd, path, argv, envp, flags);
+}
+
#ifdef CONFIG_COMPAT
asmlinkage long compat_sys_execve(const char __user * filename,
const compat_uptr_t __user * argv,
@@ -1625,4 +1719,19 @@ asmlinkage long compat_sys_execve(const char __user * filename,
{
return compat_do_execve(getname(filename), argv, envp);
}
+
+asmlinkage long compat_sys_execveat(int fd,
+ const char __user *filename,
+ const compat_uptr_t __user *argv,
+ const compat_uptr_t __user *envp,
+ int flags)
+{
+ struct filename *path = NULL;
+ if (filename) {
+ path = getname(filename);
+ if (IS_ERR(path))
+ return PTR_ERR(path);
+ }
+ return compat_do_execveat(fd, path, argv, envp, flags);
+}
#endif
diff --git a/include/linux/compat.h b/include/linux/compat.h
index 3f448c65511b..e875a5d97e08 100644
--- a/include/linux/compat.h
+++ b/include/linux/compat.h
@@ -341,6 +341,9 @@ asmlinkage long compat_sys_lseek(unsigned int, compat_off_t, unsigned int);
asmlinkage long compat_sys_execve(const char __user *filename, const compat_uptr_t __user *argv,
const compat_uptr_t __user *envp);
+asmlinkage long compat_sys_execveat(int dfd, const char __user *filename,
+ const compat_uptr_t __user *argv,
+ const compat_uptr_t __user *envp, int flags);
asmlinkage long compat_sys_select(int n, compat_ulong_t __user *inp,
compat_ulong_t __user *outp, compat_ulong_t __user *exp,
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a781dec1cd0b..92601818b4fb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2315,6 +2315,10 @@ extern int disallow_signal(int);
extern int do_execve(struct filename *,
const char __user * const __user *,
const char __user * const __user *);
+extern int do_execveat(int, struct filename *,
+ const char __user * const __user *,
+ const char __user * const __user *,
+ int);
extern long do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *);
struct task_struct *fork_idle(int);
extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index a747a77ea584..309a810cf39d 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -855,4 +855,8 @@ asmlinkage long sys_process_vm_writev(pid_t pid,
asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
unsigned long idx1, unsigned long idx2);
asmlinkage long sys_finit_module(int fd, const char __user *uargs, int flags);
+asmlinkage long sys_execveat(int dfd, const char __user *filename,
+ const char __user *const __user *argv,
+ const char __user *const __user *envp, int flags);
+
#endif
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index dde8041f40d2..4231bca3f95e 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -696,9 +696,11 @@ __SYSCALL(__NR_finit_module, sys_finit_module)
__SYSCALL(__NR_sched_setattr, sys_sched_setattr)
#define __NR_sched_getattr 275
__SYSCALL(__NR_sched_getattr, sys_sched_getattr)
+#define __NR_execveat 276
+__SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)
#undef __NR_syscalls
-#define __NR_syscalls 276
+#define __NR_syscalls 277
/*
* All syscalls below here should go away really,
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 7078052284fd..5ea7b8ab9e63 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -209,3 +209,6 @@ cond_syscall(compat_sys_open_by_handle_at);
/* compare kernel pointers */
cond_syscall(sys_kcmp);
+
+/* execveat */
+cond_syscall(sys_execveat);
diff --git a/lib/audit.c b/lib/audit.c
index 76bbed4a20e5..712456ed5960 100644
--- a/lib/audit.c
+++ b/lib/audit.c
@@ -48,6 +48,9 @@ int audit_classify_syscall(int abi, unsigned syscall)
return 4;
#endif
+#ifdef __NR_execveat
+#endif
return 5;
--
1.9.1.423.g4596e3a
--
Kees Cook
Chrome OS Security
Kees Cook
2014-06-05 17:14:52 UTC
Permalink
Post by David Drysdale
Resending, adding cc:linux-api.
Also, it may help to add a little more background -- this patch is
needed as a (small) part of implementing Capsicum in the Linux kernel.
Capsicum is a security framework that has been present in FreeBSD since
version 9.0 (Jan 2012), and is based on concepts from object-capability
security [1].
One of the features of Capsicum is capability mode, which locks down
access to global namespaces such as the filesystem hierarchy. In
capability mode, /proc is thus inaccessible and so fexecve(3) doesn't
work -- hence the need for a kernel-space alternative.
[1] http://www.cl.cam.ac.uk/research/security/capsicum/papers/2010usenix-security-capsicum-website.pdf
Thanks for reposting!

I think it'd be quite helpful to have this available for very tightly
confined sandboxes. And in a larger sense, Capsicum itself is an
interesting way to do programmatic isolation.

-Kees
--
Kees Cook
Chrome OS Security
Andy Lutomirski
2014-10-17 21:45:03 UTC
Permalink
[Added Eric Biederman, since I think your tree might be a reasonable
route forward for these patches.]
Post by David Drysdale
Resending, adding cc:linux-api.
Also, it may help to add a little more background -- this patch is
needed as a (small) part of implementing Capsicum in the Linux kernel.
Capsicum is a security framework that has been present in FreeBSD since
version 9.0 (Jan 2012), and is based on concepts from object-capability
security [1].
One of the features of Capsicum is capability mode, which locks down
access to global namespaces such as the filesystem hierarchy. In
capability mode, /proc is thus inaccessible and so fexecve(3) doesn't
work -- hence the need for a kernel-space
I just found myself wanting this syscall for another reason: injecting
programs into sandboxes or otherwise heavily locked-down namespaces.

For example, I want to be able to reliably do something like nsenter
--namespace-flags-here toybox sh. Toybox's shell is unusual in that
it is more or less fully functional, so this should Just Work (tm),
except that the toybox binary might not exist in the namespace being
entered. If execveat were available, I could rig nsenter or a similar
tool to open it with O_CLOEXEC, enter the namespace, and then call
execveat.

Is there any reason that these patches can't be merged more or less as
is for 3.19?

--Andy
Post by David Drysdale
[1] http://www.cl.cam.ac.uk/research/security/capsicum/papers/2010usenix-security-capsicum-website.pdf
------
This patch set adds execveat(2) for x86, and is derived from Meredydd
Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).
The primary aim of adding an execveat syscall is to allow an
implementation of fexecve(3) that does not rely on the /proc
filesystem. The current glibc version of fexecve(3) is implemented
via /proc, which causes problems in sandboxed or otherwise restricted
environments.
Given the desire for a /proc-free fexecve() implementation, HPA
suggested (https://lkml.org/lkml/2006/7/11/556) that an execveat(2)
syscall would be an appropriate generalization.
Also, having a new syscall means that it can take a flags argument
without back-compatibility concerns. The current implementation just
defines the AT_SYMLINK_NOFOLLOW flag, but other flags could be added
in future -- for example, flags for new namespaces (as suggested at
https://lkml.org/lkml/2006/7/11/474).
- https://lkml.org/lkml/2006/12/27/123 is an example of someone
realizing that fexecve() is likely to fail in a chroot environment.
- http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
documenting the /proc requirement of fexecve(3) in its manpage, to
"prevent other people from wasting their time".
- https://bugzilla.kernel.org/show_bug.cgi?id=74481 documented that
it's not possible to fexecve() a file descriptor for a script with
close-on-exec set (which is possible with the implementation here).
- https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
problem where a process that did setuid() could not fexecve()
because it no longer had access to /proc/self/fd; this has since
been fixed.
- Added a selftest.
- Added a man page.
- Left open_exec() signature untouched to reduce patch impact
elsewhere (as suggested by Al Viro).
- Filled in bprm->filename with d_path() into a buffer, to avoid use
of potentially-ephemeral dentry->d_name.
- Patch against v3.14 (455c6fdbd21916).
syscalls,x86: implement execveat() system call
syscalls,x86: add selftest for execveat(2)
arch/x86/ia32/audit.c | 1 +
arch/x86/ia32/ia32entry.S | 1 +
arch/x86/kernel/audit_64.c | 1 +
arch/x86/kernel/entry_64.S | 28 ++++
arch/x86/syscalls/syscall_32.tbl | 1 +
arch/x86/syscalls/syscall_64.tbl | 2 +
arch/x86/um/sys_call_table_64.c | 1 +
fs/exec.c | 153 ++++++++++++++++---
include/linux/compat.h | 3 +
include/linux/sched.h | 4 +
include/linux/syscalls.h | 4 +
include/uapi/asm-generic/unistd.h | 4 +-
kernel/sys_ni.c | 3 +
lib/audit.c | 3 +
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/exec/.gitignore | 6 +
tools/testing/selftests/exec/Makefile | 32 ++++
tools/testing/selftests/exec/execveat.c | 251 ++++++++++++++++++++++++++++++++
18 files changed, 476 insertions(+), 23 deletions(-)
create mode 100644 tools/testing/selftests/exec/.gitignore
create mode 100644 tools/testing/selftests/exec/Makefile
create mode 100644 tools/testing/selftests/exec/execveat.c
--
1.9.1.423.g4596e3a
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Andy Lutomirski
AMA Capital Management, LLC
Eric W. Biederman
2014-10-19 00:20:29 UTC
Permalink
Post by Andy Lutomirski
[Added Eric Biederman, since I think your tree might be a reasonable
route forward for these patches.]
Post by David Drysdale
Resending, adding cc:linux-api.
Also, it may help to add a little more background -- this patch is
needed as a (small) part of implementing Capsicum in the Linux kernel.
Capsicum is a security framework that has been present in FreeBSD since
version 9.0 (Jan 2012), and is based on concepts from object-capability
security [1].
One of the features of Capsicum is capability mode, which locks down
access to global namespaces such as the filesystem hierarchy. In
capability mode, /proc is thus inaccessible and so fexecve(3) doesn't
work -- hence the need for a kernel-space
I just found myself wanting this syscall for another reason: injecting
programs into sandboxes or otherwise heavily locked-down namespaces.
For example, I want to be able to reliably do something like nsenter
--namespace-flags-here toybox sh. Toybox's shell is unusual in that
it is more or less fully functional, so this should Just Work (tm),
except that the toybox binary might not exist in the namespace being
entered. If execveat were available, I could rig nsenter or a similar
tool to open it with O_CLOEXEC, enter the namespace, and then call
execveat.
Is there any reason that these patches can't be merged more or less as
is for 3.19?
Yes. There is a silliness in how it implements fexecve. The fexecve
case should be use the empty string "" not a NULL pointer to indication
that. That change will then harmonize execveat with the other ...at
system calls and simplify the code and remove a special case. I believe
using the empty string "" requires implementing the AT_EMPTY_PATH flag.

For sandboxes execveat seems to make a great deal of sense. I can
get the same functionality by passing in a directory file descriptor
calling fchdir and execve so this should not introduce any new security
holes. And using the final file descriptor removes a race.

AT_SYMLINK_NOFOLLOW seems to have some limited utility as well, although
for exec I don't know what problems it can solve.

Until I am done moving I won't have time to pick this up, and the code
clearly needs another revision but I will be happy to work to see that
we get a sane execveat implemented.

Eric

p.s. I don't believe there are any namespaces issues where doing
something with execveat flags make sense.
Andy Lutomirski
2014-10-19 19:11:52 UTC
Permalink
Post by Eric W. Biederman
Post by Andy Lutomirski
[Added Eric Biederman, since I think your tree might be a reasonable
route forward for these patches.]
Post by David Drysdale
Resending, adding cc:linux-api.
Also, it may help to add a little more background -- this patch is
needed as a (small) part of implementing Capsicum in the Linux kernel.
Capsicum is a security framework that has been present in FreeBSD since
version 9.0 (Jan 2012), and is based on concepts from object-capability
security [1].
One of the features of Capsicum is capability mode, which locks down
access to global namespaces such as the filesystem hierarchy. In
capability mode, /proc is thus inaccessible and so fexecve(3) doesn't
work -- hence the need for a kernel-space
I just found myself wanting this syscall for another reason: injecting
programs into sandboxes or otherwise heavily locked-down namespaces.
For example, I want to be able to reliably do something like nsenter
--namespace-flags-here toybox sh. Toybox's shell is unusual in that
it is more or less fully functional, so this should Just Work (tm),
except that the toybox binary might not exist in the namespace being
entered. If execveat were available, I could rig nsenter or a similar
tool to open it with O_CLOEXEC, enter the namespace, and then call
execveat.
Is there any reason that these patches can't be merged more or less as
is for 3.19?
Yes. There is a silliness in how it implements fexecve. The fexecve
case should be use the empty string "" not a NULL pointer to indication
that. That change will then harmonize execveat with the other ...at
system calls and simplify the code and remove a special case. I believe
using the empty string "" requires implementing the AT_EMPTY_PATH flag.
Sounds reasonable.
Post by Eric W. Biederman
For sandboxes execveat seems to make a great deal of sense. I can
get the same functionality by passing in a directory file descriptor
calling fchdir and execve so this should not introduce any new security
holes. And using the final file descriptor removes a race.
The problem with that approach is that the execed program now has its
current directory outside the sandbox, which could be problematic if
you don't trust that program.
Post by Eric W. Biederman
AT_SYMLINK_NOFOLLOW seems to have some limited utility as well, although
for exec I don't know what problems it can solve.
It can always be added later
Post by Eric W. Biederman
Until I am done moving I won't have time to pick this up, and the code
clearly needs another revision but I will be happy to work to see that
we get a sane execveat implemented.
Do you have an ETA? If it's likely to miss 3.19, but if you'll have
time to review before then, I can try to do it.
Post by Eric W. Biederman
Eric
p.s. I don't believe there are any namespaces issues where doing
something with execveat flags make sense.
OK, I'll bite. How feasible would it be to have a flag that activated
pid_ns_for_children? That would reduce a lot of the ugliness in tools
like nsenter that need to fork to enter a pid ns.

I always assume that the reason for the active vs. for_children
distinction was because a lot of userspace libraries, including glibc,
would malfunction if getpid(2) started returning a different value.
But, for exec, this doesn't matter.

--Andy
David Drysdale
2014-10-20 13:48:45 UTC
Permalink
On Sun, Oct 19, 2014 at 1:20 AM, Eric W. Biederman
Post by Eric W. Biederman
Post by Andy Lutomirski
[Added Eric Biederman, since I think your tree might be a reasonable
route forward for these patches.]
Post by David Drysdale
Resending, adding cc:linux-api.
Also, it may help to add a little more background -- this patch is
needed as a (small) part of implementing Capsicum in the Linux kernel.
Capsicum is a security framework that has been present in FreeBSD since
version 9.0 (Jan 2012), and is based on concepts from object-capability
security [1].
One of the features of Capsicum is capability mode, which locks down
access to global namespaces such as the filesystem hierarchy. In
capability mode, /proc is thus inaccessible and so fexecve(3) doesn't
work -- hence the need for a kernel-space
I just found myself wanting this syscall for another reason: injecting
programs into sandboxes or otherwise heavily locked-down namespaces.
For example, I want to be able to reliably do something like nsenter
--namespace-flags-here toybox sh. Toybox's shell is unusual in that
it is more or less fully functional, so this should Just Work (tm),
except that the toybox binary might not exist in the namespace being
entered. If execveat were available, I could rig nsenter or a similar
tool to open it with O_CLOEXEC, enter the namespace, and then call
execveat.
Is there any reason that these patches can't be merged more or less as
is for 3.19?
Yes. There is a silliness in how it implements fexecve. The fexecve
case should be use the empty string "" not a NULL pointer to indication
that. That change will then harmonize execveat with the other ...at
system calls and simplify the code and remove a special case. I believe
using the empty string "" requires implementing the AT_EMPTY_PATH flag.
Good point -- I'll shift to "" + AT_EMPTY_PATH.
Post by Eric W. Biederman
For sandboxes execveat seems to make a great deal of sense. I can
get the same functionality by passing in a directory file descriptor
calling fchdir and execve so this should not introduce any new security
holes. And using the final file descriptor removes a race.
AT_SYMLINK_NOFOLLOW seems to have some limited utility as well, although
for exec I don't know what problems it can solve.
Until I am done moving I won't have time to pick this up, and the code
clearly needs another revision but I will be happy to work to see that
we get a sane execveat implemented.
If it helps, I can push out another revision in the next couple of days.
Post by Eric W. Biederman
Eric
p.s. I don't believe there are any namespaces issues where doing
something with execveat flags make sense.
Al Viro
2014-10-19 20:20:34 UTC
Permalink
Post by Andy Lutomirski
For example, I want to be able to reliably do something like nsenter
--namespace-flags-here toybox sh. Toybox's shell is unusual in that
it is more or less fully functional, so this should Just Work (tm),
except that the toybox binary might not exist in the namespace being
entered. If execveat were available, I could rig nsenter or a similar
tool to open it with O_CLOEXEC, enter the namespace, and then call
execveat.
The question I hadn't seen really answered through all of that was how to
deal with #!... "Just use d_path()" isn't particulary appealing - if that
file has a pathname reachable for you, you could bloody well use _that_
from the very beginning.

Frankly, I wonder if it would make sense to provide something like
dupfs. We can't mount it by default on /dev/fd (more's the pity), but
it might be a good thing to have.

What it is, for those who are not familiar with Plan 9: a filesystem with
one directory and a bunch of files in it. Directory contents depends on
who's looking; for each opened descriptor in your descriptor table, you'll
see two files there. One series is 0, 1, ... - opening one of those gives
dup(). IOW, it's *not* giving you a new struct file; it gives you a new
reference to existing one, complete with sharing IO position, etc. Another
is 0ctl, 1ctl, ... - those are read-only and reading from them gives pretty
much a combination of our /proc/self/fdinfo/n with readlink of /proc/self/fd/n.

It's actually a better match for what one would expect at /dev/fd than what
we do. Example:

; echo 'read i; cat /dev/fd/0; echo "The first line was $i"' >a.sh
; (echo 'line 1';echo 'line 2') >a
; cat a|sh a.sh
line 2
The first line was line 1
; sh a.sh <a
line 1
line 2
The first line was line 1
;

See what's going on? Opening /dev/fd/0 (aka /dev/stdin) does a fresh open
of whatever your stdin is; if it's a pipe - fine, you've just added yourself
as additional reader. But if it's a regular file, you've got yourself
a brand-new opened file, with IO position of its own. Sitting at the
beginning of the file.

Moreover, try that with stdin being a socket and you'll see cat(1) failing
to open that sucker.

We _can't_ blindly replace /dev/fd with it - it has to be a sysadmin choice;
semantics is different. However, there's no reason why it can't be mounted
in environments where you want to avoid procfs - it's certainly exposing less
than procfs would.

And these days we can implement relatively cheaply. It's a window that will
close after a while, but right now we can change ->atomic_open() calling
conventions. Instead of having it return 0 or error, let's switch to returning
NULL, ERR_PTR(error) *or* an extra reference to preexisting struct file.
Same as we did for ->lookup(), and for similar reason.

Right now we have 8 instances of ->atomic_open() and one place calling that
method. Changing the API like that would be trivial (and it's a trivial
conversion - replace return ret; with return ERR_PTR(ret); through all
instances, so any out-of-tree filesystems could follow easily). We certainly
can't do anything of that sort with ->open() - there would be thousands
instances to convert. ->atomic_open(), OTOH, is still new enough for
that to be feasible.

What we get from that conversion is an ability to do dup-style semantics
easily.
* give root directory an ->atomic_open() instance that would be
handling opens.
* make lookups in there fail with ENOENT if you don't have such a
descriptor at the moment. Otherwise bind all of them to the same inode.
The only method it needs is ->getattr(), and that would look into your
descriptor table for descriptor with number derived from dentry (stashed
in ->d_fsdata at lookup time) and do what fstat() would.
* have those dentries always fail ->d_revalidate(), to force
everything towards ->atomic_open().
* for ...ctl names, ->atomic_open() would act in normal fashion;
again, only one inode is needed. ->read() would pick descriptor number
from ->d_fsdata and report on whatever you have with that number at the
time.

I'll try to put a prototype of that together; I think it's at least
interesting to try. And that ought to be safe to mount even in very
restricted environments, making arguments along the lines of "but we can't
get the path by opened file without the big bad wol^Wprocfs and we can't
have that in our environment" much weaker...

Comments?
Andy Lutomirski
2014-10-19 20:37:54 UTC
Permalink
Post by Al Viro
Post by Andy Lutomirski
For example, I want to be able to reliably do something like nsenter
--namespace-flags-here toybox sh. Toybox's shell is unusual in that
it is more or less fully functional, so this should Just Work (tm),
except that the toybox binary might not exist in the namespace being
entered. If execveat were available, I could rig nsenter or a similar
tool to open it with O_CLOEXEC, enter the namespace, and then call
execveat.
The question I hadn't seen really answered through all of that was how to
deal with #!... "Just use d_path()" isn't particulary appealing - if that
file has a pathname reachable for you, you could bloody well use _that_
from the very beginning.
Does this matter for absolute paths after #! (or for absolute paths to
ELF interpreters)? Does anyone use relative paths there?

Does execve("/proc/self/fd/N", ...) not work correctly now?
Presumably relative paths should be relative to the execed program, or
maybe there should be a flag to execveat that disallows that behavior
entirely, or maybe it should never work, even through /proc. I don't
really like the idea that an fd pointing at a *file* should allow
access to its directory.
Post by Al Viro
Frankly, I wonder if it would make sense to provide something like
dupfs. We can't mount it by default on /dev/fd (more's the pity), but
it might be a good thing to have.
What it is, for those who are not familiar with Plan 9: a filesystem with
one directory and a bunch of files in it. Directory contents depends on
who's looking; for each opened descriptor in your descriptor table, you'll
see two files there. One series is 0, 1, ... - opening one of those gives
dup(). IOW, it's *not* giving you a new struct file; it gives you a new
reference to existing one, complete with sharing IO position, etc. Another
is 0ctl, 1ctl, ... - those are read-only and reading from them gives pretty
much a combination of our /proc/self/fdinfo/n with readlink of /proc/self/fd/n.
It's actually a better match for what one would expect at /dev/fd than what
; echo 'read i; cat /dev/fd/0; echo "The first line was $i"' >a.sh
; (echo 'line 1';echo 'line 2') >a
; cat a|sh a.sh
line 2
The first line was line 1
; sh a.sh <a
line 1
line 2
The first line was line 1
;
See what's going on? Opening /dev/fd/0 (aka /dev/stdin) does a fresh open
of whatever your stdin is; if it's a pipe - fine, you've just added yourself
as additional reader. But if it's a regular file, you've got yourself
a brand-new opened file, with IO position of its own. Sitting at the
beginning of the file.
Moreover, try that with stdin being a socket and you'll see cat(1) failing
to open that sucker.
We _can't_ blindly replace /dev/fd with it - it has to be a sysadmin choice;
semantics is different. However, there's no reason why it can't be mounted
in environments where you want to avoid procfs - it's certainly exposing less
than procfs would.
And these days we can implement relatively cheaply. It's a window that will
close after a while, but right now we can change ->atomic_open() calling
conventions. Instead of having it return 0 or error, let's switch to returning
NULL, ERR_PTR(error) *or* an extra reference to preexisting struct file.
Same as we did for ->lookup(), and for similar reason.
Right now we have 8 instances of ->atomic_open() and one place calling that
method. Changing the API like that would be trivial (and it's a trivial
conversion - replace return ret; with return ERR_PTR(ret); through all
instances, so any out-of-tree filesystems could follow easily). We certainly
can't do anything of that sort with ->open() - there would be thousands
instances to convert. ->atomic_open(), OTOH, is still new enough for
that to be feasible.
What we get from that conversion is an ability to do dup-style semantics
easily.
* give root directory an ->atomic_open() instance that would be
handling opens.
* make lookups in there fail with ENOENT if you don't have such a
descriptor at the moment. Otherwise bind all of them to the same inode.
The only method it needs is ->getattr(), and that would look into your
descriptor table for descriptor with number derived from dentry (stashed
in ->d_fsdata at lookup time) and do what fstat() would.
* have those dentries always fail ->d_revalidate(), to force
everything towards ->atomic_open().
* for ...ctl names, ->atomic_open() would act in normal fashion;
again, only one inode is needed. ->read() would pick descriptor number
from ->d_fsdata and report on whatever you have with that number at the
time.
I'll try to put a prototype of that together; I think it's at least
interesting to try. And that ought to be safe to mount even in very
restricted environments, making arguments along the lines of "but we can't
get the path by opened file without the big bad wol^Wprocfs and we can't
have that in our environment" much weaker...
Comments?
I'm not convinced that these semantics are better than /proc/self/fd's
in many contexts. I don't really like the idea that catting some file
can *change* the position of one of my open file descriptors. Also,
for execveat in particular, I want to be able to setns into a
completely unknown namespace and exec something, so a new fs won't
help if it's not mounted.

An alternative solution to proc-lite would be to have a heavily
stripped-down variant of /proc. It could self as a real directory
(not a symlink), with the normal semantics for /proc/self/fd, and it
could have very little else (possibly nothing at all; possibly exe,
root, and cwd).

--Andy
\
Al Viro
2014-10-19 21:29:21 UTC
Permalink
Post by Andy Lutomirski
Post by Al Viro
The question I hadn't seen really answered through all of that was how to
deal with #!... "Just use d_path()" isn't particulary appealing - if that
file has a pathname reachable for you, you could bloody well use _that_
from the very beginning.
Does this matter for absolute paths after #! (or for absolute paths to
ELF interpreters)? Does anyone use relative paths there?
It's not about what's after #!; it's what we *append* to what's after #!
that is interesting. Recall how #! works - we turn execve() of something
that starts with e.g. "#!/usr/bin/make -f\n" into execve of /usr/bin/make,
with (/usr/bin/make, -f, name of that file, argv[1]..argv[argc]) as
arguments list. With make(1) doing opening and reading the file, as it
would for any makefile. Or /bin/sh opening and reading the script, etc.

Pathname of interpreter is a non-issue (and ELF ones don't go anywhere
near that path anyway).
Post by Andy Lutomirski
Does execve("/proc/self/fd/N", ...) not work correctly now?
Yes, it does. And if procfs is there, this syscall is completely pointless.
The main argument in favour of adding it to the kernel (rather than doing
in userland) has been "but what of the people who don't want procfs mounted
in there?".
Post by Andy Lutomirski
Presumably relative paths should be relative to the execed program, or
maybe there should be a flag to execveat that disallows that behavior
entirely, or maybe it should never work, even through /proc. I don't
really like the idea that an fd pointing at a *file* should allow
access to its directory.
Huh? What are you talking about? And who the hell cares whether it's
absolute or relative?
Post by Andy Lutomirski
I'm not convinced that these semantics are better than /proc/self/fd's
in many contexts. I don't really like the idea that catting some file
can *change* the position of one of my open file descriptors.
Er... You do realize that if descriptor refers to the same file,
cat <&<that descriptor> will change position of your other descriptor,
right? BTW, on *BSD /dev/stdin *does* have a dup()-style semantics.
They do it at the price of very convoluted code in their VFS, but that's
how it works there. And in Plan 9, and (AFAIK) in Solaris...
Andy Lutomirski
2014-10-19 22:16:03 UTC
Permalink
Post by Al Viro
Post by Andy Lutomirski
Post by Al Viro
The question I hadn't seen really answered through all of that was how to
deal with #!... "Just use d_path()" isn't particulary appealing - if that
file has a pathname reachable for you, you could bloody well use _that_
from the very beginning.
Does this matter for absolute paths after #! (or for absolute paths to
ELF interpreters)? Does anyone use relative paths there?
It's not about what's after #!; it's what we *append* to what's after #!
that is interesting. Recall how #! works - we turn execve() of something
that starts with e.g. "#!/usr/bin/make -f\n" into execve of /usr/bin/make,
with (/usr/bin/make, -f, name of that file, argv[1]..argv[argc]) as
arguments list. With make(1) doing opening and reading the file, as it
would for any makefile. Or /bin/sh opening and reading the script, etc.
Pathname of interpreter is a non-issue (and ELF ones don't go anywhere
near that path anyway).
Oh, you mean that #!/usr/bin/make -f would turn into /usr/bin/make
/dev/fd/3? That could be interesting, although I can imagine it
breaking things, especially if /dev/fd/3 isn't set up like that, e.g.
early in boot.

Why does this use case need the dup-style semantics? Merely opening
the same file should work, right?

[deleted my questions based on my misunderstanding of what you were
talking about]
Post by Al Viro
Post by Andy Lutomirski
I'm not convinced that these semantics are better than /proc/self/fd's
in many contexts. I don't really like the idea that catting some file
can *change* the position of one of my open file descriptors.
Er... You do realize that if descriptor refers to the same file,
cat <&<that descriptor> will change position of your other descriptor,
right? BTW, on *BSD /dev/stdin *does* have a dup()-style semantics.
They do it at the price of very convoluted code in their VFS, but that's
how it works there. And in Plan 9, and (AFAIK) in Solaris...
But cat <&3 is shell magic; it isn't actually an invocation of
/bin/cat with some special argument that causes it to poke at some fd.
And bash already supports this kind of use case.

I still don't userstand why the BSD behavior is better here. Also,
this isn't just file *position* -- anything that opens dupfs/N and
uses fcntl with F_SETFL seems likely to cause failures, crashes, or
outright corruption. And there are file locks, which suddenly have
rather odd semantics. (Even with the new OFD locks or with BSD locks,
I wonder how many little programs open argv[1], lock the file, do
something, and then exit. Anything that does that is now leaking a
lock.)

Aside from the general scariness of allowing one process to actually
dup another process's fds, I feel like this is asking for trouble wrt
the various types of file locks.

--Andy
Al Viro
2014-10-19 22:42:51 UTC
Permalink
Post by Andy Lutomirski
Oh, you mean that #!/usr/bin/make -f would turn into /usr/bin/make
/dev/fd/3? That could be interesting, although I can imagine it
breaking things, especially if /dev/fd/3 isn't set up like that, e.g.
early in boot.
Sigh... What I mean is that fexecve(fd, ...) would have to put _something_
into argv when it execs the interpreter of #! file. Simply because the
interpreter (which can be anything whatsoever) has no fscking idea what
to do with some descriptor it has before execve(). Hell, it doesn't have
any idea *which* descriptor had it been.

You need to put some pathname that would yield your script upon open(2).
If you bothered to read those patches, you'd see that they do supply
one, generating it with d_path(). Which isn't particulary reliable.

I'm not sure there's any point putting any of that in the kernel - if
you *do* have that pathname, you can just pass it.
Post by Andy Lutomirski
Aside from the general scariness of allowing one process to actually
dup another process's fds, I feel like this is asking for trouble wrt
the various types of file locks.
Who said anything about another process's fds? That, indeed, would be
a recipe for serious trouble. It's a filesystem with one directory,
not with one directory for each process...

FWIW, they (Plan 9) do have procfs and there they have /proc/<pid>/fd.
Which is a regular file, with contents consisting of \n-terminated
lines (one per descriptor in <pid>'s descriptor table>) in the same
format as in *ctl (they put descriptor number as the first field in
those).

Unlike our solution, they do not allow to get to any process' files via
procfs. They do allow /dev/stdin-style access to your own files via
dupfs. And yes, for /dev/stdin and friends dup-style semantics is better -
you get consistent behaviours for pipes and redirects from file that way.
See the example I've posted upthread. Besides, for things like sockets
our semantics simply fails - they really depend on having only one
struct file for given socket; it's dup or nothing there. The same goes
for a lot of things like eventfd, etc.
Andy Lutomirski
2014-10-19 23:35:05 UTC
Permalink
Post by Al Viro
Post by Andy Lutomirski
Oh, you mean that #!/usr/bin/make -f would turn into /usr/bin/make
/dev/fd/3? That could be interesting, although I can imagine it
breaking things, especially if /dev/fd/3 isn't set up like that, e.g.
early in boot.
Sigh... What I mean is that fexecve(fd, ...) would have to put _something_
into argv when it execs the interpreter of #! file. Simply because the
interpreter (which can be anything whatsoever) has no fscking idea what
to do with some descriptor it has before execve(). Hell, it doesn't have
any idea *which* descriptor had it been.
You need to put some pathname that would yield your script upon open(2).
If you bothered to read those patches, you'd see that they do supply
one, generating it with d_path(). Which isn't particulary reliable.
I'm not sure there's any point putting any of that in the kernel - if
you *do* have that pathname, you can just pass it.
Hmm.

This issue certainly makes fexecve or execveat less attractive, at
least in cases where d_path won't work.

On the other hand, if you want to run a static binary on a cloexec fd
(or, for that matter, a dynamic binary if you trust the interpreter to
close the extra copy of the fd it gets) in a namespace or chroot where
the binary is invisible, then you need kernel help.

It's too bad that script interpreters don't have a mechanism to open
their scripts by fd.
Post by Al Viro
Post by Andy Lutomirski
Aside from the general scariness of allowing one process to actually
dup another process's fds, I feel like this is asking for trouble wrt
the various types of file locks.
Who said anything about another process's fds? That, indeed, would be
a recipe for serious trouble. It's a filesystem with one directory,
not with one directory for each process...
This still has issues with locks if you pass an fd to a child process,
but I guess that you get what you ask for if you do that.
Post by Al Viro
FWIW, they (Plan 9) do have procfs and there they have /proc/<pid>/fd.
Which is a regular file, with contents consisting of \n-terminated
lines (one per descriptor in <pid>'s descriptor table>) in the same
format as in *ctl (they put descriptor number as the first field in
those).
Unlike our solution, they do not allow to get to any process' files via
procfs. They do allow /dev/stdin-style access to your own files via
dupfs. And yes, for /dev/stdin and friends dup-style semantics is better -
you get consistent behaviours for pipes and redirects from file that way.
See the example I've posted upthread. Besides, for things like sockets
our semantics simply fails - they really depend on having only one
struct file for given socket; it's dup or nothing there. The same goes
for a lot of things like eventfd, etc.
Fair enough.

--Andy
H. Peter Anvin
2014-10-19 20:53:58 UTC
Permalink
Post by Al Viro
Post by Andy Lutomirski
For example, I want to be able to reliably do something like nsenter
--namespace-flags-here toybox sh. Toybox's shell is unusual in that
it is more or less fully functional, so this should Just Work (tm),
except that the toybox binary might not exist in the namespace being
entered. If execveat were available, I could rig nsenter or a similar
tool to open it with O_CLOEXEC, enter the namespace, and then call
execveat.
The question I hadn't seen really answered through all of that was how to
deal with #!... "Just use d_path()" isn't particulary appealing - if that
file has a pathname reachable for you, you could bloody well use _that_
from the very beginning.
Frankly, I wonder if it would make sense to provide something like
dupfs. We can't mount it by default on /dev/fd (more's the pity), but
it might be a good thing to have.
What it is, for those who are not familiar with Plan 9: a filesystem with
one directory and a bunch of files in it. Directory contents depends on
who's looking; for each opened descriptor in your descriptor table, you'll
see two files there. One series is 0, 1, ... - opening one of those gives
dup(). IOW, it's *not* giving you a new struct file; it gives you a new
reference to existing one, complete with sharing IO position, etc. Another
is 0ctl, 1ctl, ... - those are read-only and reading from them gives pretty
much a combination of our /proc/self/fdinfo/n with readlink of /proc/self/fd/n.
It's actually a better match for what one would expect at /dev/fd than what
Yes, it is really unfortunate that /proc/self/fd/* had the wrong
semantics for the start, due to then-existing implementation issues.

-hpa
Loading...