Discussion:
device namespaces
(too old to reply)
riya khanna
2014-09-24 04:29:17 UTC
Permalink
Thanks for your feedback!

Letting the kernel know about what devices a container could access (based
on device cgroups) and having devtmpfs in the kernel create device nodes
for a container that map to corresponding CUSE nodes is what I thought of.
For example, "echo 29:0 > /proc/<pid>/devices" would prepare a virtual
framebuffer (based on real fb0 SCREENINFO properties) for this process
provided permissions allow this operation. To view the framebuffer, the
CUSE based virtual device would talk to the actual hardware. Since
namespaces would have different view of the underlying devices, "sysfs" has
to made aware of this as well.

Please let me know your inputs. Thanks again!

-Riya
Hi,
I'm a newbie trying to come up with a fuse/cuse-based solution to
device namespace virtualization.
Fwiw I find the thought of allowing use of cuse from a container
(well,
an unprivileged container at least) more than a little bit
frightening
from a security perspective. If a process does an ioctl on a
cuse-based
device then the process implementing the device can get a very
broad
ability to read and write in the initiator's address space. If the
The cuse or fuse process would best run with the permissions of the
container. Even for an unprivileged container it could connect to
bind-mounts of say /dev/null etc for any passthrough access.
device were to show up automagically in devtmpfs and a process on
the
host could be tricked into opening the device, then that sounds
like a
great vector for an attack. Just something to keep in mind.
Yup. You'd like to think that having the devices be owned by uid
100000
would be a clue, but a script might not notice. The fs should only
be
mounted in the container's fs, but that can of course be reached
through
/proc/pid/root. Now an unpriv user shouldn't be able to chroot into
there without starting a new user namespace - leaving the victim no
long privileged and so no more harmful than the user was to begin
with.
I don't think it matters if the user is unprivileged if you're using
cuse to implement the devices. In order for it to work the unprivileged
user would need read/write access to /dev/cuse, and once it has that
there seems to be no restrictions on what cuse functionality it can
make
use of.
When the user creates a device cuse calls device_add() for the new
device, which is going to create a node in devtmpfs which is owned by
global root. At that point I see nothing that would stop a process in
the host from opening the file and doing ioctls. It looks like it would
even be possible to use cuse to claim a well-known major/minor pair for
your device if it wasn't already claimed (e.g. the driver was a module
and not loaded).
I didn't spend a lot of time looking at the code, so it's possible I
missed something, but if I didn't then giving unprivileged users access
to /dev/cuse seems like a very bad idea.
Ok, agreed. The original author mainly mentioned fuse. I thought fuse
couldn't create device nodes though.
Yeah, but since he did mention cuse I thought I'd throw out a warning.
With fuse it is technically possible to have device nodes, but it's
usually prevented for unprivileged users by the suid helper (fusermount)
adding MS_NODEV to the mountflags. With my patches for fuse in user
namespaces the kernel will add nodev for any userns mount, and from a
security perspective I don't see any way around that.
Seth
_______________________________________________
lxc-devel mailing list
http://lists.linuxcontainers.org/listinfo/lxc-devel
riya khanna
2014-09-24 04:34:46 UTC
Permalink
(Please pardon multiple emails, artifact of merging all separate
conversations)

Thanks for your feedback!

Letting the kernel know about what devices a container could access (based
on device cgroups) and having devtmpfs in the kernel create device nodes
for a container that map to corresponding CUSE nodes is what I thought of.
For example, "echo 29:0 > /proc/<pid>/devices" would prepare a virtual
framebuffer (based on real fb0 SCREENINFO properties) for this process
provided permissions allow this operation. To view the framebuffer, the
CUSE based virtual device would talk to the actual hardware. Since
namespaces would have different view of the underlying devices, "sysfs" has
to made aware of this as well.

Please let me know your inputs. Thanks again!

-Riya
Hi,
I'm a newbie trying to come up with a fuse/cuse-based solution to
device namespace virtualization.
Fwiw I find the thought of allowing use of cuse from a container
(well,
an unprivileged container at least) more than a little bit
frightening
from a security perspective. If a process does an ioctl on a
cuse-based
device then the process implementing the device can get a very
broad
ability to read and write in the initiator's address space. If the
The cuse or fuse process would best run with the permissions of the
container. Even for an unprivileged container it could connect to
bind-mounts of say /dev/null etc for any passthrough access.
device were to show up automagically in devtmpfs and a process on
the
host could be tricked into opening the device, then that sounds
like a
great vector for an attack. Just something to keep in mind.
Yup. You'd like to think that having the devices be owned by uid
100000
would be a clue, but a script might not notice. The fs should only
be
mounted in the container's fs, but that can of course be reached
through
/proc/pid/root. Now an unpriv user shouldn't be able to chroot into
there without starting a new user namespace - leaving the victim no
long privileged and so no more harmful than the user was to begin
with.
I don't think it matters if the user is unprivileged if you're using
cuse to implement the devices. In order for it to work the unprivileged
user would need read/write access to /dev/cuse, and once it has that
there seems to be no restrictions on what cuse functionality it can
make
use of.
When the user creates a device cuse calls device_add() for the new
device, which is going to create a node in devtmpfs which is owned by
global root. At that point I see nothing that would stop a process in
the host from opening the file and doing ioctls. It looks like it would
even be possible to use cuse to claim a well-known major/minor pair for
your device if it wasn't already claimed (e.g. the driver was a module
and not loaded).
I didn't spend a lot of time looking at the code, so it's possible I
missed something, but if I didn't then giving unprivileged users access
to /dev/cuse seems like a very bad idea.
Ok, agreed. The original author mainly mentioned fuse. I thought fuse
couldn't create device nodes though.
Yeah, but since he did mention cuse I thought I'd throw out a warning.
With fuse it is technically possible to have device nodes, but it's
usually prevented for unprivileged users by the suid helper (fusermount)
adding MS_NODEV to the mountflags. With my patches for fuse in user
namespaces the kernel will add nodev for any userns mount, and from a
security perspective I don't see any way around that.
Seth
_______________________________________________
lxc-devel mailing list
http://lists.linuxcontainers.org/listinfo/lxc-devel
Eric W. Biederman
2014-09-24 05:04:30 UTC
Permalink
(Please pardon multiple emails, artifact of merging all separate conversations)
Thanks for your feedback!
Letting the kernel know about what devices a container could access (based on
device cgroups) and having devtmpfs in the kernel create device nodes for a
container that map to corresponding CUSE nodes is what I thought of. For
example, "echo 29:0 > /proc/<pid>/devices" would prepare a virtual framebuffer
(based on real fb0 SCREENINFO properties) for this process provided permissions
allow this operation. To view the framebuffer, the CUSE based virtual device
would talk to the actual hardware. Since namespaces would have different view of
the underlying devices, "sysfs" has to made aware of this as well.
Please let me know your inputs. Thanks again!
The solution hugely depends on what you are trying to do with it.

The situation today is that device nodes are slowly fading out. In
another 20 years linux may not have any device nodes at all.

Therefore the question becomes what are you trying to support.

If it is just filtering of existing device nodes. We can do a pretty
good approximation with bind mounts.

If you want to emulate a device you can use normal fuse (not cuse).
As normal fuse file will support arbitrary ioctls.

There are a few cases where it is desirable to emulate what devpts
does for allowing arbitrary users to creating virtual devices in the
kernel. Loop devices in particular.

Ultimately given the existence of device hotplug I don't see any call
for being able to create device nodes with well known device numbers
(fundamentally what a device namespace would be about).

The conversation last year was about people wanting to multiplex devices
that don't have multiplexer support in the kernel. If that is your
desire I think it is entirely reasonable to device type by device type
add support for multiplexing that device type to the kernel, or
potentially just use fuse or cuse to implement your multiplexer in
userspace but that has the potential to be unusably slow.

Eric
riya khanna
2014-09-24 05:32:27 UTC
Permalink
My use case for having device namespaces is device isolation. Isn't what
namespaces are there for (as I understand)? Not everything should be
accessible (or even visible) from a container all the time (we have seen
people come up with different use cases for this). However, bind-mounting
takes away this flexibility. I agree that assigning fixed device numbers is
clearly not a long-term solution. Emulation for safe and flexible
multiplexing, like you suggested either using CUSE/FUSE or something like
devpts, is what I'm exploring.
Post by riya khanna
Post by riya khanna
(Please pardon multiple emails, artifact of merging all separate
conversations)
Post by riya khanna
Thanks for your feedback!
Letting the kernel know about what devices a container could access
(based on
Post by riya khanna
device cgroups) and having devtmpfs in the kernel create device nodes
for a
Post by riya khanna
container that map to corresponding CUSE nodes is what I thought of. For
example, "echo 29:0 > /proc/<pid>/devices" would prepare a virtual
framebuffer
Post by riya khanna
(based on real fb0 SCREENINFO properties) for this process provided
permissions
Post by riya khanna
allow this operation. To view the framebuffer, the CUSE based virtual
device
Post by riya khanna
would talk to the actual hardware. Since namespaces would have different
view of
Post by riya khanna
the underlying devices, "sysfs" has to made aware of this as well.
Please let me know your inputs. Thanks again!
The solution hugely depends on what you are trying to do with it.
The situation today is that device nodes are slowly fading out. In
another 20 years linux may not have any device nodes at all.
Therefore the question becomes what are you trying to support.
If it is just filtering of existing device nodes. We can do a pretty
good approximation with bind mounts.
If you want to emulate a device you can use normal fuse (not cuse).
As normal fuse file will support arbitrary ioctls.
There are a few cases where it is desirable to emulate what devpts
does for allowing arbitrary users to creating virtual devices in the
kernel. Loop devices in particular.
Ultimately given the existence of device hotplug I don't see any call
for being able to create device nodes with well known device numbers
(fundamentally what a device namespace would be about).
The conversation last year was about people wanting to multiplex devices
that don't have multiplexer support in the kernel. If that is your
desire I think it is entirely reasonable to device type by device type
add support for multiplexing that device type to the kernel, or
potentially just use fuse or cuse to implement your multiplexer in
userspace but that has the potential to be unusably slow.
Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
riya khanna
2014-09-25 15:40:10 UTC
Permalink
Is there a plan or work-in-progress to add namespace tags to other
classes in sysfs similar to net? Does it make sense to add namespace
tags to kobjects?

-Riya
=2Ecom>
=2Ecom>
Isolation is provided by the devices cgroup. You want something=
more
than isolation.
My use case for having device namespaces is device isolation. I=
sn't
what
namespaces are there for (as I understand)?
Namespaces fundamentally provide for using the same ``global'' na=
me
in different contexts. This allows them to be used for isolation
and process migration (because you can take the same name from
machine to machine).
Unless someone cares about device numbers at a namespace level
the work is done.
The mount namespace provides exsits to deal with file names.
The devices cgroup will limit which devices you can access (altho=
ugh
I can't ever imagine a case where the mout namespace would be
insufficient).
Not everything should be
accessible (or even visible) from a container all the time (we =
have
seen
people come up with different use cases for this). However, bind-mounting
takes away this flexibility.
I don't see how. If they are mounts that propogate into the cont=
ainer
and are controlled from outside you can do whatever you want. (I=
am
imagining device by device bind mounts here). It should be trivi=
al
to have a a directory tree that propogates into a container and w=
orks.
Device-by-device bind mounts can grant/revoke access to real
individual devices as and when needed. However, revoking the acces=
s to
real devices could break the applications if there=E2=80=99s no tr=
ansparent
mechanism to back up the propagated (but now revoked) device bind
mounts that could fool the apps into believing that they are worki=
ng
with real devices. Frame buffer is one such example, where safe
multiplexing could be applied.
I agree that assigning fixed device numbers is
clearly not a long-term solution. Emulation for safe and flexib=
le
multiplexing, like you suggested either using CUSE/FUSE or some=
thing
like
devpts, is what I'm exploring.
Is the problem you actually care about multiplexing devices?
The problem I care about is access to real devices, such as input,=
fb,
loop, etc. as and when needed, thereby having native I/O performan=
ce -
either through secure multiplexing or exclusive ownership, whateve=
r
makes sense according to the device type.
I guess policy-based multiplexing (or exclusive ownership) is the
usage. What kind of devices (loop, fb, etc.) this is needed for
depends on the usage. If there are multiple FBs, then each contain=
er
could potentially own one. One may want to provide exclusive owner=
ship
of input devices to one container at a time to avoid information
leakage. Like we saw at LPC last year, this applies to sensors (gp=
s,
accelerometer, etc.) on mobile devices as well.
Allowing mutiplexing of those devices seems reasonable.
Where the discussion ran into problems last time was that people did=
not
want to use any of the existing linux solutions for multiplexing tho=
se
kind of thing and wanted to invent something new.
Inventing something new is fine if it the extra code maintenance can=
be
justified, or if the invention just a better solution for all users =
and
new code can just start using that in general.
The old solution to your problem of multiplexing devices is by
allocating a virtual terminal nd sending signals to coordinate
cooperatively sharing those resources.
If you want some sort of preemtive multitasking that requires
something a bit more effort, and work in the device abstractions.
You may be able to share concepts and library code but I don't belie=
ve
there is something you can just pain on top of devices and make it
happen. Certainly in the bad old days of X terminal switching the
cooperation was necessary so that when a video card was yanked from =
an
application writing directly to that video card the application woul=
d
need to restore the video card to a known state so the next applicat=
ion
would have a chance of making sense of it. Furthermore most device=
s
are not safe to let unprivileged users to access their control regis=
ters
directly.
All of which boils down the simple fact that for each type of device=
you
would like to share it is necessary to update the subsystem to suppo=
rt
arbitrary numbers of virtual devices that you can talk to.
The macvlan driver in the networking stack is a rough example of wha=
t I
expect you would like. Something that takes one real physical devic=
e
and turns it into N virtual devices each of which runs at effectivel=
y
full speed. Along with some kind of new master interface for
controlling when the multiplexing takes place.
I think we do most of this is software today and arguably for a lot =
of
devices the overhead is small enough that a software solution is fin=
e.
So perhaps all you need is a fuse interface to the existing software
multiplexers so that weird legacy code can be made to run.
What kind of existing multiplexers could be used? Is there one for fb=
? We
have evdev abstractions for input in place already.
Now I suspect part of doing this right will be getting proper video
drivers on Android. I assume that Android is the platform you care
about.
Eric
ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
2014-09-25 18:09:43 UTC
Permalink
Post by riya khanna
Is there a plan or work-in-progress to add namespace tags to other
classes in sysfs similar to net? Does it make sense to add namespace
tags to kobjects?
Currently the a general nack from gregkh on such work.

Given that sysfs is almost never a fast path I suspect it makes most
sense to filter sysfs in some way (aka bind mounts or fuse) and present
the results to the container.

At the point this is something that we are using a lot and have
demonstrated the usefulness of it and it appears a kernel level
solution would be better it would be worth reopening the disucssion.

Eric
ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
2014-09-25 18:21:50 UTC
Permalink
What kind of existing multiplexers could be used? Is there one for fb? We have
evdev abstractions for input in place already.
We have X and Wayland/Weston and pulse audio and doubtless more that I
am not aware of.

For video a lot of working is going into compositing and handling
multiple contexts in the hardware so there may already be support in the
kernel.

Fundamentally these are all pieces of hardware we allow multiple
userspace applications access to their information or to modify.
Therefore there is existing multiplexing somewhere.

I won't claim all of the existing multiplexing methods are good and
should be used as is, but they definitely should be used as a starting
point.


From another perspective there is how kvm tackles this today. If you
really want to emulate the hardware and make it appear that your
instance of userspace has direct hardware access building upon the
infrastructure that is used for kvm may be worth exploring.

Eric

Continue reading on narkive:
Loading...