Discussion:
cgroup: status-quo and userland efforts
(too old to reply)
Tejun Heo
2013-04-06 01:21:59 UTC
Permalink
Hello, guys.

Status-quo
==========

It's been about a year since I wrote up a summary on cgroup status quo
and future plans. We're not there yet but much closer than we were
before. At least the locking and object life-time management aren't
crazy anymore and most controllers now support proper hierarchy
although not all of them agree on how to treat inheritance.

IIRC, the yet-to-be-converted ones are blk-throttle and perf. cpu
needs to be updated so that it at least supports a similar mechanism
as cfq-iosched for configuring ratio between tasks on an internal
cgroup and its children. Also, we really should update how cpuset
handles a cgroup becoming empty (no cpus or memory node left due to
hot-unplug). It currently transfers all its tasks to the nearest
ancestor with executing resources, which is an irreversible process
which would affect all other co-mounted controllers. We probably want
it to just take on the masks of the ancestor until its own executing
resources become online again, and the new behavior should be gated
behind a switch (Li, can you please look into this?).

While we have still ways to go, I feel relatively confident saying
that we aren't too far out now, well, except for the writeback mess
that still needs to be tackled. Anyways, once the remaining bits are
settled, we can proceed to implement the unified hierarchy mode I've
been talking about forever. I can't think of any fundamental
roadblocks at the moment but who knows? The devil usually is in the
details. Let's hope it goes okay.

So, while we aren't moving as fast as we wish we were, the kernel side
of things are falling into places. At least, that's how I see it.
From now on, I think how to make it actually useable to userland
deserves a bit more focus, and by "useable to userland", I don't mean
some group hacking up an elaborate, manual configuration which is
tailored to the point of being eccentric to suit the needs of the said
group. There's nothing wrong with that and they can continue to do
so, but it just isn't generically useable or useful. It should be
possible to generically and automatically split resources among, say,
several servers and a couple users sharing a system without resorting
to indecipherable ad-hoc shell script running off rc.local.


Userland efforts
================

There are currently a few userland efforts trying to make interfacing
with cgroup less painful.

* libcg: Make cgroup interface accessible from programming languages
with support for configuration persistency, which also brings its
own config files to remember what to do on the next boot. Sans the
persistence part, it just seems to directly translate the filesystem
interface to function interface.

http://libcg.sourceforge.net/

* Workman: It's a rather young project but as its name (workload
management) implies, its aims are higher level than that of libcg.
It aims to provide high-level resource allocation and management and
introduces new concepts like resource partitions to represent its
view of resource hierarchy. Like libcg, this one is implemented as
a library but provides bindings for more languages.

https://gitorious.org/workman/pages/Home

* Pax Controla Groupiana: A document on how not to step on other's
toes while using cgroup. It's not a software project but tries to
define precautions that a software or user can take to avoid
breaking or confusing other users of the cgroup filesystem.

http://www.freedesktop.org/wiki/Software/systemd/PaxControlGroups

All try to play nice with other possible users of the cgroup
filesystem - be it libvirt cgroup, applications doing their own cgroup
tricks, or hand-crafted custom scripts. While the approach is
understandable given that those usages already exist, I don't think
it's a workable solution in the long term. There are several reasons
for that.

* The configurations aren't independent. e.g. for weight-based
controllers, your weight is only meaningful in relation to other
weights at that level. Distributing configuration to whatever
entities which may write to cgroupfs simply cannot work. It's
fundamentally flawed.

* It's fragile like hell. There's no accountability. Nobody really
knows what's going on. Is this subdirectory still there due to a
bug in this program, or something or someone else created it and
crashed / forgot to remove it, or what? Oh, the cgroup I wanted to
create already exists. Maybe the previous instance created it and
then crashed or maybe some other program just happened to choose the
same name. Who owns config knobs in that directory? This way lies
madness. I understand why the Pax doc exists but I'm not sure its
long-term effect would be positive - best practices which ultimately
lead to utter confusion and fragility.

* In many cases, resource distribution is system-wide policy decisions
and determining what to do often requires system-wide knowledge.
You can't provision memory limits without knowing what's available
in the system and what else is going on in the system, and you want
to be able to adjust them as situation and configuration changes.
Without anybody having full picture of how resources are
provisioned, how would any of that be possible?

I think this anything-goes approach is prevalent largely because the
cgroup filesystem interface encourages such usage. From the looks of
it, the filesystem permissions combined with hierarchy should be able
to handle delegation perfectly. Well, as it currently stands, it's
anything but and the interface is just misleading. Hierarchy support
was an utter mess, configuration schemes aren't uniform across
controllers, and, more fundamentally, hierarchy itself is expensive -
we can't delegate hierarchy creation to unpriviledged users or
programs safely.

It is in the realm of possibility to make all cgroup operations and
controllers to do all that; however, it's a very tall order. Just
think about how much effort it has been to achieve and maintain proper
delegation in the core elements of the kernel - processes and
filesystems, and there will be security implications with cgroup
likely involving a lot of gotchas and extensions of security
infrastructures, and, even then, I'm pretty sure it's gonna require
helps from userland to effect proper policy decisions and config
changes. We have things like polkit for a reason and are likely to
need finer-grained, domain-aware access control than is possible with
tweaking directory permissions.

Given the above and how relatively marginal cgroup is, I'm extremely
skeptical that implementing full delegation in kernel is the right
course of action and likely to scream like a banshee at any attempt
driving things that way.

I think the only logical thing to do is creating a centralized
userland authority which takes full ownership of the cgroup filesystem
interface, gives it a sane structure, represents available resources
in a sane form, and makes policy decisions based on configuration and
requests. I don't have a concerete idea what that authority should be
like, but I think there already are pretty similar facilities in our
userland, and don't see why this should be much different.

Another reason why this could be helpful is that we're gonna be
morphing towards unified hierarchy and it'd very nice to have
something which can match impedance between the old and new ways and
not require each individual consumer of cgroup to handle such changes.
As for the unified hierarchy, we just have to. It's currently
fundamentally broken in that it's impossible to tell which cgroup a
resource belongs to independent of which task is looking at it. It's
like this damn thing is designed to honor Hisenberg and Einstein. No
disrespect for the great minds, but it just doens't look like the
proper place.

Even apart from the unified hierarchy thing, I think it generally is a
good idea to have a buffer layer between the kernel interface and
individual consumers for cgroup, which is still very immature and
kinda tightly coupled with internal implementation details.

So, umm, that's what I want. When I first heard of WorkMan, I was
excited thinking maybe the universe is being really nice and making
things happen to my wishes without me actually doing anything. :) Oh
well, one can dream, but everything is still early, so hopefully we
have enough time to figure things out.

What do you guys think?

Thanks.

--
tejun
Glauber Costa
2013-04-08 13:46:09 UTC
Permalink
Post by Tejun Heo
Hello, guys.
Hello Tejun, how are you?
Post by Tejun Heo
Status-quo
==========
tl;did read;

This is mostly sensible. There is still one problem that we hadn't yet
had the bandwidth to tackle that should be added to your official TODO list.

The cpu cgroup needs a real-time timeslice to accept real time tasks. It
defaults to 0, meaning that a newly created cpu cgroup cannot accept
tasks (rt tasks) without the user having to manually configure it.
As far as I know, this problem hasn't yet been fixed.

The fix of course, is as trivial as setting a new value instead of 0 as
a default. The complication lies in determining which value should that be.

There are many things that we should ask from a controller to implement
in order to be able to handle fully joint hierarchies. One of them,
IMHO, is that if you drop a task into a newly created cgroup it should
run without the user having to do anything for it.
Vivek Goyal
2013-04-08 18:00:56 UTC
Permalink
On Mon, Apr 08, 2013 at 05:46:09PM +0400, Glauber Costa wrote:

[..]
Post by Glauber Costa
The cpu cgroup needs a real-time timeslice to accept real time tasks. It
defaults to 0, meaning that a newly created cpu cgroup cannot accept
tasks (rt tasks) without the user having to manually configure it.
As far as I know, this problem hasn't yet been fixed.
Yes, systemd folks wanted this to be fixed so that out of the box they
could put individual user session in a cgroup and still expect that
any RT applications of user are not broken.

Thanks
Vivek
Tejun Heo
2013-04-08 18:26:49 UTC
Permalink
Hey, Glauber.
Post by Glauber Costa
Post by Tejun Heo
Hello, guys.
Hello Tejun, how are you?
I'm doing okay. :)
Post by Glauber Costa
Post by Tejun Heo
Status-quo
==========
tl;did read;
This is mostly sensible. There is still one problem that we hadn't yet
had the bandwidth to tackle that should be added to your official TODO list.
The cpu cgroup needs a real-time timeslice to accept real time tasks. It
defaults to 0, meaning that a newly created cpu cgroup cannot accept
tasks (rt tasks) without the user having to manually configure it.
As far as I know, this problem hasn't yet been fixed.
The fix of course, is as trivial as setting a new value instead of 0 as
a default. The complication lies in determining which value should that be.
There are many things that we should ask from a controller to implement
in order to be able to handle fully joint hierarchies. One of them,
IMHO, is that if you drop a task into a newly created cgroup it should
run without the user having to do anything for it.
Yeap, definitely. cpuset has similar problems (Li, help us!). For
the controllers which are showing behaviors which don't allow sharing
a single hierarchy, I think the solution is to implement an alternate
behavior which can be flipped on mount time and force the switch
flipped when mounting unified hierarchy, so that we don't disturb the
existing users while pushing for more consistent behavior.

Thanks.
--
tejun
Lennart Poettering
2013-04-08 23:32:01 UTC
Permalink
Heya,
Post by Glauber Costa
Post by Tejun Heo
Hello, guys.
Hello Tejun, how are you?
Post by Tejun Heo
Status-quo
==========
tl;did read;
This is mostly sensible. There is still one problem that we hadn't yet
had the bandwidth to tackle that should be added to your official TODO list.
The cpu cgroup needs a real-time timeslice to accept real time tasks. It
defaults to 0, meaning that a newly created cpu cgroup cannot accept
tasks (rt tasks) without the user having to manually configure it.
As far as I know, this problem hasn't yet been fixed.
The fix of course, is as trivial as setting a new value instead of 0 as
a default. The complication lies in determining which value should that be.
There are many things that we should ask from a controller to implement
in order to be able to handle fully joint hierarchies. One of them,
IMHO, is that if you drop a task into a newly created cgroup it should
run without the user having to do anything for it.
The other big thing we want from the systemd side is saner notifications
when cgroups run empty. i.e. currently we don't get these at all in
containers (since the agent can be only installed once, for the host).
And the way we get this is awful, via kernel-spawned processes. I am
looking for a way how I can establish a watch on a certain subtree (not
just one directory) and get simple notifications in a race-free whenever
a cgroup runs empty.

Lennart
Glauber Costa
2013-04-09 07:37:32 UTC
Permalink
Post by Lennart Poettering
The other big thing we want from the systemd side is saner notifications
when cgroups run empty. i.e. currently we don't get these at all in
containers (since the agent can be only installed once, for the host).
And the way we get this is awful, via kernel-spawned processes. I am
looking for a way how I can establish a watch on a certain subtree (not
just one directory) and get simple notifications in a race-free whenever
a cgroup runs empty.
Well, as I am trying to port our tools for Upstream Linux (aka cgroups),
I also got a pet peeve on this one as well. The notification system is
global and done at the root level. IOW, notify_on_release is local, but
release_agent is global.

We use our management tool to enter containers and call something like
init 0, that will shut the container down. But if the admin does it
itself, the cgroup directory will stay there. We would like them to
automatically disappear.

Maybe that is not something that needs to be done in the kernel. If
systemd had some very easy and well documented way for a 3rd party
software to register a notification to be called upon a certain cgroup
release (if it exists already, sorry Lennart, but I haven't found
anything in the likes. Just enlighten me)
Tejun Heo
2013-04-09 19:11:45 UTC
Permalink
Hello,
Post by Lennart Poettering
The other big thing we want from the systemd side is saner
notifications when cgroups run empty. i.e. currently we don't get
these at all in containers (since the agent can be only installed
once, for the host). And the way we get this is awful, via
kernel-spawned processes. I am looking for a way how I can establish
a watch on a certain subtree (not just one directory) and get simple
notifications in a race-free whenever a cgroup runs empty.
Oh yeah, it's horrifying. There was something going on a while ago
but I couldn't get hold of Eric Paris. We probably should resurrect
that patch.

As for delegating to namespaces, I'm not exactly sure what to do. At
least for now, it could be an acceptable trade-off to delegate the
subdirectory with some limits on the number of cgroups / depth of
hierarchy / whatever. That said, I'm not really fond of the idea. It
isn't likely to work seamlessly. The root cgroup is special anyway
and I don't really like the idea of putting NS related stuff directly
into cgroupfs.

Thanks.
--
tejun
Vivek Goyal
2013-04-08 17:59:26 UTC
Permalink
On Fri, Apr 05, 2013 at 06:21:59PM -0700, Tejun Heo wrote:

[..]
Post by Tejun Heo
Userland efforts
================
There are currently a few userland efforts trying to make interfacing
with cgroup less painful.
* libcg: Make cgroup interface accessible from programming languages
with support for configuration persistency, which also brings its
own config files to remember what to do on the next boot. Sans the
persistence part, it just seems to directly translate the filesystem
interface to function interface.
http://libcg.sourceforge.net/
* Workman: It's a rather young project but as its name (workload
management) implies, its aims are higher level than that of libcg.
It aims to provide high-level resource allocation and management and
introduces new concepts like resource partitions to represent its
view of resource hierarchy. Like libcg, this one is implemented as
a library but provides bindings for more languages.
https://gitorious.org/workman/pages/Home
* Pax Controla Groupiana: A document on how not to step on other's
toes while using cgroup. It's not a software project but tries to
define precautions that a software or user can take to avoid
breaking or confusing other users of the cgroup filesystem.
http://www.freedesktop.org/wiki/Software/systemd/PaxControlGroups
All try to play nice with other possible users of the cgroup
filesystem - be it libvirt cgroup, applications doing their own cgroup
tricks, or hand-crafted custom scripts. While the approach is
understandable given that those usages already exist, I don't think
it's a workable solution in the long term. There are several reasons
for that.
* The configurations aren't independent. e.g. for weight-based
controllers, your weight is only meaningful in relation to other
weights at that level. Distributing configuration to whatever
entities which may write to cgroupfs simply cannot work. It's
fundamentally flawed.
Hi Tejun,

I thought in workman, "partition" configuration was still centralized
while individual "consumer" configuration was with consumer manger
(systemd, libvirt, .. etc). IOW, library can tell consumer manger to
which partition to associate consumer with at startup time. (consumer
manager can assume their own defaults if nothing has been told).

Agreed, that weight is meaningful only if one as full hierarchy view
and then one should be able to calculate effective % share of resoures
of a group.

But using the library admin application should be able to query the
full "paritition" hierarchy and their weigths and calculate % system
resources. I think one problem there is cpu controller where % resoruce
of a cgroup depends on tasks entities which are peer to group. But that's
a kernel issue and not user space thing.

So I am not sure what are potential problems with proposed model of
configuration in workman. All the consumer managers still follow what
libarary has told them to do.
Post by Tejun Heo
* It's fragile like hell. There's no accountability. Nobody really
knows what's going on. Is this subdirectory still there due to a
bug in this program, or something or someone else created it and
crashed / forgot to remove it, or what?
I thought any directory under a consumer manger is managed by that
manager and nobody is supposed to dynamically create resource
partition/cgroup there. So that takes away a bit of confusion.
Post by Tejun Heo
Oh, the cgroup I wanted to
create already exists. Maybe the previous instance created it and
then crashed
This should be the case as long as we stick to the notion of a manger
managing its own sub-hierarchy.
Post by Tejun Heo
or maybe some other program just happened to choose the
same name.
Two programs ideally would have their own sub hiearchy. And if not one
of the programs should get the conflict when trying to create cgroup and
should back-off or fail or give warning...
Post by Tejun Heo
Who owns config knobs in that directory?
IIUC, workman was looking at two types of cgroups. Once called
"partitions" which will be created by library at startup time and
library manages the configuration (something like cgconfig.conf).

And individual managers create their own children groups for various
services under that partition and control the config knobs for those
services.

user-defined-partition
/ | \
virt1 virt2 virt3

So user should be able to define a partition and control the configuration
using workman lib. And if multiple virtual machines are being run in
the partition, then they create their own cgroups and libvirt controls
the properties of virt1, virt2, virt3 cgroups. I thought that was the
the understanding when we dicussed ownership of config knobs las time.
But things might have changed since last time. Workman folks should
be able to shed light on this.
Post by Tejun Heo
This way lies
madness. I understand why the Pax doc exists but I'm not sure its
long-term effect would be positive - best practices which ultimately
lead to utter confusion and fragility.
* In many cases, resource distribution is system-wide policy decisions
and determining what to do often requires system-wide knowledge.
You can't provision memory limits without knowing what's available
in the system and what else is going on in the system, and you want
to be able to adjust them as situation and configuration changes.
Without anybody having full picture of how resources are
provisioned, how would any of that be possible?
I thought workman library will provide interfaces so that one can query
and be able to construct the full system view.

Their doc says.

GList *workmanager_partition_get_children(WorkmanPartition *partition,
GError **error);


So I am assuming this can be used to construct the full partition
hierarchy and associated resource allocation.
Post by Tejun Heo
I think this anything-goes approach is prevalent largely because the
cgroup filesystem interface encourages such usage. From the looks of
it, the filesystem permissions combined with hierarchy should be able
to handle delegation perfectly. Well, as it currently stands, it's
anything but and the interface is just misleading. Hierarchy support
was an utter mess, configuration schemes aren't uniform across
controllers, and, more fundamentally, hierarchy itself is expensive -
we can't delegate hierarchy creation to unpriviledged users or
programs safely.
It is in the realm of possibility to make all cgroup operations and
controllers to do all that; however, it's a very tall order. Just
think about how much effort it has been to achieve and maintain proper
delegation in the core elements of the kernel - processes and
filesystems, and there will be security implications with cgroup
likely involving a lot of gotchas and extensions of security
infrastructures, and, even then, I'm pretty sure it's gonna require
helps from userland to effect proper policy decisions and config
changes. We have things like polkit for a reason and are likely to
need finer-grained, domain-aware access control than is possible with
tweaking directory permissions.
Given the above and how relatively marginal cgroup is, I'm extremely
skeptical that implementing full delegation in kernel is the right
course of action and likely to scream like a banshee at any attempt
driving things that way.
[..]
Post by Tejun Heo
I think the only logical thing to do is creating a centralized
userland authority which takes full ownership of the cgroup filesystem
interface, gives it a sane structure,
Right now systemd seems to be giving initial structure. I guess we will
require some changes where systemd itself runs in a cgroup and that
allows one to create peer groups. Something like.

root
/ \
systemd other-groups

So currently no central authority is enforcing it. It seems to be just
a matter of right defaults in systemd.
Post by Tejun Heo
represents available resources
in a sane form, and makes policy decisions based on configuration and
requests.
Given the fact that library has view of full system resoruces (both
persistent view and active view), shouldn't we just be able to extend
the API to meet additional configuration or resource needs.
Post by Tejun Heo
I don't have a concerete idea what that authority should be
like, but I think there already are pretty similar facilities in our
userland, and don't see why this should be much different.
Thanks
Vivek
Tejun Heo
2013-04-08 18:16:07 UTC
Permalink
Hey, Vivek.
Post by Vivek Goyal
But using the library admin application should be able to query the
full "paritition" hierarchy and their weigths and calculate % system
resources. I think one problem there is cpu controller where % resoruce
of a cgroup depends on tasks entities which are peer to group. But that's
a kernel issue and not user space thing.
Yeah, we're gonna have to implement a different operation mode.
Post by Vivek Goyal
So I am not sure what are potential problems with proposed model of
configuration in workman. All the consumer managers still follow what
libarary has told them to do.
Sure, if we assume everyone follows the rules and behaves nicely.
It's more about the general approach. Allowing / encouraging sharing
or distributing control of cgroup hierarchy without forcing structure
and rigid control over it is likely to lead to confusion and
fragility.
Post by Vivek Goyal
Post by Tejun Heo
or maybe some other program just happened to choose the
same name.
Two programs ideally would have their own sub hiearchy. And if not one
of the programs should get the conflict when trying to create cgroup and
should back-off or fail or give warning...
And who's responsible for deleting it? What if the program crashes?
Post by Vivek Goyal
Post by Tejun Heo
Who owns config knobs in that directory?
IIUC, workman was looking at two types of cgroups. Once called
"partitions" which will be created by library at startup time and
library manages the configuration (something like cgconfig.conf).
And individual managers create their own children groups for various
services under that partition and control the config knobs for those
services.
user-defined-partition
/ | \
virt1 virt2 virt3
So user should be able to define a partition and control the configuration
using workman lib. And if multiple virtual machines are being run in
the partition, then they create their own cgroups and libvirt controls
the properties of virt1, virt2, virt3 cgroups. I thought that was the
the understanding when we dicussed ownership of config knobs las time.
But things might have changed since last time. Workman folks should
be able to shed light on this.
I just read the introduction doc and haven't delved into the API or
code so I could be off but why should there be multiple managers?
What's the benefit of that? Wouldn't it make more sense to just have
a central arbitrator that everyone talks to? What's the benefit of
distributing the responsiblities here? It's not like we can put them
in different security domains.
Post by Vivek Goyal
Post by Tejun Heo
* In many cases, resource distribution is system-wide policy decisions
and determining what to do often requires system-wide knowledge.
You can't provision memory limits without knowing what's available
in the system and what else is going on in the system, and you want
to be able to adjust them as situation and configuration changes.
Without anybody having full picture of how resources are
provisioned, how would any of that be possible?
I thought workman library will provide interfaces so that one can query
and be able to construct the full system view.
Their doc says.
GList *workmanager_partition_get_children(WorkmanPartition *partition,
GError **error);
So I am assuming this can be used to construct the full partition
hierarchy and associated resource allocation.
Sure, maybe it can be used as a building block.
Post by Vivek Goyal
[..]
Post by Tejun Heo
I think the only logical thing to do is creating a centralized
userland authority which takes full ownership of the cgroup filesystem
interface, gives it a sane structure,
Right now systemd seems to be giving initial structure. I guess we will
require some changes where systemd itself runs in a cgroup and that
allows one to create peer groups. Something like.
root
/ \
systemd other-groups
No, we need a single structured hierarchy which everyone uses
*including* systemd.
Post by Vivek Goyal
Post by Tejun Heo
represents available resources
in a sane form, and makes policy decisions based on configuration and
requests.
Given the fact that library has view of full system resoruces (both
persistent view and active view), shouldn't we just be able to extend
the API to meet additional configuration or resource needs.
Maybe, I don't know. It just looks like a weird approach to me.
Wouldn't it make more sense to implement it as a dbus service that
everyone talks to? That's how our base system is structured these
days. Why should this be any different?

Thanks.
--
tejun
Tejun Heo
2013-04-08 18:49:51 UTC
Permalink
Post by Tejun Heo
Post by Vivek Goyal
Given the fact that library has view of full system resoruces (both
persistent view and active view), shouldn't we just be able to extend
the API to meet additional configuration or resource needs.
Maybe, I don't know. It just looks like a weird approach to me.
Wouldn't it make more sense to implement it as a dbus service that
everyone talks to? That's how our base system is structured these
days. Why should this be any different?
To expand a bit, the base system being composed that way makes a lot
of sense. It becomes clear who's responsible for what and there's a
reliable way to recover when things go awry on the clients' sides.
Also, it pretty much *forces* you to design an interface which fits
the problem domain properly rather than exposing all the control knobs
there are without thinking how they'd be actually useful. The
language binding issue is much easier too - it's already solved.

It seems like the only logical thing to do, well, at least to me.
Am I missing something?

Thanks.
--
tejun
Vivek Goyal
2013-04-08 19:11:05 UTC
Permalink
Post by Tejun Heo
Hey, Vivek.
Post by Vivek Goyal
But using the library admin application should be able to query the
full "paritition" hierarchy and their weigths and calculate % system
resources. I think one problem there is cpu controller where % resoruce
of a cgroup depends on tasks entities which are peer to group. But that's
a kernel issue and not user space thing.
Yeah, we're gonna have to implement a different operation mode.
Post by Vivek Goyal
So I am not sure what are potential problems with proposed model of
configuration in workman. All the consumer managers still follow what
libarary has told them to do.
Sure, if we assume everyone follows the rules and behaves nicely.
It's more about the general approach. Allowing / encouraging sharing
or distributing control of cgroup hierarchy without forcing structure
and rigid control over it is likely to lead to confusion and
fragility.
Post by Vivek Goyal
Post by Tejun Heo
or maybe some other program just happened to choose the
same name.
Two programs ideally would have their own sub hiearchy. And if not one
of the programs should get the conflict when trying to create cgroup and
should back-off or fail or give warning...
And who's responsible for deleting it?
I think "consumer" manager should delete its own cgroup directories when
associated consumer[s] stop running.

And partitions created by workman will just remain there until and unless
user wanted to delete these explicitly.
Post by Tejun Heo
What if the program crashes?
I am not sure about this. I guess when applications comes back after crash,
it can go through all the children cgroups and reclaim empty cgroups.
Post by Tejun Heo
Post by Vivek Goyal
Post by Tejun Heo
Who owns config knobs in that directory?
IIUC, workman was looking at two types of cgroups. Once called
"partitions" which will be created by library at startup time and
library manages the configuration (something like cgconfig.conf).
And individual managers create their own children groups for various
services under that partition and control the config knobs for those
services.
user-defined-partition
/ | \
virt1 virt2 virt3
So user should be able to define a partition and control the configuration
using workman lib. And if multiple virtual machines are being run in
the partition, then they create their own cgroups and libvirt controls
the properties of virt1, virt2, virt3 cgroups. I thought that was the
the understanding when we dicussed ownership of config knobs las time.
But things might have changed since last time. Workman folks should
be able to shed light on this.
I just read the introduction doc and haven't delved into the API or
code so I could be off but why should there be multiple managers?
What's the benefit of that?
A centralized authority does not know about all the managed objects.
Only respective manager knows about what objects it is managing and
what are the controllable attributes of that object.

systemd is managing services and libvirt is managing virtual machines,
containers etc. Some people view associated resource group as just one
additional attribute of the managed service. These managers already
maintain multiple attributes of a service and can store one additional
attribute easily.
Post by Tejun Heo
Wouldn't it make more sense to just have
a central arbitrator that everyone talks to?
May be. Just that in the past folks have not liked the idea of talking
to central authority to figure out resource group of an object they are
managing.
Post by Tejun Heo
What's the benefit of
distributing the responsiblities here? It's not like we can put them
in different security domains.
To me it makes sense in a way, as these resources associated with the
service is just one another property and there does not seem to be
anything special about this property that it should be managed using
a single centralized authority.

For example, one might want to say that maximum IO bandwidth for
virtual machine virt1 on disk sda should be 10MB/s. Now libvirt
should be able to save it in virtual machine specific configuration
easily and whenever virtual machine is started, create a children
cgroup, set the limits as specified.

If a central authority keeps track of all this, I am not sure how
would it look like and might get messy.

[..]
Post by Tejun Heo
Post by Vivek Goyal
Post by Tejun Heo
I think the only logical thing to do is creating a centralized
userland authority which takes full ownership of the cgroup filesystem
interface, gives it a sane structure,
Right now systemd seems to be giving initial structure. I guess we will
require some changes where systemd itself runs in a cgroup and that
allows one to create peer groups. Something like.
root
/ \
systemd other-groups
No, we need a single structured hierarchy which everyone uses
*including* systemd.
That would make sense. systemd had this conflict with cgconfig
too. Problem is that systemd starts first and sets up everything. Now
if there is a service which sets up cgroups, after systemd startup,
it is already late.

Thanks
Vivek
Tejun Heo
2013-04-08 19:20:24 UTC
Permalink
Hey,
Post by Vivek Goyal
Post by Tejun Heo
What if the program crashes?
I am not sure about this. I guess when applications comes back after crash,
it can go through all the children cgroups and reclaim empty cgroups.
Fragile, right? What are you arguing here?
Post by Vivek Goyal
Post by Tejun Heo
Wouldn't it make more sense to just have
a central arbitrator that everyone talks to?
May be. Just that in the past folks have not liked the idea of talking
to central authority to figure out resource group of an object they are
managing.
What we've been doing seems tragically broken to me, so I'm not sure
"people didn't use to do it that way" is a good point.
Post by Vivek Goyal
Post by Tejun Heo
What's the benefit of
distributing the responsiblities here? It's not like we can put them
in different security domains.
To me it makes sense in a way, as these resources associated with the
service is just one another property and there does not seem to be
anything special about this property that it should be managed using
a single centralized authority.
For example, one might want to say that maximum IO bandwidth for
virtual machine virt1 on disk sda should be 10MB/s. Now libvirt
should be able to save it in virtual machine specific configuration
easily and whenever virtual machine is started, create a children
cgroup, set the limits as specified.
Yes, sure, libvirt can *request* whatever it seems appropriate to the
central authority, which will decide whether it'll be able to honor
the request and grant it if possible and allowed by policies in
effect.
Post by Vivek Goyal
That would make sense. systemd had this conflict with cgconfig
too. Problem is that systemd starts first and sets up everything. Now
if there is a service which sets up cgroups, after systemd startup,
it is already late.
Come on, that's not a difficult or fundamental problem. Whatever the
central authority may be, systemd can use it to setup the initial
hierarchy or set up bare-bone hierarchy in compatible manner. This
isn't that different from udev.

Thanks.
--
tejun
Vivek Goyal
2013-04-08 19:46:31 UTC
Permalink
On Mon, Apr 08, 2013 at 12:20:24PM -0700, Tejun Heo wrote:

[..]
Post by Tejun Heo
Post by Vivek Goyal
For example, one might want to say that maximum IO bandwidth for
virtual machine virt1 on disk sda should be 10MB/s. Now libvirt
should be able to save it in virtual machine specific configuration
easily and whenever virtual machine is started, create a children
cgroup, set the limits as specified.
Yes, sure, libvirt can *request* whatever it seems appropriate to the
central authority, which will decide whether it'll be able to honor
the request and grant it if possible and allowed by policies in
effect.
10MB/s is an absolute limit. So I guess there is nothing to be requested
from an central authority here in terms of resources.

Even in the case of IO weight or cpu shares, there is nothing to be asked
from central authority. Well, there is. Creation of new croups changes
effective %share of peer groups. More below.

Where it makes sense though is if one says give a particular service
25% cpu. Then suddenly all the peer and parent entities become important.
IIUC, initial draft of workman does not address this issue.

It would be good to think more about it. How a user can ensure minimum
resources to a partition/service. Because in that case at every level
somebody needs to keep track how much of resources have been committed
as minimum requirements and more consumers can't be allowed at same level.
(This sounds like cpu RT time division among various cgroups).

Thanks
Vivek
Tejun Heo
2013-04-08 20:02:32 UTC
Permalink
Post by Vivek Goyal
It would be good to think more about it. How a user can ensure minimum
resources to a partition/service. Because in that case at every level
somebody needs to keep track how much of resources have been committed
as minimum requirements and more consumers can't be allowed at same level.
(This sounds like cpu RT time division among various cgroups).
Yes, please take a step back from what we have right now because it
isn't very good. It's a general policy decision / enforcement problem
and even the policies may change dynamically. Having a central
authority doesn't automatically solve any of that and it'd be most
likely as limited as existing solutions at the beginning but it allows
for future improvements unlike scattering the solution all over the
place which just digs the hole deeper.

Thanks.
--
tejun
Daniel P. Berrange
2013-04-09 09:50:25 UTC
Permalink
Post by Tejun Heo
Userland efforts
================
There are currently a few userland efforts trying to make interfacing
with cgroup less painful.
* libcg: Make cgroup interface accessible from programming languages
with support for configuration persistency, which also brings its
own config files to remember what to do on the next boot. Sans the
persistence part, it just seems to directly translate the filesystem
interface to function interface.
http://libcg.sourceforge.net/
* Workman: It's a rather young project but as its name (workload
management) implies, its aims are higher level than that of libcg.
It aims to provide high-level resource allocation and management and
introduces new concepts like resource partitions to represent its
view of resource hierarchy. Like libcg, this one is implemented as
a library but provides bindings for more languages.
https://gitorious.org/workman/pages/Home
* Pax Controla Groupiana: A document on how not to step on other's
toes while using cgroup. It's not a software project but tries to
define precautions that a software or user can take to avoid
breaking or confusing other users of the cgroup filesystem.
http://www.freedesktop.org/wiki/Software/systemd/PaxControlGroups
All try to play nice with other possible users of the cgroup
filesystem - be it libvirt cgroup, applications doing their own cgroup
tricks, or hand-crafted custom scripts. While the approach is
understandable given that those usages already exist, I don't think
it's a workable solution in the long term. There are several reasons
for that.
Actually libcg doesn't really try to play nice with anything - being
just a direct representation of the cgroups filesystem, it allows for
absolutely anything to be done with no regard for best practice or
co-operation.

The PaxControlGroups document is the key piece to making distributed
management work. This document does need updating, since some of what
it describes doesn't really work, but its goal is sound IMHO.

The Workman library is presuming that apps will follow the PaxControlGroups
guidelines for use of cgroups, and from there aims to provide system
administrators with a "single world view" and tools to then configure
this. It does not, however, attempt to force itself underneath the
apps like systemd / libvirt, since there is no need todo that. It
just aggregates information from system/libvirt/etc so that admin has
the complete picture of what the cgroups are being used for.
Post by Tejun Heo
* The configurations aren't independent. e.g. for weight-based
controllers, your weight is only meaningful in relation to other
weights at that level. Distributing configuration to whatever
entities which may write to cgroupfs simply cannot work. It's
fundamentally flawed.
I agree that whatever is setting weight values needs to be aware of
what other weight values are set at the same point in the hiearchy.
This doesn't imply we have to have a single authority setting these
values though, just that anything that wants to set them, needs to
be aware of the bigger picture.
Post by Tejun Heo
* It's fragile like hell. There's no accountability. Nobody really
knows what's going on. Is this subdirectory still there due to a
bug in this program, or something or someone else created it and
crashed / forgot to remove it, or what? Oh, the cgroup I wanted to
create already exists. Maybe the previous instance created it and
then crashed or maybe some other program just happened to choose the
same name. Who owns config knobs in that directory? This way lies
madness. I understand why the Pax doc exists but I'm not sure its
long-term effect would be positive - best practices which ultimately
lead to utter confusion and fragility.
I don't see that creating a "single authority" magically solves any
of the problems you describe. For example, such an authority can't
know whether it should delete a cgroup just because an application
exits. It is quite possible an application would want the cgroup to
continue to exist, so that it is still there when it restarts.
Post by Tejun Heo
* In many cases, resource distribution is system-wide policy decisions
and determining what to do often requires system-wide knowledge.
You can't provision memory limits without knowing what's available
in the system and what else is going on in the system, and you want
to be able to adjust them as situation and configuration changes.
Without anybody having full picture of how resources are
provisioned, how would any of that be possible?
Ultimately it is the end admin or top level management tool that has
the whole picture. The Workman library / cli is aiming to provide
admins / apps with the complete picture of everything that is using
resources on the system, so they can adjust policies dynamically.
Post by Tejun Heo
I think this anything-goes approach is prevalent largely because the
cgroup filesystem interface encourages such usage. From the looks of
it, the filesystem permissions combined with hierarchy should be able
to handle delegation perfectly. Well, as it currently stands, it's
anything but and the interface is just misleading. Hierarchy support
was an utter mess, configuration schemes aren't uniform across
controllers, and, more fundamentally, hierarchy itself is expensive -
we can't delegate hierarchy creation to unpriviledged users or
programs safely.
You seem to be implying that 'distributed == anything goes', which is
certainly not what I consider to be the case. Indeed the main point
of having the PaxControlGroups guidelines is explicitly because we do
*not* want an "anything goes" approach.

We ultimately do need the ability to delegate hierarchy creation to
unprivileged users / programs, in order to allow containerized OS to
have the ability to use cgroups. Requiring any applications inside a
container to talk to a cgroups "authority" existing on the host OS is
not a satisfactory architecture. We need to allow for a container to
be self-contained in its usage of cgroups.

At the same time, we don't need/want to give them unrestricted ability
to create arbitarily complex hiearchies - we need some limits on it
to avoid them exposing pathelogically bad kernel behaviour.

This could be as simple as saying that each cgroup controller directory
has a tunable "cgroups.max_children" and/or "cgroups.max_depth" which
allow limits to be placed when delegating administration of part of a
cgroups tree to an unprivileged user.
Post by Tejun Heo
I think the only logical thing to do is creating a centralized
userland authority which takes full ownership of the cgroup filesystem
interface, gives it a sane structure, represents available resources
in a sane form, and makes policy decisions based on configuration and
requests. I don't have a concerete idea what that authority should be
like, but I think there already are pretty similar facilities in our
userland, and don't see why this should be much different.
I don't think that requiring a single userspace authority is
satisfactory. We need to be able to delegate this to containers,
without them needing to talk to some authority back in the
host OS, so that they remain 100% isolated from processes in
the host OS.
Post by Tejun Heo
Another reason why this could be helpful is that we're gonna be
morphing towards unified hierarchy and it'd very nice to have
something which can match impedance between the old and new ways and
not require each individual consumer of cgroup to handle such changes.
As for the unified hierarchy, we just have to. It's currently
fundamentally broken in that it's impossible to tell which cgroup a
resource belongs to independent of which task is looking at it. It's
like this damn thing is designed to honor Hisenberg and Einstein. No
disrespect for the great minds, but it just doens't look like the
proper place.
I've no disagreement that we need a unified hiearchy. The workman
app explicitly does /not/ expose the concept of differing hiearchies
per controller. Likewise libvirt will not allow the user to configure
non-unified hiearchies.
Post by Tejun Heo
So, umm, that's what I want. When I first heard of WorkMan, I was
excited thinking maybe the universe is being really nice and making
things happen to my wishes without me actually doing anything. :) Oh
well, one can dream, but everything is still early, so hopefully we
have enough time to figure things out.
What do you guys think?
We need to make the distribute approach work in order to support
containers, which requiring them to have a back-channel open to
the host userspace. If we can do that, then we've solved the problem
of delegated to unprivileged users in non-container environments too.
IMHO with a sufficiently specified PaxControlGroups the distributed
approach is just fine. If applications are badly behaved and don't
follow the rules, then so be it, file bugs against those apps. Both
libvirt & systemd are committed to following rules for co-operating
in usage of cgroups & Workman can provide a "single unified view"
for the administrator without requiring a single authority too.

Daniel
--
|: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org -o- http://virt-manager.org :|
|: http://autobuild.org -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
Tejun Heo
2013-04-09 19:38:51 UTC
Permalink
Hello, Daniel.
Post by Daniel P. Berrange
The PaxControlGroups document is the key piece to making distributed
management work. This document does need updating, since some of what
it describes doesn't really work, but its goal is sound IMHO.
I think we should add a comment to the doc saying "this is how to keep
things from falling apart completely but in no way is a long term
solution."
Post by Daniel P. Berrange
The Workman library is presuming that apps will follow the PaxControlGroups
guidelines for use of cgroups, and from there aims to provide system
administrators with a "single world view" and tools to then configure
this. It does not, however, attempt to force itself underneath the
apps like systemd / libvirt, since there is no need todo that. It
just aggregates information from system/libvirt/etc so that admin has
the complete picture of what the cgroups are being used for.
I suppose that can be useful for now but pretty strongly disagree it
would be acceptable as long term solution.
Post by Daniel P. Berrange
I don't see that creating a "single authority" magically solves any
of the problems you describe. For example, such an authority can't
know whether it should delete a cgroup just because an application
exits. It is quite possible an application would want the cgroup to
continue to exist, so that it is still there when it restarts.
Sure, then make it request the persistency explicitly. The debate is
not whether trusting each individual player can show similar result.
Sure, that's in the realm of possibility. If you push it as far as
"everyone" should and would behave properly even on edge cases, I
would have to add "theoretical" there tho. The debate is which is the
better way to achieve the desired goals and up until now I don't see
any pros for the distributed approach other than "this is what we've
been doing till now".
Post by Daniel P. Berrange
Ultimately it is the end admin or top level management tool that has
the whole picture. The Workman library / cli is aiming to provide
admins / apps with the complete picture of everything that is using
resources on the system, so they can adjust policies dynamically.
Again, I don't know. It can be useful for now I suppose. I just
can't see it being the long term solution.
Post by Daniel P. Berrange
You seem to be implying that 'distributed == anything goes', which is
certainly not what I consider to be the case. Indeed the main point
of having the PaxControlGroups guidelines is explicitly because we do
*not* want an "anything goes" approach.
Yeah, by asking cooperations from individual players without any way
to monitor or police them.
Post by Daniel P. Berrange
We ultimately do need the ability to delegate hierarchy creation to
unprivileged users / programs, in order to allow containerized OS to
have the ability to use cgroups. Requiring any applications inside a
container to talk to a cgroups "authority" existing on the host OS is
not a satisfactory architecture. We need to allow for a container to
be self-contained in its usage of cgroups.
I'm not sure about this one. Yeah, we might need delegation there at
least for now. That said, it's not gonna be completely consistent.
Root cgroup is special for several controllers and we even have
controllers which propagate config changes down the hierarchy. It
just isn't built for proper delegation.
Post by Daniel P. Berrange
I don't think that requiring a single userspace authority is
satisfactory. We need to be able to delegate this to containers,
without them needing to talk to some authority back in the
host OS, so that they remain 100% isolated from processes in
the host OS.
It's unlikely to work that well. I think a good mental image to have
for cgroup is that of sysctl rather than a generic file system. You
can't go delegate sysctl control knobs to containers or !root users.
You need an extra layer of control to do that. It's true that such
policing could happen in the kernel, but something in the kernel being
exposed to untrusted entities has a lot of implications as the kernel
now becomes heavily involved in *policy* decisions as to what can be
allowed and what can't be and the kernel has a lot less latitude in
making those decisions compared to userland base system.

There are also security implications. memcg control knobs directly
regulate the operation of memory reclaim and writeback. I wouldn't be
surprised if there are pretty easy ways to make them go bonkers while
staying inside the limits from the parent. Again, think of sysctl.
You don't wanna hand these out to untrusted entities.
Post by Daniel P. Berrange
We need to make the distribute approach work in order to support
containers, which requiring them to have a back-channel open to
the host userspace. If we can do that, then we've solved the problem
of delegated to unprivileged users in non-container environments too.
IMHO with a sufficiently specified PaxControlGroups the distributed
approach is just fine. If applications are badly behaved and don't
follow the rules, then so be it, file bugs against those apps. Both
libvirt & systemd are committed to following rules for co-operating
in usage of cgroups & Workman can provide a "single unified view"
for the administrator without requiring a single authority too.
Well, you guys can try I guess. Maybe I'm wrong and workman turns out
to be awesome. I'll be happy to switch my position then, but for now,
the kernel isn't moving towards that direction.

Thanks.
--
tejun
Tejun Heo
2013-04-09 19:46:40 UTC
Permalink
A bit of addition.
Post by Daniel P. Berrange
We need to make the distribute approach work in order to support
containers, which requiring them to have a back-channel open to
the host userspace. If we can do that, then we've solved the problem
Why is back-channel such a bad thing? Even fully virtualized
environments do special things to communicate with the host (the whole
stack of virt drivers). It is sub-optimal and pointless to make
everything completely transparent. There's nothing wrong with the
basesystem knowing that they're inside a container or a virtualized
environment, so I don't understand why a back-channel is such a big
problem.

Thanks.
--
tejun
Serge Hallyn
2013-04-09 21:04:22 UTC
Permalink
Post by Tejun Heo
A bit of addition.
Post by Daniel P. Berrange
We need to make the distribute approach work in order to support
containers, which requiring them to have a back-channel open to
the host userspace. If we can do that, then we've solved the problem
Why is back-channel such a bad thing? Even fully virtualized
environments do special things to communicate with the host (the whole
stack of virt drivers). It is sub-optimal and pointless to make
everything completely transparent. There's nothing wrong with the
basesystem knowing that they're inside a container or a virtualized
environment, so I don't understand why a back-channel is such a big
problem.
Agreed, that's fine so long as it will be a consistent interface.
Ideally, we could do it in a way that the container monitor can
transparently proxy between userspace inside the container and the
library on the host - so that userspace can 'use cgroups' the same
way no matter where it is.

So for instance if there is a dbus call saying "please create cgroup
/x with (some constraints) and put $$ into it", "something" in the
container can convert that into "please create cgroup /lxc/c1/x
and put (host_uid($$)) into it" and pass that to the host's (or
parent container's) "something".

So perhaps it is best if the container monitor, living in the parent
namespaces, opens a socket '@cgroup_monitor' in the container
namespace (through setns), listens for container-userpsace requests
there, and passes them on to the host's monitor (which hopefully
also listens on '@cgroup_monitor', @ being '\0'). Note that my
mentino of converting pids requires a new kernel feature which we
don't currently have (but have wanted for a long time).

-serge
Tejun Heo
2013-04-09 21:11:52 UTC
Permalink
Hey, Serge.
Post by Serge Hallyn
So for instance if there is a dbus call saying "please create cgroup
/x with (some constraints) and put $$ into it", "something" in the
container can convert that into "please create cgroup /lxc/c1/x
and put (host_uid($$)) into it" and pass that to the host's (or
parent container's) "something".
Yeap, definitely. It shouldn't be difficult to make it transparent to
individual consumers. It would actually be far easier to achieve that
with userland agent which knows what's going on in the middle.
Post by Serge Hallyn
So perhaps it is best if the container monitor, living in the parent
namespace (through setns), listens for container-userpsace requests
there, and passes them on to the host's monitor (which hopefully
mentino of converting pids requires a new kernel feature which we
don't currently have (but have wanted for a long time).
Yeah, details may change but in principle something like that.

Thanks.
--
tejun
Li Zefan
2013-04-16 11:17:17 UTC
Permalink
Post by Tejun Heo
Hello, guys.
Status-quo
==========
It's been about a year since I wrote up a summary on cgroup status quo
and future plans. We're not there yet but much closer than we were
before. At least the locking and object life-time management aren't
crazy anymore and most controllers now support proper hierarchy
although not all of them agree on how to treat inheritance.
IIRC, the yet-to-be-converted ones are blk-throttle and perf. cpu
needs to be updated so that it at least supports a similar mechanism
as cfq-iosched for configuring ratio between tasks on an internal
cgroup and its children. Also, we really should update how cpuset
handles a cgroup becoming empty (no cpus or memory node left due to
hot-unplug). It currently transfers all its tasks to the nearest
ancestor with executing resources, which is an irreversible process
which would affect all other co-mounted controllers. We probably want
it to just take on the masks of the ancestor until its own executing
resources become online again, and the new behavior should be gated
behind a switch (Li, can you please look into this?).
Sure, I'll be working on sane hierarchy behavior for cpuset.
Tejun Heo
2013-04-16 17:10:56 UTC
Permalink
Hello, Li.

On Tue, Apr 16, 2013 at 07:17:17PM +0800, Li Zefan wrote:
...
Post by Li Zefan
Post by Tejun Heo
hot-unplug). It currently transfers all its tasks to the nearest
ancestor with executing resources, which is an irreversible process
which would affect all other co-mounted controllers. We probably want
it to just take on the masks of the ancestor until its own executing
resources become online again, and the new behavior should be gated
behind a switch (Li, can you please look into this?).
Sure, I'll be working on sane hierarchy behavior for cpuset.
Great, it'd be great if you can share how it's gonna be done once the
basic design gets settled before full implementation.

Thanks a lot!
--
tejun
Li Zefan
2013-04-17 01:29:32 UTC
Permalink
Post by Tejun Heo
Hello, Li.
...
Post by Li Zefan
Post by Tejun Heo
hot-unplug). It currently transfers all its tasks to the nearest
ancestor with executing resources, which is an irreversible process
which would affect all other co-mounted controllers. We probably want
it to just take on the masks of the ancestor until its own executing
resources become online again, and the new behavior should be gated
behind a switch (Li, can you please look into this?).
Sure, I'll be working on sane hierarchy behavior for cpuset.
Great, it'd be great if you can share how it's gonna be done once the
basic design gets settled before full implementation.
The basic idea is, when a cpuset becomes empty due to hotplug, we don't
move the tasks in it, but instead we update tasks' cpumask/nodemask using
the nearest non-empty acestor cpuset's cpus_allowed and mems_allowed.

- then it's allowed to move those tasks from the empty cpuset to another
cpuset

- when this acestor cpuset's cpumask/nodemask is changed (either by writing
cpuset.cpus/mems or hotplug), not only the tasks in it but also tasks in
the empty cpuset will be updated.

- it's allowed to move a task to an empty cpuset, and the task's cpumask/nodemask
will be updated according to the nearst non-empty acestor, no matter if this
empty cpuset is exclusive or not.

- if a previously offlined cpu becomes online again, the emtpy cpuset won't
get this cpu resource automatically, which is the current behavior.

How does this sound?
Tim Hockin
2013-04-22 21:26:48 UTC
Permalink
Hi Tejun,

This email worries me. A lot. It sounds very much like retrograde
motion from our (Google's) point of view.

We absolutely depend on the ability to split cgroup hierarchies. It
pretty much saved our fleet from imploding, in a way that a unified
hierarchy just could not do. A mandated unified hierarchy is madness.
Please step away from the ledge.

More, going towards a unified hierarchy really limits what we can
delegate, and that is the word of the day. We've got a central
authority agent running which manages cgroups, and we want out of this
business. At least, we want to be able to grant users a set of
constraints, and then let them run wild within those constraints.
Forcing all such work to go through a daemon has proven to be very
problematic, and it has been great now that users can have DIY
sub-cgroups.
Post by Daniel P. Berrange
We ultimately do need the ability to delegate hierarchy creation to
unprivileged users / programs, in order to allow containerized OS to
have the ability to use cgroups. Requiring any applications inside a
container to talk to a cgroups "authority" existing on the host OS is
not a satisfactory architecture. We need to allow for a container to
be self-contained in its usage of cgroups.
This! A thousand times, this!
Post by Daniel P. Berrange
At the same time, we don't need/want to give them unrestricted ability
to create arbitarily complex hiearchies - we need some limits on it
to avoid them exposing pathelogically bad kernel behaviour.
This could be as simple as saying that each cgroup controller directory
has a tunable "cgroups.max_children" and/or "cgroups.max_depth" which
allow limits to be placed when delegating administration of part of a
cgroups tree to an unprivileged user.
We've been bitten by this, and more limitations would be great. We've
got some less-than-perfect patches that impose limits for us now.
Post by Daniel P. Berrange
I've no disagreement that we need a unified hiearchy. The workman
app explicitly does /not/ expose the concept of differing hiearchies
per controller. Likewise libvirt will not allow the user to configure
non-unified hiearchies.
Strong disagreement, here. We use split hierarchies to great effect.
Containment should be composable. If your users or abstractions can't
handle it, please feel free to co-mount the universe, but please
PLEASE don't force us to.

I'm happy to talk more about what we do and why.

Tim
Post by Daniel P. Berrange
Hello, guys.
Status-quo
==========
It's been about a year since I wrote up a summary on cgroup status quo
and future plans. We're not there yet but much closer than we were
before. At least the locking and object life-time management aren't
crazy anymore and most controllers now support proper hierarchy
although not all of them agree on how to treat inheritance.
IIRC, the yet-to-be-converted ones are blk-throttle and perf. cpu
needs to be updated so that it at least supports a similar mechanism
as cfq-iosched for configuring ratio between tasks on an internal
cgroup and its children. Also, we really should update how cpuset
handles a cgroup becoming empty (no cpus or memory node left due to
hot-unplug). It currently transfers all its tasks to the nearest
ancestor with executing resources, which is an irreversible process
which would affect all other co-mounted controllers. We probably want
it to just take on the masks of the ancestor until its own executing
resources become online again, and the new behavior should be gated
behind a switch (Li, can you please look into this?).
While we have still ways to go, I feel relatively confident saying
that we aren't too far out now, well, except for the writeback mess
that still needs to be tackled. Anyways, once the remaining bits are
settled, we can proceed to implement the unified hierarchy mode I've
been talking about forever. I can't think of any fundamental
roadblocks at the moment but who knows? The devil usually is in the
details. Let's hope it goes okay.
So, while we aren't moving as fast as we wish we were, the kernel side
of things are falling into places. At least, that's how I see it.
From now on, I think how to make it actually useable to userland
deserves a bit more focus, and by "useable to userland", I don't mean
some group hacking up an elaborate, manual configuration which is
tailored to the point of being eccentric to suit the needs of the said
group. There's nothing wrong with that and they can continue to do
so, but it just isn't generically useable or useful. It should be
possible to generically and automatically split resources among, say,
several servers and a couple users sharing a system without resorting
to indecipherable ad-hoc shell script running off rc.local.
Userland efforts
================
There are currently a few userland efforts trying to make interfacing
with cgroup less painful.
* libcg: Make cgroup interface accessible from programming languages
with support for configuration persistency, which also brings its
own config files to remember what to do on the next boot. Sans the
persistence part, it just seems to directly translate the filesystem
interface to function interface.
http://libcg.sourceforge.net/
* Workman: It's a rather young project but as its name (workload
management) implies, its aims are higher level than that of libcg.
It aims to provide high-level resource allocation and management and
introduces new concepts like resource partitions to represent its
view of resource hierarchy. Like libcg, this one is implemented as
a library but provides bindings for more languages.
https://gitorious.org/workman/pages/Home
* Pax Controla Groupiana: A document on how not to step on other's
toes while using cgroup. It's not a software project but tries to
define precautions that a software or user can take to avoid
breaking or confusing other users of the cgroup filesystem.
http://www.freedesktop.org/wiki/Software/systemd/PaxControlGroups
All try to play nice with other possible users of the cgroup
filesystem - be it libvirt cgroup, applications doing their own cgroup
tricks, or hand-crafted custom scripts. While the approach is
understandable given that those usages already exist, I don't think
it's a workable solution in the long term. There are several reasons
for that.
* The configurations aren't independent. e.g. for weight-based
controllers, your weight is only meaningful in relation to other
weights at that level. Distributing configuration to whatever
entities which may write to cgroupfs simply cannot work. It's
fundamentally flawed.
* It's fragile like hell. There's no accountability. Nobody really
knows what's going on. Is this subdirectory still there due to a
bug in this program, or something or someone else created it and
crashed / forgot to remove it, or what? Oh, the cgroup I wanted to
create already exists. Maybe the previous instance created it and
then crashed or maybe some other program just happened to choose the
same name. Who owns config knobs in that directory? This way lies
madness. I understand why the Pax doc exists but I'm not sure its
long-term effect would be positive - best practices which ultimately
lead to utter confusion and fragility.
* In many cases, resource distribution is system-wide policy decisions
and determining what to do often requires system-wide knowledge.
You can't provision memory limits without knowing what's available
in the system and what else is going on in the system, and you want
to be able to adjust them as situation and configuration changes.
Without anybody having full picture of how resources are
provisioned, how would any of that be possible?
I think this anything-goes approach is prevalent largely because the
cgroup filesystem interface encourages such usage. From the looks of
it, the filesystem permissions combined with hierarchy should be able
to handle delegation perfectly. Well, as it currently stands, it's
anything but and the interface is just misleading. Hierarchy support
was an utter mess, configuration schemes aren't uniform across
controllers, and, more fundamentally, hierarchy itself is expensive -
we can't delegate hierarchy creation to unpriviledged users or
programs safely.
It is in the realm of possibility to make all cgroup operations and
controllers to do all that; however, it's a very tall order. Just
think about how much effort it has been to achieve and maintain proper
delegation in the core elements of the kernel - processes and
filesystems, and there will be security implications with cgroup
likely involving a lot of gotchas and extensions of security
infrastructures, and, even then, I'm pretty sure it's gonna require
helps from userland to effect proper policy decisions and config
changes. We have things like polkit for a reason and are likely to
need finer-grained, domain-aware access control than is possible with
tweaking directory permissions.
Given the above and how relatively marginal cgroup is, I'm extremely
skeptical that implementing full delegation in kernel is the right
course of action and likely to scream like a banshee at any attempt
driving things that way.
I think the only logical thing to do is creating a centralized
userland authority which takes full ownership of the cgroup filesystem
interface, gives it a sane structure, represents available resources
in a sane form, and makes policy decisions based on configuration and
requests. I don't have a concerete idea what that authority should be
like, but I think there already are pretty similar facilities in our
userland, and don't see why this should be much different.
Another reason why this could be helpful is that we're gonna be
morphing towards unified hierarchy and it'd very nice to have
something which can match impedance between the old and new ways and
not require each individual consumer of cgroup to handle such changes.
As for the unified hierarchy, we just have to. It's currently
fundamentally broken in that it's impossible to tell which cgroup a
resource belongs to independent of which task is looking at it. It's
like this damn thing is designed to honor Hisenberg and Einstein. No
disrespect for the great minds, but it just doens't look like the
proper place.
Even apart from the unified hierarchy thing, I think it generally is a
good idea to have a buffer layer between the kernel interface and
individual consumers for cgroup, which is still very immature and
kinda tightly coupled with internal implementation details.
So, umm, that's what I want. When I first heard of WorkMan, I was
excited thinking maybe the universe is being really nice and making
things happen to my wishes without me actually doing anything. :) Oh
well, one can dream, but everything is still early, so hopefully we
have enough time to figure things out.
What do you guys think?
Thanks.
--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Tejun Heo
2013-04-22 21:41:59 UTC
Permalink
Hello, Tim.
Post by Tim Hockin
We absolutely depend on the ability to split cgroup hierarchies. It
pretty much saved our fleet from imploding, in a way that a unified
hierarchy just could not do. A mandated unified hierarchy is madness.
Please step away from the ledge.
You need to be a lot more specific about why unified hierarchy can't
be implemented. The last time I asked around blk/memcg people in
google, while they said that they'll need different levels of
granularities for different controllers, google's use of cgroup
doesn't require multiple orthogonal classifications of the same group
of tasks.

Also, cgroup isn't dropping multiple hierarchy support over-night.
What has been working till now will continue to work for very long
time. If there is no fundamental conflict with the future changes,
there should be enough time to migrate gradually as desired.
Post by Tim Hockin
More, going towards a unified hierarchy really limits what we can
delegate, and that is the word of the day. We've got a central
authority agent running which manages cgroups, and we want out of this
business. At least, we want to be able to grant users a set of
constraints, and then let them run wild within those constraints.
Forcing all such work to go through a daemon has proven to be very
problematic, and it has been great now that users can have DIY
sub-cgroups.
Sorry, but that doesn't work properly now. It gives you the illusion
of proper delegation but it's inherently dangerous. If that sort of
illusion has been / is good enough for your setup, fine. Delegate at
your own risks, but cgroup in itself doesn't support delegation to
lesser security domains and it won't in the foreseeable future.
Post by Tim Hockin
Strong disagreement, here. We use split hierarchies to great effect.
Containment should be composable. If your users or abstractions can't
handle it, please feel free to co-mount the universe, but please
PLEASE don't force us to.
I'm happy to talk more about what we do and why.
Please do so. Why do you need multiple orthogonal hierarchies?

Thanks.
--
tejun
Tim Hockin
2013-04-22 22:33:04 UTC
Permalink
Post by Tejun Heo
Hello, Tim.
Post by Tim Hockin
We absolutely depend on the ability to split cgroup hierarchies. It
pretty much saved our fleet from imploding, in a way that a unified
hierarchy just could not do. A mandated unified hierarchy is madness.
Please step away from the ledge.
You need to be a lot more specific about why unified hierarchy can't
be implemented. The last time I asked around blk/memcg people in
google, while they said that they'll need different levels of
granularities for different controllers, google's use of cgroup
doesn't require multiple orthogonal classifications of the same group
of tasks.
I'll pull some concrete examples together. I don't have them on hand,
and I am out of country this week. I have looped in the gang at work
(though some are here with me).
Post by Tejun Heo
Also, cgroup isn't dropping multiple hierarchy support over-night.
What has been working till now will continue to work for very long
time. If there is no fundamental conflict with the future changes,
there should be enough time to migrate gradually as desired.
Post by Tim Hockin
More, going towards a unified hierarchy really limits what we can
delegate, and that is the word of the day. We've got a central
authority agent running which manages cgroups, and we want out of this
business. At least, we want to be able to grant users a set of
constraints, and then let them run wild within those constraints.
Forcing all such work to go through a daemon has proven to be very
problematic, and it has been great now that users can have DIY
sub-cgroups.
Sorry, but that doesn't work properly now. It gives you the illusion
of proper delegation but it's inherently dangerous. If that sort of
illusion has been / is good enough for your setup, fine. Delegate at
your own risks, but cgroup in itself doesn't support delegation to
lesser security domains and it won't in the foreseeable future.
We've had great success letting users create sub-cgroups in a few
specific controller types (cpu, cpuacct, memory). This is, of course,
with some restrictions. We do not just give them blanket access to
all knobs. We don't need ALL cgroups, just the important ones.

For a simple example, letting users create sub-groups in freezer or
job (we have a job group that we've been carrying) lets them launch
sub-tasks and manage them in a very clean way.

We've been doing a LOT of development internally to make user-defined
sub-memcgs work in our cluster scheduling system, and it's made some
of our biggest, more insane users very happy.

And for some cgroups, like cpuset, hierarchy just doesn't really make
sense to me. I just don't care if that never works, though I have no
problem with others wanting it. :) Aside: if the last CPU in your
cpuset goes offline, you should go into a state akin to freezer.
Running on any other CPU is an overt violation of policy that the
user, or worse - the admin, set up. Just my 2cents.
Post by Tejun Heo
Post by Tim Hockin
Strong disagreement, here. We use split hierarchies to great effect.
Containment should be composable. If your users or abstractions can't
handle it, please feel free to co-mount the universe, but please
PLEASE don't force us to.
I'm happy to talk more about what we do and why.
Please do so. Why do you need multiple orthogonal hierarchies?
Look for this in the next few days/weeks. From our point of view,
cgroups are the ideal match for how we want to manage things (no
surprise, really, since Mr. Menage worked on both).

Tim
Tim Hockin
2013-06-22 23:13:41 UTC
Permalink
I'm very sorry I let this fall off my plate. I was pointed at a
systemd-devel message indicating that this is done. Is it so? It
seems so completely ass-backwards to me. Below is one of our use-cases
that I just don't see how we can reproduce in a single-heierarchy.
We're also long into the model that users can control their own
sub-cgroups (moderated by permissions decided by admin SW up front).

We have classes of jobs which can run together on shared machines. This is
VERY important to us, and is a key part of how we run things. Over the years
we have evolved from very little isolation to fairly strong isolation, and
cgroups are a large part of that.

We have experienced and adapted to a number of problems around isolation over
time. I won't go into the history of all of these, because it's not so
relevant, but here is how we set things up today.
From a CPU perspective, we have two classes of jobs: production and batch.
Production jobs can (but don't always) ask for exclusive cores, which ensures
that no batch work runs on those CPUs. We manage this with the cpuset cgroup.
Batch jobs are relegated to the set of CPUs that are "left-over" after
exclusivity rules are applied. This is implemented with a shared subdirectory
of the cpuset cgroup called "batch". Production jobs get their own
subdirectories under cpuset.
From an IO perspective we also have two classes of jobs: normal and
DTF-approved. Normal jobs do not get strong isolation for IO, whereas
DTF-enabled jobs do. The vast majority of jobs are NOT DTF-enabled, and they
share a nominal amount of IO bandwidth. This is implemented with a shared
subdirectory of the io cgroup called "default". Jobs that are DTF-enabled get
their own subdirectories under IO.

This gives us 4 combinations:
1) { production, DTF }
2) { production, non-DTF }
3) { batch, DTF }
4) { batch non-DTF }

Of these, (3) is sort of nonsense, but the others are actually used
and needed. This is only
possible because of split hierarchies. In fact, we undertook a very painful
process to move from a unified cgroup hierarchy to split hierarchies in large
part _because of_ these examples.

And for more fun, I am simplifying this all. Batch jobs are actually bound to
NUMA-node specific cpuset cgroups when possible. And we have a similar
concept for the cpu cgroup as for cpuset. And we have a third tier of IO
jobs. We don't do all of this for fun - it is in direct response to REAL
problems we have experienced.

Making cgroups composable allows us to build a higher level abstraction that
is very powerful and flexible. Moving back to unified hierarchies goes
against everything that we're doing here, and will cause us REAL pain.
Post by Tejun Heo
Hello, Tim.
Post by Tim Hockin
We absolutely depend on the ability to split cgroup hierarchies. It
pretty much saved our fleet from imploding, in a way that a unified
hierarchy just could not do. A mandated unified hierarchy is madness.
Please step away from the ledge.
You need to be a lot more specific about why unified hierarchy can't
be implemented. The last time I asked around blk/memcg people in
google, while they said that they'll need different levels of
granularities for different controllers, google's use of cgroup
doesn't require multiple orthogonal classifications of the same group
of tasks.
I'll pull some concrete examples together. I don't have them on hand,
and I am out of country this week. I have looped in the gang at work
(though some are here with me).
Post by Tejun Heo
Also, cgroup isn't dropping multiple hierarchy support over-night.
What has been working till now will continue to work for very long
time. If there is no fundamental conflict with the future changes,
there should be enough time to migrate gradually as desired.
Post by Tim Hockin
More, going towards a unified hierarchy really limits what we can
delegate, and that is the word of the day. We've got a central
authority agent running which manages cgroups, and we want out of this
business. At least, we want to be able to grant users a set of
constraints, and then let them run wild within those constraints.
Forcing all such work to go through a daemon has proven to be very
problematic, and it has been great now that users can have DIY
sub-cgroups.
Sorry, but that doesn't work properly now. It gives you the illusion
of proper delegation but it's inherently dangerous. If that sort of
illusion has been / is good enough for your setup, fine. Delegate at
your own risks, but cgroup in itself doesn't support delegation to
lesser security domains and it won't in the foreseeable future.
We've had great success letting users create sub-cgroups in a few
specific controller types (cpu, cpuacct, memory). This is, of course,
with some restrictions. We do not just give them blanket access to
all knobs. We don't need ALL cgroups, just the important ones.
For a simple example, letting users create sub-groups in freezer or
job (we have a job group that we've been carrying) lets them launch
sub-tasks and manage them in a very clean way.
We've been doing a LOT of development internally to make user-defined
sub-memcgs work in our cluster scheduling system, and it's made some
of our biggest, more insane users very happy.
And for some cgroups, like cpuset, hierarchy just doesn't really make
sense to me. I just don't care if that never works, though I have no
problem with others wanting it. :) Aside: if the last CPU in your
cpuset goes offline, you should go into a state akin to freezer.
Running on any other CPU is an overt violation of policy that the
user, or worse - the admin, set up. Just my 2cents.
Post by Tejun Heo
Post by Tim Hockin
Strong disagreement, here. We use split hierarchies to great effect.
Containment should be composable. If your users or abstractions can't
handle it, please feel free to co-mount the universe, but please
PLEASE don't force us to.
I'm happy to talk more about what we do and why.
Please do so. Why do you need multiple orthogonal hierarchies?
Look for this in the next few days/weeks. From our point of view,
cgroups are the ideal match for how we want to manage things (no
surprise, really, since Mr. Menage worked on both).
Tim
Tejun Heo
2013-06-26 21:20:47 UTC
Permalink
Hello, Tim.
I really want to understand why this is SO IMPORTANT that you have to
break userspace compatibility? I mean, isn't Linux supposed to be the
OS with the stable kernel interface? I've seen Linus rant time and
time again about this - why is it OK now?
What the hell are you talking about? Nobody is breaking userland
interface. A new version of interface is being phased in and the old
one will stay there for the foreseeable future. It will be phased out
eventually but that's gonna take a long time and it will have to be
something hardly noticeable. Of course new features will only be
available with the new interface and there will be efforts to nudge
people away from the old one but the existing interface will keep
working it does.
Examples? we obviously don't grant full access, but our kernel gang
and security gang seem to trust the bits we're enabling well enough...
Then the security gang doesn't have any clue what's going on, or at
least operating on very different assumptions (ie. the workloads are
trusted by default). You can OOM the whole kernel by creating many
cgroups, completely mess up controllers by creating deep hierarchies,
affect your siblings by adjusting your weight and so on. It's really
easy to DoS the whole system if you have write access to a cgroup
directory.
The non-DTF jobs have a combined share that is small but non-trivial.
If we cut that share in half, giving one slice to prod and one slice
to batch, we get bad sharing under contention. We tried this. We
Why is that tho? It *should* work fine and I can't think of a reason
why that would behave particularly badly off the top of my head.
Maybe I forgot too much of the iosched modification used in google.
Anyways, if there's a problem, that should be fixable, right? And
controller-specific issues like that should really dictate the
architectural design too much.
could add control loops in userspace code which try to balance the
shares in proportion to the load. We did that with CPU, and it's sort
Yeah, that is horrible.
of horrible. We're moving AWAY from all this craziness in favor of
well-defined hierarchical behaviors.
But I don't follow the conclusion here. For short term workaround,
sure, but having that dictate the whole architecture decision seems
completely backwards to me.
It's a bit naive to think that this is some absolute truth, don't you
think? It just isn't so. You should know better than most what
craziness our users do, and what (legit) rationales they can produce.
I have $large_number of machines running $huge_number of jobs from
thousands of developers running for years upon years backing up my
worldview.
If so, you aren't communicating it very well. I've talked with quite
a few people about multiple orthogonal hierarchies including people
inside google. Sure, some are using it as it is there but I couldn't
find strong enough rationale to continue that way given the amount of
crazys it implies / encourages. On the other hand, most people agreed
that having a unified hierarchy with differing level of granularities
would serve their cases well enough while not being crazy.

Really, I have $huge_number of machines configured certain way isn't
much of an argument when unified hierarchy isn't gonna break them and
many people involved in cgroup both on kernel and userland sides share
the view that the whole thing is a hellish mess which can only be used
by crafting very specialized configurations for each setup.
I'm not sure I really grok that statement. I'm OK with defining new
That's about google's blkcg modifications to support blkcg on
writeback IOs. It works but can't be upstreamed as it requires
tagging each page both with memcg and blkcg tags.
rules that bring some order to the chaos. Give us new rules to live
by. All-or-nothing would be fine. What if mounting cgroupfs gives me
N sub-dirs, one for each compiled-in controller? You could make THAT
the mount option - you can have either a unified hierarchy of all
controllers or fully disjoint hierarchies. Or some other rule.
Now I'm lost what you're talking about. But the summary is, in the
future, use a single unified hierarchy with differing granularities.
It's still being worked on, so, for now, try not to depend on creating
completely orthogonal hierarchies for different controllers.
The time frame you talk about IS reason for panic. If I know that
What time frame are you referring to?
you're going to completely screw me in a a year and a half, I have to
How the hell am I gonna screw you in a year and half? What are you
talking about? Where is this coming from?
start moving NOW to find new ways to hack around the mess you're
making, make my userspace mesh with it, test those things with
critical customers, find a way to deploy it safely to a bajillion
machines, handle inevitable rollback issues, and so on and so on.
Moving from single hierarchy to split hierarchy LITERALLY took 2
years.
So yeah, I'm in a bit of a panic. You're making a huge amount of work
for us. You're breaking binary compatibility of the (probably)
largest single installation of Linux in the world. And you're being
kind of flip about the reality of it, which is so weird to me,
considering you have first-hand experience with it all.
I frankly have no idea what you're talking about. Calm down and try
to understand what's actually going on.

Thanks.
--
tejun
Tim Hockin
2013-06-27 00:06:02 UTC
Permalink
Post by Tejun Heo
Hello, Tim.
I really want to understand why this is SO IMPORTANT that you have to
break userspace compatibility? I mean, isn't Linux supposed to be the
OS with the stable kernel interface? I've seen Linus rant time and
time again about this - why is it OK now?
What the hell are you talking about? Nobody is breaking userland
interface. A new version of interface is being phased in and the old
The first assertion, as I understood, was that (eventually) cgroupfs
will not allow split hierarchies - that unified hierarchy would be the
only mode. Is that not the case?

The second assertion, as I understood, was that (eventually) cgroupfs
would not support granting access to some cgroup control files to
users (through chown/chmod). Is that not the case?
Post by Tejun Heo
one will stay there for the foreseeable future. It will be phased out
eventually but that's gonna take a long time and it will have to be
something hardly noticeable. Of course new features will only be
available with the new interface and there will be efforts to nudge
people away from the old one but the existing interface will keep
working it does.
Hmm, so what exactly is changing then? If, as you say here, the
existing interfaces will keep working - what is changing?
Post by Tejun Heo
Examples? we obviously don't grant full access, but our kernel gang
and security gang seem to trust the bits we're enabling well enough...
Then the security gang doesn't have any clue what's going on, or at
least operating on very different assumptions (ie. the workloads are
trusted by default). You can OOM the whole kernel by creating many
cgroups, completely mess up controllers by creating deep hierarchies,
affect your siblings by adjusting your weight and so on. It's really
easy to DoS the whole system if you have write access to a cgroup
directory.
As I said, it's controlled delegated access. And we have some patches
that we carry to prevent some of these DoS situations.
Post by Tejun Heo
The non-DTF jobs have a combined share that is small but non-trivial.
If we cut that share in half, giving one slice to prod and one slice
to batch, we get bad sharing under contention. We tried this. We
Why is that tho? It *should* work fine and I can't think of a reason
why that would behave particularly badly off the top of my head.
Maybe I forgot too much of the iosched modification used in google.
Anyways, if there's a problem, that should be fixable, right? And
controller-specific issues like that should really dictate the
architectural design too much.
I actually can not speak to the details of the default IO problem, as
it happened before I really got involved. But just think through it.
If one half of the split has 5 processes running and the other half
has 200, the processes in the 200 set each get FAR less spindle time
than those in the 5 set. That is NOT the semantic we need. We're
trying to offer ~equal access for users of the non-DTF class of jobs.

This is not the tail doing the wagging. This is your assertion that
something should work, when it just doesn't. We have two, totally
orthogonal classes of applications on two totally disjoint sets of
resources. Conjoining them is the wrong answer.
Post by Tejun Heo
could add control loops in userspace code which try to balance the
shares in proportion to the load. We did that with CPU, and it's sort
Yeah, that is horrible.
Yeah, I would love to explain some of the really nasty things we have
done and are moving away from. I am not sure I am allowed to, though
:)
Post by Tejun Heo
of horrible. We're moving AWAY from all this craziness in favor of
well-defined hierarchical behaviors.
But I don't follow the conclusion here. For short term workaround,
sure, but having that dictate the whole architecture decision seems
completely backwards to me.
My point is that the orthogonality of resources is intrinsic. Letting
"it's hard to make it work" dictate the architecture is what's
backwards.
Post by Tejun Heo
It's a bit naive to think that this is some absolute truth, don't you
think? It just isn't so. You should know better than most what
craziness our users do, and what (legit) rationales they can produce.
I have $large_number of machines running $huge_number of jobs from
thousands of developers running for years upon years backing up my
worldview.
If so, you aren't communicating it very well. I've talked with quite
a few people about multiple orthogonal hierarchies including people
inside google. Sure, some are using it as it is there but I couldn't
find strong enough rationale to continue that way given the amount of
crazys it implies / encourages. On the other hand, most people agreed
that having a unified hierarchy with differing level of granularities
would serve their cases well enough while not being crazy.
I'm not sure what "differing level of granularities" means? But that
aside, who have you spoken to here? On our internal discussions I
have not heard a SINGLE member of our prod-kernel team nor our cluster
management team who think this is a good idea. Not one.
Post by Tejun Heo
Really, I have $huge_number of machines configured certain way isn't
much of an argument when unified hierarchy isn't gonna break them and
many people involved in cgroup both on kernel and userland sides share
the view that the whole thing is a hellish mess which can only be used
by crafting very specialized configurations for each setup.
I still don't really get what the hellish mess is, and why it can't be
solved another way. Your statement of "unified hierarchy isn't gonna
break them" is patently false, though. If we did this it would a)
cause a large amount of work to happen and b) cause a major regression
for our users.

If 99.99% of users in the world don't need orthogonality, then
co-mounting the controllers is a great solution for them. But for the
remainder, we need to find a solution that continues to let us do what
we are doing now, which is indeed "very sepcialized". That's not a
bad thing.
Post by Tejun Heo
I'm not sure I really grok that statement. I'm OK with defining new
That's about google's blkcg modifications to support blkcg on
writeback IOs. It works but can't be upstreamed as it requires
tagging each page both with memcg and blkcg tags.
rules that bring some order to the chaos. Give us new rules to live
by. All-or-nothing would be fine. What if mounting cgroupfs gives me
N sub-dirs, one for each compiled-in controller? You could make THAT
the mount option - you can have either a unified hierarchy of all
controllers or fully disjoint hierarchies. Or some other rule.
Now I'm lost what you're talking about. But the summary is, in the
future, use a single unified hierarchy with differing granularities.
It's still being worked on, so, for now, try not to depend on creating
completely orthogonal hierarchies for different controllers.
I'm trying to understand your root problem so that I can try to find
other solutions. "Just do what I say" is not a great way to defend
your position in the face of evidence to the contrary. I'm presenting
you real life cases of situations that simply do not work, neither
philosophically nor in practice, and you continue to assert that it's
fine. It's not fine.
Post by Tejun Heo
The time frame you talk about IS reason for panic. If I know that
What time frame are you referring to?
Somewhere I picked up the notion that you were talking about making
these changes in O(1.5 years). Perhaps I got that wrong. what *is*
the timeframe? At what point will everything we depend on today no
longer be supported?
Post by Tejun Heo
you're going to completely screw me in a a year and a half, I have to
How the hell am I gonna screw you in a year and half? What are you
talking about? Where is this coming from?
start moving NOW to find new ways to hack around the mess you're
making, make my userspace mesh with it, test those things with
critical customers, find a way to deploy it safely to a bajillion
machines, handle inevitable rollback issues, and so on and so on.
Moving from single hierarchy to split hierarchy LITERALLY took 2
years.
So yeah, I'm in a bit of a panic. You're making a huge amount of work
for us. You're breaking binary compatibility of the (probably)
largest single installation of Linux in the world. And you're being
kind of flip about the reality of it, which is so weird to me,
considering you have first-hand experience with it all.
I frankly have no idea what you're talking about. Calm down and try
to understand what's actually going on.
OK. So please shed some light? Will split-hierarchies continue to
work for the indefinite future? Or will they be disabled at some
point? Or will they become so crippled or bit-rotted that they are
effectively removed, without having to actually say that?

I need to know what's happening here both so I can try to help nudge
the ship and so that I can make plans. As I said, it takes literally
O(year) for us to make a change like this safely.

Tim
David Lang
2013-06-26 23:14:07 UTC
Permalink
Post by Tim Hockin
Post by Tejun Heo
Hello, Tim.
I really want to understand why this is SO IMPORTANT that you have to
break userspace compatibility? I mean, isn't Linux supposed to be the
OS with the stable kernel interface? I've seen Linus rant time and
time again about this - why is it OK now?
What the hell are you talking about? Nobody is breaking userland
interface. A new version of interface is being phased in and the old
The first assertion, as I understood, was that (eventually) cgroupfs
will not allow split hierarchies - that unified hierarchy would be the
only mode. Is that not the case?
The second assertion, as I understood, was that (eventually) cgroupfs
would not support granting access to some cgroup control files to
users (through chown/chmod). Is that not the case?
As a bystander, what I understand to be happening is:

1. the Kernel developers are saying that multiple hierarchies is causing lots of
problems, and so they are starting the migration to a unified hierarchy. In the
near term this will be optional, at a later (unspecified) point, it will no
longer be optional.

It is recognized that this is an API break, but the problem is bad enough (too
much undefined behavior) that it looks like they are going to do this anyway.


2. indpendantly from this, the systemd people have declared that systemd is
going to take control of this unified hierarchy and all applications had better
use DBUS calls to systemd to make any cgroup changes or else. (i.e. systemd may
break whatever you are doing)


I don't think the kernel developers are talking about changing ways to control
cgroups, just eliminating having multiple hierarchies.



Now, I could be completely misunderstanding this (and I expect to hear about it
if I am :-)

David Lang
Tejun Heo
2013-06-27 01:04:27 UTC
Permalink
Hello,
Post by Tim Hockin
The first assertion, as I understood, was that (eventually) cgroupfs
will not allow split hierarchies - that unified hierarchy would be the
only mode. Is that not the case?
No, unified hierarchy would be an optional thing for quite a while.
Post by Tim Hockin
The second assertion, as I understood, was that (eventually) cgroupfs
would not support granting access to some cgroup control files to
users (through chown/chmod). Is that not the case?
Again, it'll be an opt-in thing. The hierarchy controller would be
able to notice that and issue warnings if it wants to.
Post by Tim Hockin
Hmm, so what exactly is changing then? If, as you say here, the
existing interfaces will keep working - what is changing?
New interface is being added and new features will be added only for
the new interface. The old one will eventually be deprecated and
removed, but that *years* away.
Post by Tim Hockin
As I said, it's controlled delegated access. And we have some patches
that we carry to prevent some of these DoS situations.
I don't know. You can probably hack around some of the most serious
problems but the whole thing isn't built for proper delgation and
that's not the direction the upstream kernel is headed at the moment.
Post by Tim Hockin
I actually can not speak to the details of the default IO problem, as
it happened before I really got involved. But just think through it.
If one half of the split has 5 processes running and the other half
has 200, the processes in the 200 set each get FAR less spindle time
than those in the 5 set. That is NOT the semantic we need. We're
trying to offer ~equal access for users of the non-DTF class of jobs.
This is not the tail doing the wagging. This is your assertion that
something should work, when it just doesn't. We have two, totally
orthogonal classes of applications on two totally disjoint sets of
resources. Conjoining them is the wrong answer.
As I've said multiple times, there sure are things that you cannot
achieve without orthogonal multiple hierarchies, but given the options
we have at hands, compromising inside a unified hierarchy seems like
the best trade-off. Please take a step back from the immediate detail
and think of the general hierarchical organization of workloads. If
DTF / non-DTF is a fundamental part of your workload classfication,
that should go above.

I don't really understand your example anyway because you can classify
by DTF / non-DTF first and then just propagate cpuset settings along.
You won't lose anything that way, right?

Again, in general, you might not be able to achieve *exactly* what
you've been doing, but, an acceptable compromise should be possible
and not doing so leads to complete mess.
Post by Tim Hockin
Post by Tejun Heo
But I don't follow the conclusion here. For short term workaround,
sure, but having that dictate the whole architecture decision seems
completely backwards to me.
My point is that the orthogonality of resources is intrinsic. Letting
"it's hard to make it work" dictate the architecture is what's
backwards.
No, it's not "it's hard to make it work". It's more "it's
fundamentally broken". You can't identify a resource to be belonging
to a cgroup independent of who's looking at the resource.
Post by Tim Hockin
I'm not sure what "differing level of granularities" means? But that
It means that you'll be able to ignore subtrees depending on
controllers.
Post by Tim Hockin
aside, who have you spoken to here? On our internal discussions I
have not heard a SINGLE member of our prod-kernel team nor our cluster
management team who think this is a good idea. Not one.
Some of memcg and blkcg people in infra kernel team.
Post by Tim Hockin
I still don't really get what the hellish mess is, and why it can't be
solved another way. Your statement of "unified hierarchy isn't gonna
break them" is patently false, though. If we did this it would a)
cause a large amount of work to happen and b) cause a major regression
for our users.
No, what I meant was that unified hierarchy won't break the multiple
hierarchy support immediately.
Post by Tim Hockin
I'm trying to understand your root problem so that I can try to find
other solutions. "Just do what I say" is not a great way to defend
your position in the face of evidence to the contrary. I'm presenting
you real life cases of situations that simply do not work, neither
philosophically nor in practice, and you continue to assert that it's
fine. It's not fine.
I wrote about that many times, but here are two of the problems.

* There's no way to designate a cgroup to a resource, because cgroup
is only defined by the combination of who's looking at it for which
controller. That's how you end up with tagging the same resource
multiple times for different controllers and even then it's broken
as when you move resources from one cgroup to another, you can't
tell what to do with other tags.

While allowing obscene level of flexibility, multiple hierarchies
destroy a very fundamental concept that it *should* provide - that
of a resource container. It can't because a "cgroup" is undefined
under multiple hierarchies.

* The level of flexibility makes it very difficult to scope the common
usage models. It's a problem for both the kernel and userland. The
kernel has to be prepared to cope with anything - e.g. with unified
hierarchy, we can assume things like either all tasks in a cgroup
are frozen or not, with multiple, any combination is possible - and
the userland is generally lost on what to do and has been in a
complete disarray, and it's not really userland's fault because
enforcing any rule would mean hindering some crazy setup that
someone is using.

cgroup as it currently stands invites pretty insane usages which we
can't back out of later on. Well, it's already painful to back out
but the sooner the better. And all that for what? Allowing exotic
specialized configurations which in all likelihood will be served
acceptably with unified hierarchy anyway?
Post by Tim Hockin
Somewhere I picked up the notion that you were talking about making
these changes in O(1.5 years). Perhaps I got that wrong. what *is*
the timeframe? At what point will everything we depend on today no
longer be supported?
I'm making the changes as soon as possible. There of course are two
steps involved here - implementing the new thing and then removing the
old thing. Implementing the new thing is gonna happen, hopefully, in
a year's timeframe. The latter. I don't know for sure but probably
over five years.
Post by Tim Hockin
OK. So please shed some light? Will split-hierarchies continue to
work for the indefinite future? Or will they be disabled at some
point? Or will they become so crippled or bit-rotted that they are
effectively removed, without having to actually say that?
It's gonna be properly maintained but new features in general will
only be implemented for the unified hierarchy. In time, hopefully,
the difference in capabilities between the new and old interfaces
combined with other efforts will drive users towards the new
interface. After the old interface's usage has sufficiently dwindled,
it will be deprecated.

Thanks.
--
tejun
Tim Hockin
2013-06-27 03:42:21 UTC
Permalink
Post by Tejun Heo
Hello,
Post by Tim Hockin
The first assertion, as I understood, was that (eventually) cgroupfs
will not allow split hierarchies - that unified hierarchy would be the
only mode. Is that not the case?
No, unified hierarchy would be an optional thing for quite a while.
Post by Tim Hockin
The second assertion, as I understood, was that (eventually) cgroupfs
would not support granting access to some cgroup control files to
users (through chown/chmod). Is that not the case?
Again, it'll be an opt-in thing. The hierarchy controller would be
able to notice that and issue warnings if it wants to.
Post by Tim Hockin
Hmm, so what exactly is changing then? If, as you say here, the
existing interfaces will keep working - what is changing?
New interface is being added and new features will be added only for
the new interface. The old one will eventually be deprecated and
removed, but that *years* away.
OK, then what I don't know is what is the new interface? A new cgroupfs?
Post by Tejun Heo
Post by Tim Hockin
As I said, it's controlled delegated access. And we have some patches
that we carry to prevent some of these DoS situations.
I don't know. You can probably hack around some of the most serious
problems but the whole thing isn't built for proper delgation and
that's not the direction the upstream kernel is headed at the moment.
Post by Tim Hockin
I actually can not speak to the details of the default IO problem, as
it happened before I really got involved. But just think through it.
If one half of the split has 5 processes running and the other half
has 200, the processes in the 200 set each get FAR less spindle time
than those in the 5 set. That is NOT the semantic we need. We're
trying to offer ~equal access for users of the non-DTF class of jobs.
This is not the tail doing the wagging. This is your assertion that
something should work, when it just doesn't. We have two, totally
orthogonal classes of applications on two totally disjoint sets of
resources. Conjoining them is the wrong answer.
As I've said multiple times, there sure are things that you cannot
achieve without orthogonal multiple hierarchies, but given the options
we have at hands, compromising inside a unified hierarchy seems like
the best trade-off. Please take a step back from the immediate detail
and think of the general hierarchical organization of workloads. If
DTF / non-DTF is a fundamental part of your workload classfication,
that should go above.
DTF and CPU and cpuset all have "default" groups for some tasks (and
not others) in our world today. DTF actually has default, prio, and
"normal". I was simplifying before. I really wish it were as simple
as you think it is. But if it were, do you think I'd still be
arguing?
Post by Tejun Heo
I don't really understand your example anyway because you can classify
by DTF / non-DTF first and then just propagate cpuset settings along.
You won't lose anything that way, right?
This really doesn't scale when I have thousands of jobs running.
Being able to disable at some levels on some controllers probably
helps some, but I can't say for sure without knowing the new interface
Post by Tejun Heo
Again, in general, you might not be able to achieve *exactly* what
you've been doing, but, an acceptable compromise should be possible
and not doing so leads to complete mess.
We tried it in unified hierarchy. We had our Top People on the
problem. The best we could get was bad enough that we embarked on a
LITERAL 2 year transition to make it better.
Post by Tejun Heo
Post by Tim Hockin
Post by Tejun Heo
But I don't follow the conclusion here. For short term workaround,
sure, but having that dictate the whole architecture decision seems
completely backwards to me.
My point is that the orthogonality of resources is intrinsic. Letting
"it's hard to make it work" dictate the architecture is what's
backwards.
No, it's not "it's hard to make it work". It's more "it's
fundamentally broken". You can't identify a resource to be belonging
to a cgroup independent of who's looking at the resource.
What if you could ensure that for a given TID (or PID if required) in
dir X of controller C, all of the other TIDs in that cgroup were in
the same group, but maybe not the same sub-path, under every
controller? This gives you what it sounds like you wanted elsewhere -
a container abstraction.

In other words, define a container as a set of cgroups, one under each
each active controller type. A TID enters the container atomically,
joining all of the cgroups or none of the cgroups.

container C1 = { /cgroup/cpu/foo, /cgroup/memory/bar,
/cgroup/io/default/foo/bar, /cgroup/cpuset/

This is an abstraction that we maintain in userspace (more or less)
and we do actually have headaches from split hierarchies here
(handling partial failures, non-atomic joins, etc)
Post by Tejun Heo
Post by Tim Hockin
I'm not sure what "differing level of granularities" means? But that
It means that you'll be able to ignore subtrees depending on
controllers.
I'm still a bit fuzzy - is all of this written somewhere?
Post by Tejun Heo
Post by Tim Hockin
aside, who have you spoken to here? On our internal discussions I
have not heard a SINGLE member of our prod-kernel team nor our cluster
management team who think this is a good idea. Not one.
Some of memcg and blkcg people in infra kernel team.
Well, if anyone there feels like we should be moving in this
direction, I hope they will come and talk to me and enlighten me.
Post by Tejun Heo
Post by Tim Hockin
I still don't really get what the hellish mess is, and why it can't be
solved another way. Your statement of "unified hierarchy isn't gonna
break them" is patently false, though. If we did this it would a)
cause a large amount of work to happen and b) cause a major regression
for our users.
No, what I meant was that unified hierarchy won't break the multiple
hierarchy support immediately.
I did not realize you were building a parallel <thing>. This at least
makes me believe I have time to adapt better (or have our teams hack
some more), if I can't bring you to your senses.
Post by Tejun Heo
Post by Tim Hockin
I'm trying to understand your root problem so that I can try to find
other solutions. "Just do what I say" is not a great way to defend
your position in the face of evidence to the contrary. I'm presenting
you real life cases of situations that simply do not work, neither
philosophically nor in practice, and you continue to assert that it's
fine. It's not fine.
I wrote about that many times, but here are two of the problems.
* There's no way to designate a cgroup to a resource, because cgroup
is only defined by the combination of who's looking at it for which
controller. That's how you end up with tagging the same resource
multiple times for different controllers and even then it's broken
as when you move resources from one cgroup to another, you can't
tell what to do with other tags.
While allowing obscene level of flexibility, multiple hierarchies
destroy a very fundamental concept that it *should* provide - that
of a resource container. It can't because a "cgroup" is undefined
under multiple hierarchies.
It sounds like you're missing a layer of abstraction. Why not add the
abstraction you want to expose on top of powerful primitives, instead
of dumbing down the primitives?
Post by Tejun Heo
* The level of flexibility makes it very difficult to scope the common
usage models. It's a problem for both the kernel and userland. The
kernel has to be prepared to cope with anything - e.g. with unified
hierarchy, we can assume things like either all tasks in a cgroup
are frozen or not, with multiple, any combination is possible - and
the userland is generally lost on what to do and has been in a
complete disarray, and it's not really userland's fault because
enforcing any rule would mean hindering some crazy setup that
someone is using.
cgroup as it currently stands invites pretty insane usages which we
can't back out of later on. Well, it's already painful to back out
but the sooner the better. And all that for what? Allowing exotic
specialized configurations which in all likelihood will be served
acceptably with unified hierarchy anyway?
Again, not served acceptably. Saying it over and over does not make
it true. Please believe me when I say that I understand how hard it
is to back out of overly-flexible APIs that are hard to support.

But it seems vastly better to define a next-gen API that retains the
important flexibility but adds structure where it was lacking
previously.
Post by Tejun Heo
Post by Tim Hockin
Somewhere I picked up the notion that you were talking about making
these changes in O(1.5 years). Perhaps I got that wrong. what *is*
the timeframe? At what point will everything we depend on today no
longer be supported?
I'm making the changes as soon as possible. There of course are two
steps involved here - implementing the new thing and then removing the
old thing. Implementing the new thing is gonna happen, hopefully, in
a year's timeframe. The latter. I don't know for sure but probably
over five years.
Post by Tim Hockin
OK. So please shed some light? Will split-hierarchies continue to
work for the indefinite future? Or will they be disabled at some
point? Or will they become so crippled or bit-rotted that they are
effectively removed, without having to actually say that?
It's gonna be properly maintained but new features in general will
only be implemented for the unified hierarchy. In time, hopefully,
the difference in capabilities between the new and old interfaces
combined with other efforts will drive users towards the new
interface. After the old interface's usage has sufficiently dwindled,
it will be deprecated.
Thanks.
--
tejun
Tejun Heo
2013-06-27 17:38:09 UTC
Permalink
Hello, Tim.
Post by Tim Hockin
OK, then what I don't know is what is the new interface? A new cgroupfs?
It's gonna be a new mount option for cgroupfs.
Post by Tim Hockin
DTF and CPU and cpuset all have "default" groups for some tasks (and
not others) in our world today. DTF actually has default, prio, and
"normal". I was simplifying before. I really wish it were as simple
as you think it is. But if it were, do you think I'd still be
arguing?
How am I supposed to know when you don't communicate it but just wave
your hands saying it's all very complicated? The cpuset / blkcg
example is pretty bad because you can enforce any cpuset rules at the
leaves.
Post by Tim Hockin
This really doesn't scale when I have thousands of jobs running.
Being able to disable at some levels on some controllers probably
helps some, but I can't say for sure without knowing the new interface
How does the number of jobs affect it? Does each job create a new
cgroup?
Post by Tim Hockin
We tried it in unified hierarchy. We had our Top People on the
problem. The best we could get was bad enough that we embarked on a
LITERAL 2 year transition to make it better.
What didn't work? What part was so bad? I find it pretty difficult
to believe that multiple orthogonal hierarchies is the only possible
solution, so please elaborate the issues that you guys have
experienced.

The hierarchy is for organization and enforcement of dynamic
hierarchical resource distribution and that's it. If its expressive
power is lacking, take compromise or tune the configuration according
to the workloads. The latter is necessary in workloads which have
clear distinction of foreground and background anyway - anything which
interacts with human beings including androids.
Post by Tim Hockin
In other words, define a container as a set of cgroups, one under each
each active controller type. A TID enters the container atomically,
joining all of the cgroups or none of the cgroups.
container C1 = { /cgroup/cpu/foo, /cgroup/memory/bar,
/cgroup/io/default/foo/bar, /cgroup/cpuset/
This is an abstraction that we maintain in userspace (more or less)
and we do actually have headaches from split hierarchies here
(handling partial failures, non-atomic joins, etc)
That'd separate out task organization from controllre config
hierarchies. Kay had a similar idea some time ago. I think it makes
things even more complex than it is right now. I'll continue on this
below.
Post by Tim Hockin
I'm still a bit fuzzy - is all of this written somewhere?
If you dig through cgroup ML, most are there. There'll be
"cgroup.controllers" file with which you can enable / disable
controllers. Enabling a controller in a cgroup implies that the
controller is enabled in all ancestors.
Post by Tim Hockin
It sounds like you're missing a layer of abstraction. Why not add the
abstraction you want to expose on top of powerful primitives, instead
of dumbing down the primitives?
It sure would be possible build more and try to address the issues
we're seeing now; however, after looking at cgroups for some time now,
the underlying theme is failure to take reasonable trade-offs and
going for maximum flexibility in making each choice - the choice of
interface, multiple hierarchies, no restriction on hierarchical
behavior, splitting threads of the same process into separate cgroups,
semi-encouraging delegation through file permission without actually
pondering the consequences and so on. And each choice probably made
sense trying to serve each immediate requirement at the time but added
up it's a giant pile of mess which developed without direction.

So, at this point, I'm very skeptical about adding more flexibility.
Once the basics are settled, we sure can look into the missing pieces
but I don't think that's what we should be doing right now. Another
thing is that the unified hierarchy can be implemented by using most
of the constructs cgroup core already has in more controller way.
Given that we're gonna have to maintain both interfaces for quite some
time, the deviation should be kept as minimal as possible.
Post by Tim Hockin
But it seems vastly better to define a next-gen API that retains the
important flexibility but adds structure where it was lacking
previously.
I suppose that's where we disagree. I think a lot of cgroup's
problems stem from too much flexibility. The problem with such level
of flexibility is that, in addition to breaking fundamental constructs
and adding significantly to maintenance overhead, it blocks reasonable
trade-offs to be made at the right places, in turn requiring more
"flexibility" to address the introduced deficiencies.

Thanks.
--
tejun
Tim Hockin
2013-06-27 20:46:18 UTC
Permalink
Post by Tejun Heo
Hello, Tim.
Post by Tim Hockin
OK, then what I don't know is what is the new interface? A new cgroupfs?
It's gonna be a new mount option for cgroupfs.
Post by Tim Hockin
DTF and CPU and cpuset all have "default" groups for some tasks (and
not others) in our world today. DTF actually has default, prio, and
"normal". I was simplifying before. I really wish it were as simple
as you think it is. But if it were, do you think I'd still be
arguing?
How am I supposed to know when you don't communicate it but just wave
your hands saying it's all very complicated? The cpuset / blkcg
example is pretty bad because you can enforce any cpuset rules at the
leaves.
Modifying hundreds of cgroups is really painful, and yes, we do it
often enough to be able to see it.
Post by Tejun Heo
Post by Tim Hockin
This really doesn't scale when I have thousands of jobs running.
Being able to disable at some levels on some controllers probably
helps some, but I can't say for sure without knowing the new interface
How does the number of jobs affect it? Does each job create a new
cgroup?
Well, in your model it does...
Post by Tejun Heo
Post by Tim Hockin
We tried it in unified hierarchy. We had our Top People on the
problem. The best we could get was bad enough that we embarked on a
LITERAL 2 year transition to make it better.
What didn't work? What part was so bad? I find it pretty difficult
to believe that multiple orthogonal hierarchies is the only possible
solution, so please elaborate the issues that you guys have
experienced.
I'm looping in more Google people.
Post by Tejun Heo
The hierarchy is for organization and enforcement of dynamic
hierarchical resource distribution and that's it. If its expressive
power is lacking, take compromise or tune the configuration according
to the workloads. The latter is necessary in workloads which have
clear distinction of foreground and background anyway - anything which
interacts with human beings including androids.
So what you're saying is that you don't care that this new thing is
less capable than the old thing, despite it having real impact.
Post by Tejun Heo
Post by Tim Hockin
In other words, define a container as a set of cgroups, one under each
each active controller type. A TID enters the container atomically,
joining all of the cgroups or none of the cgroups.
container C1 = { /cgroup/cpu/foo, /cgroup/memory/bar,
/cgroup/io/default/foo/bar, /cgroup/cpuset/
This is an abstraction that we maintain in userspace (more or less)
and we do actually have headaches from split hierarchies here
(handling partial failures, non-atomic joins, etc)
That'd separate out task organization from controllre config
hierarchies. Kay had a similar idea some time ago. I think it makes
things even more complex than it is right now. I'll continue on this
below.
Post by Tim Hockin
I'm still a bit fuzzy - is all of this written somewhere?
If you dig through cgroup ML, most are there. There'll be
"cgroup.controllers" file with which you can enable / disable
controllers. Enabling a controller in a cgroup implies that the
controller is enabled in all ancestors.
Implies or requires? Cause or predicate?

If controller C is enabled at level X but disabled at level X/Y, does
that mean that X/Y uses the limits set in X? How about X/Y/Z?

This will get rid of the bulk of the cpuset scaling problem, but not
all of it. I think we still have the same problems with cpu as we do
with io. Perhaps that should have been the example.
Post by Tejun Heo
Post by Tim Hockin
It sounds like you're missing a layer of abstraction. Why not add the
abstraction you want to expose on top of powerful primitives, instead
of dumbing down the primitives?
It sure would be possible build more and try to address the issues
we're seeing now; however, after looking at cgroups for some time now,
the underlying theme is failure to take reasonable trade-offs and
going for maximum flexibility in making each choice - the choice of
interface, multiple hierarchies, no restriction on hierarchical
behavior, splitting threads of the same process into separate cgroups,
semi-encouraging delegation through file permission without actually
pondering the consequences and so on. And each choice probably made
sense trying to serve each immediate requirement at the time but added
up it's a giant pile of mess which developed without direction.
I am very sympathetic to this problem. You could have just described
some of our internal problems too. The difference is that we are
trying to make changes that provide more structure and boundaries in
ways that retain the fundamental power, without tossing out the baby
with the bathwater.
Post by Tejun Heo
So, at this point, I'm very skeptical about adding more flexibility.
Once the basics are settled, we sure can look into the missing pieces
but I don't think that's what we should be doing right now. Another
thing is that the unified hierarchy can be implemented by using most
of the constructs cgroup core already has in more controller way.
Given that we're gonna have to maintain both interfaces for quite some
time, the deviation should be kept as minimal as possible.
Post by Tim Hockin
But it seems vastly better to define a next-gen API that retains the
important flexibility but adds structure where it was lacking
previously.
I suppose that's where we disagree. I think a lot of cgroup's
problems stem from too much flexibility. The problem with such level
of flexibility is that, in addition to breaking fundamental constructs
and adding significantly to maintenance overhead, it blocks reasonable
trade-offs to be made at the right places, in turn requiring more
"flexibility" to address the introduced deficiencies.
So take away some of the flexibility that has minimal impact and
maximum return. Splitting threads across cgroups - we use it, but we
could get off that. Force all-or-nothing joining of an aggregate
construct (a container vs N cgroups).

But perform surgery with a scalpel, not a hatchet.
Tejun Heo
2013-06-27 21:04:45 UTC
Permalink
Hello,
Post by Tim Hockin
So what you're saying is that you don't care that this new thing is
less capable than the old thing, despite it having real impact.
Sort of. I'm saying, at least up until now, moving away from
orthogonal hierarchy support seems to be the right trade-off. It all
depends on how you measure how much things are simplified and how
heavy the "real impacts" are. It's not like these things can be
determined white and black. Given the current situation, I think it's
the right call.
Post by Tim Hockin
If controller C is enabled at level X but disabled at level X/Y, does
that mean that X/Y uses the limits set in X? How about X/Y/Z?
Y and Y/Z wouldn't make any difference. Tasks belonging to them would
behave as if they belong to X as far as C is concerened.
Post by Tim Hockin
So take away some of the flexibility that has minimal impact and
maximum return. Splitting threads across cgroups - we use it, but we
could get off that. Force all-or-nothing joining of an aggregate
Please do so.
Post by Tim Hockin
construct (a container vs N cgroups).
But perform surgery with a scalpel, not a hatchet.
As anything else, it's drawing a line in a continuous spectrum of
grey. Right now, given that maintaining multiple orthogonal
hierarchies while introducing a proper concept of resource container
involves addition of completely new constructs and complexity, I don't
think that's a good option. If there are problems which can't be
resolved / worked around in a reasonable manner, please bring them up
along with their contexts. Let's examine them and see whether there
are other ways to accomodate them.

Thanks.
--
tejun
Tim Hockin
2013-06-28 18:44:23 UTC
Permalink
Post by Tejun Heo
Hello,
Post by Tim Hockin
So what you're saying is that you don't care that this new thing is
less capable than the old thing, despite it having real impact.
Sort of. I'm saying, at least up until now, moving away from
orthogonal hierarchy support seems to be the right trade-off. It all
depends on how you measure how much things are simplified and how
heavy the "real impacts" are. It's not like these things can be
determined white and black. Given the current situation, I think it's
the right call.
I totally understand where you're coming from - trying to get back to
a stable feature set. But it sucks to be on the losing end of that
battle - you're cutting things that REALLY matter to us, and without a
really viable alternative. So we'll keep fighting.
Post by Tejun Heo
Post by Tim Hockin
If controller C is enabled at level X but disabled at level X/Y, does
that mean that X/Y uses the limits set in X? How about X/Y/Z?
Y and Y/Z wouldn't make any difference. Tasks belonging to them would
behave as if they belong to X as far as C is concerened.
OK, that *sounds* sane. It doesn't solve all our problems, but it
alleviates some of them.
Post by Tejun Heo
Post by Tim Hockin
So take away some of the flexibility that has minimal impact and
maximum return. Splitting threads across cgroups - we use it, but we
could get off that. Force all-or-nothing joining of an aggregate
Please do so.
Splitting threads is sort of important for some cgroups, like CPU. I
wonder if pjt is paying attention to this thread.
Post by Tejun Heo
Post by Tim Hockin
construct (a container vs N cgroups).
But perform surgery with a scalpel, not a hatchet.
As anything else, it's drawing a line in a continuous spectrum of
grey. Right now, given that maintaining multiple orthogonal
hierarchies while introducing a proper concept of resource container
involves addition of completely new constructs and complexity, I don't
think that's a good option. If there are problems which can't be
resolved / worked around in a reasonable manner, please bring them up
along with their contexts. Let's examine them and see whether there
are other ways to accomodate them.
You're arguing that the abstraction you want is that of a "container"
but that it's easier to remove options than to actually build a better
API.

I think this is wrong. Take the opportunity to define the RIGHT
interface that you WANT - a container. Implement it in terms of
cgroups (and maybe other stuff!). Make that API so compelling that
people want to use it, and your war of attrition on direct cgroup
madness will be won, but with net progress rather than regress.
Tejun Heo
2013-06-29 16:40:51 UTC
Permalink
Hello, Tim.
Post by Tim Hockin
I totally understand where you're coming from - trying to get back to
a stable feature set. But it sucks to be on the losing end of that
Oh, it has been sucking and will continue to suck like hell for me too
for the foreseeable future. Trust me, this side ain't any greener.
Post by Tim Hockin
battle - you're cutting things that REALLY matter to us, and without a
really viable alternative. So we'll keep fighting.
Yeah, that's understandable. More on this later.
Post by Tim Hockin
Splitting threads is sort of important for some cgroups, like CPU. I
wonder if pjt is paying attention to this thread.
Paul?
Post by Tim Hockin
I think this is wrong. Take the opportunity to define the RIGHT
interface that you WANT - a container. Implement it in terms of
cgroups (and maybe other stuff!). Make that API so compelling that
people want to use it, and your war of attrition on direct cgroup
madness will be won, but with net progress rather than regress.
The goal is to reach sane and widely useable / useful state with
minimum amount of complexity. Maintaining backward compatibility for
some period - likely quite a few years - while still allowing future
development is a pretty important consideration. Another factor is
that the general situation has been more or less atrocious and cgroup
as a whole has been failing in the very basic places, which also
reinforces the drive for simplicity.

I probably am forgetting some, but anyways, from my POV, there are
fairly strong by-default factors which push for simplicity even if
that means some loss of functionalities as long as those aren't
something catastrophic. I've been going over the decisions past few
days and unified hierarchy still seems the best, or rather, most
acceptable solution.

That said, I stil don't know very well the scope and severity of the
problems you guys might face from the loss of multiple orthogonal
hierarchies. The cpuset one wasn't very convincing especially given
that most of expressibility problems can be mitigated if you presume
the central managing facility which can adapt the configurations as
the workload changes. Dynamic execution of configuration of course is
the job of cgroup proper but larger cadence changes doesn't have to be
statically encoded in the hierarchy itself and as I wrote before some
just can't be whether multiple hierarchy or not.

While the bar to overcome is pretty high, I do want to learn about the
problems you guys are foreseeing, so that I can at least evaulate the
graveness properly and hopefully compromises which can mitigate the
most sore ones can be made wherever necessary.

So, can you please explain the issues that you've experienced and are
foreseeing in detail with their contexts? ie. if you have certain
requirement, please give at least brief explanation on where such
requirement is coming from and how important the requirement is.

Thanks.
--
tejun
Mike Galbraith
2013-06-27 05:45:07 UTC
Permalink
Post by Tejun Heo
Hello, Tim.
I really want to understand why this is SO IMPORTANT that you have to
break userspace compatibility? I mean, isn't Linux supposed to be the
OS with the stable kernel interface? I've seen Linus rant time and
time again about this - why is it OK now?
What the hell are you talking about? Nobody is breaking userland
interface. A new version of interface is being phased in and the old
one will stay there for the foreseeable future. It will be phased out
eventually but that's gonna take a long time and it will have to be
something hardly noticeable. Of course new features will only be
available with the new interface and there will be efforts to nudge
people away from the old one but the existing interface will keep
working it does.
I can understand some alarm. When I saw the below I started frothing at
the face and howling at the moon, and I don't even use the things much.

http://lists.freedesktop.org/archives/systemd-devel/2013-June/011521.html

Hierarchy layout aside, that "private property" bit says that the folks
who currently own and use the cgroups interface will lose direct access
to it. I can imagine folks who have become dependent upon an on the fly
management agents of their own design becoming a tad alarmed.

-Mike
Serge Hallyn
2013-06-27 13:22:06 UTC
Permalink
Post by Mike Galbraith
Post by Tejun Heo
Hello, Tim.
I really want to understand why this is SO IMPORTANT that you have to
break userspace compatibility? I mean, isn't Linux supposed to be the
OS with the stable kernel interface? I've seen Linus rant time and
time again about this - why is it OK now?
What the hell are you talking about? Nobody is breaking userland
interface. A new version of interface is being phased in and the old
one will stay there for the foreseeable future. It will be phased out
eventually but that's gonna take a long time and it will have to be
something hardly noticeable. Of course new features will only be
available with the new interface and there will be efforts to nudge
people away from the old one but the existing interface will keep
working it does.
I can understand some alarm. When I saw the below I started frothing at
the face and howling at the moon, and I don't even use the things much.
http://lists.freedesktop.org/archives/systemd-devel/2013-June/011521.html
Hierarchy layout aside, that "private property" bit says that the folks
who currently own and use the cgroups interface will lose direct access
to it. I can imagine folks who have become dependent upon an on the fly
management agents of their own design becoming a tad alarmed.
FWIW, the code is too embarassing yet to see daylight, but I'm playing
with a very lowlevel cgroup manager which supports nesting itself.
Access in this POC is low-level ("set freezer.state to THAWED for cgroup
/c1/c2", "Create /c3"), but the key feature is that it can run in two
modes - native mode in which it uses cgroupfs, and child mode where it
talks to a parent manager to make the changes.

So then the idea would be that userspace (like libvirt and lxc) would
talk over /dev/cgroup to its manager. Userspace inside a container
(which can't actually mount cgroups itself) would talk to its own
manager which is talking over a passed-in socket to the host manager,
which in turn runs natively (uses cgroupfs, and nests "create /c1" under
the requestor's cgroup).

At some point (probably soon) we might want to talk about a standard API
for these things. However I think it will have to come in the form of
a standard library, which knows to either send requests over dbus to
systemd, or over /dev/cgroup sock to the manager.

-serge
Tim Hockin
2013-06-27 15:29:21 UTC
Permalink
Post by Serge Hallyn
Post by Mike Galbraith
Post by Tejun Heo
Hello, Tim.
I really want to understand why this is SO IMPORTANT that you have to
break userspace compatibility? I mean, isn't Linux supposed to be the
OS with the stable kernel interface? I've seen Linus rant time and
time again about this - why is it OK now?
What the hell are you talking about? Nobody is breaking userland
interface. A new version of interface is being phased in and the old
one will stay there for the foreseeable future. It will be phased out
eventually but that's gonna take a long time and it will have to be
something hardly noticeable. Of course new features will only be
available with the new interface and there will be efforts to nudge
people away from the old one but the existing interface will keep
working it does.
I can understand some alarm. When I saw the below I started frothing at
the face and howling at the moon, and I don't even use the things much.
http://lists.freedesktop.org/archives/systemd-devel/2013-June/011521.html
Hierarchy layout aside, that "private property" bit says that the folks
who currently own and use the cgroups interface will lose direct access
to it. I can imagine folks who have become dependent upon an on the fly
management agents of their own design becoming a tad alarmed.
FWIW, the code is too embarassing yet to see daylight, but I'm playing
with a very lowlevel cgroup manager which supports nesting itself.
Access in this POC is low-level ("set freezer.state to THAWED for cgroup
/c1/c2", "Create /c3"), but the key feature is that it can run in two
modes - native mode in which it uses cgroupfs, and child mode where it
talks to a parent manager to make the changes.
In this world, are users able to read cgroup files, or do they have to
go through a central agent, too?
Post by Serge Hallyn
So then the idea would be that userspace (like libvirt and lxc) would
talk over /dev/cgroup to its manager. Userspace inside a container
(which can't actually mount cgroups itself) would talk to its own
manager which is talking over a passed-in socket to the host manager,
which in turn runs natively (uses cgroupfs, and nests "create /c1" under
the requestor's cgroup).
How do you handle updates of this agent? Suppose I have hundreds of
running containers, and I want to release a new version of the cgroupd
?

(note: inquiries about the implementation do not denote acceptance of
the model :)
Post by Serge Hallyn
At some point (probably soon) we might want to talk about a standard API
for these things. However I think it will have to come in the form of
a standard library, which knows to either send requests over dbus to
systemd, or over /dev/cgroup sock to the manager.
-serge
Serge Hallyn
2013-06-27 16:18:09 UTC
Permalink
Post by Tim Hockin
Post by Serge Hallyn
Post by Mike Galbraith
Post by Tejun Heo
Hello, Tim.
I really want to understand why this is SO IMPORTANT that you have to
break userspace compatibility? I mean, isn't Linux supposed to be the
OS with the stable kernel interface? I've seen Linus rant time and
time again about this - why is it OK now?
What the hell are you talking about? Nobody is breaking userland
interface. A new version of interface is being phased in and the old
one will stay there for the foreseeable future. It will be phased out
eventually but that's gonna take a long time and it will have to be
something hardly noticeable. Of course new features will only be
available with the new interface and there will be efforts to nudge
people away from the old one but the existing interface will keep
working it does.
I can understand some alarm. When I saw the below I started frothing at
the face and howling at the moon, and I don't even use the things much.
http://lists.freedesktop.org/archives/systemd-devel/2013-June/011521.html
Hierarchy layout aside, that "private property" bit says that the folks
who currently own and use the cgroups interface will lose direct access
to it. I can imagine folks who have become dependent upon an on the fly
management agents of their own design becoming a tad alarmed.
FWIW, the code is too embarassing yet to see daylight, but I'm playing
with a very lowlevel cgroup manager which supports nesting itself.
Access in this POC is low-level ("set freezer.state to THAWED for cgroup
/c1/c2", "Create /c3"), but the key feature is that it can run in two
modes - native mode in which it uses cgroupfs, and child mode where it
talks to a parent manager to make the changes.
In this world, are users able to read cgroup files, or do they have to
go through a central agent, too?
The agent won't itself do anything to stop access through cgroupfs, but
the idea would be that cgroupfs would only be mounted in the agent's
mntns. My hope would be that the libcgroup commands (like cgexec,
cgcreate, etc) would know to talk to the agent when possible, and users
would use those.
Post by Tim Hockin
Post by Serge Hallyn
So then the idea would be that userspace (like libvirt and lxc) would
talk over /dev/cgroup to its manager. Userspace inside a container
(which can't actually mount cgroups itself) would talk to its own
manager which is talking over a passed-in socket to the host manager,
which in turn runs natively (uses cgroupfs, and nests "create /c1" under
the requestor's cgroup).
How do you handle updates of this agent? Suppose I have hundreds of
running containers, and I want to release a new version of the cgroupd
?
This may change (which is part of what I want to investigate with some
POC), but right now I'm building any controller-aware smarts into it. I
think that's what you're asking about? The agent doesn't do "slices"
etc. This may turn out to be insufficient, we'll see.

So the only state which the agent stores is a list of cgroup mounts (if
in native mode) or an open socket to the parent (if in child mode), and a
list of connected children sockets.

HUPping the agent will cause it to reload the cgroupfs mounts (in case
you've mounted a new controller, living in "the old world" :). If you
just kill it and start a new one, it shouldn't matter.
Post by Tim Hockin
(note: inquiries about the implementation do not denote acceptance of
the model :)
To put it another way, the problem I'm solving (for now) is not the "I
want a daemon to ensure that requested guarantees are correctly
implemented." In that sense I'm maintaining the status quo, i.e. the
admin needs to architect the layout correctly.

The problem I'm solving is really that I want containers to be able to
handle cgroups even if they can't mount cgroupfs, and I want all
userspace to be able to behave the same whether they are in a container
or not.

This isn't meant as a poke in the eye of anyone who wants to address the
other problem. If it turns out that we (meaning "the community of
cgroup users") really want such an agent, then we can add that. I'm not
convinced.

What would probably be a better design, then, would be that the agent
I'm working on can plug into a resource allocation agent. Or, I
suppose, the other way around.
Post by Tim Hockin
Post by Serge Hallyn
At some point (probably soon) we might want to talk about a standard API
for these things. However I think it will have to come in the form of
a standard library, which knows to either send requests over dbus to
systemd, or over /dev/cgroup sock to the manager.
-serge
Tejun Heo
2013-06-27 17:48:50 UTC
Permalink
Hello, Serge.
Post by Serge Hallyn
At some point (probably soon) we might want to talk about a standard API
for these things. However I think it will have to come in the form of
a standard library, which knows to either send requests over dbus to
systemd, or over /dev/cgroup sock to the manager.
Yeah, eventually, I think we'll have a standardized way to configure
resource distribution in the system. Maybe we'll agree on a
standardized dbus protocol or there will be library, I don't know;
however, whatever form it may be in, it abstraction level should be
way higher than that of direct cgroupfs access. It's way too low
level and very easy to end up in a complete nonsense configuration.

e.g. enabling "cpu" on a cgroup whlie leaving other cgroups alone
wouldn't enable fair scheduling on that cgroup but drastically reduce
the amount of cpu share it gets as it now gets treated as single
entity competing with all tasks at the parent level.

At the moment, I'm not sure what the eventual abstraction would look
like. systemd is extending its basic constructs by adding slices and
scopes and it does make sense to integrate the general organization of
the system (services, user sessions, VMs and so on) with resource
management. Given some time, I'm hoping we'll be able to come up with
and agree on some common constructs so that each workload can indicate
its resource requirements in a unified way.

That said, I really think we should experiment for a while before
trying to settle down on things. We've now just started exploring how
system-wide resource managment can be made widely available to systems
without requiring extremely specialized hand-crafted configurations
and I'm pretty sure we're getting and gonna get quite a few details
wrong, so I don't think it'd be a good idea to try to agree on things
right now. As far as such integration goes, I think it's time to play
with things and observe the results.

Thanks.
--
tejun
Serge Hallyn
2013-06-27 18:14:57 UTC
Permalink
Post by Tejun Heo
Hello, Serge.
Post by Serge Hallyn
At some point (probably soon) we might want to talk about a standard API
for these things. However I think it will have to come in the form of
a standard library, which knows to either send requests over dbus to
systemd, or over /dev/cgroup sock to the manager.
Yeah, eventually, I think we'll have a standardized way to configure
resource distribution in the system. Maybe we'll agree on a
standardized dbus protocol or there will be library, I don't know;
however, whatever form it may be in, it abstraction level should be
way higher than that of direct cgroupfs access. It's way too low
level and very easy to end up in a complete nonsense configuration.
e.g. enabling "cpu" on a cgroup whlie leaving other cgroups alone
wouldn't enable fair scheduling on that cgroup but drastically reduce
the amount of cpu share it gets as it now gets treated as single
entity competing with all tasks at the parent level.
Right. I *think* this can be offered as a daemon which sits as the
sole consumer of my agent's API, and offers a higher level "do what I
want" API. But designing that API is going to be interesting.

I should find a good, up-to-date summary of the current behaviors of
each controller so I can talk more intelligently about it. (I'll
start by looking at the kernel Documentation/cgroups, but don't
feel too confident that they'll be uptodate :)
Post by Tejun Heo
At the moment, I'm not sure what the eventual abstraction would look
like. systemd is extending its basic constructs by adding slices and
scopes and it does make sense to integrate the general organization of
the system (services, user sessions, VMs and so on) with resource
management. Given some time, I'm hoping we'll be able to come up with
and agree on some common constructs so that each workload can indicate
its resource requirements in a unified way.
That said, I really think we should experiment for a while before
trying to settle down on things. We've now just started exploring how
system-wide resource managment can be made widely available to systems
without requiring extremely specialized hand-crafted configurations
and I'm pretty sure we're getting and gonna get quite a few details
wrong, so I don't think it'd be a good idea to try to agree on things
right now. As far as such integration goes, I think it's time to play
with things and observe the results.
Right, I'm not attached to my toy implementation at all - except for
the ability, in some fashion, to have nested agents which don't have
cgroupfs access but talk to another agent to get the job done.

-serge
Tejun Heo
2013-06-27 18:45:41 UTC
Permalink
Hello, Serge.
Post by Serge Hallyn
I should find a good, up-to-date summary of the current behaviors of
each controller so I can talk more intelligently about it. (I'll
start by looking at the kernel Documentation/cgroups, but don't
feel too confident that they'll be uptodate :)
Heh, it's hopelessly outdated. Sorry about that. I'll get around to
updating it eventually. Right now everything is in flux.
Post by Serge Hallyn
Right, I'm not attached to my toy implementation at all - except for
the ability, in some fashion, to have nested agents which don't have
cgroupfs access but talk to another agent to get the job done.
I think it probably would be better to allow organization and RO
access to knobs and stat files inside containers, for lower overhead,
if nothing else, and have comm channel for operations which need
supervision at a wider level.

Thanks.
--
tejun
Serge Hallyn
2013-06-27 18:51:04 UTC
Permalink
Post by Tejun Heo
Hello, Serge.
Post by Serge Hallyn
I should find a good, up-to-date summary of the current behaviors of
each controller so I can talk more intelligently about it. (I'll
start by looking at the kernel Documentation/cgroups, but don't
feel too confident that they'll be uptodate :)
Heh, it's hopelessly outdated. Sorry about that. I'll get around to
updating it eventually. Right now everything is in flux.
Post by Serge Hallyn
Right, I'm not attached to my toy implementation at all - except for
the ability, in some fashion, to have nested agents which don't have
cgroupfs access but talk to another agent to get the job done.
I think it probably would be better to allow organization and RO
What do you mean by "organization"? Creating cgroups and moving tasks
between them, without setting other cgroup values?
Post by Tejun Heo
access to knobs and stat files inside containers, for lower overhead,
if nothing else, and have comm channel for operations which need
supervision at a wider level.
Thanks.
--
tejun
Tejun Heo
2013-06-27 18:52:32 UTC
Permalink
Hello,
Post by Serge Hallyn
Post by Tejun Heo
I think it probably would be better to allow organization and RO
What do you mean by "organization"? Creating cgroups and moving tasks
between them, without setting other cgroup values?
Yeap, I also think that's how user sessions are gonna be handled.
We're gonna have limited amount of delegation for organization and
read accesses.

Thanks.

--
tejun
Tim Hockin
2013-06-27 20:52:01 UTC
Permalink
Post by Serge Hallyn
Post by Tejun Heo
Hello, Serge.
Post by Serge Hallyn
At some point (probably soon) we might want to talk about a standard API
for these things. However I think it will have to come in the form of
a standard library, which knows to either send requests over dbus to
systemd, or over /dev/cgroup sock to the manager.
Yeah, eventually, I think we'll have a standardized way to configure
resource distribution in the system. Maybe we'll agree on a
standardized dbus protocol or there will be library, I don't know;
however, whatever form it may be in, it abstraction level should be
way higher than that of direct cgroupfs access. It's way too low
level and very easy to end up in a complete nonsense configuration.
e.g. enabling "cpu" on a cgroup whlie leaving other cgroups alone
wouldn't enable fair scheduling on that cgroup but drastically reduce
the amount of cpu share it gets as it now gets treated as single
entity competing with all tasks at the parent level.
Right. I *think* this can be offered as a daemon which sits as the
sole consumer of my agent's API, and offers a higher level "do what I
want" API. But designing that API is going to be interesting.
This is something we have, partially, and are working to be able to
open-source. We have a LOT of experience feeding into the semantics
that actually make users happy.

Today it leverages split-hierarchies, but that is not required in the
generic case (only if you want to offer the semantics we do). It
explicitly delegates some aspects of sub-cgroup control to users, but
that could go away if your lowest-level agency can handle it.
Post by Serge Hallyn
I should find a good, up-to-date summary of the current behaviors of
each controller so I can talk more intelligently about it. (I'll
start by looking at the kernel Documentation/cgroups, but don't
feel too confident that they'll be uptodate :)
Post by Tejun Heo
At the moment, I'm not sure what the eventual abstraction would look
like. systemd is extending its basic constructs by adding slices and
scopes and it does make sense to integrate the general organization of
the system (services, user sessions, VMs and so on) with resource
management. Given some time, I'm hoping we'll be able to come up with
and agree on some common constructs so that each workload can indicate
its resource requirements in a unified way.
That said, I really think we should experiment for a while before
trying to settle down on things. We've now just started exploring how
system-wide resource managment can be made widely available to systems
without requiring extremely specialized hand-crafted configurations
and I'm pretty sure we're getting and gonna get quite a few details
wrong, so I don't think it'd be a good idea to try to agree on things
right now. As far as such integration goes, I think it's time to play
with things and observe the results.
Right, I'm not attached to my toy implementation at all - except for
the ability, in some fashion, to have nested agents which don't have
cgroupfs access but talk to another agent to get the job done.
-serge
Daniel P. Berrange
2013-06-28 09:09:10 UTC
Permalink
Post by Serge Hallyn
FWIW, the code is too embarassing yet to see daylight, but I'm playing
with a very lowlevel cgroup manager which supports nesting itself.
Access in this POC is low-level ("set freezer.state to THAWED for cgroup
/c1/c2", "Create /c3"), but the key feature is that it can run in two
modes - native mode in which it uses cgroupfs, and child mode where it
talks to a parent manager to make the changes.
So then the idea would be that userspace (like libvirt and lxc) would
talk over /dev/cgroup to its manager. Userspace inside a container
(which can't actually mount cgroups itself) would talk to its own
manager which is talking over a passed-in socket to the host manager,
which in turn runs natively (uses cgroupfs, and nests "create /c1" under
the requestor's cgroup).
At some point (probably soon) we might want to talk about a standard API
for these things. However I think it will have to come in the form of
a standard library, which knows to either send requests over dbus to
systemd, or over /dev/cgroup sock to the manager.
Are you also planning to actually write a new cgroup parent manager
daemon too ? Currently my plan for libvirt is to just talk directly
to systemd's new DBus APIs for all management of cgroups, and then
fall back to writing to cgroupfs directly for cases where systemd
is not around. Having a library to abstract these two possible
alternatives isn't all that compelling unless we think there will
be multiple cgroups manager daemons. I've been somewhat assuming that
even Ubuntu will eventually see the benefits & switch to systemd,
then the issue of multiple manager daemons wouldn't really exist.

Daniel
--
|: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org -o- http://virt-manager.org :|
|: http://autobuild.org -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
Serge Hallyn
2013-06-28 15:53:06 UTC
Permalink
Post by Daniel P. Berrange
Post by Serge Hallyn
FWIW, the code is too embarassing yet to see daylight, but I'm playing
with a very lowlevel cgroup manager which supports nesting itself.
Access in this POC is low-level ("set freezer.state to THAWED for cgroup
/c1/c2", "Create /c3"), but the key feature is that it can run in two
modes - native mode in which it uses cgroupfs, and child mode where it
talks to a parent manager to make the changes.
So then the idea would be that userspace (like libvirt and lxc) would
talk over /dev/cgroup to its manager. Userspace inside a container
(which can't actually mount cgroups itself) would talk to its own
manager which is talking over a passed-in socket to the host manager,
which in turn runs natively (uses cgroupfs, and nests "create /c1" under
the requestor's cgroup).
At some point (probably soon) we might want to talk about a standard API
for these things. However I think it will have to come in the form of
a standard library, which knows to either send requests over dbus to
systemd, or over /dev/cgroup sock to the manager.
Are you also planning to actually write a new cgroup parent manager
daemon too ? Currently my plan for libvirt is to just talk directly
I'm toying with the idea, yes. (Right now my toy runs in either native
mode, using cgroupfs, or child mode, talking to a parent manager) I'd
love if someone else does it, but it needs to be done.

As I've said elsewhere in the thread, I see 2 problems to be addressed:

1. The ability to nest the cgroup manager daemons, so that a daemon
running in a container can talk to a daemon running on the host. This
is the problem my current toy is aiming to address. But the API it
exports is just a thin layer over cgroupfs.

2. Abstract away the kernel/cgroupfs details so that userspace can
explain its cgroup needs generically. This is IIUC what systemd is
addressing with slices and scopes.

(2) is where I'd really like to have a well thought out, community
designed API that everyone can agree on, and it might be worth getting
together (with Tejun) at plumbers or something to lay something out.

In the end, something like libvirt or lxc should not need to care
what is running underneat it. It should be able to make its requests
the same way regardless of whether it running in fedora or ubuntu,
and whether it is running on the host or in a tightly bound container.
That's my goal anyway :)
Post by Daniel P. Berrange
to systemd's new DBus APIs for all management of cgroups, and then
fall back to writing to cgroupfs directly for cases where systemd
is not around. Having a library to abstract these two possible
alternatives isn't all that compelling unless we think there will
be multiple cgroups manager daemons. I've been somewhat assuming that
even Ubuntu will eventually see the benefits & switch to systemd,
So far I've seen no indication of that :)

If the systemd code to manage slices could be made separately
compileable as a standalone library or daemon, then I'd advocate
using that. But I don't see a lot of incentive for systemd to do
that, so I'd feel like a heel even asking.
Post by Daniel P. Berrange
then the issue of multiple manager daemons wouldn't really exist.
True. But I'm running under the assumption that Ubuntu will stick with
upstart, and therefore yes I'll need a separate (perhaps pair of)
management daemons.

Even if we were to switch to systemd, I'd like the API for userspace
programs to configure and use cgroups to be as generic as possible,
so that anyone who wanted to write their own daemon could do so.

-serge
Tim Hockin
2013-06-28 18:58:10 UTC
Permalink
Post by Serge Hallyn
Post by Daniel P. Berrange
Are you also planning to actually write a new cgroup parent manager
daemon too ? Currently my plan for libvirt is to just talk directly
I'm toying with the idea, yes. (Right now my toy runs in either native
mode, using cgroupfs, or child mode, talking to a parent manager) I'd
love if someone else does it, but it needs to be done.
1. The ability to nest the cgroup manager daemons, so that a daemon
running in a container can talk to a daemon running on the host. This
is the problem my current toy is aiming to address. But the API it
exports is just a thin layer over cgroupfs.
2. Abstract away the kernel/cgroupfs details so that userspace can
explain its cgroup needs generically. This is IIUC what systemd is
addressing with slices and scopes.
(2) is where I'd really like to have a well thought out, community
designed API that everyone can agree on, and it might be worth getting
together (with Tejun) at plumbers or something to lay something out.
We're also working on (2) (well, we HAVE it, but we're dis-integrating
it so we can hopefully publish more widely). But our (2) depends on
direct cgroupfs access. If that is to change, we need a really robust
(1). It's OK (desireable, in fact) that (1) be a very thin layer of
abstraction.
Post by Serge Hallyn
In the end, something like libvirt or lxc should not need to care
what is running underneat it. It should be able to make its requests
the same way regardless of whether it running in fedora or ubuntu,
and whether it is running on the host or in a tightly bound container.
That's my goal anyway :)
Post by Daniel P. Berrange
to systemd's new DBus APIs for all management of cgroups, and then
fall back to writing to cgroupfs directly for cases where systemd
is not around. Having a library to abstract these two possible
alternatives isn't all that compelling unless we think there will
be multiple cgroups manager daemons. I've been somewhat assuming that
even Ubuntu will eventually see the benefits & switch to systemd,
So far I've seen no indication of that :)
If the systemd code to manage slices could be made separately
compileable as a standalone library or daemon, then I'd advocate
using that. But I don't see a lot of incentive for systemd to do
that, so I'd feel like a heel even asking.
I want to say "let the best API win", but I know that systemd is a
giant katamari ball, and it's absorbing subsystems so it may win by
default. That isn't going to stop us from trying to do what we do,
and share that with the world.
Post by Serge Hallyn
Post by Daniel P. Berrange
then the issue of multiple manager daemons wouldn't really exist.
True. But I'm running under the assumption that Ubuntu will stick with
upstart, and therefore yes I'll need a separate (perhaps pair of)
management daemons.
Even if we were to switch to systemd, I'd like the API for userspace
programs to configure and use cgroups to be as generic as possible,
so that anyone who wanted to write their own daemon could do so.
-serge
Tejun Heo
2013-06-27 18:01:43 UTC
Permalink
Hello, Mike.
Post by Mike Galbraith
I can understand some alarm. When I saw the below I started frothing at
the face and howling at the moon, and I don't even use the things much.
Can I ask why? The reasons are not apparent to me.
Post by Mike Galbraith
http://lists.freedesktop.org/archives/systemd-devel/2013-June/011521.html
Hierarchy layout aside, that "private property" bit says that the folks
who currently own and use the cgroups interface will lose direct access
to it. I can imagine folks who have become dependent upon an on the fly
management agents of their own design becoming a tad alarmed.
They're gonna be able to do what they've been doing for the
foreseeable future if they choose not to use systemd's unified
resource management. That said, what we have today is pretty lousy
and a lot of hierarchical stuff were completely broken until some
releases ago and things *must* have been broken on the userland side
too. It could have worked for their specific setup but I strongly
doubt there are anything generic working well out in the wild. cgroup
hasn't been capable of supporting something like that.

AFAICS, having a userland agent which has overall knowledge of the
hierarchy and enforcesf structure and limiations is a requirement to
make cgroup generally useable and useful. For systemd based systems,
systemd serving that role isn't too crazy. It's sure gonna have
teeting issues at the beginning but it has all the necessary
information to manage workloads on the system.

A valid issue is interoperability between systemd and non-systemd
systems. I don't have an immediately good answer for that. I wrote
in another reply but making cgroup generally available is a pretty new
effort and we're still in the process of figuring out what the right
constructs and abstractions are. Hopefully, we'll be able to reach a
common set of abstractions to base things on top in itme.

Thanks.
--
tejun
Mike Galbraith
2013-06-28 03:46:38 UTC
Permalink
Post by Tejun Heo
Hello, Mike.
Post by Mike Galbraith
I can understand some alarm. When I saw the below I started frothing at
the face and howling at the moon, and I don't even use the things much.
Can I ask why? The reasons are not apparent to me.
Sure, because in private property and I mandatory agent, I see "gimme
yer wallet bitch", an incredibly arrogant and brutal mugging. That's
not the way it's meant, I know that, but that's how it comes across.
You asked, so you get the straight up answer.

Offering to manage cgroups is one thing, very generous, forcefully
placing itself between user and kernel quite another. Perhaps I
misread, but my interpretation was that the intent is to make systemd a
mandatory agent, even saw reference to it taking up residence in the
kernel tree (that bit made me chuckle, pull request would have to be
very cleverly worded methinks). I'm sure it will be quite capable, its
authors are. However, when I want to talk to my kernel, I expect to be
able to tell anyone else using the phone to hang up.. now.
Post by Tejun Heo
Post by Mike Galbraith
http://lists.freedesktop.org/archives/systemd-devel/2013-June/011521.html
Hierarchy layout aside, that "private property" bit says that the folks
who currently own and use the cgroups interface will lose direct access
to it. I can imagine folks who have become dependent upon an on the fly
management agents of their own design becoming a tad alarmed.
They're gonna be able to do what they've been doing for the
foreseeable future if they choose not to use systemd's unified
Those are the comforting words I wanted to hear, that the user chooses,
that the user will not find that this that or any other userspace agent
gains the right to insert itself between user and kernel.
Post by Tejun Heo
AFAICS, having a userland agent which has overall knowledge of the
hierarchy and enforcesf structure and limiations is a requirement to
make cgroup generally useable and useful.
It's useful now, usable to the point that enterprise users exist who
have integrated cgroups into their business model. But then you know
that. Sure, there are problems, things could and no doubt will get a
lot better.

However, wrt userspace agent, no agent is going to be the right answer
for all, so that agent needs to have a step aside button so another
agent can be tasked with the managerial duties, whether that be little
ole /me or Aunt Tilly piddling with this and that because we damn well
feel like it, or BigFoot company X going massively wild and crazy doing
their business thing.
Post by Tejun Heo
For systemd based systems,
systemd serving that role isn't too crazy. It's sure gonna have
teeting issues at the beginning but it has all the necessary
information to manage workloads on the system.
No, it's not at all crazy, _offering_ the user a managerial service is
great, generous, way to go guys, pass out the white hats. Use force,
and those pretty white hats turn black as night, hero to villain.
Post by Tejun Heo
A valid issue is interoperability between systemd and non-systemd
in another reply but making cgroup generally available is a pretty new
effort and we're still in the process of figuring out what the right
constructs and abstractions are. Hopefully, we'll be able to reach a
common set of abstractions to base things on top in itme.
systemd and no systemd is also a valid issue. I'm sure it'll all get
worked out, but that link, and others like it make me see bright red.

-Mike
Tejun Heo
2013-06-28 04:09:30 UTC
Permalink
Hello, Mike.
Post by Mike Galbraith
Sure, because in private property and I mandatory agent, I see "gimme
yer wallet bitch", an incredibly arrogant and brutal mugging. That's
not the way it's meant, I know that, but that's how it comes across.
You asked, so you get the straight up answer.
I don't know. It reads more like tungue-in-cheek thing to me rather
than being actually arrogant, and some part of the brutality is
necessary at this point.
Post by Mike Galbraith
Offering to manage cgroups is one thing, very generous, forcefully
placing itself between user and kernel quite another. Perhaps I
misread, but my interpretation was that the intent is to make systemd a
mandatory agent, even saw reference to it taking up residence in the
kernel tree (that bit made me chuckle, pull request would have to be
very cleverly worded methinks). I'm sure it will be quite capable, its
authors are. However, when I want to talk to my kernel, I expect to be
able to tell anyone else using the phone to hang up.. now.
I don't know how to respond to this. It feels more emotional than
technical.
Post by Mike Galbraith
It's useful now, usable to the point that enterprise users exist who
have integrated cgroups into their business model. But then you know
that. Sure, there are problems, things could and no doubt will get a
lot better.
No, it's completely messed up. We're now starting to see users trying
to embed low level cgroup details into their binaries and cgroup is
exposing sysctl-level konbs which are directly tied to internal
implementation of core subsystems. cgroup successfully bypassed the
usual kernel API policing with the help of hierarchical filesystem
interface which allows delegation on the surface. We completely
fucked up. This is a full scale disaster unrolling.
Post by Mike Galbraith
However, wrt userspace agent, no agent is going to be the right answer
for all, so that agent needs to have a step aside button so another
agent can be tasked with the managerial duties, whether that be little
ole /me or Aunt Tilly piddling with this and that because we damn well
feel like it, or BigFoot company X going massively wild and crazy doing
their business thing.
*ANY* agent is better than now. We need to back the hell out of
direct usages as soon as possible. cgroup is leaking kernel
implementation details into individual binaries. The current
situation is dangerous and putting an agent inbetween is a good way of
gradually backing out of it.
Post by Mike Galbraith
No, it's not at all crazy, _offering_ the user a managerial service is
great, generous, way to go guys, pass out the white hats. Use force,
and those pretty white hats turn black as night, hero to villain.
No, it's completely crazy. Full psycho crazy. You just don't realize
it yet.
Post by Mike Galbraith
systemd and no systemd is also a valid issue. I'm sure it'll all get
worked out, but that link, and others like it make me see bright red.
That red is nothing compared to the kernel implementation detail leak
going on right now. The alarm for that has been blinking
psychedelically for some time now.

Thanks.
--
tejun
Mike Galbraith
2013-06-28 04:49:10 UTC
Permalink
Post by Tejun Heo
No, it's completely messed up. We're now starting to see users trying
to embed low level cgroup details into their binaries and cgroup is
exposing sysctl-level konbs which are directly tied to internal
implementation of core subsystems. cgroup successfully bypassed the
usual kernel API policing with the help of hierarchical filesystem
interface which allows delegation on the surface. We completely
fucked up. This is a full scale disaster unrolling.
I always thought that was a very cool feature, mkdir+echo, poof done.
Now maybe that interface is suboptimal for serious usage, but it makes
the things usable via dirt simple scripts, very flexible, nice.

But whatever, not my call, you know your business better than I. If
mandatory agent happens, fine, but imho that will be sad day.

-Mike
Tejun Heo
2013-06-28 05:01:38 UTC
Permalink
Hello, Mike.
Post by Mike Galbraith
I always thought that was a very cool feature, mkdir+echo, poof done.
Now maybe that interface is suboptimal for serious usage, but it makes
the things usable via dirt simple scripts, very flexible, nice.
Oh, that in itself is not bad. I mean, if you're root, it's pretty
easy to play with and that part is fine. But combined with the
hierarchical nature of cgroup and file permissions, it encourages
people to "deligate" subdirectories to less previledged domains, which
in turn leads to normal binaries to manipulate them directly, which is
where the horror begins. We end up exposing control knobs which are
tightly coupled to kernel implementation details right into lay
binaries and scripts directly used by end users.

I think this is the first time this happened, which is probably why
nobody really noticed the mess earlier.

Anyways, if you're root, you can keep doing whatever you want. You
could be stepping on the centralized agent's toes a bit and vice-versa
but I don't think that's gonna be disastrous. What I'm trying to
stamp out is direct usages from !root domains and !system-management
binaries / scripts. They absolutely have to go. There's no question
about it and I'll take totalitarian userland agent anyday over the
current mess.

Eventually, I think we'll be able to reach an equilibrium where most
things are reasonable and we'll be exploring the acceptable limits of
flexibility again, but right now, please bear with the brutality.
We're way over the line and I can't see a way back which isn't gonna
sting a bit. I'm and will keep trying to make it as painless as
possible.

Thanks!
--
tejun
Mike Galbraith
2013-06-28 06:00:04 UTC
Permalink
Post by Tejun Heo
Anyways, if you're root, you can keep doing whatever you want. You
could be stepping on the centralized agent's toes a bit and vice-versa
Keep on truckn' sounds good, that vice-versa toe stomping not so good,
but yeah, until systemd or ilk grows the ability to shut me down, I
shouldn't feel any burning need to introduce it to my machete.
Post by Tejun Heo
but I don't think that's gonna be disastrous. What I'm trying to
stamp out is direct usages from !root domains and !system-management
binaries / scripts. They absolutely have to go. There's no question
about it and I'll take totalitarian userland agent anyday over the
current mess.
I get some of the why.. and yeah, it's the dirt simple usage that I care
about most, not the big hairy problem cases you're trying to address.
Post by Tejun Heo
Eventually, I think we'll be able to reach an equilibrium where most
things are reasonable and we'll be exploring the acceptable limits of
flexibility again, but right now, please bear with the brutality.
We're way over the line and I can't see a way back which isn't gonna
sting a bit. I'm and will keep trying to make it as painless as
possible.
Keep on driving, and thanks for listening. Aaaooooo ;-)

-Mike
Michal Hocko
2013-06-28 15:05:13 UTC
Permalink
Post by Tejun Heo
Hello, Mike.
Post by Mike Galbraith
I always thought that was a very cool feature, mkdir+echo, poof done.
Now maybe that interface is suboptimal for serious usage, but it makes
the things usable via dirt simple scripts, very flexible, nice.
Oh, that in itself is not bad. I mean, if you're root, it's pretty
easy to play with and that part is fine. But combined with the
hierarchical nature of cgroup and file permissions, it encourages
people to "deligate" subdirectories to less previledged domains,
OK, this really depends on what you expose to non-root users. I have
seen use cases where admin prepares top-level which is root-only but
it allows creating sub-groups which are under _full_ control of the
subdomain. This worked nicely for memcg for example because hard limit,
oom handling and other knobs are hierarchical so the subdomain cannot
overwrite what admin has said.
Post by Tejun Heo
which
in turn leads to normal binaries to manipulate them directly, which is
where the horror begins. We end up exposing control knobs which are
tightly coupled to kernel implementation details right into lay
binaries and scripts directly used by end users.
I think this is the first time this happened, which is probably why
nobody really noticed the mess earlier.
Anyways, if you're root, you can keep doing whatever you want.
OK, so libcgroup's rules daemon will still work and place my tasks in
appropriate cgroups?

This is not quite in par with "libcgroup is dead and others have to
migrate to systemd as well" statements from the link posted earlier.
I really do not think that _any_ central agent will understand my
requirements and needs so I need a way to talk to cgroupfs somehow - I
have used libcgroups so far but touching cgroupfs is quite convinient
as well.

And the systemd, with its history of eating projects and not caring much
about their previous users who are not willing to jump in to the systemd
car, doesn't sound like a good place where to place the new interface to
me.

[...]
--
Michal Hocko
SUSE Labs
Vivek Goyal
2013-06-28 18:01:55 UTC
Permalink
Post by Michal Hocko
Post by Tejun Heo
Hello, Mike.
Post by Mike Galbraith
I always thought that was a very cool feature, mkdir+echo, poof done.
Now maybe that interface is suboptimal for serious usage, but it makes
the things usable via dirt simple scripts, very flexible, nice.
Oh, that in itself is not bad. I mean, if you're root, it's pretty
easy to play with and that part is fine. But combined with the
hierarchical nature of cgroup and file permissions, it encourages
people to "deligate" subdirectories to less previledged domains,
OK, this really depends on what you expose to non-root users. I have
seen use cases where admin prepares top-level which is root-only but
it allows creating sub-groups which are under _full_ control of the
subdomain. This worked nicely for memcg for example because hard limit,
oom handling and other knobs are hierarchical so the subdomain cannot
overwrite what admin has said.
Post by Tejun Heo
which
in turn leads to normal binaries to manipulate them directly, which is
where the horror begins. We end up exposing control knobs which are
tightly coupled to kernel implementation details right into lay
binaries and scripts directly used by end users.
I think this is the first time this happened, which is probably why
nobody really noticed the mess earlier.
Anyways, if you're root, you can keep doing whatever you want.
OK, so libcgroup's rules daemon will still work and place my tasks in
appropriate cgroups?
Do you use that daemon in practice? For user session logins, I think
systemd has plans to put user sessions in a cgroup (kind of making
pam_cgroup redundant).

Other functionality rulesengined was providing moving tasks automatically
in a cgroup based on executable name. I think that was racy and not
many people had liked it.

IIUC, systemd can't disable access to cgroupfs from other utilities.
So most likely rulesengined should contine to work. But having both
systemd and libcgroup might not make much sense though.

Thanks
Vivek
Daniel P. Berrange
2013-06-28 19:59:17 UTC
Permalink
Post by Vivek Goyal
Post by Michal Hocko
Post by Tejun Heo
Hello, Mike.
Post by Mike Galbraith
I always thought that was a very cool feature, mkdir+echo, poof done.
Now maybe that interface is suboptimal for serious usage, but it makes
the things usable via dirt simple scripts, very flexible, nice.
Oh, that in itself is not bad. I mean, if you're root, it's pretty
easy to play with and that part is fine. But combined with the
hierarchical nature of cgroup and file permissions, it encourages
people to "deligate" subdirectories to less previledged domains,
OK, this really depends on what you expose to non-root users. I have
seen use cases where admin prepares top-level which is root-only but
it allows creating sub-groups which are under _full_ control of the
subdomain. This worked nicely for memcg for example because hard limit,
oom handling and other knobs are hierarchical so the subdomain cannot
overwrite what admin has said.
Post by Tejun Heo
which
in turn leads to normal binaries to manipulate them directly, which is
where the horror begins. We end up exposing control knobs which are
tightly coupled to kernel implementation details right into lay
binaries and scripts directly used by end users.
I think this is the first time this happened, which is probably why
nobody really noticed the mess earlier.
Anyways, if you're root, you can keep doing whatever you want.
OK, so libcgroup's rules daemon will still work and place my tasks in
appropriate cgroups?
Do you use that daemon in practice? For user session logins, I think
systemd has plans to put user sessions in a cgroup (kind of making
pam_cgroup redundant).
Other functionality rulesengined was providing moving tasks automatically
in a cgroup based on executable name. I think that was racy and not
many people had liked it.
Regardless of the changes being proposed, IMHO, the cgrulesd should
never be used. It is just outright dangerous for a daemon to be
arbitrarily re-arranging what cgroups a process is placed in without
the applications being aware of it. It can only be safely used in a
scenario where cgroups are exclusively used by the administrator,
and never used by applications for their own needs.
Post by Vivek Goyal
IIUC, systemd can't disable access to cgroupfs from other utilities.
The kernel can exposed a knob that would allow systemd to lock that
down
Post by Vivek Goyal
So most likely rulesengined should contine to work. But having both
systemd and libcgroup might not make much sense though.
Daniel
--
|: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org -o- http://virt-manager.org :|
|: http://autobuild.org -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
Serge Hallyn
2013-06-28 22:40:53 UTC
Permalink
Post by Daniel P. Berrange
Post by Vivek Goyal
Post by Michal Hocko
Post by Tejun Heo
Hello, Mike.
Post by Mike Galbraith
I always thought that was a very cool feature, mkdir+echo, poof done.
Now maybe that interface is suboptimal for serious usage, but it makes
the things usable via dirt simple scripts, very flexible, nice.
Oh, that in itself is not bad. I mean, if you're root, it's pretty
easy to play with and that part is fine. But combined with the
hierarchical nature of cgroup and file permissions, it encourages
people to "deligate" subdirectories to less previledged domains,
OK, this really depends on what you expose to non-root users. I have
seen use cases where admin prepares top-level which is root-only but
it allows creating sub-groups which are under _full_ control of the
subdomain. This worked nicely for memcg for example because hard limit,
oom handling and other knobs are hierarchical so the subdomain cannot
overwrite what admin has said.
Post by Tejun Heo
which
in turn leads to normal binaries to manipulate them directly, which is
where the horror begins. We end up exposing control knobs which are
tightly coupled to kernel implementation details right into lay
binaries and scripts directly used by end users.
I think this is the first time this happened, which is probably why
nobody really noticed the mess earlier.
Anyways, if you're root, you can keep doing whatever you want.
OK, so libcgroup's rules daemon will still work and place my tasks in
appropriate cgroups?
Do you use that daemon in practice? For user session logins, I think
systemd has plans to put user sessions in a cgroup (kind of making
pam_cgroup redundant).
Other functionality rulesengined was providing moving tasks automatically
in a cgroup based on executable name. I think that was racy and not
many people had liked it.
Regardless of the changes being proposed, IMHO, the cgrulesd should
never be used. It is just outright dangerous for a daemon to be
arbitrarily re-arranging what cgroups a process is placed in without
the applications being aware of it. It can only be safely used in a
scenario where cgroups are exclusively used by the administrator,
and never used by applications for their own needs.
Even then it's not safe, since if the program quickly forks or clones a
few times, you can end up with some of the tasks being reclassified
and some not.
Post by Daniel P. Berrange
Post by Vivek Goyal
IIUC, systemd can't disable access to cgroupfs from other utilities.
The kernel can exposed a knob that would allow systemd to lock that
down
Gah - why would you give him that idea? :)

But yes, I'd sort of assume that was coming, eventually.

-serge
Tejun Heo
2013-06-28 22:43:10 UTC
Permalink
Post by Serge Hallyn
Post by Daniel P. Berrange
The kernel can exposed a knob that would allow systemd to lock that
down
Gah - why would you give him that idea? :)
That's one of the ideas I had from the beginning.
Post by Serge Hallyn
But yes, I'd sort of assume that was coming, eventually.
But I think we'll probably settle with a mechanism to find out whether
someone else is touching the hierarchy, which will be generally useful
for other consumers of cgroup too.

Thanks.
--
tejun
Michal Hocko
2013-06-30 18:38:38 UTC
Permalink
[...]
Post by Vivek Goyal
Post by Michal Hocko
OK, so libcgroup's rules daemon will still work and place my tasks in
appropriate cgroups?
Do you use that daemon in practice?
I am not but my users do. And that is why I care.
Post by Vivek Goyal
For user session logins, I think systemd has plans to put user
sessions in a cgroup (kind of making pam_cgroup redundant).
Other functionality rulesengined was providing moving tasks automatically
in a cgroup based on executable name. I think that was racy and not
many people had liked it.
It doesn't make sense for short lived processes, all right, but it can
be useful for those that live for a long time.
Post by Vivek Goyal
IIUC, systemd can't disable access to cgroupfs from other utilities.
The previous messages read otherwise. And that is why this rised the red
flag at many fronts.
Post by Vivek Goyal
So most likely rulesengined should contine to work. But having both
systemd and libcgroup might not make much sense though.
Thanks
Vivek
--
Michal Hocko
SUSE Labs
Vivek Goyal
2013-07-15 18:49:40 UTC
Permalink
Post by Michal Hocko
[...]
Post by Vivek Goyal
Post by Michal Hocko
OK, so libcgroup's rules daemon will still work and place my tasks in
appropriate cgroups?
Do you use that daemon in practice?
I am not but my users do. And that is why I care.
Michael,

would you have more details of how those users are exactly using
rules engine daemon.

To me rulesengined processed 3 kinds of rules.

- uid based
- gid based
- exec file path based

uid/gid based rule exection can be taken care by pam_cgroup module too.
So I think one should not need cgrulesengined for that.

I am curious what kind of exec rules are useful. Any placement of
services one can do using systemd. So only executables we are left
to manage are which are not services.

In practice is it very useful for an admin to say if "firefox" is launched
by a user then it should run in xyz cgroup. And if user cares about
firefox running in a sub cgroup, then it can always use cgexec to do
that.

Thanks
Vivek
Michal Hocko
2013-07-23 14:48:16 UTC
Permalink
Post by Vivek Goyal
Post by Michal Hocko
[...]
Post by Vivek Goyal
Post by Michal Hocko
OK, so libcgroup's rules daemon will still work and place my tasks in
appropriate cgroups?
Do you use that daemon in practice?
I am not but my users do. And that is why I care.
Michael,
would you have more details of how those users are exactly using
rules engine daemon.
The most common usage is uid and exec names.
Post by Vivek Goyal
To me rulesengined processed 3 kinds of rules.
- uid based
- gid based
- exec file path based
uid/gid based rule exection can be taken care by pam_cgroup module too.
So I think one should not need cgrulesengined for that.
I am not familiar with pam_cgroup much but it is a part of libcgroup
package, right?
Post by Vivek Goyal
I am curious what kind of exec rules are useful. Any placement of
services one can do using systemd. So only executables we are left
to manage are which are not services.
Yes, those are usually backup processes which should not disrupt the
regular server workload.

uid ones are used to keep a leash on local users of the machine but i do
not have many details as I usually do not have access to those machines.
All I see are complains when something explodes ;)
--
Michal Hocko
SUSE Labs
Tejun Heo
2013-06-28 18:30:59 UTC
Permalink
Hello, Michal.
Post by Michal Hocko
OK, this really depends on what you expose to non-root users. I have
seen use cases where admin prepares top-level which is root-only but
it allows creating sub-groups which are under _full_ control of the
subdomain. This worked nicely for memcg for example because hard limit,
oom handling and other knobs are hierarchical so the subdomain cannot
overwrite what admin has said.
Some knobs are safer than others and memcg probably has it easy as it
doesn't implement proportional control. But, even then, there's a
huge chasm between cgroup knobs and proper kernel API visible to
normal programs. Just imagine exposing memcg features by extending
rlimits. It'll take months if not a couple years ironing out the API
details and going through review process, and rightfully so, these
things, once published and made widely available, can't be taken back.
Now compare that to how we decide what knobs to expose in cgroup. I
mean, you even recently suggested flipping the default polarity of
soft limit knob.

cgroup's interface standard is very low. It's probably a notch higher
than boot params but about at the same level as sysctl knobs. It
isn't necessarily a bad thing as it allows us to rapidly explore
various options and expose useable things in a very agile manner, but
we should be very aware of how widely the interface is exposed;
otherwise, we'd be exposing features and leaking kernel implementation
details directly into userland programs without going through proper
review process or buliding consensus, which, in the long term, is
gonna be much worse than not having the feature exposed at all.

"It works for special cases XXX and YYY" is a very poor and extremely
short-sighted argument when the whole approach is breaching the very
fundamentals of kernel API conventions.

In addition, I really don't think cgroup is the right interface to
directly expose to individual programs. As a management thing, it
does make some sense but kernel API already has its, at times ancient
but, generally working hierarchy and inheritance rules and conventions
and primitive resource control contructs - nice, ionice, rlimits and
so on. If exposing cgroup-level resource control directly to
individual applications proves to be beneficial enough, what we should
do is extending those things. The backend sure can be supported by
cgroups but this mkdiring and echoing things with separate hierarchy
from the usual process hierarchy isn't something which should be
visible to individual applications.

Currently, I'm not convinced that this is something which should be
exposed to individual applications, but I sure can be wrong. But,
right now, let's first get the existing part settled. We can worry
about the rest later.

Also, in light of the rather sneaky subversion happened with cgroup
filesystem interface, I wonder whether we need to add some sort of
generic warning mechanism which warns when permissions of pseudo file
systems like cgroupfs are delegated to lesser security domains. In
itself, it could be harmless but it can serves as a useful beacon.
Not sure to what extent or how tho.
Post by Michal Hocko
OK, so libcgroup's rules daemon will still work and place my tasks in
appropriate cgroups?
You have two competing managers of the same hierarchy. There are ways
to make them not interfere with each other too much but ultimately
it's gonna be something clunky. That said, libcgroup itself is pretty
clunky, so maybe you'll be okay with it. I don't know.
Post by Michal Hocko
This is not quite in par with "libcgroup is dead and others have to
migrate to systemd as well" statements from the link posted earlier.
I really do not think that _any_ central agent will understand my
requirements and needs so I need a way to talk to cgroupfs somehow - I
have used libcgroups so far but touching cgroupfs is quite convinient
as well.
As a developer who knows what's going on, I don't think it'd be too
difficult to meddle with things manually with or without the central
manager. It'll complain that someone else is meddling with the cgroup
hierarchy and some functionalities might not work as expected, but I
don't think it'll lock you out.

At the same time, while us, the developers, having the level of
latitude required to do our work is necessary, that shouldn't be the
overruling focal point of the design of the whole system. It's
something to be used and supporting the actual use cases should be the
priority. I'm not saying developer convenience is not important but
that it's not the only thing which matters. The way I see it, cgroup
has basically been a playground for devs going wild without too much,
if any, thought on how it'll actually be useable and useful to wider
audience, so let's please adjust our priorities a bit.

And, no, I don't believe that the use cases are so wildly different
that we can't have a capable enough central manager. That's usually a
symptom of not understanding the problem space well enough and how one
ends up with mess like e.g. grub2 configuration. There sure are and
will be outliers but it should be possible to come up with something
which can serve most of the use cases reasonably well, and right now,
I believe that should be the focus.
Post by Michal Hocko
And the systemd, with its history of eating projects and not caring much
about their previous users who are not willing to jump in to the systemd
car, doesn't sound like a good place where to place the new interface to
me.
That part I don't know. I really don't care whether it's systemd or
something else but it sure seems there are people who dislike it with
passion. To me, it seems rather silly but to each his/her own. Maybe
ubuntu will come up with their own manager paired with upstart and
people can use that one instead? Who knows.

Thanks.
--
tejun
Tim Hockin
2013-06-28 18:53:13 UTC
Permalink
Post by Michal Hocko
Post by Tejun Heo
Oh, that in itself is not bad. I mean, if you're root, it's pretty
easy to play with and that part is fine. But combined with the
hierarchical nature of cgroup and file permissions, it encourages
people to "deligate" subdirectories to less previledged domains,
OK, this really depends on what you expose to non-root users. I have
seen use cases where admin prepares top-level which is root-only but
it allows creating sub-groups which are under _full_ control of the
subdomain. This worked nicely for memcg for example because hard limit,
oom handling and other knobs are hierarchical so the subdomain cannot
overwrite what admin has said.
bingo
Post by Michal Hocko
And the systemd, with its history of eating projects and not caring much
about their previous users who are not willing to jump in to the systemd
car, doesn't sound like a good place where to place the new interface to
me.
+1

If systemd is the only upstream implementation of this single-agent
idea, we will have to invent our own, and continue to diverge rather
than converge. I think that, if we are going to pursue this model of
a single-agent, we should make a kick-ass implementation that is
flexible and scalable, and full-featured enough to not require
divergence at the lowest layer of the stack. Then build systemd on
top of that. Let systemd offer more features and policies and
"semantic" APIs.

We will build our own semantic APIs that are, necessarily, different
from systemd. But we can all use the same low-level mechanism.

Tim
Vrijendra (वृजेन्द्र) Gokhale
2013-06-28 19:01:56 UTC
Permalink
Post by Tim Hockin
Post by Michal Hocko
Post by Tejun Heo
Oh, that in itself is not bad. I mean, if you're root, it's pretty
easy to play with and that part is fine. But combined with the
hierarchical nature of cgroup and file permissions, it encourages
people to "deligate" subdirectories to less previledged domains,
OK, this really depends on what you expose to non-root users. I have
seen use cases where admin prepares top-level which is root-only but
it allows creating sub-groups which are under _full_ control of the
subdomain. This worked nicely for memcg for example because hard limit,
oom handling and other knobs are hierarchical so the subdomain cannot
overwrite what admin has said.
bingo
Note that we also use cpu and io hierarchies as user accessible hierarchies.

This makes delegation possible to google workloads for subset (sub-cgroups)
creation and monitoring.
Post by Tim Hockin
Post by Michal Hocko
And the systemd, with its history of eating projects and not caring much
about their previous users who are not willing to jump in to the systemd
car, doesn't sound like a good place where to place the new interface to
me.
+1
If systemd is the only upstream implementation of this single-agent
idea, we will have to invent our own, and continue to diverge rather
than converge. I think that, if we are going to pursue this model of
a single-agent, we should make a kick-ass implementation that is
flexible and scalable, and full-featured enough to not require
divergence at the lowest layer of the stack. Then build systemd on
top of that. Let systemd offer more features and policies and
"semantic" APIs.
We will build our own semantic APIs that are, necessarily, different
from systemd. But we can all use the same low-level mechanism.
Tim
_______________________________________________
Containers mailing list
https://lists.linuxfoundation.org/mailman/listinfo/containers
Lennart Poettering
2013-06-29 01:48:16 UTC
Permalink
Post by Tim Hockin
a single-agent, we should make a kick-ass implementation that is
flexible and scalable, and full-featured enough to not require
divergence at the lowest layer of the stack. Then build systemd on
top of that. Let systemd offer more features and policies and
"semantic" APIs.
Well, what if systemd is already kick-ass? I mean, if you have a problem
with systemd, then that's your own problem, but I really don't think why
I should bother?

I for sure am not going to make the PID 1 a client of another daemon.
That's just wrong. If you have a daemon that is both conceptually the
manager of another service and the client of that other service, then
that's bad design and you will easily run into deadlocks and such. Just
think about it: if you have some external daemon for managing cgroups,
and you need cgroups for running external daemons, how are you going to
start the external daemon for managing cgroups? Sure, you can hack
around this, make that daemon special, and magic, and stuff -- or you
can just not do such nonsense. There's no reason to repeat the fuckup
that cgroup became in kernelspace a second time, but this time in
userspace, with multiple manager daemons all with different and slightly
incompatible definitions what a unit to manage actualy is...

We want to run fewer, simpler things on our systems, we want to reuse as
much of the code as we can. You don't achieve that by running yet
another daemon that does worse what systemd can anyway do simpler,
easier and better.

The least you could grant us is to have a look at the final APIs we will
have to offer before you already imply that systemd cannot be a valid
implementation of any API people could ever agree on.

Lennart
Tim Hockin
2013-06-29 03:05:43 UTC
Permalink
Come on, now, Lennart. You put a lot of words in my mouth.
Post by Lennart Poettering
Post by Tim Hockin
a single-agent, we should make a kick-ass implementation that is
flexible and scalable, and full-featured enough to not require
divergence at the lowest layer of the stack. Then build systemd on
top of that. Let systemd offer more features and policies and
"semantic" APIs.
Well, what if systemd is already kick-ass? I mean, if you have a problem
with systemd, then that's your own problem, but I really don't think why I
should bother?
I didn't say it wasn't. I said that we can build a common substrate
that systemd can build on *and* non-systemd systems can use *and*
Google can participate in.
Post by Lennart Poettering
I for sure am not going to make the PID 1 a client of another daemon. That's
just wrong. If you have a daemon that is both conceptually the manager of
another service and the client of that other service, then that's bad design
and you will easily run into deadlocks and such. Just think about it: if you
have some external daemon for managing cgroups, and you need cgroups for
running external daemons, how are you going to start the external daemon for
managing cgroups? Sure, you can hack around this, make that daemon special,
and magic, and stuff -- or you can just not do such nonsense. There's no
reason to repeat the fuckup that cgroup became in kernelspace a second time,
but this time in userspace, with multiple manager daemons all with different
and slightly incompatible definitions what a unit to manage actualy is...
I forgot about the tautology of systemd. systemd is monolithic.
Therefore it can not have any external dependencies. Therefore it
must absorb anything it depends on. Therefore systemd continues to
grow in size and scope. Up next: systemd manages your X sessions!

But that's not my point. It seems pretty easy to make this cgroup
management (in "native mode") a library that can have either a thin
veneer of a main() function, while also being usable by systemd. The
point is to solve all of the problems ONCE. I'm trying to make the
case that systemd itself should be focusing on features and policies
and awesome APIs.
Post by Lennart Poettering
We want to run fewer, simpler things on our systems, we want to reuse as
Fewer and simpler are not compatible, unless you are losing
functionality. Systemd is fewer, but NOT simpler.
Post by Lennart Poettering
much of the code as we can. You don't achieve that by running yet another
daemon that does worse what systemd can anyway do simpler, easier and
better.
Considering this is all hypothetical, I find this to be a funny
debate. My hypothetical idea is better than your hypothetical idea.
Post by Lennart Poettering
The least you could grant us is to have a look at the final APIs we will
have to offer before you already imply that systemd cannot be a valid
implementation of any API people could ever agree on.
Whoah, don't get defensive. I said nothing of the sort. The fact of
the matter is that we do not run systemd, at least in part because of
the monolithic nature. That's unlikely to change in this timescale.
What I said was that it would be a shame if we had to invent our own
low-level cgroup daemon just because the "upstream" daemons was too
tightly coupled with systemd.

I think we have a lot of experience to offer to this project, and a
vested interest in seeing it done well. But if it is purely
targetting systemd, we have little incentive to devote resources to
it.

Please note that I am strictly talking about the lowest layer of the
API. Just the thing that guards cgroupfs against mere mortals. The
higher layers - where abstractions exist, that are actually USEFUL to
end users - are not really in scope right now. We already have our
own higher level APIs.

This is supposed to be collaborative, not combative.

Tim
Lennart Poettering
2013-06-30 19:39:34 UTC
Permalink
Heya,
Post by Tim Hockin
Come on, now, Lennart. You put a lot of words in my mouth.
Post by Lennart Poettering
I for sure am not going to make the PID 1 a client of another daemon. That's
just wrong. If you have a daemon that is both conceptually the manager of
another service and the client of that other service, then that's bad design
and you will easily run into deadlocks and such. Just think about it: if you
have some external daemon for managing cgroups, and you need cgroups for
running external daemons, how are you going to start the external daemon for
managing cgroups? Sure, you can hack around this, make that daemon special,
and magic, and stuff -- or you can just not do such nonsense. There's no
reason to repeat the fuckup that cgroup became in kernelspace a second time,
but this time in userspace, with multiple manager daemons all with different
and slightly incompatible definitions what a unit to manage actualy is...
I forgot about the tautology of systemd. systemd is monolithic.
systemd is certainly not monolithic for almost any definition of that
term. I am not sure where you are taking that from, and I am not sure I
want to discuss on that level. This just sounds like FUD you picked up
somewhere and are repeating carelessly...
Post by Tim Hockin
But that's not my point. It seems pretty easy to make this cgroup
management (in "native mode") a library that can have either a thin
veneer of a main() function, while also being usable by systemd. The
point is to solve all of the problems ONCE. I'm trying to make the
case that systemd itself should be focusing on features and policies
and awesome APIs.
You know, getting this all right isn't easy. If you want to do things
properly, then you need to propagate attribute changes between the units
you manage. You also need something like a scheduler, since a number of
controllers can only be configured under certain external conditions
(for example: the blkio or devices controller use major/minor parameters
for configuring per-device limits. Since major/minor assignments are
pretty much unpredictable these days -- and users probably want to
configure things with friendly and stable /dev/disk/by-id/* symlinks
anyway -- this requires us to wait for devices to show up before we can
configure the parameters.) Soo... you need a graph of units, where you
can propagate things, and schedule things based on some execution/event
queue. And the propagation and scheduling are closely intermingled.

Now, that's pretty much exactly what systemd actually *is*. It
implements a graph of units with a scheduler. And if you rip that part
out of systemd to make this an "easy cgroup management library", then
you simply turn what systemd is into a library without leaving anything.
Which is just bogus.

So no, if you say "seems pretty easy to make this cgroup management a
library" then well, I have to disagree with you.
Post by Tim Hockin
Post by Lennart Poettering
We want to run fewer, simpler things on our systems, we want to reuse as
Fewer and simpler are not compatible, unless you are losing
functionality. Systemd is fewer, but NOT simpler.
Oh, certainly it is. If we'd split up the cgroup fs access into
separate daemon of some kind, then we'd need some kind of IPC for that,
and so you have more daemons and you have some complex IPC between the
processes. So yeah, the systemd approach is certainly both simpler and
uses fewer daemons then your hypothetical one.
Post by Tim Hockin
Post by Lennart Poettering
much of the code as we can. You don't achieve that by running yet another
daemon that does worse what systemd can anyway do simpler, easier and
better.
Considering this is all hypothetical, I find this to be a funny
debate. My hypothetical idea is better than your hypothetical idea.
Well, systemd is pretty real, and the code to do the unified cgroup
management within systemd is pretty complete. systemd is certainly not
hypothetical.
Post by Tim Hockin
Post by Lennart Poettering
The least you could grant us is to have a look at the final APIs we will
have to offer before you already imply that systemd cannot be a valid
implementation of any API people could ever agree on.
Whoah, don't get defensive. I said nothing of the sort. The fact of
the matter is that we do not run systemd, at least in part because of
the monolithic nature. That's unlikely to change in this timescale.
Oh, my. I am not sure what makes you think it is monolithic.
Post by Tim Hockin
What I said was that it would be a shame if we had to invent our own
low-level cgroup daemon just because the "upstream" daemons was too
tightly coupled with systemd.
I have no interest to reimplement systemd as a library, just to make you
happy... I am quite happy with what we already have....
Post by Tim Hockin
This is supposed to be collaborative, not combative.
It certainly sounds *very* differently in what you are writing.

Lennart
Tim Hockin
2013-07-01 06:06:18 UTC
Permalink
On Sun, Jun 30, 2013 at 12:39 PM, Lennart Poettering
Heya,
Post by Tim Hockin
Come on, now, Lennart. You put a lot of words in my mouth.
Post by Lennart Poettering
I for sure am not going to make the PID 1 a client of another daemon. That's
just wrong. If you have a daemon that is both conceptually the manager of
another service and the client of that other service, then that's bad design
and you will easily run into deadlocks and such. Just think about it: if you
have some external daemon for managing cgroups, and you need cgroups for
running external daemons, how are you going to start the external daemon for
managing cgroups? Sure, you can hack around this, make that daemon special,
and magic, and stuff -- or you can just not do such nonsense. There's no
reason to repeat the fuckup that cgroup became in kernelspace a second time,
but this time in userspace, with multiple manager daemons all with different
and slightly incompatible definitions what a unit to manage actualy is...
I forgot about the tautology of systemd. systemd is monolithic.
systemd is certainly not monolithic for almost any definition of that term.
I am not sure where you are taking that from, and I am not sure I want to
discuss on that level. This just sounds like FUD you picked up somewhere and
are repeating carelessly...
It does a number of sort-of-related things. Maybe it does them better
by doing them together. I can't say, really. We don't use it at
work, and I am on Ubuntu elsewhere, for now.
Post by Tim Hockin
But that's not my point. It seems pretty easy to make this cgroup
management (in "native mode") a library that can have either a thin
veneer of a main() function, while also being usable by systemd. The
point is to solve all of the problems ONCE. I'm trying to make the
case that systemd itself should be focusing on features and policies
and awesome APIs.
You know, getting this all right isn't easy. If you want to do things
properly, then you need to propagate attribute changes between the units you
manage. You also need something like a scheduler, since a number of
controllers can only be configured under certain external conditions (for
example: the blkio or devices controller use major/minor parameters for
configuring per-device limits. Since major/minor assignments are pretty much
unpredictable these days -- and users probably want to configure things with
friendly and stable /dev/disk/by-id/* symlinks anyway -- this requires us to
wait for devices to show up before we can configure the parameters.) Soo...
you need a graph of units, where you can propagate things, and schedule
things based on some execution/event queue. And the propagation and
scheduling are closely intermingled.
I'm really just talking about the most basic low-level substrate of
writing to cgroupfs. Again, we don't use udev (yet?) so we don't have
these problems. It seems to me that it's possible to formulate a
bottom layer that is usable by both systemd and non-systemd systems.
But, you know, maybe I am wrong and our internal universe is so much
simpler (and behind the times) than the rest of the world that
layering can work for us and not you.
Now, that's pretty much exactly what systemd actually *is*. It implements a
graph of units with a scheduler. And if you rip that part out of systemd to
make this an "easy cgroup management library", then you simply turn what
systemd is into a library without leaving anything. Which is just bogus.
So no, if you say "seems pretty easy to make this cgroup management a
library" then well, I have to disagree with you.
Post by Tim Hockin
Post by Lennart Poettering
We want to run fewer, simpler things on our systems, we want to reuse as
Fewer and simpler are not compatible, unless you are losing
functionality. Systemd is fewer, but NOT simpler.
Oh, certainly it is. If we'd split up the cgroup fs access into separate
daemon of some kind, then we'd need some kind of IPC for that, and so you
have more daemons and you have some complex IPC between the processes. So
yeah, the systemd approach is certainly both simpler and uses fewer daemons
then your hypothetical one.
Well, it SOUNDS like Serge is trying to develop this to demonstrate
that a standalone daemon works. That's what I am keen to help with
(or else we have to invent ourselves). I am not really afraid of IPC
or of "more daemons". I much prefer simple agents doing one thing and
interacting with each other in simple ways. But that's me.
Post by Tim Hockin
Post by Lennart Poettering
much of the code as we can. You don't achieve that by running yet another
daemon that does worse what systemd can anyway do simpler, easier and
better.
Considering this is all hypothetical, I find this to be a funny
debate. My hypothetical idea is better than your hypothetical idea.
Well, systemd is pretty real, and the code to do the unified cgroup
management within systemd is pretty complete. systemd is certainly not
hypothetical.
Fair enough - I did not realize you had already done all the work that
Serge is just starting out on.
Post by Tim Hockin
Post by Lennart Poettering
The least you could grant us is to have a look at the final APIs we will
have to offer before you already imply that systemd cannot be a valid
implementation of any API people could ever agree on.
Whoah, don't get defensive. I said nothing of the sort. The fact of
the matter is that we do not run systemd, at least in part because of
the monolithic nature. That's unlikely to change in this timescale.
Oh, my. I am not sure what makes you think it is monolithic.
It is not a replacement for any one thing. It is a replacement for a
handful of things that we are not keen to change all at once. That's
all. I have not personally looked at what subsystems are able to be
compiled-out so we could do an incremental changeover, though, so
maybe it can work in different modes? I don't know. I am not
pursuing this anyway, so I am not the person to convince, regardless.
Post by Tim Hockin
What I said was that it would be a shame if we had to invent our own
low-level cgroup daemon just because the "upstream" daemons was too
tightly coupled with systemd.
I have no interest to reimplement systemd as a library, just to make you
happy... I am quite happy with what we already have....
Post by Tim Hockin
This is supposed to be collaborative, not combative.
It certainly sounds *very* differently in what you are writing.
Sorry, then. No offense intended. I'm just looking for opportunities
to not-replicate work, if this whole model is going to be thrust upon
me.

Tim
Thomas Gleixner
2013-07-02 23:57:05 UTC
Permalink
Lennart,
Post by Lennart Poettering
Post by Tim Hockin
But that's not my point. It seems pretty easy to make this cgroup
management (in "native mode") a library that can have either a thin
veneer of a main() function, while also being usable by systemd. The
point is to solve all of the problems ONCE. I'm trying to make the
case that systemd itself should be focusing on features and policies
and awesome APIs.
You know, getting this all right isn't easy. If you want to do things
properly, then you need to propagate attribute changes between the units you
manage. You also need something like a scheduler, since a number of
controllers can only be configured under certain external conditions (for
example: the blkio or devices controller use major/minor parameters for
configuring per-device limits. Since major/minor assignments are pretty much
unpredictable these days -- and users probably want to configure things with
friendly and stable /dev/disk/by-id/* symlinks anyway -- this requires us to
wait for devices to show up before we can configure the parameters.) Soo...
you need a graph of units, where you can propagate things, and schedule things
based on some execution/event queue. And the propagation and scheduling are
closely intermingled.
you are confusing policy and mechanisms.

The access to cgroupfs is mechanism.

The propagation of changes, the scheduling of cgroupfs access and
the correlation to external conditions are policy.

What Tim is asking for is to have a common interface, i.e. a library
which implements the low level access to the cgroupfs mechanism
without imposing systemd defined policies to it (It might implement a
set of common useful policies, but that's a different discussion).

That's definitely not an unreasonable request, because he wants to
implement his own set of policies which are not necessarily the same
as those which are implemented by systemd.

You are simply ignoring the fact, that Linux is used in other ways
than those which you are focussed on. That's true for Google's way to
manage its gazillion machines and that's equally true for the other
end of the spectrum which is deep embedded or any other specialized
use case. Just face it: running Linux on your laptop and on some RHT
lab machines is covering about 1% of the use cases.

Nevertheless you repeatedly claim, that systemd is the only way to
deal with system startup and system management, is covering _ALL_ use
cases and the interfaces you expose are sufficient.

Did you ever work on specialized embedded or big data use cases? I
really doubt that, but I might be wrong as usual.

So I invite you to prove that you can beat an existing setup for an
automotive use case with your magic systemd foo. I refund you fully,
if you can beat the mark of a functional system less than 800ms after
reset release on a 200MHz ARM machine. Functional is defined by the
use case requirements and means:

- Basic cgroups management working
- GUI up and running
- Main communication interface (CAN bus) up and running

The rest of the system is starting up after that including a more
complex cgroup management.

According to your claim that systemd is covering everything and some
more, this should take you a few hours. I grant you a full week to
work on that.

The use case Tim is talking about is different, but has similar
constraints which are completely driven by his particular use case
scenario. I'm sure, that Tim can persuade his management to setup a
similar contest to prove your expertise on the other extreme of the
Linux world.

Before answering please think about the relevance of your statements
"getting this all right isn't easy", "something like a scheduler",
"users probably want ..." and "stable /dev/disk/by-id/* symlinks" in
those contexts.

Thanks,

tglx
Kay Sievers
2013-07-03 00:44:31 UTC
Permalink
Post by Thomas Gleixner
Post by Lennart Poettering
Post by Tim Hockin
But that's not my point. It seems pretty easy to make this cgroup
management (in "native mode") a library that can have either a thin
veneer of a main() function, while also being usable by systemd. The
point is to solve all of the problems ONCE. I'm trying to make the
case that systemd itself should be focusing on features and policies
and awesome APIs.
You know, getting this all right isn't easy. If you want to do things
properly, then you need to propagate attribute changes between the units you
manage. You also need something like a scheduler, since a number of
controllers can only be configured under certain external conditions (for
example: the blkio or devices controller use major/minor parameters for
configuring per-device limits. Since major/minor assignments are pretty much
unpredictable these days -- and users probably want to configure things with
friendly and stable /dev/disk/by-id/* symlinks anyway -- this requires us to
wait for devices to show up before we can configure the parameters.) Soo...
you need a graph of units, where you can propagate things, and schedule things
based on some execution/event queue. And the propagation and scheduling are
closely intermingled.
you are confusing policy and mechanisms.
The access to cgroupfs is mechanism.
The propagation of changes, the scheduling of cgroupfs access and
the correlation to external conditions are policy.
What Tim is asking for is to have a common interface, i.e. a library
which implements the low level access to the cgroupfs mechanism
without imposing systemd defined policies to it (It might implement a
set of common useful policies, but that's a different discussion).
That's definitely not an unreasonable request, because he wants to
implement his own set of policies which are not necessarily the same
as those which are implemented by systemd.
You are simply ignoring the fact, that Linux is used in other ways
than those which you are focussed on. That's true for Google's way to
manage its gazillion machines and that's equally true for the other
end of the spectrum which is deep embedded or any other specialized
use case. Just face it: running Linux on your laptop and on some RHT
lab machines is covering about 1% of the use cases.
Nevertheless you repeatedly claim, that systemd is the only way to
deal with system startup and system management, is covering _ALL_ use
cases and the interfaces you expose are sufficient.
Did you ever work on specialized embedded or big data use cases? I
really doubt that, but I might be wrong as usual.
So I invite you to prove that you can beat an existing setup for an
automotive use case with your magic systemd foo. I refund you fully,
if you can beat the mark of a functional system less than 800ms after
reset release on a 200MHz ARM machine. Functional is defined by the
- Basic cgroups management working
- GUI up and running
- Main communication interface (CAN bus) up and running
The rest of the system is starting up after that including a more
complex cgroup management.
According to your claim that systemd is covering everything and some
more, this should take you a few hours. I grant you a full week to
work on that.
The use case Tim is talking about is different, but has similar
constraints which are completely driven by his particular use case
scenario. I'm sure, that Tim can persuade his management to setup a
similar contest to prove your expertise on the other extreme of the
Linux world.
Before answering please think about the relevance of your statements
"getting this all right isn't easy", "something like a scheduler",
"users probably want ..." and "stable /dev/disk/by-id/* symlinks" in
those contexts.
I don't think anybody needs your money.

But it's sure an improvement over last time when you wanted to use a
"Kantholz" to make your statement.

Thanks,
Kay
Jiri Kosina
2013-07-09 23:12:40 UTC
Permalink
Post by Kay Sievers
Post by Thomas Gleixner
Post by Lennart Poettering
Post by Tim Hockin
But that's not my point. It seems pretty easy to make this cgroup
management (in "native mode") a library that can have either a thin
veneer of a main() function, while also being usable by systemd. The
point is to solve all of the problems ONCE. I'm trying to make the
case that systemd itself should be focusing on features and policies
and awesome APIs.
You know, getting this all right isn't easy. If you want to do things
properly, then you need to propagate attribute changes between the units you
manage. You also need something like a scheduler, since a number of
controllers can only be configured under certain external conditions (for
example: the blkio or devices controller use major/minor parameters for
configuring per-device limits. Since major/minor assignments are pretty much
unpredictable these days -- and users probably want to configure things with
friendly and stable /dev/disk/by-id/* symlinks anyway -- this requires us to
wait for devices to show up before we can configure the parameters.) Soo...
you need a graph of units, where you can propagate things, and schedule things
based on some execution/event queue. And the propagation and scheduling are
closely intermingled.
you are confusing policy and mechanisms.
The access to cgroupfs is mechanism.
The propagation of changes, the scheduling of cgroupfs access and
the correlation to external conditions are policy.
What Tim is asking for is to have a common interface, i.e. a library
which implements the low level access to the cgroupfs mechanism
without imposing systemd defined policies to it (It might implement a
set of common useful policies, but that's a different discussion).
That's definitely not an unreasonable request, because he wants to
implement his own set of policies which are not necessarily the same
as those which are implemented by systemd.
You are simply ignoring the fact, that Linux is used in other ways
than those which you are focussed on. That's true for Google's way to
manage its gazillion machines and that's equally true for the other
end of the spectrum which is deep embedded or any other specialized
use case. Just face it: running Linux on your laptop and on some RHT
lab machines is covering about 1% of the use cases.
Nevertheless you repeatedly claim, that systemd is the only way to
deal with system startup and system management, is covering _ALL_ use
cases and the interfaces you expose are sufficient.
Did you ever work on specialized embedded or big data use cases? I
really doubt that, but I might be wrong as usual.
So I invite you to prove that you can beat an existing setup for an
automotive use case with your magic systemd foo. I refund you fully,
if you can beat the mark of a functional system less than 800ms after
reset release on a 200MHz ARM machine. Functional is defined by the
- Basic cgroups management working
- GUI up and running
- Main communication interface (CAN bus) up and running
The rest of the system is starting up after that including a more
complex cgroup management.
According to your claim that systemd is covering everything and some
more, this should take you a few hours. I grant you a full week to
work on that.
The use case Tim is talking about is different, but has similar
constraints which are completely driven by his particular use case
scenario. I'm sure, that Tim can persuade his management to setup a
similar contest to prove your expertise on the other extreme of the
Linux world.
Before answering please think about the relevance of your statements
"getting this all right isn't easy", "something like a scheduler",
"users probably want ..." and "stable /dev/disk/by-id/* symlinks" in
those contexts.
I don't think anybody needs your money.
But it's sure an improvement over last time when you wanted to use a
"Kantholz" to make your statement.
Now how about the policy vs. mechanisms part of Thomas' e-mail?
--
Jiri Kosina
SUSE Labs
Andy Lutomirski
2013-06-28 19:18:38 UTC
Permalink
Post by Tejun Heo
AFAICS, having a userland agent which has overall knowledge of the
hierarchy and enforcesf structure and limiations is a requirement to
make cgroup generally useable and useful. For systemd based systems,
systemd serving that role isn't too crazy. It's sure gonna have
teeting issues at the beginning but it has all the necessary
information to manage workloads on the system.
A valid issue is interoperability between systemd and non-systemd
in another reply but making cgroup generally available is a pretty new
effort and we're still in the process of figuring out what the right
constructs and abstractions are. Hopefully, we'll be able to reach a
common set of abstractions to base things on top in itme.
The systemd stuff will break my code, too (although the single hierarchy
by itself won't, I think). I think that the kernel should make whatever
simple changes are needed so that systemd can function without using
cgroups at all. That way users of a different cgroup scheme can turn
off systemd's.

Here was my proposal, which hasn't gotten a clear reply:

http://article.gmane.org/gmane.comp.sysutils.systemd.devel/11424

I've already sent a patch to make /proc/<pid>/task/<tid>/children
available regardless of configuration.

--Andy
Serge Hallyn
2013-06-28 19:36:08 UTC
Permalink
Post by Andy Lutomirski
Post by Tejun Heo
AFAICS, having a userland agent which has overall knowledge of the
hierarchy and enforcesf structure and limiations is a requirement to
make cgroup generally useable and useful. For systemd based systems,
systemd serving that role isn't too crazy. It's sure gonna have
teeting issues at the beginning but it has all the necessary
information to manage workloads on the system.
A valid issue is interoperability between systemd and non-systemd
in another reply but making cgroup generally available is a pretty new
effort and we're still in the process of figuring out what the right
constructs and abstractions are. Hopefully, we'll be able to reach a
common set of abstractions to base things on top in itme.
The systemd stuff will break my code, too (although the single hierarchy
by itself won't, I think). I think that the kernel should make whatever
simple changes are needed so that systemd can function without using
cgroups at all. That way users of a different cgroup scheme can turn
off systemd's.
http://article.gmane.org/gmane.comp.sysutils.systemd.devel/11424
Neat. I like that proposal.
Post by Andy Lutomirski
I've already sent a patch to make /proc/<pid>/task/<tid>/children
available regardless of configuration.
-serge
Continue reading on narkive:
Loading...