[ANNOUNCE] Minneapolis Cluster Summit, July 29-30

Post by Daniel Phillips
Red Hat and (the former) Sistina Software are pleased to announce that
we will host a two day kickoff workshop on GFS and Cluster
Infrastructure in Minneapolis, July 29 and 30, not too long after OLS.
We call this the "Cluster Summit" because it goes well beyond GFS, and
is really about building a comprehensive cluster infrastructure for
Linux, which will hopefully be a reality by the time Linux 2.8 arrives.
If we want that, we have to start now, and we have to work like fiends,
time is short. We offer as a starting point, functional code for a
half-dozen major, generic cluster subsystems that Sistina has had under
development for several years.

Don't you think it's a little too short-term? I'd rather see the cluster
software that could be merged mid-term on KS (and that seems to be only OCFS2
so far)

Daniel Phillips

2004-07-05 18:42:27 UTC

Hi Christoph,

Post by Christoph Hellwig

Post by Daniel Phillips
Red Hat and (the former) Sistina Software are pleased to announce
that we will host a two day kickoff workshop on GFS and Cluster
Infrastructure in Minneapolis, July 29 and 30, not too long after
OLS. We call this the "Cluster Summit" because it goes well beyond
GFS, and is really about building a comprehensive cluster
infrastructure for Linux, which will hopefully be a reality by the
time Linux 2.8 arrives. If we want that, we have to start now, and
we have to work like fiends, time is short. We offer as a starting
point, functional code for a half-dozen major, generic cluster
subsystems that Sistina has had under development for several
years.

Don't you think it's a little too short-term?

Not really. It's several months later than it should have been if
anything.

Post by Christoph Hellwig
I'd rather see the
cluster software that could be merged mid-term on KS (and that seems
to be only OCFS2 so far)

Don't you think we ought to take a look at how OCFS and GFS might share
some of the same infrastructure, for example, the DLM and cluster
membership services?

"Think twice, merge once"

Regards,

Daniel

Chris Friesen

2004-07-05 19:08:18 UTC

Post by Daniel Phillips
Don't you think we ought to take a look at how OCFS and GFS might share
some of the same infrastructure, for example, the DLM and cluster
membership services?

For cluster membership, you might consider looking at the OpenAIS CLM portion.
It would be nice if this type of thing was unified across more than just
filesystems.

Chris

Daniel Phillips

2004-07-05 20:29:54 UTC

Post by Daniel Phillips
Don't you think we ought to take a look at how OCFS and GFS might
share some of the same infrastructure, for example, the DLM and
cluster membership services?

For cluster membership, you might consider looking at the OpenAIS CLM
portion. It would be nice if this type of thing was unified across
more than just filesystems.

My own project is a block driver, that's not a filesystem, right?
Cluster membership services as implemented by Sistina are generic,
symmetric and (hopefully) raceless. See:

http://www.usenix.org/publications/library/proceedings/als00/2000papers/papers/full_papers/preslan/preslan.pdf

There is much overlap between the OpenAIS and Sistina's Symmetric
Cluster Architecture. You are right, we do need to get together.

By the way, how do I get your source code if I don't agree with the
BitKeeper license?

Regards,

Daniel

Steven Dake

2004-07-07 22:55:51 UTC

Post by Daniel Phillips
Don't you think we ought to take a look at how OCFS and GFS might
share some of the same infrastructure, for example, the DLM and
cluster membership services?

For cluster membership, you might consider looking at the OpenAIS CLM
portion. It would be nice if this type of thing was unified across
more than just filesystems.

My own project is a block driver, that's not a filesystem, right?
Cluster membership services as implemented by Sistina are generic,
http://www.usenix.org/publications/library/proceedings/als00/2000papers/papers/full_papers/preslan/preslan.pdf
There is much overlap between the OpenAIS and Sistina's Symmetric
Cluster Architecture. You are right, we do need to get together.
By the way, how do I get your source code if I don't agree with the
BitKeeper license?

Daniel

If you mean how do you get source code to the openais project without
bk, it is available as a nightly tarball download from
developer.osdl.org:

http://developer.osdl.org/cherry/openais

If you want to contribute to openais, you can still contribute by using
diff by sending patches to:

***@lists.osdl.org

Regards
-steve

Post by Daniel Phillips
Regards,
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Daniel Phillips

2004-07-08 01:30:17 UTC

Post by Chris Friesen
For cluster membership, you might consider looking at the OpenAIS
CLM portion. It would be nice if this type of thing was unified
across more than just filesystems.

Whoops, I just noticed that that link is way wrong, I must have been
asleep when I posted it. This is the correct one:

http://people.redhat.com/~teigland/sca.pdf

and

http://sources.redhat.com/cluster/cman/

Not that the other isn't interesting, it's just a little dated and
GFS-specific.

Regards,

Daniel

Lars Marowsky-Bree

2004-07-05 19:12:04 UTC

On 2004-07-05T14:42:27,

Don't you think we ought to take a look at how OCFS and GFS might sha=

re=20

some of the same infrastructure, for example, the DLM and cluster=20
membership services?

Indeed. If your efforts in joining the infrastructure are more
successful than ours have been, more power to you ;-)

Sincerely,
Lars Marowsky-Br=E9e <***@suse.de>

--=20
High Availability & Clustering \ ever tried. ever failed. no matter=
=2E
SUSE Labs, Research and Development | try again. fail again. fail bette=
r.
SUSE LINUX AG - A Novell company \ -- Samuel Beckett

Daniel Phillips

2004-07-05 20:27:51 UTC

Hi Lars,

Post by Lars Marowsky-Bree
On 2004-07-05T14:42:27,

Post by Daniel Phillips
Don't you think we ought to take a look at how OCFS and GFS might
share some of the same infrastructure, for example, the DLM and
cluster membership services?

Indeed. If your efforts in joining the infrastructure are more
successful than ours have been, more power to you ;-)

What problems did you run into?

On a quick read-through, it seems quite straightforward for quorum,
membership and distributed locking.

The idea of having more than one node fencing system running at the same
time seems deeply scary, we'd better make some effort to come up with
something common.

Regards,

Daniel

Lars Marowsky-Bree

2004-07-06 07:34:45 UTC

On 2004-07-05T16:27:51,

Post by Lars Marowsky-Bree
Indeed. If your efforts in joining the infrastructure are more
successful than ours have been, more power to you ;-)

=20
What problems did you run into?

The problems were mostly political. Maybe we tried to push too early,
but 1-3 years back, people weren't really interested in agreeing on som=
e
common components or APIs. In particular a certain Linux vendor didn't
even join the group ;-) And the "industry" was very reluctant too. Whic=
h
meant that everybody spend ages talking and not much happening.

However, times may have changed, and hopefully for the better. The push
to get one solution included into the Linux kernel may be enough to
convince people that this time its for real...

There still is the Open Clustering Framework group though, which is a
sub-group of the FSG and maybe the right umbrella to put this under, to
stay away from the impression that it's a single vendor pushing.

If we could revive that and make real progress, I'd be as happy as a
well fed penguin.

Now with OpenAIS on the table, the GFS stack, the work already done by
OCF in the past (which is, admittedly, depressingly little, but I quite
like the Resource Agent API for one) et cetera, there may be a good
chance.

I'll try to get travel approval to go to the meeting.=20

BTW, is the mailing list working? I tried subscribing when you first
announced it, but the subscription request hasn't been approved yet...
Maybe I shouldn't have subscribed with the suse.de address ;-)

Post by Daniel Phillips
On a quick read-through, it seems quite straightforward for quorum,=20
membership and distributed locking.

Believe me, you'd be amazed to find out how long you can argue on how t=
o
identify a node alone - node name, node number (sparse or continuous?),
UUID...? ;-)

And, how do you define quorum, and is it always needed? Some algorithms
don't need quorum (ie, election algorithms can do fine without), so a
membership service which only works with quorum isn't the right
component etc...

Post by Daniel Phillips
The idea of having more than one node fencing system running at the s=

ame=20

Post by Daniel Phillips
time seems deeply scary, we'd better make some effort to come up with=

=20

Post by Daniel Phillips
something common.

Yes. This is actually an important point, and fencing policies are also
reasonably complex. The GFS stack seems to tie fencing quite deeply int=
o
the system (which is understandable, since you always have shared
storage, otherwise a node wouldn't be part of the GFS domain in the
first place).

However, the new dependency based cluster resource manager we are
writing right now (which we simply call "Cluster Resource Manager" for
lack of creativity ;) decides whether or not it needs to fence a node
based on the resources in the cluster - if it isn't affecting the
resources we can run on the remaining nodes, or none of the resources
requires node-level fencing, no such operation will be done.=20

This has advantages in larger clusters (where, if split, each partition
could still continue to run resources which are unaffected by the split
even the other nodes cannot be fenced), in shared nothing clusters or
resources which are self-fencing and do not need STONITH etc.

The ties between membership, quorum and fencing are not as strong in
these scenarios, at least not mandatory. So a stack which enforced
fencing at these levels, and w/o coordinating with the CRM first, would
not work out.

And by pushing for inclusion into the main kernel, you'll also raise al=
l
sleeping zom^Wbeauties. I hope you have a long breath for the
discussions ;-)

There's lots of work there.

Sincerely,
Lars Marowsky-Br=E9e <***@suse.de>

--=20
High Availability & Clustering \ ever tried. ever failed. no matter=
=2E
SUSE Labs, Research and Development | try again. fail again. fail bette=
r.
SUSE LINUX AG - A Novell company \ -- Samuel Beckett

Daniel Phillips

2004-07-06 21:34:51 UTC

Hi Lars,

Post by Lars Marowsky-Bree
On 2004-07-05T16:27:51,

Post by Lars Marowsky-Bree
Indeed. If your efforts in joining the infrastructure are more
successful than ours have been, more power to you ;-)

What problems did you run into?

The problems were mostly political. Maybe we tried to push too early,
but 1-3 years back, people weren't really interested in agreeing on
some common components or APIs. In particular a certain Linux vendor
didn't even join the group ;-)

*blush*

Post by Lars Marowsky-Bree
And the "industry" was very reluctant
too. Which meant that everybody spend ages talking and not much
happening.

We're showing up with loads of Sistina code this time. It's up to
everybody else to ante up, and yes, I see there's more code out there.
It's going to be quite a summer reading project.

Post by Lars Marowsky-Bree
However, times may have changed, and hopefully for the better. The
push to get one solution included into the Linux kernel may be enough
to convince people that this time its for real...

It's for real, no question. There are at least two viable GPL code
bases already, GFS and Lustre, with OCFS2 coming up fast. And there
are several commercial (binary/evil) cluster filesystems in service
already, not that Linus should care about them, but they do lend
credibility.

Post by Lars Marowsky-Bree
There still is the Open Clustering Framework group though, which is a
sub-group of the FSG and maybe the right umbrella to put this under,
to stay away from the impression that it's a single vendor pushing.

Oops, another code base to read ;-)

Post by Lars Marowsky-Bree
If we could revive that and make real progress, I'd be as happy as a
well fed penguin.

Red Hat is solidly behind this as a _community_ effort.

Post by Lars Marowsky-Bree
Now with OpenAIS on the table, the GFS stack, the work already done
by OCF in the past (which is, admittedly, depressingly little, but I
quite like the Resource Agent API for one) et cetera, there may be a
good chance.
I'll try to get travel approval to go to the meeting.

:-)

Post by Lars Marowsky-Bree
BTW, is the mailing list working? I tried subscribing when you first
announced it, but the subscription request hasn't been approved
yet... Maybe I shouldn't have subscribed with the suse.de address ;-)

Perhaps it has more to do with a cross-channel grudge? <grin>

Just poke Alasdair, you know where to find him.

Post by Daniel Phillips
On a quick read-through, it seems quite straightforward for quorum,
membership and distributed locking.

Believe me, you'd be amazed to find out how long you can argue on how
to identify a node alone - node name, node number (sparse or
continuous?), UUID...? ;-)

I can believe it. What I have just done with my cluster snapshot target
over the last couple of weeks is, removed _every_ dependency on cluster
infrastructure and moved the one remaining essential interface to user
space. In this way the infrastructure becomes pluggable from the
cluster block device's point of view and you can run the target without
any cluster infrastructure at all if you want (just dmsetup and a
utility for connecting a socket to the target). This is a general
technique that we're now applying to a second block driver. It's a
tiny amount of kernel and userspace code which I will post pretty soon.
With this refactoring, the cluster block driver shrank to less than
half its former size with no loss of functionality.

The nice thing is, I get to use the existing (SCA) infrastructure, but I
don't have any dependency on it.

Post by Lars Marowsky-Bree
And, how do you define quorum, and is it always needed? Some
algorithms don't need quorum (ie, election algorithms can do fine
without), so a membership service which only works with quorum isn't
the right component etc...

Oddly enough, there has been much discussion about quorum here as well.
This must be pluggable, and we must be able to handle multiple,
independent clusters, with a single node potentially belonging to more
than one at the same time. Please see this, for a formal writeup on
our 2.6 code base:

http://people.redhat.com/~teigland/sca.pdf

Is this the key to the grand, unified quorum system that will do every
job perfectly? Good question, however I do know how to make it
pluggable for my own component, at essentially zero cost. This makes
me optimistic that we can work out something sensible, and that perhaps
it's already a solved problem.

It looks like fencing is more of an issue, because having several node
fencing systems running at the same time in ignorance of each other is
deeply wrong. We can't just wave our hands at this by making it
pluggable, we need to settle on one that works and use it. I'll humbly
suggest that Sistina is furthest along in this regard.

Post by Daniel Phillips
The idea of having more than one node fencing system running at the
same time seems deeply scary, we'd better make some effort to come
up with something common.

Yes. This is actually an important point, and fencing policies are
also reasonably complex. The GFS stack seems to tie fencing quite
deeply into the system (which is understandable, since you always
have shared storage, otherwise a node wouldn't be part of the GFS
domain in the first place).

Oops, should have read ahead ;) The DLM is also tied deeply into the
GFS stack, but that factors out nicely, and in fact, GFS can currently
use two completely different fencing systems (GULM vs SCA-Fence). I
think we can sort this out.

Post by Lars Marowsky-Bree
However, the new dependency based cluster resource manager we are
writing right now (which we simply call "Cluster Resource Manager"
for lack of creativity ;) decides whether or not it needs to fence a
node based on the resources in the cluster - if it isn't affecting
the resources we can run on the remaining nodes, or none of the
resources requires node-level fencing, no such operation will be
done.

Cluster resource management is the least advanced of the components that
our Red Hat Sistina group has to offer, mainly because it is seen as a
matter of policy, and so the pressing need at this state is to provide
suitable hooks. Lon Hohberger is working on system that works with the
SCA framework (Magma). The preexisting Red Hat cluster team decided to
re-roll their whole cluster suite within the new framework. Perhaps
you would like to take a look, and tell us why this couldn't possibly
work for you? (Or maybe we need to get you drunk first...)

Post by Lars Marowsky-Bree
This has advantages in larger clusters (where, if split, each
partition could still continue to run resources which are unaffected
by the split even the other nodes cannot be fenced), in shared
nothing clusters or resources which are self-fencing and do not need
STONITH etc.

"STOMITH" :) Yes, exactly. Global load balancing is another big item,
i.e., which node gets assigned the job of running a particular service,
which means you need to know how much of each of several different
kinds of resources a particular service requires, and what the current
resource usage profile is for each node on the cluster. Rik van Riel
is taking a run at this.

It's a huge, scary problem. We _must_ be able to plug in different
solutions, all the way from completely manual to completely automagic,
and we have to be able to handle more than one at once.

Post by Lars Marowsky-Bree
The ties between membership, quorum and fencing are not as strong in
these scenarios, at least not mandatory. So a stack which enforced
fencing at these levels, and w/o coordinating with the CRM first,
would not work out.

Yes, again, fencing looks like the one we have to fret about. The
others will be a lot easier to mix and match.

Post by Lars Marowsky-Bree
And by pushing for inclusion into the main kernel, you'll also raise
all sleeping zom^Wbeauties. I hope you have a long breath for the
discussions ;-)

You know I do!

Post by Lars Marowsky-Bree
There's lots of work there.

Indeed, and I didn't do any work today yet, due to answering email.

Incidently, there is already a nice crosssection of the cluster
community on the way to sunny Minneapolis for the July meeting. We've
reached about 50% capacity, and we have quorum, I think :-)

Regards,

Daniel

Lars Marowsky-Bree

2004-07-07 18:16:50 UTC

On 2004-07-06T17:34:51,

And the "industry" was very reluctant=20
too. Which meant that everybody spend ages talking and not much
happening.

We're showing up with loads of Sistina code this time. It's up to=20
everybody else to ante up, and yes, I see there's more code out there=

=2E =20

It's going to be quite a summer reading project.

Yeah, I wish you the best. There's always been quite a bit of code to
show, but that alone didn't convince people ;-) I've certainly grown a
bit more experienced / cynical during that time. (Which, according to
Oscar Wilde, is the same anyway ;)

It's for real, no question. There are at least two viable GPL code=20
bases already, GFS and Lustre, with OCFS2 coming up fast.

Yes, they have some common requirements on the kernel VFS layer, though
Lustre certainly has the most extensive demands. I hope someone from CF=
S
Inc can make it to your summit.

I can believe it. What I have just done with my cluster snapshot tar=

get=20

over the last couple of weeks is, removed _every_ dependency on clust=

er=20

infrastructure and moved the one remaining essential interface to use=

r=20

space.

Is there a KS presentation on this? I didn't get invited to KS and will
just be allowed in for OLS, but I'll be around town already...

Oddly enough, there has been much discussion about quorum here as wel=

l. =20

This must be pluggable, and we must be able to handle multiple,=20
independent clusters, with a single node potentially belonging to mor=

e=20

than one at the same time. Please see this, for a formal writeup on=20
=20
http://people.redhat.com/~teigland/sca.pdf

Thanks for the pointer, this is a good read.

It looks like fencing is more of an issue, because having several nod=

e=20

fencing systems running at the same time in ignorance of each other i=

s=20

deeply wrong. We can't just wave our hands at this by making it=20
pluggable, we need to settle on one that works and use it. I'll humb=

ly=20

suggest that Sistina is furthest along in this regard.

Your fencing system is fine with me; based on the assumption that you
always have to fence a failed node, you are doing the right thing.
However, the issues are more subtle when this is no longer true, and in
a 1:1 how do you arbitate who is allowed to fence?

Cluster resource management is the least advanced of the components t=

hat=20

our Red Hat Sistina group has to offer, mainly because it is seen as =

a=20

matter of policy, and so the pressing need at this state is to provid=

e=20

suitable hooks.
"STOMITH" :) Yes, exactly. Global load balancing is another big ite=

m,=20

i.e., which node gets assigned the job of running a particular servic=

e,=20

which means you need to know how much of each of several different=20
kinds of resources a particular service requires, and what the curren=

t=20

resource usage profile is for each node on the cluster. Rik van Riel=

=20

is taking a run at this.

Right, cluster resource management is one of the things where I'm quite
happy with the approach the new heartbeat resource manager is heading
down (or up, I hope ;).

It's a huge, scary problem. We _must_ be able to plug in different=20
solutions, all the way from completely manual to completely automagic=

,=20

and we have to be able to handle more than one at once.

You can plug multiple ones as long as they are managing independent
resources, obviously. However, if the CRM is the one which ultimately
decides whether a node needs to be fenced or not - based on its
knowledge of which resources it owns or could own - this gets a lot mor=
e
scary still...

Yes, again, fencing looks like the one we have to fret about. The=20
others will be a lot easier to mix and match.

Mostly, yes. Unless you (like some) require quorum to report a cluster
membership, which some implementations do.

Incidently, there is already a nice crosssection of the cluster=20
community on the way to sunny Minneapolis for the July meeting. We'v=

e=20

reached about 50% capacity, and we have quorum, I think :-)

Uhm, do I have to be frightened of being fenced? ;)

Sincerely,
Lars Marowsky-Br=E9e <***@suse.de>

--=20
High Availability & Clustering \ ever tried. ever failed. no matter=
=2E
SUSE Labs, Research and Development | try again. fail again. fail bette=
r.
SUSE LINUX AG - A Novell company \ -- Samuel Beckett

Daniel Phillips

2004-07-08 01:14:07 UTC

Post by Lars Marowsky-Bree
And the "industry" was very reluctant
too. Which meant that everybody spend ages talking and not much
happening.

We're showing up with loads of Sistina code this time. It's up to
everybody else to ante up, and yes, I see there's more code out
there. It's going to be quite a summer reading project.

Yeah, I wish you the best. There's always been quite a bit of code to
show, but that alone didn't convince people ;-) I've certainly grown
a bit more experienced / cynical during that time. (Which, according
to Oscar Wilde, is the same anyway ;)

OK, what I've learned from the discussion so far is, we need to avoid
getting stuck too much on the HA aspects and focus more on the
cluster/performance side for now. There are just too many entrenched
positions on failover. Even though every component of the cluster is
designed to fail over, that's just a small part of what we have to deal
with:

- Cluster Volume management
- Cluster configuration management
- Cluster membership/quorum
- Node Fencing
- Parallel cluster filesystems with local semantics
- Distributed Locking
- Cluster mirror block device
- Cluster snapshot block device
- Cluster administration interface, including volume managment
- Cluster resource balancing
- bits I forgot to mention

Out of that, we need to pick the three or four items we're prepared to
address immediately, that we can obviously share between at least two
known cluster filesystems, and get them onto lkml for peer review.
Trying to push the whole thing as one lump has never worked for
anybody, and won't work in this case either. For example, the DLM is
fairly non-controversial, and important in terms of performance and
reliability. Let's start with that.

Furthermore, nobody seems interested in arguing about the cluster block
devices either, so lets just discuss how they work and get them out of
the way.

Then let's tackle the low level infrastructure, such as CCS (Cluster
Configuration System) that does a simple job, that is, it distributes
configuration files racelessly.

I heard plenty of fascinating discussion of quorum strategies last
night, and have a number of papers to read as a result. But that's a
diversion: it can and must be pluggable. We just need to agree on how
the plugs work, a considerably less ambitious task.

In general, the principle is: the less important it is, the more
argument there will be about it. Defer that, make it pluggable, call
it policy, push it to user space, and move on. We need to agree on the
basics so that we can manage network volumes with cluster filesystems
on top of them.

Post by Daniel Phillips
I can believe it. What I have just done with my cluster snapshot
target over the last couple of weeks is, removed _every_ dependency
on cluster infrastructure and moved the one remaining essential
interface to user space.

Is there a KS presentation on this? I didn't get invited to KS and
will just be allowed in for OLS, but I'll be around town already...

There will be a BOF at OLS, "Cluster Infrastructure". Since I didn't
get a KS invite either and what remains is more properly lkml stuff
anyway, I will go canoing with Matt O'Keefe during KS as planned. We
already did the necessary VFS fixups over the last year (save the
non-critical flock patch, which is now in play) so there is nothing
much left to beg Linus for. There are additional VFS hooks that would
be nice to have for optimization, but they can wait, people will
appreciate them more that way ;)

The non-vfs cluster infrastructure just uses the normal module API,
except for a couple of places in the DM cluster block devices where
I've allowed myself some creative license, easily undone. Again, this
is lkml material, not KS stuff.

Post by Daniel Phillips
It looks like fencing is more of an issue, because having several
node fencing systems running at the same time in ignorance of each
other is deeply wrong. We can't just wave our hands at this by
making it pluggable, we need to settle on one that works and use
it. I'll humbly suggest that Sistina is furthest along in this
regard.

Your fencing system is fine with me; based on the assumption that you
always have to fence a failed node, you are doing the right thing.
However, the issues are more subtle when this is no longer true, and
in a 1:1 how do you arbitate who is allowed to fence?

Good question. Since two-node clusters are my primary interest at the
moment, I need some answers. I think the current plan is: they try to
fence each other, winner take all. Each node will introspect to decide
if it's in good enough shape to do the job itself, then go try to fence
the other one. Alternatively, they can be configured so that one has
more votes than the other, if somebody wants that broken arrangement.

This is my dim recollection, I'll have more to say when I've actually
hooked my stuff up to it. There are others with plenty of experience
in this, see below.

Post by Daniel Phillips
Cluster resource management is the least advanced of the components
that our Red Hat Sistina group has to offer, mainly because it is
seen as a matter of policy, and so the pressing need at this state
is to provide suitable hooks.
"STOMITH" :) Yes, exactly. Global load balancing is another big
item, i.e., which node gets assigned the job of running a
particular service, which means you need to know how much of each
of several different kinds of resources a particular service
requires, and what the current resource usage profile is for each
node on the cluster. Rik van Riel is taking a run at this.

Right, cluster resource management is one of the things where I'm
quite happy with the approach the new heartbeat resource manager is
heading down (or up, I hope ;).

Combining heartbeat and resource management sounds like a good idea.
Currently, we have them separate and since I have not tried it myself
yet, I'll reserve comment. Dave Teigland would be more than happy to
wax poetic, though.

Post by Daniel Phillips
It's a huge, scary problem. We _must_ be able to plug in different
solutions, all the way from completely manual to completely
automagic, and we have to be able to handle more than one at once.

We do not see the CRM as being involved in fencing at present, though I
can see why perhaps it ought to be. The resource manager that Lon
Hohberger is cooking up is scriptable and rule-driven. I'm sure we
could spend 100% of the available time on that alone. My strategy is,
I send my manually-configurable cluster bits to Lon and he hooks them
in so everything is automagic, then I look at how much the end result
sucks/doesn't suck.

There's some philosophy at work here: I feel that any cluster device
that requires elaborate infrastructure and configuration to run is
broken. If you can set the cluster devices up manually and they depend
only on existing kernel interfaces, they're more likely to get unit
testing. At the same time, these devices have to fit well into a
complex infrastructure, therefore the manual interface can be driven
equally well by a script or C program, and there is one tiny but
crucial additional hook to allow for automatic reconnection to the
cluster if something bad happens, or if the resource manager just feels
the need to reorganize things.

So while I'm rambling here, I'll mention that the resource manager (or
anybody else) can just summarily cut the block target's pipe and the
block target will politely go ask for a new one. No IOs will be
failed, nothing will break, no suspend needed, just one big breaker
switch to throw. This of course depends on the target using a pipe
(socket) to communicate with the cluster, but even if I do switch to
UDP, I'll still keep at least one pipe around, just because it makes
the target so easy to control.

It didn't start this way. The first prototype had a couple thousand
lines of glue code to work with various possible infrastructures. Now
that's all gone and there are just two pipes left, one to local user
space for cluster management and the other to somewhere out on the
cluster for synchronization. It's now down to 30% of the original size
and runs faster as a bonus. All cluster interfaces are "read/write",
except for one ioctl to reconnect a broken pipe.

Post by Daniel Phillips
Incidently, there is already a nice crosssection of the cluster
community on the way to sunny Minneapolis for the July meeting.
We've reached about 50% capacity, and we have quorum, I think :-)

Uhm, do I have to be frightened of being fenced? ;)

Only if you drink too much of that kluster Koolaid

Regards,

Daniel

Lars Marowsky-Bree

2004-07-08 09:10:43 UTC

On 2004-07-07T21:14:07,

OK, what I've learned from the discussion so far is, we need to avoid=

=20

getting stuck too much on the HA aspects and focus more on the=20
cluster/performance side for now. There are just too many entrenched=

=20

positions on failover.

Well, first, failover is not all of HA. But that's a different diversio=
n
again.

Out of that, we need to pick the three or four items we're prepared t=

o=20

address immediately, that we can obviously share between at least two=

=20

known cluster filesystems, and get them onto lkml for peer review. =20

Ok.

For example, the DLM is fairly non-controversial, and important in
terms of performance and reliability. Let's start with that.

I doubt that assessment, the DLM is going to be somewhat controversial
already and requires the dragging in of membership, inter-node
messaging, fencing and quorum. The problem is that you cannot easily
separate out the different pieces.

I'd humbly suggest to start with the changes in the VFS layers which th=
e
CFS's of the different kinds require, regardless of which infrastructur=
e
they use.

Of all the cluster-subsystems, the fencing system is likely the most
important. If the various implementations don't step on eachothers toes
there, the duplication of membership/messaging/etc is only inefficient,
but not actively harmful.

I heard plenty of fascinating discussion of quorum strategies last=20
night, and have a number of papers to read as a result. But that's a=

=20

diversion: it can and must be pluggable. We just need to agree on ho=

w=20

the plugs work, a considerably less ambitious task.

When you argue whether or not you can mandate quorum for a given cluste=
r
implementation, and which layers of the cluster are allowed to require
quorum (some will refuse to even tell you the membership without quorum=
;
some will require quorum before they fence, others will recover quorum
by fencing), this discussion is fairly complex.

Again, let's see what kernel hooks these require, and defer all the res=
t
of the discussions as far as possible.

it policy, push it to user space, and move on. We need to agree on t=

he=20

basics so that we can manage network volumes with cluster filesystems=

=20

on top of them.

Ah, that in itself is a very data-centric point of view and not exactly
applicable to the needs of shared-nothing clusters. (I'm not trying to
nitpick, just trying to make you aware of all the hidden assumptions yo=
u
may not be aware of yourself.) Of course, this is perfectly fine for
something such as GFS (which, being SAN based, of course requires
these), but a cluster infrastructure in the kernel may not be limitted
to this.

Post by Lars Marowsky-Bree
Is there a KS presentation on this? I didn't get invited to KS and
will just be allowed in for OLS, but I'll be around town already...

There will be a BOF at OLS, "Cluster Infrastructure". Since I didn't=

=20

get a KS invite either and what remains is more properly lkml stuff=20
anyway, I will go canoing with Matt O'Keefe during KS as planned.=20

Ah, okay.

Post by Lars Marowsky-Bree
Your fencing system is fine with me; based on the assumption that y=

Post by Lars Marowsky-Bree
always have to fence a failed node, you are doing the right thing.
However, the issues are more subtle when this is no longer true, an=

Post by Lars Marowsky-Bree
in a 1:1 how do you arbitate who is allowed to fence?

Good question. Since two-node clusters are my primary interest at th=

e=20

moment, I need some answers.=20

Two-node clusters are reasonably easy, true.

I think the current plan is: they try to fence each other, winner tak=

all. Each node will introspect to decide if it's in good enough shap=

to do the job itself, then go try to fence the other one.

Ok, this is essentially what heartbeat does, but it gets more complex
with >2 nodes. In which case your cluster block device is going to run
into interesting synchronization issues, too, I'd venture. (Or at least
drbd does, where we look at replicating across >2 nodes.)

Post by Lars Marowsky-Bree
resources, obviously. However, if the CRM is the one which ultimate=

Post by Lars Marowsky-Bree
decides whether a node needs to be fenced or not - based on its
knowledge of which resources it owns or could own - this gets a lot
more scary still...

We do not see the CRM as being involved in fencing at present, though=

I=20

can see why perhaps it ought to be. The resource manager that Lon=20
Hohberger is cooking up is scriptable and rule-driven.=20

=46rankly, I'm kind of disappointed; why are you cooking up your own on=
ce
more? When we set out to write a new dependency-based flexible resource
manager, we explicitly made it clear that it wasn't just meant to run o=
n
top of heartbeat, but in theory on top of any cluster infrastructure.

I know this is the course of Open Source development, and that
"community project" basically means "my wheel be better than your wheel=
,
and you are allowed to get behind it after we are done, but don't
interfere before that", but I'd have expected some discussions or at
least solicitation of them on the established public mailing lists, jus=
t
to keep up the pretense ;-)

Sincerely,
Lars Marowsky-Br=E9e <***@suse.de>

--=20
High Availability & Clustering \ ever tried. ever failed. no matter=
=2E
SUSE Labs, Research and Development | try again. fail again. fail bette=
r.
SUSE LINUX AG - A Novell company \ -- Samuel Beckett

David Teigland

2004-07-08 10:53:38 UTC

Post by Lars Marowsky-Bree
Of all the cluster-subsystems, the fencing system is likely the most
important. If the various implementations don't step on eachothers toes
there, the duplication of membership/messaging/etc is only inefficient,
but not actively harmful.

I'm afraid the fencing issue has been rather misrepresented. Here's what we're
doing (a lot of background is necessary I'm afraid.) We have a symmetric,
kernel-based, stand-alone cluster manager (CMAN) that has no ties to anything
else whatsoever. It'll simply run and answer the question "who's in the
cluster?" by providing a list of names/nodeids.

So, if that's all you want you can just run cman on all your nodes and it'll
tell you who's in the cluster (kernel and userland api's). CMAN will also do
generic callbacks to tell you when the membership has changed. Some people can
stop reading here.

In the event of network partitions you can obviously have two cman clusters
form independently (i.e. "split-brain"). Some people care about this. Quorum
is a trivial true/false property of the cluster. Every cluster member has a
number of votes and the cluster itself has a number of expected votes. Using
these simple values, cman does a quick computation to tell you if the cluster
has quorum. It's a very standard way of doing things -- we modelled it
directly off the VMS-cluster style. Whether you care about this quorum value
or what you do with it are beside the point. Some may be interested in
discussing how cman works and participating in further development; if so go
ahead and ask on linux-***@redhat.com. We've been developing and using
cman for 3-4 years. Are there other valid approaches? of course. Is cman
suitable for many people? yes. Suitable for everyone? no.

(see http://sources.redhat.com/cluster/ for patches and mailing list)

What about the DLM? The DLM we've developed is again modelled exactly after
that in VMS-clusters. It depends on cman for the necessary clustering input.
Note that it uses the same generic cman api's as any other system. Again, the
DLM is utterly symmetric; there is no server or master node involved. Is this
DLM suitable for many people? yes. For everyone? no. (Right now gfs and clvm
are the primary dlm users simply because those are the other projects our group
works on. DLM is in no way specific to either of those.)

What about Fencing? Fencing is not a part of the cluster manager, not a part
of the dlm and not a part of gfs. It's an entirely independent system that
runs on its own in userland. It depends on cman for cluster information just
like the dlm or gfs does. I'll repeat what I said on the linux-cluster mailing
list:

--
Fencing is a service that runs on its own in a CMAN cluster; it's entirely
independent from other services. GFS simply checks to verify fencing is
running before allowing a mount since it's especially dangerous for a mount to
succeed without it.

As soon as a node joins a fencing domain it will be fenced by another domain

Post by Lars Marowsky-Bree
cman_tool join (joins the cluster)
fence_tool join (starts fenced which joins the default fence domain)

it will be fenced by another fence domain member if it fails. So, you simply
need to configure your nodes to run fence_tool join after joining the cluster
if you want fencing to happen. You can add any checks later on that you think
are necessary to be sure that the node is in the fence domain.

Running fence_tool leave will remove a node cleanly from the fence domain (it
won't be fenced by other members.)
--

This fencing system is suitable for us in our gfs/clvm work. It's probably
suitable for others, too. For everyone? no. Can be improved with further
development? yes. A central or difficult issue? not really. Again, no need to
look at the dlm or gfs or clvm to work with this fencing system.

--
Dave Teigland <***@redhat.com>

Chris Friesen

2004-07-08 14:14:40 UTC

Post by David Teigland
I'm afraid the fencing issue has been rather misrepresented. Here's what we're
doing (a lot of background is necessary I'm afraid.) We have a symmetric,
kernel-based, stand-alone cluster manager (CMAN) that has no ties to anything
else whatsoever. It'll simply run and answer the question "who's in the
cluster?" by providing a list of names/nodeids.
So, if that's all you want you can just run cman on all your nodes and it'll
tell you who's in the cluster (kernel and userland api's). CMAN will also do
generic callbacks to tell you when the membership has changed. Some people can
stop reading here.

I'm curious--this seems to be exactly what the cluster membership portion of the
SAF spec provides. Would it make sense to contribute to that portion of
OpenAIS, then export the CMAN API on top of it for backwards compatibility?

It just seems like there are a bunch of different cluster messaging, membership,
etc. systems, and there is a lot of work being done in parallel with different
implementations of the same functionality. Now that there is a standard
emerging for clustering (good or bad, we've got people asking for it) would it
make sense to try and get behind that standard and try and make a reference
implementation?

You guys are more experienced than I, but it seems a bit of a waste to see all
these projects re-inventing the wheel.

Chris

David Teigland

2004-07-08 16:06:22 UTC

Post by David Teigland
I'm afraid the fencing issue has been rather misrepresented. Here's what
we're doing (a lot of background is necessary I'm afraid.) We have a
symmetric, kernel-based, stand-alone cluster manager (CMAN) that has no ties
to anything else whatsoever. It'll simply run and answer the question
"who's in the cluster?" by providing a list of names/nodeids.
So, if that's all you want you can just run cman on all your nodes and it'll
tell you who's in the cluster (kernel and userland api's). CMAN will also
do generic callbacks to tell you when the membership has changed. Some
people can stop reading here.

That's definately worth investigating. If the SAF API is only of interest in
userland, then perhaps a library can translate between the SAF api and the
existing interface cman exports to userland. We'd welcome efforts to make cman
itself more compatible with SAF, too. We're not very familiar with it, though.

Post by Chris Friesen
It just seems like there are a bunch of different cluster messaging,
membership, etc. systems, and there is a lot of work being done in parallel
with different implementations of the same functionality. Now that there is
a standard emerging for clustering (good or bad, we've got people asking for
it) would it make sense to try and get behind that standard and try and make
a reference implementation?
You guys are more experienced than I, but it seems a bit of a waste to see
all these projects re-inventing the wheel.

Sure, we're happy to help make this code more useful to others. We wrote this
for a very immediate and practical reason of course -- to support gfs, clvm,
dlm, etc, but always expected it would be used more broadly. We've not done a
lot of work with it lately since as I mentioned it was begun years ago.

--
Dave Teigland <***@redhat.com>

Daniel Phillips

2004-07-08 18:22:19 UTC

Hi Dave,

We have a symmetric, kernel-based, stand-alone cluster manager (CMAN)
that has no ties to anything else whatsoever. It'll simply run and
answer the question "who's in the cluster?" by providing a list of
names/nodeids.

While we're in here, could you please explain why CMAN needs to be
kernel-based? (Just thought I'd broach the question before Christoph
does.)

Regards,

Daniel

Steven Dake

2004-07-08 19:41:21 UTC

Post by Daniel Phillips
Hi Dave,

While we're in here, could you please explain why CMAN needs to be
kernel-based? (Just thought I'd broach the question before Christoph
does.)
Regards,
Daniel

Daniel,

I have that same question as well. I can think of several
disadvantages:

1) security faults in the protocol can crash the kernel or violate
system security
2) secure group communication is difficult to implement in kernel
- secure group key protocols can be implemented fairly easily in
userspace using packages like openssl. Implementing these
protocols in kernel will prove to be very complex.
3) live upgrades are much more difficult with kernel components
4) a standard interface (the SA Forum AIS) is not being used,
disallowing replaceability of components. This is a big deal for
people interested in clustering that dont want to be locked into
a partciular implementation.
5) dlm, fencing, cluster messaging (including membership) can be done
in userspace, so why not do it there.
6) cluster services for the kernel and cluster services for applications
will fork, because SA Forum AIS will be chosen for application
level services.
7) faults in the protocols can bring down all of Linux, instead of one
cluster service on one node.
8) kernel changes require much longer to get into the field and are
much more difficult to distribute. userspace applications are much
simpler to unit test, qualify, and release.

The advantages are:
interrupt driven timers
some possible reduction in latency related to the cost of executing a
system call when sending messages (including lock messages)

I would like to share with you the efforts of the industry standards
body Service Availability Forum (www.saforum.org). The Forum is
intersted in specifying interfaces for improving availability of a
system. One of the collections of APIs (called the application
interface specification) utilizes redundant software components using
clustering approaches to improve availability.

The AIS specification specifies APIs for cluster membership, application
failover, checkpointing, eventing, messaging, and distributed locks.
All of these services are designed to work with multiple nodes.

It would be beneficial to everyone to adopt these standard interfaces.
Alot of thought has gone into them. They are pretty solid. And there
are atleast two open source implementations under way (openais and
linux-ha) and more on the horizon.

One of these projects, the openais project which I maintain, implements
3 of these services (and the rest will be done in the timeframes we are
talking about) in user space without any kernel changes required. It
would be possible with kernel to userland communication for the cluster
applications (GFS, distributed block device, etc) to use this standard
interface and implementation. Then we could avoid all of the
unnecessary kernel maintenance and potential problems that come along
with it.

Are you interested in such an approach?

Thanks
-steve

Post by Daniel Phillips
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Daniel Phillips

2004-07-10 04:58:28 UTC

Hi Steven,

Post by Daniel Phillips
While we're in here, could you please explain why CMAN needs to be
kernel-based? (Just thought I'd broach the question before Christoph
does.)

Daniel,
I have that same question as well. I can think of several
1) security faults in the protocol can crash the kernel or violate
system security
2) secure group communication is difficult to implement in kernel
- secure group key protocols can be implemented fairly easily in
userspace using packages like openssl. Implementing these
protocols in kernel will prove to be very complex.
3) live upgrades are much more difficult with kernel components
4) a standard interface (the SA Forum AIS) is not being used,
disallowing replaceability of components. This is a big deal for
people interested in clustering that dont want to be locked into
a partciular implementation.
5) dlm, fencing, cluster messaging (including membership) can be done
in userspace, so why not do it there.
6) cluster services for the kernel and cluster services for applications
will fork, because SA Forum AIS will be chosen for application
level services.
7) faults in the protocols can bring down all of Linux, instead of one
cluster service on one node.
8) kernel changes require much longer to get into the field and are
much more difficult to distribute. userspace applications are much
simpler to unit test, qualify, and release.
interrupt driven timers
some possible reduction in latency related to the cost of executing a
system call when sending messages (including lock messages)

I'm not saying you're wrong, but I can think of an advantage you didn't
mention: a service living in kernel will inherit the PF_MEMALLOC state of the
process that called it, that is, a VM cache flushing task. A userspace
service will not. A cluster block device in kernel may need to invoke some
service in userspace at an inconvenient time.

For example, suppose somebody spills coffee into a network node while another
network node is in PF_MEMALLOC state, busily trying to write out dirty file
data to it. The kernel block device now needs to yell to the user space
service to go get it a new network connection. But the userspace service may
need to allocate some memory to do that, and, whoops, the kernel won't give
it any because it is in PF_MEMALLOC state. Now what?

Post by Steven Dake
One of these projects, the openais project which I maintain, implements
3 of these services (and the rest will be done in the timeframes we are
talking about) in user space without any kernel changes required. It
would be possible with kernel to userland communication for the cluster
applications (GFS, distributed block device, etc) to use this standard
interface and implementation. Then we could avoid all of the
unnecessary kernel maintenance and potential problems that come along
with it.
Are you interested in such an approach?

We'd be remiss not to be aware of it, and its advantages. It seems your
project is still in early stages. How about we take pains to ensure that
your cluster membership service is plugable into the CMAN infrastructure, as
a starting point.

Though I admit I haven't read through the whole code tree, there doesn't seem
to be a distributed lock manager there. Maybe that is because it's so
tightly coded I missed it?

Regards,

Daniel

Steven Dake

2004-07-10 17:59:18 UTC

Comments inline thanks
-steve

Post by Daniel Phillips
Hi Steven,

Post by Daniel Phillips
While we're in here, could you please explain why CMAN needs to be
kernel-based? (Just thought I'd broach the question before Christoph
does.)

I'm not saying you're wrong, but I can think of an advantage you didn't
mention: a service living in kernel will inherit the PF_MEMALLOC state of the
process that called it, that is, a VM cache flushing task. A userspace
service will not. A cluster block device in kernel may need to invoke some
service in userspace at an inconvenient time.
For example, suppose somebody spills coffee into a network node while another
network node is in PF_MEMALLOC state, busily trying to write out dirty file
data to it. The kernel block device now needs to yell to the user space
service to go get it a new network connection. But the userspace service may
need to allocate some memory to do that, and, whoops, the kernel won't give
it any because it is in PF_MEMALLOC state. Now what?

overload conditions that have caused the kernel to run low on memory are
a difficult problem, even for kernel components. Currently openais
includes "memory pools" which preallocate data structures. While that
work is not yet complete, the intent is to ensure every data area is
preallocated so the openais executive (the thing that does all of the
work) doesn't ever request extra memory once it becomes operational.

This of course, leads to problems in the following system calls which
openais uses extensively:
sys_poll
sys_recvmsg
sys_sendmsg

which require the allocations of memory with GFP_KERNEL, which can then
fail returning ENOMEM to userland. The openais protocol currently can
handle low memory failures in recvmsg and sendmsg. This is because it
uses a protocol designed to operate on lossy networks.

The poll system call problem will be rectified by utilizing
sys_epoll_wait which does not allocate any memory (the poll data is
preallocated).

I hope that helps atleast answer that some r&d is underway to solve this
particular overload problem in userspace.

sounds good

Post by Daniel Phillips
Though I admit I haven't read through the whole code tree, there doesn't seem
to be a distributed lock manager there. Maybe that is because it's so
tightly coded I missed it?

There is as of yet no implementation of the SAF AIS dlock API in
openais. The work requires about 4 weeks of development for someone
well-skilled. I'd expect a contribution for this API in the timeframes
that make GFS interesting.

I'd invite you, or others interested in these sorts of services, to
contribute that code, if interested. If interested in developing such a
service for openais, check out the developer's map (which describes
developing a service for openais) at:

http://developer.osdl.org/dev/openais/src/README.devmap

Thanks!
-steve

Post by Daniel Phillips
Regards,
Daniel

Daniel Phillips

2004-07-10 20:57:06 UTC

Post by Daniel Phillips
I'm not saying you're wrong, but I can think of an advantage you
didn't mention: a service living in kernel will inherit the
PF_MEMALLOC state of the process that called it, that is, a VM
cache flushing task. A userspace service will not. A cluster
block device in kernel may need to invoke some service in userspace
at an inconvenient time.
For example, suppose somebody spills coffee into a network node
while another network node is in PF_MEMALLOC state, busily trying
to write out dirty file data to it. The kernel block device now
needs to yell to the user space service to go get it a new network
connection. But the userspace service may need to allocate some
memory to do that, and, whoops, the kernel won't give it any
because it is in PF_MEMALLOC state. Now what?

overload conditions that have caused the kernel to run low on memory
are a difficult problem, even for kernel components. Currently
openais includes "memory pools" which preallocate data structures.
While that work is not yet complete, the intent is to ensure every
data area is preallocated so the openais executive (the thing that
does all of the work) doesn't ever request extra memory once it
becomes operational.
This of course, leads to problems in the following system calls which
sys_poll
sys_recvmsg
sys_sendmsg
which require the allocations of memory with GFP_KERNEL, which can
then fail returning ENOMEM to userland. The openais protocol
currently can handle low memory failures in recvmsg and sendmsg.
This is because it uses a protocol designed to operate on lossy
networks.
The poll system call problem will be rectified by utilizing
sys_epoll_wait which does not allocate any memory (the poll data is
preallocated).

But if the user space service is sitting in the kernel's dirty memory
writeout path, you have a real problem: the low memory condition may
never get resolved, rendering your userspace service autistic.
Meanwhile, whoever is generating the dirty memory just keeps spinning
and spinning, generating more of it, ensuring that if the system does
survive the first incident, there's another, worse traffic jam coming
down the pipe. To trigger this deadlock, a kernel filesystem or block
device module just has to lose its cluster connection(s) at the wrong
time.

Post by Steven Dake
I hope that helps atleast answer that some r&d is underway to solve
this particular overload problem in userspace.

I'm certain there's a solution, but until it is demonstrated and proved,
any userspace cluster services must be regarded with narrow squinty
eyes.

Post by Daniel Phillips
Though I admit I haven't read through the whole code tree, there
doesn't seem to be a distributed lock manager there. Maybe that is
because it's so tightly coded I missed it?

I suspect you have underestimated the amount of development time
required.

Post by Steven Dake
I'd invite you, or others interested in these sorts of services, to
contribute that code, if interested.

Humble suggestion: try grabbing the Red Hat (Sistina) DLM code and see
if you can hack it to do what you want. Just write a kernel module
that exports the DLM interface to userspace in the desired form.

http://sources.redhat.com/cluster/dlm/

Regards,

Daniel

Steven Dake

2004-07-10 23:24:51 UTC

some comments inline

overload conditions that have caused the kernel to run low on memory
are a difficult problem, even for kernel components. Currently
openais includes "memory pools" which preallocate data structures.
While that work is not yet complete, the intent is to ensure every
data area is preallocated so the openais executive (the thing that
does all of the work) doesn't ever request extra memory once it
becomes operational.
This of course, leads to problems in the following system calls which
sys_poll
sys_recvmsg
sys_sendmsg
which require the allocations of memory with GFP_KERNEL, which can
then fail returning ENOMEM to userland. The openais protocol
currently can handle low memory failures in recvmsg and sendmsg.
This is because it uses a protocol designed to operate on lossy
networks.
The poll system call problem will be rectified by utilizing
sys_epoll_wait which does not allocate any memory (the poll data is
preallocated).

Post by Steven Dake
I hope that helps atleast answer that some r&d is underway to solve
this particular overload problem in userspace.

I'm certain there's a solution, but until it is demonstrated and proved,
any userspace cluster services must be regarded with narrow squinty
eyes.

I agree that a solution must be demonstrated and proved.

There is another option, which I regularly recommend to anyone that
must deal with memory overload conditions. Don't size the applications
in such a way as to ever cause memory overload. This practical approach
requires just a little more thought on application deployment with the
benefit of avoiding the various and many problems with memory overload
that leads to application faults, OS faults, and other sorts of nasty
conditions.

Post by Daniel Phillips
Though I admit I haven't read through the whole code tree, there
doesn't seem to be a distributed lock manager there. Maybe that is
because it's so tightly coded I missed it?

I suspect you have underestimated the amount of development time
required.

The checkpointing api took approx 3 weeks to develop and has many more
functions to implement. Cluster membership took approx 1 week to
develop. The AMF which provides application failover, the most
complicated of the APIs, took approx 8 weeks to develop. The group
messaging protocol (which implements the virtual synchrony model) has
consumed 80% of the development time thus far.

So 4 weeks is reasonable for someone not familiar with the openais
architecture or SA Forum specification, since the virtual synchrony
group messaging protocol is complete enough to implement a lock service
with simple messaging without any race conditions even during network
partitions and merges.

Post by Steven Dake
I'd invite you, or others interested in these sorts of services, to
contribute that code, if interested.

I would rather avoid non-mainline kernel dependencies at this time as it
makes adoption difficult until kernel patches are merged into upstream
code. Who wants to patch their kernel to try out some APIs? I am
doubtful these sort of kernel patches will be merged without a strong
argument of why it absolutely must be implemented in the kernel vs all
of the counter arguments against a kernel implementation.

There is one more advantage to group messaging and distributed locking
implemented within the kernel, that I hadn't originally considered; it
sure is sexy.

Regards
-steve

Post by Daniel Phillips
Regards,
Daniel

Daniel Phillips

2004-07-11 19:44:25 UTC

Post by Steven Dake
overload conditions that have caused the kernel to run low on memory
are a difficult problem, even for kernel components...
...I hope that helps atleast answer that some r&d is underway to solve
this particular overload problem in userspace.

I'm certain there's a solution, but until it is demonstrated and proved,
any userspace cluster services must be regarded with narrow squinty
eyes.

That, and "just add more memory" are the two common mistakes people make when
thinking about this problem. The kernel _normally_ runs near the low-memory
barrier, on the theory that caching as much as possible is a good thing.

Unless you can prove that your userspace approach never deadlocks, the other
questions don't even move the needle. I am sure that one day somebody, maybe
you, will demonstrate a userspace approach that is provably correct. Until
then, if you want your cluster to stay up and fail over properly, there's
only one game in town.

We need to worry about ensuring that no API _depends_ on the cluster manager
being in-kernel, and we also need to seek out and excise any parts that could
possibly be moved out to user space without enabling the deadlock or grossly
messing up the kernel code.

Post by Steven Dake
I'd invite you, or others interested in these sorts of services, to
contribute that code, if interested.

Everybody working on clusters. It's a fact of life that you have to apply
patches to run cluster filesystems right now. Production will be a different
story, but (except for the stable GFS code on 2.4) nobody is close to that.

Post by Steven Dake
I am doubtful these sort of kernel patches will be merged without a strong
argument of why it absolutely must be implemented in the kernel vs all
of the counter arguments against a kernel implementation.

True. Do you agree that the PF_MEMALLOC argument is a strong one?

Post by Steven Dake
There is one more advantage to group messaging and distributed locking
implemented within the kernel, that I hadn't originally considered; it
sure is sexy.

I don't think it's sexy, I think it's ugly, to tell the truth. I am actively
researching how to move the slow-path cluster infrastructure out of kernel,
and I would be pleased to work together with anyone else who is interested in
this nasty problem.

Regards,

Daniel

Lars Marowsky-Bree

2004-07-11 21:06:24 UTC

On 2004-07-11T15:44:25,

Unless you can prove that your userspace approach never deadlocks, th=

e other=20

questions don't even move the needle. I am sure that one day somebod=

y, maybe=20

you, will demonstrate a userspace approach that is provably correct. =

=20

If you can _prove_ your kernel-space implementation to be correct, I'll
drop all and every single complaint ;)

Until then, if you want your cluster to stay up and fail over
properly, there's only one game in town. =20

This however is not true; clusters have managed just fine running in
user-space (realtime priority, mlocked into (pre-allocated) memory
etc).

I agree that for a cluster filesystem it's much lower latency to have
the infrastructure in the kernel. Going back and forth to user-land jus=
t
ain't as fast and also not very neat.

However, the memory argument is pretty weak; the memory for
heartbeating and core functionality must be pre-allocated if you care
that much. And if you cannot allocate it, maybe you ain't healthy enoug=
h
to join the cluster in the first place.

Otherwise, I don't much care about whether it's in-kernel or not.

My main argument against being in the kernel space has always been
portability and ease of integration, which makes this quite annoying fo=
r
ISVs, and the support issues which arise. But if it's however a common
component part of the 'kernel proper', then this argument no longer
holds.

If the infrastructure takes that jump, I'd be happy. Infrastructure is
boring and has been solved/reinvented so often there's hardly anything
new and exciting about heartbeating, membership, there's more fun work
higher up the stack.

Post by Steven Dake
There is one more advantage to group messaging and distributed
locking implemented within the kernel, that I hadn't originally
considered; it sure is sexy.

I don't think it's sexy, I think it's ugly, to tell the truth. I am
actively researching how to move the slow-path cluster infrastructure
out of kernel, and I would be pleased to work together with anyone
else who is interested in this nasty problem.

Messaging (which hopefully includes strong authentication if not
encryption, though I could see that being delegated to IPsec) and
locking is in the fast-path, though.

Sincerely,
Lars Marowsky-Br=E9e <***@suse.de>

--=20
High Availability & Clustering \ ever tried. ever failed. no matter=
=2E
SUSE Labs, Research and Development | try again. fail again. fail bette=
r.
SUSE LINUX AG - A Novell company \ -- Samuel Beckett

Arjan van de Ven

2004-07-12 06:58:46 UTC

Post by Lars Marowsky-Bree
This however is not true; clusters have managed just fine running in
user-space (realtime priority, mlocked into (pre-allocated) memory
etc).

(ignoring the entire context and argument)

Running realtime and mlocked (prealloced) is most certainly not
sufficient for causes like this; any system call that internally
allocates memory (even if it's just for allocating the kernel side of
the filename you handle to open) can lead to this RT, mlocked process to
cause VM writeout elsewhere.

While I can't say how this affects your argument, everyone should be
really careful with the "just mlock it" argument because it just doesn't
help the worst case in scenarios like this. (It most obviously helps the
average case so for soft-realtime use it's a good approach)

Lars Marowsky-Bree

2004-07-12 10:05:47 UTC

On 2004-07-12T08:58:46,

Post by Arjan van de Ven
Running realtime and mlocked (prealloced) is most certainly not
sufficient for causes like this; any system call that internally
allocates memory (even if it's just for allocating the kernel side of
the filename you handle to open) can lead to this RT, mlocked process=

Post by Arjan van de Ven
cause VM writeout elsewhere.=20

Of course; appropriate safety measures - like not doing any syscall
which could potentially block, or isolating them from the main task via
double-buffering childs - need to be done. (heartbeat does this in
fact.)

Again, if we have "many" in kernel users requiring high performance &
low-latency, running in the kernel may not be as bad, but I still don't
entirely like it.

But user-space can also manage just fine, and instead continuing the "w=
e
need highperf, low-latency and non-blocking so it must be in the
kernel", we may want to consider how to have high-perf low-latency
kernel/user-space communication so that we can NOT move this into the
kernel.

Suffice to say that many user-space implementations exist which satisfy
these needs quite sufficiently; in the case of a CFS, this argument may
be different, but I'd like to see some hard data to back it up.

(On a practical note, a system which drops out of membership because
allocating a 256 byte buffer for a filename takes longer than the node
deadtime (due to high load) is reasonably unlikely to be a healthy
cluster member anyway and is on its road to eviction already.)

The main reason why I'd like to see cluster infrastructure in the kerne=
l
is not technical, but because it increases the pressure on unification
so much that people might actually get their act together this time ;-)

Sincerely,
Lars Marowsky-Br=E9e <***@suse.de>

--=20
High Availability & Clustering \ ever tried. ever failed. no matter=
=2E
SUSE Labs, Research and Development | try again. fail again. fail bette=
r.
SUSE LINUX AG - A Novell company \ -- Samuel Beckett

Arjan van de Ven

2004-07-12 10:11:07 UTC

Post by Lars Marowsky-Bree
On 2004-07-12T08:58:46,

well the problem is that you cannot prevent a syscall from blocking really.
O_NONBLOCK only impacts the waiting for IO/socket buffer space to not do so
(in general), it doesn't impact the memory allocation strategies by
syscalls. And there's a whopping lot of that in the non-boring syscalls...
So while your heartbeat process won't block during getpid, it'll eventually
need to do real work too .... and I'm quite certain that will lead down to
GFP_KERNEL memory allocations.

Lars Marowsky-Bree

2004-07-12 10:21:24 UTC

On 2004-07-12T12:11:07,

well the problem is that you cannot prevent a syscall from blocking r=

eally.

O_NONBLOCK only impacts the waiting for IO/socket buffer space to not=

do so

(in general), it doesn't impact the memory allocation strategies by
syscalls. And there's a whopping lot of that in the non-boring syscal=

ls...

So while your heartbeat process won't block during getpid, it'll even=

tually

need to do real work too .... and I'm quite certain that will lead do=

wn to

GFP_KERNEL memory allocations.

Sure, but the network IO is isolated from the main process via a _very
careful_ non-blocking IO using sockets library, so that works out well.
The only scenario which could still impact this severely would be that
the kernel did not schedule the soft-rr tasks often enough or all NICs
being so overloaded that we can no longer send out the heartbeat
packets, and some more silly conditions. In either case I'd venture tha=
t
said node is so unhealthy that it is quite rightfully evicted from the
cluster. A node which is so overloaded should not be starting any new
resources whatsoever.

However, of course this is more difficult for the case where you are in
the write path needed to free some memory; alas, swapping to a GFS moun=
t
is probably a realllllly silly idea, too.

But again, I'd rather like to see this solved (memory pools for
userland, PF_ etc), because it's relevant for many scenarios requiring
near-hard-realtime properties, and the answer surely can't be to push i=
t
all into the kernel.

Sincerely,
Lars Marowsky-Br=E9e <***@suse.de>

--=20
High Availability & Clustering \ ever tried. ever failed. no matter=
=2E
SUSE Labs, Research and Development | try again. fail again. fail bette=
r.
SUSE LINUX AG - A Novell company \ -- Samuel Beckett

Arjan van de Ven

2004-07-12 10:28:19 UTC

Post by Lars Marowsky-Bree
On 2004-07-12T12:11:07,

Post by Arjan van de Ven
well the problem is that you cannot prevent a syscall from blocking really.
O_NONBLOCK only impacts the waiting for IO/socket buffer space to not do so
(in general), it doesn't impact the memory allocation strategies by
syscalls. And there's a whopping lot of that in the non-boring syscalls...
So while your heartbeat process won't block during getpid, it'll eventually
need to do real work too .... and I'm quite certain that will lead down to
GFP_KERNEL memory allocations.

Sure, but the network IO is isolated from the main process via a _very
careful_ non-blocking IO using sockets library, so that works out well.

... which of course never allocates skb's ? ;)

Post by Lars Marowsky-Bree
However, of course this is more difficult for the case where you are in
the write path needed to free some memory; alas, swapping to a GFS mount
is probably a realllllly silly idea, too.

there is more than swap, there's dirty pagecache/mmaps as well

Post by Lars Marowsky-Bree
But again, I'd rather like to see this solved (memory pools for
userland, PF_ etc), because it's relevant for many scenarios requiring

PF_ is not enough really ;)
You need to force GFP_NOFS etc for several critical parts, and well, by
being in kernel you can avoid a bunch of these allocations for real, and/or
influence their GFP flags

Lars Marowsky-Bree

2004-07-12 11:50:03 UTC

On 2004-07-12T12:28:19,

Post by Arjan van de Ven

Sure, but the network IO is isolated from the main process via a _v=

ery

Post by Arjan van de Ven

careful_ non-blocking IO using sockets library, so that works out w=

ell.

Post by Arjan van de Ven
... which of course never allocates skb's ? ;)

No, the interprocess communication does not; it's local sockets. I thin=
k
Alan (Robertson) even has a paper on this. It's really quite well
engineered, with a non-blocking poll() implementation based on signals
and stuff. Oh well.

Post by Arjan van de Ven

But again, I'd rather like to see this solved (memory pools for
userland, PF_ etc), because it's relevant for many scenarios requir=

ing

Post by Arjan van de Ven
PF_ is not enough really ;)=20
You need to force GFP_NOFS etc for several critical parts, and well, =

Post by Arjan van de Ven
being in kernel you can avoid a bunch of these allocations for real, =

and/or

Post by Arjan van de Ven
influence their GFP flags

True enough, but I'm somewhat unhappy with this still. So whenever we
have something like that we need to move it into the kernel space?
(pvmove first, and now the clustering etc.) Can't we come up with a way
to export this flag to user-space?

Sincerely,
Lars Marowsky-Br=E9e <***@suse.de>

--=20
High Availability & Clustering \ ever tried. ever failed. no matter=
=2E
SUSE Labs, Research and Development | try again. fail again. fail bette=
r.
SUSE LINUX AG - A Novell company \ -- Samuel Beckett

Arjan van de Ven

2004-07-12 12:01:27 UTC

Post by Lars Marowsky-Bree
True enough, but I'm somewhat unhappy with this still. So whenever we
have something like that we need to move it into the kernel space?
(pvmove first, and now the clustering etc.) Can't we come up with a way
to export this flag to user-space?

I'm not convinced that's a good idea, in that it exposes what is basically VM internals
to userspace, which then would become a set-in-stone interface....

Lars Marowsky-Bree

2004-07-12 13:13:12 UTC

On 2004-07-12T14:01:27,

Post by Arjan van de Ven
I'm not convinced that's a good idea, in that it exposes what is
basically VM internals to userspace, which then would become a
set-in-stone interface....

But I'm also not a big fan of moving all HA relevant infrastructure int=
o
the kernel. Membership and DLM are the first ones; then follows
messaging (and reliable and globally ordered messaging is somewhat
complex - but if one node is slow, it will hurt global communication
too, so...), next someone argues that a node always must be able to
report which resources it holds and fence other nodes even under memory
pressure, and there goes the cluster resource manager and fencing
subsystem into the kernel too etc...

Where's the border?=20

And what can we do to make critical user-space infrastructure run
reliably and with deterministic-enough & low latency instead of moving
it all into the kernel?

Yes, the kernel solves these problems right now, but is that really the
path we want to head down? Maybe it is, I'm not sure, afterall we also
have the entire regular network stack in the kernel, but maybe also it
is not.

Sincerely,
Lars Marowsky-Br=E9e <***@suse.de>

--=20
High Availability & Clustering \ ever tried. ever failed. no matter=
=2E
SUSE Labs, Research and Development | try again. fail again. fail bette=
r.
SUSE LINUX AG - A Novell company \ -- Samuel Beckett

Nick Piggin

2004-07-12 13:40:34 UTC

Post by Lars Marowsky-Bree
On 2004-07-12T14:01:27,

Post by Arjan van de Ven
I'm not convinced that's a good idea, in that it exposes what is
basically VM internals to userspace, which then would become a
set-in-stone interface....

But I'm also not a big fan of moving all HA relevant infrastructure into
the kernel. Membership and DLM are the first ones; then follows
messaging (and reliable and globally ordered messaging is somewhat
complex - but if one node is slow, it will hurt global communication
too, so...), next someone argues that a node always must be able to
report which resources it holds and fence other nodes even under memory
pressure, and there goes the cluster resource manager and fencing
subsystem into the kernel too etc...
Where's the border?
And what can we do to make critical user-space infrastructure run
reliably and with deterministic-enough & low latency instead of moving
it all into the kernel?
Yes, the kernel solves these problems right now, but is that really the
path we want to head down? Maybe it is, I'm not sure, afterall we also
have the entire regular network stack in the kernel, but maybe also it
is not.

I don't see why it would be a problem to implement a "this task
facilitates page reclaim" flag for userspace tasks that would take
care of this as well as the kernel does.

There would probably be a few technical things to work out (like
GFP_NOFS), but I think it would be pretty trivial to implement.

Andrew Morton

2004-07-12 20:54:32 UTC

Post by Nick Piggin
I don't see why it would be a problem to implement a "this task
facilitates page reclaim" flag for userspace tasks that would take
care of this as well as the kernel does.

Yes, that has been done before, and it works - userspace "block drivers"
which permanently mark themselves as PF_MEMALLOC to avoid the obvious
deadlocks.

Note that you can achieve a similar thing in current 2.6 by acquiring
realtime scheduling policy, but that's an artifact of some brainwave which
a VM hacker happened to have and isn't a thing which should be relied upon.

A privileged syscall which allows a task to mark itself as one which
cleans memory would make sense.

Daniel Phillips

2004-07-13 02:19:17 UTC

Hi Andrew,

Post by Nick Piggin
I don't see why it would be a problem to implement a "this task
facilitates page reclaim" flag for userspace tasks that would take
care of this as well as the kernel does.

Yes, that has been done before, and it works - userspace "block drivers"
which permanently mark themselves as PF_MEMALLOC to avoid the obvious
deadlocks.
Note that you can achieve a similar thing in current 2.6 by acquiring
realtime scheduling policy, but that's an artifact of some brainwave which
a VM hacker happened to have and isn't a thing which should be relied upon.

Do you have a pointer to the brainwave?

Post by Andrew Morton
A privileged syscall which allows a task to mark itself as one which
cleans memory would make sense.

For now we can do it with an ioctl, and we pretty much have to do it for
pvmove. But that's when user space drives the kernel by syscalls; there is
also the nasty (and common) case where the kernel needs userspace to do
something for it while it's in PF_MEMALLOC. I'm playing with ideas there,
but nothing I'm proud of yet. For now I see the in-kernel approach as the
conservative one, for anything that could possibly find itself on the VM
writeout path.

Unfortunately, that may include some messy things like authentication. I'd
really like to solve this reliable-userspace problem. We'd still have lots
of arguments left to resolve about where things should be, but at least we'd
have the choice.

Regards,

Daniel

Nick Piggin

2004-07-13 02:31:20 UTC

Post by Daniel Phillips
Hi Andrew,

Post by Nick Piggin
I don't see why it would be a problem to implement a "this task
facilitates page reclaim" flag for userspace tasks that would take
care of this as well as the kernel does.

Yes, that has been done before, and it works - userspace "block drivers"
which permanently mark themselves as PF_MEMALLOC to avoid the obvious
deadlocks.
Note that you can achieve a similar thing in current 2.6 by acquiring
realtime scheduling policy, but that's an artifact of some brainwave which
a VM hacker happened to have and isn't a thing which should be relied upon.

Do you have a pointer to the brainwave?

Search for rt_task in mm/page_alloc.c

Post by Andrew Morton
A privileged syscall which allows a task to mark itself as one which
cleans memory would make sense.

You'd obviously want to make the PF_MEMALLOC task as tight as possible,
and running mlocked: I don't particularly see why such a task would be
any safer in-kernel.

PF_MEMALLOC tasks won't enter page reclaim at all. The only way they
will reach the writeout path is if you are write(2)ing stuff (you may
hit synch writeout).

Daniel Phillips

2004-07-27 03:31:43 UTC

Post by Nick Piggin
I don't see why it would be a problem to implement a "this task
facilitates page reclaim" flag for userspace tasks that would take
care of this as well as the kernel does.

Yes, that has been done before, and it works - userspace "block drivers"
which permanently mark themselves as PF_MEMALLOC to avoid the obvious
deadlocks.
Note that you can achieve a similar thing in current 2.6 by acquiring
realtime scheduling policy, but that's an artifact of some brainwave
which a VM hacker happened to have and isn't a thing which should be
relied upon.

Do you have a pointer to the brainwave?

Search for rt_task in mm/page_alloc.c

Ah, interesting idea: realtime tasks get to dip into the PF_MEMALLOC reserve,
until it gets down to some threshold, then they have to give up and wait like
any other unwashed nobody of a process. _But_ if there's a user space
process sitting in the writeout path and some other realtime process eats the
entire realtime reserve, everything can still grind to a halt.

So it's interesting for realtime, but does not solve the userspace PF_MEMALLOC
inversion.

Post by Andrew Morton
A privileged syscall which allows a task to mark itself as one which
cleans memory would make sense.

You'd obviously want to make the PF_MEMALLOC task as tight as possible,

Not just tight, but bounded. And tight too, of course.

Post by Nick Piggin
I don't particularly see why such a task would be any safer in-kernel.

The PF_MEMALLOC flag is inherited down a call chain, not across a pipe or
similar IPC to user space.

Post by Nick Piggin
PF_MEMALLOC tasks won't enter page reclaim at all. The only way they
will reach the writeout path is if you are write(2)ing stuff (you may
hit synch writeout).

That's the problem.

Regards,

Daniel

Nick Piggin

2004-07-27 04:07:45 UTC

Post by Nick Piggin
Search for rt_task in mm/page_alloc.c

Ah, interesting idea: realtime tasks get to dip into the PF_MEMALLOC reserve,
until it gets down to some threshold, then they have to give up and wait like
any other unwashed nobody of a process. _But_ if there's a user space
process sitting in the writeout path and some other realtime process eats the
entire realtime reserve, everything can still grind to a halt.
So it's interesting for realtime, but does not solve the userspace PF_MEMALLOC
inversion.

Not the rt_task thing, because yes, you can have other RT tasks that aren't
small and bounded that screw up your reserves.

But a PF_MEMALLOC userspace task is still useful.

Post by Andrew Morton
A privileged syscall which allows a task to mark itself as one which
cleans memory would make sense.

You'd obviously want to make the PF_MEMALLOC task as tight as possible,

Not just tight, but bounded. And tight too, of course.

Post by Nick Piggin
I don't particularly see why such a task would be any safer in-kernel.

The PF_MEMALLOC flag is inherited down a call chain, not across a pipe or
similar IPC to user space.

This is no different in kernel of course. You would have to think about
which threads need the flag and which do not. Even better, you might
aquire and drop the flag only when required. I can't see any obvious
problems you would run into.

Post by Nick Piggin
PF_MEMALLOC tasks won't enter page reclaim at all. The only way they
will reach the writeout path is if you are write(2)ing stuff (you may
hit synch writeout).

That's the problem.

Well I don't think it would be a problem to get the write throttling path
to ignore PF_MEMALLOC tasks if that is what you need. Again, this shouldn't
be any different to in kernel code.

Daniel Phillips

2004-07-27 05:57:50 UTC

Post by Nick Piggin
But a PF_MEMALLOC userspace task is still useful.

Absolutely. This is the route I'm taking, and I just use an ioctl to flip the
task bit as I mentioned (much) earlier. It still needs to be beaten up in
practice. The cluster snapshot block device, which has a relatively complex
userspace server, should be a nice test case.

Post by Daniel Phillips
The PF_MEMALLOC flag is inherited down a call chain, not across a pipe or
similar IPC to user space.

This is no different in kernel of course.

I was talking about in-kernel. Once we let the PF_MEMALLOC state escape to
user space, things start looking brighter. But you still have to invoke that
userspace code somehow, and there is no direct way to do it, hence
PF_MEMALLOC isn't inherited. An easy solution is to have a userspace daemon
that's always in PF_MEMALLOC state, as Andrew mentioned, which we can control
via a pipe or similar.

Post by Nick Piggin
You would have to think about
which threads need the flag and which do not. Even better, you might
aquire and drop the flag only when required.

Yes, that's what the ioctl is about. However, this doesn't work for servicing
writeout.

Post by Nick Piggin
I can't see any obvious problems you would run into.

;-)

Regards,

Daniel

Pavel Machek

2004-07-14 12:19:20 UTC

Hi!

Post by Nick Piggin
I don't see why it would be a problem to implement a "this task
facilitates page reclaim" flag for userspace tasks that would take
care of this as well as the kernel does.

Yes, that has been done before, and it works - userspace "block drivers"
which permanently mark themselves as PF_MEMALLOC to avoid the obvious
deadlocks.
Note that you can achieve a similar thing in current 2.6 by acquiring
realtime scheduling policy, but that's an artifact of some brainwave which
a VM hacker happened to have and isn't a thing which should be relied upon.
A privileged syscall which allows a task to mark itself as one which
cleans memory would make sense.

Does it work?

I mean, in kernel, we have some memory cleaners (say 5), and they
need, say, 1MB total reserved memory.

Now, if you add another task with PF_MEMALLOC. But now you'd need
1.2MB reserved memory, and you only have 1MB. Things are obviously
going to break at some point.
Pavel

--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!

Nick Piggin

2004-07-15 02:19:12 UTC

Post by Pavel Machek
Hi!

Post by Nick Piggin
I don't see why it would be a problem to implement a "this task
facilitates page reclaim" flag for userspace tasks that would take
care of this as well as the kernel does.

Yes, that has been done before, and it works - userspace "block drivers"
which permanently mark themselves as PF_MEMALLOC to avoid the obvious
deadlocks.
Note that you can achieve a similar thing in current 2.6 by acquiring
realtime scheduling policy, but that's an artifact of some brainwave which
a VM hacker happened to have and isn't a thing which should be relied upon.
A privileged syscall which allows a task to mark itself as one which
cleans memory would make sense.

Does it work?
I mean, in kernel, we have some memory cleaners (say 5), and they
need, say, 1MB total reserved memory.
Now, if you add another task with PF_MEMALLOC. But now you'd need
1.2MB reserved memory, and you only have 1MB. Things are obviously
going to break at some point.
Pavel

Well you'd have to be more careful than that. In particular
you wouldn't just be starting these things up, let alone
have them allocate 1MB in to free some memory.

This situation would still blow up whether you did it in
kernel or not.

Marcelo Tosatti

2004-07-15 12:03:47 UTC

Post by Pavel Machek
Hi!

Post by Nick Piggin
I don't see why it would be a problem to implement a "this task
facilitates page reclaim" flag for userspace tasks that would take
care of this as well as the kernel does.

Yes, that has been done before, and it works - userspace "block drivers"
which permanently mark themselves as PF_MEMALLOC to avoid the obvious
deadlocks.

Andrew, as curiosity, what userspace "block driver" sets PF_MEMALLOC for
normal operation?

Post by Pavel Machek

Post by Andrew Morton
Note that you can achieve a similar thing in current 2.6 by acquiring
realtime scheduling policy, but that's an artifact of some brainwave which
a VM hacker happened to have and isn't a thing which should be relied upon.
A privileged syscall which allows a task to mark itself as one which
cleans memory would make sense.

Does it work?
I mean, in kernel, we have some memory cleaners (say 5), and they
need, say, 1MB total reserved memory.
Now, if you add another task with PF_MEMALLOC. But now you'd need
1.2MB reserved memory, and you only have 1MB. Things are obviously
going to break at some point.
Pavel

Well you'd have to be more careful than that. In particular
you wouldn't just be starting these things up, let alone
have them allocate 1MB in to free some memory.
This situation would still blow up whether you did it in
kernel or not.

Indeed, such PF_MEMALLOC app can probably kill the system if it bugs
allocating lots of memory from the lower reservations. It needs
some limitation.

Pavel Machek

2004-07-14 08:32:18 UTC

Hi!

Swapping to GFS mount is *very* similar. If swapping to GFS can
not work, it is unlikely write support will be reliable.

--
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms

Steven Dake

2004-07-12 04:08:12 UTC

I'm certain there's a solution, but until it is demonstrated and proved,
any userspace cluster services must be regarded with narrow squinty
eyes.

Running "near low memory conditions" and running in memory overload
which triggers the OOM killer and other bad behaviors are two totally
different conditions in the kernel.

Post by Daniel Phillips
Unless you can prove that your userspace approach never deadlocks, the other
questions don't even move the needle. I am sure that one day somebody, maybe
you, will demonstrate a userspace approach that is provably correct. Until
then, if you want your cluster to stay up and fail over properly, there's
only one game in town.

As soon as you have proved that cman's cluster protocol cannot be the
target of attacks which lead to kernel faults or security faults..

Byzantine failures are a fact of life. There are protocols to minimize
these sorts of attacks, but implementing them in the kernel is going to
prove very difficult (but possible). One approach is to get them
working in userspace correctly, and port them to the kernel.

Oom conditions are another fact of life for poorly sized systems. If a
cluster is within an OOM condition, it should be removed from the
cluster (because it is in overload, under which unknown and generally
bad behaviors occur).

The openais project does just this: If everything goes to hell in a
handbasket on the node running the cluster executive, it will be
rejected from the membership. This rejection is implemented with a
distributed state machine that ensures, even in low memory conditions,
every node (including the failed node) reaches the same conclusions
about the current membership and works today in the current code. If at
a later time the processor can reenter the membership because it has
freed up some memory, it will do so correctly.

Post by Daniel Phillips
We need to worry about ensuring that no API _depends_ on the cluster manager
being in-kernel, and we also need to seek out and excise any parts that could
possibly be moved out to user space without enabling the deadlock or grossly
messing up the kernel code.

Post by Steven Dake
I'd invite you, or others interested in these sorts of services, to
contribute that code, if interested.

Perhaps people skilled in running pre-alpha software would consider
patching a kernel to "give it a run". I have no doubts about that.

I would posit a guess people interested in implementing production
clusters are not too interested about applying kernel patches (and
causing their kernel to become unsupported) to achieve clustering
support any time soon.

True. Do you agree that the PF_MEMALLOC argument is a strong one?

out of memory overload is a sucky situation poorly handled by any
software, kernel, userland, embedded, whatever. The best solution is to
size the applications such that a memory overload doesn't occur. Then
if a memory overload condition does occur, that node should aleast
become suspected of a byzantine failure condition which should cause its
rejection from the current membership (in the case of a distributed
system such as a cluster).

Post by Steven Dake
There is one more advantage to group messaging and distributed locking
implemented within the kernel, that I hadn't originally considered; it
sure is sexy.

There can be some advantages to group messaging being implemented in the
kernel, if it is secure, done correctly (in my view, correctly means
implementing the virtual synchrony model) and has low risk of impact to
other systems.

There are no kernel implemented clustering protocols that come close to
these goals today.

There are userland implementations under way which will meet these
objectives.

Perhaps these protocols could be ported to the kernel if group messaging
absolutely must be available to kernel components without userland
intervention. But I'm still not convinced userland isn't the correct
place for these sorts of things.

Thanks
-steve

Post by Daniel Phillips
Regards,
Daniel

Daniel Phillips

2004-07-12 04:23:44 UTC

Post by Steven Dake
Oom conditions are another fact of life for poorly sized systems. If
a cluster is within an OOM condition, it should be removed from the
cluster (because it is in overload, under which unknown and generally
bad behaviors occur).

You missed the point. The memory deadlock I pointed out occurs in
_normal operation_. You have to find a way around it, or kernel
cluster services win, plain and simple.

Post by Steven Dake
The openais project does just this: If everything goes to hell in a
handbasket on the node running the cluster executive, it will be
rejected from the membership. This rejection is implemented with a
distributed state machine that ensures, even in low memory
conditions, every node (including the failed node) reaches the same
conclusions about the current membership and works today in the
current code. If at a later time the processor can reenter the
membership because it has freed up some memory, it will do so
correctly.

Think about it. Do you want nodes spontaneously falling over from time
to time, even though nothing is wrong with them? What does that do
your 5 nines?

Post by Steven Dake
I would rather avoid non-mainline kernel dependencies at this
time as it makes adoption difficult until kernel patches are
merged into upstream code. Who wants to patch their kernel to
try out some APIs?

Everybody working on clusters. It's a fact of life that you have
to apply patches to run cluster filesystems right now. Production
will be a different story, but (except for the stable GFS code on
2.4) nobody is close to that.

Perhaps people skilled in running pre-alpha software would consider
patching a kernel to "give it a run". I have no doubts about that.
I would posit a guess people interested in implementing production
clusters are not too interested about applying kernel patches (and
causing their kernel to become unsupported) to achieve clustering
support any time soon.

We are _far_ from production, at least on 2.6. At this point, we are
only interested in people who like to code, test, tinker, and be the
first kid on the block with a shiny new storage cluster in their rec
room. And by "we" I mean "you, me, and everybody else who hopes that
Linux will kick butt in clusters, in the 2.8 time frame."

Post by Steven Dake
I am doubtful these sort of kernel patches will be merged without
a strong argument of why it absolutely must be implemented in the
kernel vs all of the counter arguments against a kernel
implementation.

True. Do you agree that the PF_MEMALLOC argument is a strong one?

out of memory overload is a sucky situation poorly handled by any
software, kernel, userland, embedded, whatever.

In case you missed it above, please let me point out one more time that
I am not talking about OOM. I'm talking about a deadlock that may come
up even when a resource usage is well within limits, which is inherent
in the basic design of Linux. There is nothing Byzantine about it.

Regards,

Daniel

Steven Dake

2004-07-12 18:21:39 UTC

You missed the point. The memory deadlock I pointed out occurs in
_normal operation_. You have to find a way around it, or kernel
cluster services win, plain and simple.

The bottom line is that we just don't know if any such deadlock occurs,
under normal operations. The remaining objections to in-kernel cluster
services give us alot of reason to test out a userland approach.

I propose after a distributed lock service is implemented in user space,
to add support for such a project into the gfs and remaining redhat
storage cluster services trees. This will give us real data on
performance and reliability that we can't get by guessing.

Thanks
-steve

Post by Steven Dake
current code. If at a later time the processor can reenter the
membership because it has freed up some memory, it will do so
correctly.

Think about it. Do you want nodes spontaneously falling over from time
to time, even though nothing is wrong with them? What does that do
your 5 nines?

Everybody working on clusters. It's a fact of life that you have
to apply patches to run cluster filesystems right now. Production
will be a different story, but (except for the stable GFS code on
2.4) nobody is close to that.

Perhaps people skilled in running pre-alpha software would consider
patching a kernel to "give it a run". I have no doubts about that.
I would posit a guess people interested in implementing production
clusters are not too interested about applying kernel patches (and
causing their kernel to become unsupported) to achieve clustering
support any time soon.

Post by Steven Dake
I am doubtful these sort of kernel patches will be merged without
a strong argument of why it absolutely must be implemented in the
kernel vs all of the counter arguments against a kernel
implementation.

True. Do you agree that the PF_MEMALLOC argument is a strong one?

out of memory overload is a sucky situation poorly handled by any
software, kernel, userland, embedded, whatever.

Daniel Phillips

2004-07-12 19:54:12 UTC

Post by Steven Dake
Oom conditions are another fact of life for poorly sized systems.
If a cluster is within an OOM condition, it should be removed
from the cluster (because it is in overload, under which unknown
and generally bad behaviors occur).

You missed the point. The memory deadlock I pointed out occurs in
_normal operation_. You have to find a way around it, or kernel
cluster services win, plain and simple.

The bottom line is that we just don't know if any such deadlock
occurs, under normal operations.

I thought I demonstrated that, should I restate? You need to point out
the flaw in my argument (about the deadlock, not about philosophy).
If/when you succeed, I will be pleased. Until you do succeed, there's
a deadlock.

Regards,

Daniel

Pavel Machek

2004-07-13 20:06:24 UTC

Hi!

Post by Daniel Phillips
You missed the point. The memory deadlock I pointed out occurs in
_normal operation_. You have to find a way around it, or kernel
cluster services win, plain and simple.

The bottom line is that we just don't know if any such deadlock occurs,
under normal operations. The remaining objections to in-kernel cluster

I did some work on swapping-over-nbd, which has similar issues,
and yes, the deadlocks were seen under heavy load.

*Designing* something with "lets hope it does not deadlock",
while deadlock clearly can be triggered, looks like bad idea.

--
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms

David Teigland

2004-07-10 04:58:24 UTC

Post by Daniel Phillips
Hi Dave,

While we're in here, could you please explain why CMAN needs to be
kernel-based? (Just thought I'd broach the question before Christoph
does.)

I have that same question as well.

Post by Steven Dake
1) security faults in the protocol can crash the kernel or violate
system security
2) secure group communication is difficult to implement in kernel
- secure group key protocols can be implemented fairly easily in
userspace using packages like openssl. Implementing these
protocols in kernel will prove to be very complex.
3) live upgrades are much more difficult with kernel components
4) a standard interface (the SA Forum AIS) is not being used,
disallowing replaceability of components. This is a big deal for
people interested in clustering that dont want to be locked into
a partciular implementation.
5) dlm, fencing, cluster messaging (including membership) can be done
in userspace, so why not do it there.
6) cluster services for the kernel and cluster services for applications
will fork, because SA Forum AIS will be chosen for application
level services.
7) faults in the protocols can bring down all of Linux, instead of one
cluster service on one node.
8) kernel changes require much longer to get into the field and are
much more difficult to distribute. userspace applications are much
simpler to unit test, qualify, and release.
interrupt driven timers
some possible reduction in latency related to the cost of executing a
system call when sending messages (including lock messages)

This view of advantages/disadvantages seems sensible when working with your
average userland clustering application. The SAF spec looks pretty nice in
that context. I think gfs and a kernel-based dlm for gfs are a different
story, though. They're different enough from other things that few of the same
considerations seem practical. This has been our experience so far, things
could possibly change for some next-generation (think time span of years).

You'll note that gfs uses external, interchangable locking/cluster systems
which makes it easy to look at alternatives. cman and dlm are what gfs/clvm
use today; if they prove useful to others that's great, we'd even be happy to
help make them more useful.

--
Dave Teigland <***@redhat.com>

Lars Marowsky-Bree

2004-07-12 10:14:39 UTC

On 2004-07-08T18:53:38,

Excuse my ignorance, but does this ensure that there's concensus among
the nodes about this membership?

Post by David Teigland
has quorum. It's a very standard way of doing things -- we modelled =

Post by David Teigland
directly off the VMS-cluster style. Whether you care about this quor=

um value

Post by David Teigland
or what you do with it are beside the point.=20

OK, I agree with this. As long as the CMAN itself doesn't care about
this either but just reports it to the cluster, that's fine.

Post by David Teigland
What about Fencing? Fencing is not a part of the cluster manager, no=

Post by David Teigland
a part of the dlm and not a part of gfs. It's an entirely independen=

Post by David Teigland
system that runs on its own in userland. It depends on cman for
cluster information just like the dlm or gfs does. I'll repeat what =

I
I doubt it can be entirely independent; or how do you implement lock
recovery without a fencing mechanism?

Post by David Teigland
This fencing system is suitable for us in our gfs/clvm work. It's
probably suitable for others, too. For everyone? no.=20

It sounds useful enough even for our work, given appropriate
notification of fencing events; instead of scheduling a fencing event,
we'd need to make sure that the node joins a fencing domain and later
block until receiving a notification. It's not as fine grained, but our
approach (based on the dependencies of the resources managed, basically=
)
might have been more fine grained than required in a typical
environment.

Yes, I can see how that could be made to work.

Sincerely,
Lars Marowsky-Br=E9e <***@suse.de>

--=20
High Availability & Clustering \ ever tried. ever failed. no matter=
=2E
SUSE Labs, Research and Development | try again. fail again. fail bette=
r.
SUSE LINUX AG - A Novell company \ -- Samuel Beckett

Aneesh Kumar K.V

2004-07-06 06:39:08 UTC

Post by Daniel Phillips
Don't you think we ought to take a look at how OCFS and GFS might share
some of the same infrastructure, for example, the DLM and cluster
membership services?

For cluster membership, you might consider looking at the OpenAIS CLM
portion. It would be nice if this type of thing was unified across more
than just filesystems.

How about looking Cluster Infrastructure ( http://ci-linux.sf.net ) and
OpenSSI ( http://www.openssi.org ) for cluster membership service.

-aneesh

James Bottomley

2004-07-10 14:58:02 UTC

gfs needs to run in the kernel. dlm should run in the kernel since gfs uses it
so heavily. cman is the clustering subsystem on top of which both of those are
built and on which both depend quite critically. It simply makes most sense to
put cman in the kernel for what we're doing with it. That's not a dogmatic
position, just a practical one based on our experience.

This isn't really acceptable. We've spent a long time throwing things
out of the kernel so you really need a good justification for putting
things in again. "it makes sense" and "its just practical" aren't
sufficient.

You also face two other additional hurdles:

1) GFS today uses a user space DLM. What critical problems does this
have that you suddenly need to move it all into the kernel?

2) We have numerous other clustering products for Linux, none of which
(well except the Veritas one) has any requirement at all on having
pieces in the kernel. If all the others operate in user space, why does
yours need to be in the kernel?

So do you have a justification for requiring these as kernel components?

James

David Teigland

2004-07-10 16:04:09 UTC

Post by David Teigland
gfs needs to run in the kernel. dlm should run in the kernel since gfs
uses it so heavily. cman is the clustering subsystem on top of which
both of those are built and on which both depend quite critically. It
simply makes most sense to put cman in the kernel for what we're doing
with it. That's not a dogmatic position, just a practical one based on
our experience.
This isn't really acceptable. We've spent a long time throwing things out of
the kernel so you really need a good justification for putting things in
again. "it makes sense" and "its just practical" aren't sufficient.

The "it" refers to gfs. This means gfs doesn't make a lot of sense and isn't
very practical without it. I'm not the one to speculate on what gfs would
become otherwise, others would do that better.

Post by David Teigland
1) GFS today uses a user space DLM. What critical problems does this have
that you suddenly need to move it all into the kernel?

GFS does not use a user space dlm today. GFS uses the client-server gulm lock
manager for which the client (gfs) side runs in the kernel and the gulm server
runs in userspace on some other node. People have naturally been averse to
using servers like this with gfs for a long time and we've finally created the
serverless dlm (a la VMS clusters). For many people this is the only option
that makes gfs interesting; it's also what the opengfs group was doing.

This is a revealing discussion. We've worked hard to make gfs's lock manager
independent from gfs itself so it could be useful to others and make gfs less
monolithic. We could have left it embedded within the file system itself --
that's what most other cluster file systems do. If we'd done that we would
have avoided this objection altogether but with an inferior design. The fact
that there's an independent lock manager to point at and question illustrates
our success. The same goes for the cluster manager. (We could, of course, do
some simple glueing together and make a monlithic system again :-)

Post by David Teigland
2) We have numerous other clustering products for Linux, none of which (well
except the Veritas one) has any requirement at all on having pieces in the
kernel. If all the others operate in user space, why does yours need to be
in the kernel?

If you want gfs in user space you don't want gfs; you want something different.

--
Dave Teigland <***@redhat.com>

James Bottomley

2004-07-10 16:26:44 UTC