Discussion:
[PATCH 00/14] GFS
(too old to reply)
Pekka Enberg
2005-08-02 10:16:53 UTC
Permalink
Hi David,
Hi, GFS (Global File System) is a cluster file system that we'd like to
see added to the kernel. The 14 patches total about 900K so I won't send
them to the list unless that's requested. Comments and suggestions are
welcome. Thanks
+#define kmalloc_nofail(size, flags) \
+ gmalloc_nofail((size), (flags), __FILE__, __LINE__)
[snip]
+void *gmalloc_nofail_real(unsigned int size, int flags, char *file,
+ unsigned int line)
+{
+ void *x;
+ for (;;) {
+ x = kmalloc(size, flags);
+ if (x)
+ return x;
+ if (time_after_eq(jiffies, gfs2_malloc_warning + 5 * HZ)) {
+ printk("GFS2: out of memory: %s, %u\n",
+ __FILE__, __LINE__);
+ gfs2_malloc_warning = jiffies;
+ }
+ yield();
This does not belong in a filesystem. It also seems like a very bad
idea. What are you trying to do here? If you absolutely must not fail,
use __GFP_NOFAIL instead.
+ }
+}
+
+#if defined(GFS2_MEMORY_SIMPLE)
+
+atomic_t gfs2_memory_count;
+
+void gfs2_memory_add_i(void *data, char *file, unsigned int line)
+{
+ atomic_inc(&gfs2_memory_count);
+}
+
+void gfs2_memory_rm_i(void *data, char *file, unsigned int line)
+{
+ if (data)
+ atomic_dec(&gfs2_memory_count);
+}
+
+void *gmalloc(unsigned int size, int flags, char *file, unsigned int line)
+{
+ void *data = kmalloc(size, flags);
+ if (data)
+ atomic_inc(&gfs2_memory_count);
+ return data;
+}
+
+void *gmalloc_nofail(unsigned int size, int flags, char *file,
+ unsigned int line)
+{
+ atomic_inc(&gfs2_memory_count);
+ return gmalloc_nofail_real(size, flags, file, line);
+}
+
+void gfree(void *data, char *file, unsigned int line)
+{
+ if (data) {
+ atomic_dec(&gfs2_memory_count);
+ kfree(data);
+ }
+}
-mm has memory leak detection patches and there are others floating
around. Please do not introduce yet another subsystem-specific debug allocator.

Pekka
David Teigland
2005-08-03 03:56:18 UTC
Permalink
* The on disk structures are defined in terms of uint32_t and friends,
which are NOT endian neutral. Why are they not le32/be32 and thus
endian-defined? Did you run bitwise-sparse on GFS yet ?
GFS has had proper endian handling for many years, it's still correct as
far as we've been able to test. I ran bitwise-sparse yesterday and didn't
find anything alarming.
* None of your on disk structures are packet. Are you sure?
Quite, particular attention has been paid to aligning the structure
fields, you'll find "pad" fields throughout. We'll write a quick test to
verify that packing doesn't change anything.
+#define gfs2_16_to_cpu be16_to_cpu
+#define gfs2_32_to_cpu be32_to_cpu
+#define gfs2_64_to_cpu be64_to_cpu
why this pointless abstracting?
#ifdef GFS2_ENDIAN_BIG

#define gfs2_16_to_cpu be16_to_cpu
#define gfs2_32_to_cpu be32_to_cpu
#define gfs2_64_to_cpu be64_to_cpu

#define cpu_to_gfs2_16 cpu_to_be16
#define cpu_to_gfs2_32 cpu_to_be32
#define cpu_to_gfs2_64 cpu_to_be64

#else /* GFS2_ENDIAN_BIG */

#define gfs2_16_to_cpu le16_to_cpu
#define gfs2_32_to_cpu le32_to_cpu
#define gfs2_64_to_cpu le64_to_cpu

#define cpu_to_gfs2_16 cpu_to_le16
#define cpu_to_gfs2_32 cpu_to_le32
#define cpu_to_gfs2_64 cpu_to_le64

#endif /* GFS2_ENDIAN_BIG */

The point is you can define GFS2_ENDIAN_BIG to compile gfs to be BE
on-disk instead of LE which is another useful way to verify endian
correctness.

You should be able to use gfs in mixed architecture and mixed endian
clusters. We don't have a mixed endian cluster to test, though.
* +static const uint32_t crc_32_tab[] = .....
why do you duplicate this? The kernel has a perfectly good set of generic
crc32 tables/functions just fine
We'll try them, they'll probably do fine.
* Why use your own journalling layer and not say ... jbd ?
Here's an analysis of three approaches to cluster-fs journaling and their
pros/cons (including using jbd): http://tinyurl.com/7sbqq
* + while (!kthread_should_stop()) {
+ gfs2_scand_internal(sdp);
+
+ set_current_state(TASK_INTERRUPTIBLE);
+ schedule_timeout(gfs2_tune_get(sdp, gt_scand_secs) * HZ);
+ }
you probably really want to check for signals if you do interruptible sleeps
I don't know why we'd be interested in signals here.
* why not use msleep() and friends instead of schedule_timeout(), you're
not using the complex variants anyway
When unmounting we really appreciate waking up more often than the
timeout, otherwise the unmount sits and waits for the longest daemon's
msleep to complete. I converted this to msleep recently but it was too
painful and had to go back.

We'll get to your other comments, thanks.
Dave
David Teigland
2005-08-03 10:08:49 UTC
Permalink
Post by David Teigland
The point is you can define GFS2_ENDIAN_BIG to compile gfs to be BE
on-disk instead of LE which is another useful way to verify endian
correctness.
that sounds wrong to be a compile option. If you really want to deal
with dual disk endianness it really ought to be a runtime one (see jffs2
for example).
We don't want BE to be an "option" per se; as developers we'd just like to
be able to compile it that way to verify gfs's endianness handling. If
you think that's unmaintainable or a bad idea we'll rip it out.
Post by David Teigland
* + while (!kthread_should_stop()) {
+ gfs2_scand_internal(sdp);
+
+ set_current_state(TASK_INTERRUPTIBLE);
+ schedule_timeout(gfs2_tune_get(sdp, gt_scand_secs) * HZ);
you probably really want to check for signals if you do
interruptible sleeps
I don't know why we'd be interested in signals here.
well.. because if you don't your schedule_timeout becomes a nop when you
get one, which makes your loop a busy waiting one.
OK, it looks like we need to block/flush signals a la daemonize(); I guess
I mistakenly figured the kthread routines did everything daemonize did.
Thanks,
Dave
Lars Marowsky-Bree
2005-08-03 10:37:44 UTC
Permalink
Post by David Teigland
* Why use your own journalling layer and not say ... jbd ?
Here's an analysis of three approaches to cluster-fs journaling and their
pros/cons (including using jbd): http://tinyurl.com/7sbqq
Very instructive read, thanks for the link.
--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"
Mark Fasheh
2005-08-03 18:54:01 UTC
Permalink
Post by Lars Marowsky-Bree
Post by David Teigland
* Why use your own journalling layer and not say ... jbd ?
Here's an analysis of three approaches to cluster-fs journaling and their
pros/cons (including using jbd): http://tinyurl.com/7sbqq
Very instructive read, thanks for the link.
While it may be true that for a full log, flushing for a *single* lock may
be more expensive in OCFS2, Ken ignores the fact that in our one big flush
we've made all locks on journalled resources immediately releasable.
According to that description, GFS2 would have to do a seperate transaction
flush (including the extra step of writing revoke records) for each lock
protecting a journalled resource. Assuming the same number of locks are
required to be dropped under both systems then for a number of locks > 1
OCFS2 will actually do less work - the actual metadata blocks would be the
same on either end, but JBD only has to write that the journal is now clean
to the journal superblock whereas GFS2 has to revoke the blocks for each
dropped lock.

Of course all of this talk completely avoids the fact that in any case these
things are expensive so a cluster file system has to take care to ping locks
as little as possible. OCFS2 takes great pains to make as many operations
node local (requiring no cluster locks) as possible - data allocation is
usually done from a node local pool which is refreshed from the main bitmap.
Deallocation happens similarly - we have a truncate log in which we record
deleted clusters. Each node has their own inode and metadata chain
allocators which another node will only lock for delete (a truncate log
style local metadata delete log could easily be added if that ever became a
problem).
--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
***@oracle.com
Arjan van de Ven
2005-08-03 09:17:09 UTC
Permalink
Post by David Teigland
The point is you can define GFS2_ENDIAN_BIG to compile gfs to be BE
on-disk instead of LE which is another useful way to verify endian
correctness.
that sounds wrong to be a compile option. If you really want to deal
with dual disk endianness it really ought to be a runtime one (see jffs2
for example).
Post by David Teigland
* Why use your own journalling layer and not say ... jbd ?
Here's an analysis of three approaches to cluster-fs journaling and their
pros/cons (including using jbd): http://tinyurl.com/7sbqq
* + while (!kthread_should_stop()) {
+ gfs2_scand_internal(sdp);
+
+ set_current_state(TASK_INTERRUPTIBLE);
+ schedule_timeout(gfs2_tune_get(sdp, gt_scand_secs) * HZ);
+ }
you probably really want to check for signals if you do interruptible sleeps
I don't know why we'd be interested in signals here.
well.. because if you don't your schedule_timeout becomes a nop when you
get one, which makes your loop a busy waiting one.
David Teigland
2005-08-03 06:36:44 UTC
Permalink
Post by Pekka Enberg
+void *gmalloc_nofail_real(unsigned int size, int flags, char *file,
+ unsigned int line)
+{
+ void *x;
+ for (;;) {
+ x = kmalloc(size, flags);
+ if (x)
+ return x;
+ if (time_after_eq(jiffies, gfs2_malloc_warning + 5 * HZ)) {
+ printk("GFS2: out of memory: %s, %u\n",
+ __FILE__, __LINE__);
+ gfs2_malloc_warning = jiffies;
+ }
+ yield();
This does not belong in a filesystem. It also seems like a very bad
idea. What are you trying to do here? If you absolutely must not fail,
use __GFP_NOFAIL instead.
will do, carried over from before NOFAIL existed
Post by Pekka Enberg
-mm has memory leak detection patches and there are others floating
around. Please do not introduce yet another subsystem-specific debug allocator.
ok, thanks
Dave
Pekka J Enberg
2005-08-08 14:14:45 UTC
Permalink
+static ssize_t walk_vm_hard(struct file *file, char *buf, size_t size,
+ loff_t *offset, do_rw_t operation)
+{
+ struct gfs2_holder *ghs;
+ unsigned int num_gh = 0;
+ ssize_t count;
+
+ {
Can we please get rid of the extra braces everywhere?

[snip]
+
+ for (vma = find_vma(mm, start); vma; vma = vma->vm_next) {
+ if (end <= vma->vm_start)
+ break;
+ if (vma->vm_file &&
+ vma->vm_file->f_dentry->d_inode->i_sb == sb) {
+ num_gh++;
+ }
+ }
+
+ ghs = kmalloc((num_gh + 1) * sizeof(struct gfs2_holder),
+ GFP_KERNEL);
+ if (!ghs) {
+ if (!dumping)
+ up_read(&mm->mmap_sem);
+ return -ENOMEM;
+ }
+
+ for (vma = find_vma(mm, start); vma; vma = vma->vm_next) {
Sorry if this is an obvious question but what prevents another thread from
doing mmap() before we do the second walk and messing up num_gh?
+ if (end <= vma->vm_start)
+ break;
+ if (vma->vm_file) {
+ struct inode *inode;
+ inode = vma->vm_file->f_dentry->d_inode;
+ if (inode->i_sb == sb)
+ gfs2_holder_init(get_v2ip(inode)->i_gl,
+ vma2state(vma),
+ 0, &ghs[x++]);
+ }
+ }
Pekka
Zach Brown
2005-08-08 18:32:55 UTC
Permalink
Post by Pekka J Enberg
Sorry if this is an obvious question but what prevents another thread
from doing mmap() before we do the second walk and messing up num_gh?
Nothing, I suspect. OCFS2 has a problem like this, too. It wants a way
for a file system to serialize mmap/munmap/mremap during file IO. Well,
more specifically, it wants to make sure that the locks it acquired at
the start of the IO really cover the buf regions that might fault during
the IO.. mapping activity during the IO can wreck that.

- z
Christoph Hellwig
2005-08-10 07:21:21 UTC
Permalink
Post by Zach Brown
Post by Pekka J Enberg
Sorry if this is an obvious question but what prevents another thread
from doing mmap() before we do the second walk and messing up num_gh?
Nothing, I suspect. OCFS2 has a problem like this, too. It wants a way
for a file system to serialize mmap/munmap/mremap during file IO. Well,
more specifically, it wants to make sure that the locks it acquired at
the start of the IO really cover the buf regions that might fault during
the IO.. mapping activity during the IO can wreck that.
In addition, the vma walk will become an unmaintainable mess as soon as
someone introduces another mmap() capable fs that needs similar locking.
We already have OCFS2 in -mm that does similar things. I think we need
to solve this in common code before either of them can be merged.
Pekka J Enberg
2005-08-10 07:31:04 UTC
Permalink
Post by Christoph Hellwig
In addition, the vma walk will become an unmaintainable mess as soon as
someone introduces another mmap() capable fs that needs similar locking.
We already have OCFS2 in -mm that does similar things. I think we need
to solve this in common code before either of them can be merged.
It seems to me that the distributed locks must be acquired in ->nopage
anyway to solve the problem with memcpy() between two mmap'd regions. One
possible solution would be for the lock manager to detect deadlocks and
break some locks accordingly. Don't know how well that would mix with
->nopage though...

Pekka
Pekka J Enberg
2005-08-10 16:57:43 UTC
Permalink
Hi Mark,
This may sound naive, but so far OCFS2 has avoided the nead for deadlock
detection... I'd hate to have to add it now -- better to try avoiding them
in the first place.
Surely avoiding them is preferred but how do you do that when you have to
mmap'd regions where userspace does memcpy()? The kernel won't much saying
in it until ->nopage. We cannot grab all the required locks in proper order
here because we don't know what size the buffer is. That's why I think lock
sorting won't work of all the cases and thus the problem needs to be taken
care of by the dlm.

Pekka
Pekka J Enberg
2005-08-10 20:18:48 UTC
Permalink
Hmm, well today in OCFS2 if you're not coming from read or write, the lock
is held only for the duration of ->nopage so I don't think we could get into
any deadlocks for that usage.
Aah, I see GFS2 does that too so no deadlocks here. Thanks. You, however,
don't maintain the same level of data consistency when reads and writes
are from other filesystems as they use ->nopage.

Fixing this requires a generic vma walk in every write() and read(), no?
That doesn't seem such an hot idea which brings us back to using ->nopage
for taking the locks (but now the deadlocks are back).

Pekka
Mark Fasheh
2005-08-10 22:07:44 UTC
Permalink
Post by Pekka J Enberg
Aah, I see GFS2 does that too so no deadlocks here. Thanks.
Yep, no problem :)
Post by Pekka J Enberg
You, however, don't maintain the same level of data consistency when reads
and writes are from other filesystems as they use ->nopage.
I'm not sure what you mean here...
Post by Pekka J Enberg
Fixing this requires a generic vma walk in every write() and read(), no?
That doesn't seem such an hot idea which brings us back to using ->nopage
for taking the locks (but now the deadlocks are back).
Yeah if you look through mmap.c in ocfs2_fill_ctxt_from_buf() we do this...
Or am I misunderstanding what you mean?
--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
***@oracle.com
Pekka J Enberg
2005-08-11 04:41:17 UTC
Permalink
Hi,
Post by Mark Fasheh
Post by Pekka J Enberg
You, however, don't maintain the same level of data consistency when reads
and writes are from other filesystems as they use ->nopage.
I'm not sure what you mean here...
Reading and writing from other filesystems to a GFS2 mmap'd file
does not walk the vmas. Therefore, data consistency guarantees
are different:

- A GFS2 filesystem does a read that writes to a GFS2 mmap'd
file -> we take all locks for the mmap'd buffer in order and
release them after read() is done.

- A ext3 filesystem, for example, does a read that writes to
a GFS2 mmap'd file -> we now take locks one page at the
time releasing them before we exit ->nopage(). Other nodes
are now free to write to the same GFS2 mmap'd file.

Or am I missing something here?
Post by Mark Fasheh
Post by Pekka J Enberg
Fixing this requires a generic vma walk in every write() and read(), no?
That doesn't seem such an hot idea which brings us back to using ->nopage
for taking the locks (but now the deadlocks are back).
Yeah if you look through mmap.c in ocfs2_fill_ctxt_from_buf() we do this...
Or am I misunderstanding what you mean?
If are doing write() or read() from some other filesystem, we don't walk the
vmas but instead rely on ->nopage for locking, right?

Pekka
Mark Fasheh
2005-08-10 18:21:56 UTC
Permalink
Post by Pekka J Enberg
Surely avoiding them is preferred but how do you do that when you have to
mmap'd regions where userspace does memcpy()? The kernel won't much saying
in it until ->nopage. We cannot grab all the required locks in proper order
here because we don't know what size the buffer is. That's why I think lock
sorting won't work of all the cases and thus the problem needs to be taken
care of by the dlm.
Hmm, well today in OCFS2 if you're not coming from read or write, the lock
is held only for the duration of ->nopage so I don't think we could get into
any deadlocks for that usage.
--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
***@oracle.com
Mark Fasheh
2005-08-10 16:26:18 UTC
Permalink
Post by Pekka J Enberg
It seems to me that the distributed locks must be acquired in ->nopage
anyway to solve the problem with memcpy() between two mmap'd regions. One
possible solution would be for the lock manager to detect deadlocks and
break some locks accordingly. Don't know how well that would mix with
->nopage though...
Yeah, my experience with ->nopage so far has indicated to me that we are to
avoid erroring out if at all possible which I believe is what we'd have to
do if a deadlock is found.
Also, I'm not sure how multiple dlms would coordinate deadlock detection in
that case.
This may sound naive, but so far OCFS2 has avoided the nead for deadlock
detection... I'd hate to have to add it now -- better to try avoiding them
in the first place.
--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
***@oracle.com
Pekka J Enberg
2005-08-10 04:48:29 UTC
Permalink
But couldn't we use make_pages_present() to figure which locks we need,
sort them, and then grab them?
Doh, obviously we can't as nopage() needs to bring the page in. Sorry about
that.

I also thought of another failure case for the vma walk. When a thread uses
userspace memcpy() between two clusterfs mmap'd regions instead of write()
or read().

Pekka
Pekka J Enberg
2005-08-09 18:35:58 UTC
Permalink
Hi Zach,
I'll try, briefly.
Thanks for the excellent explanation.
And that's the problem. Because they're acquired in ->nopage they can
be acquired during a fault that is servicing the 'buf' argument to an
outer file->{read,write} operation which has grabbed a lock for the
target file. Acquiring multiple locks introduces the risk of ABBA
deadlocks. It's trivial to construct examples of mmap(), read(), and
write() on 2 nodes with 2 files that deadlock.
But couldn't we use make_pages_present() to figure which locks we need, sort
them, and then grab them?
I brought this up with some people at the kernel summit but no one,
including myself, considers it a high priority. It wouldn't be too hard
to construct a patch if people want to take a look.
I guess it's not a problem as long as the kernel has zero or one cluster
filesystems that support mmap(). After we have two or more, we have a
problem.

The GFS2 vma walk needs fixing anyway, I think, as it can lead to buffer
overflow (if we have more locks during the second walk).

Pekka
Zach Brown
2005-08-09 17:17:10 UTC
Permalink
In addition, the vma walk will become an unmaintainable mess as soon
as someone introduces another mmap() capable fs that needs similar
locking.
Yup, I suspect that if the core kernel ends up caring about this problem
then the VFS will be involved in helping file systems sort the locks
they'll acquire around IO.
I am not an expert so could someone please explain why this cannot be
done with a_ops->prepare_write and friends?
I'll try, briefly.

Usually clustered file systems in Linux maintain data consistency for
normal posix IO by holding DLM locks for the duration of their
file->{read,write} methods. A task on a node won't be able to read
until all tasks on other nodes have finished any conflicting writes they
might have been performing, etc, nothing surprising here.

Now say we want to extend consistency guarantees to mmap(). This boils
down to protecting mappings with DLM locks. Say a page is mapped for
reading, the continued presence of that mapping is protected by holding
a DLM lock. If another node goes to write to that page, the read lock
is revoked and the mapping is torn down. These locks are acquired in
a_ops->nopage as the task faults and tries to bring up the mapping.

And that's the problem. Because they're acquired in ->nopage they can
be acquired during a fault that is servicing the 'buf' argument to an
outer file->{read,write} operation which has grabbed a lock for the
target file. Acquiring multiple locks introduces the risk of ABBA
deadlocks. It's trivial to construct examples of mmap(), read(), and
write() on 2 nodes with 2 files that deadlock.

So clustered file systems in Linux (GFS, Lustre, OCFS2, (GPFS?)) all
walk vmas in their file->{read,write} to discover mappings that belong
to their files so that they can preemptively sort and acquire the locks
that will be needed to cover the mappings that might be established in
->nopage. As you point out, this both relies on the mappings not
changing and gets very exciting when you mix files and mappings between
file systems that are each sorting and acquiring their own DLM locks.

I brought this up with some people at the kernel summit but no one,
including myself, considers it a high priority. It wouldn't be too hard
to construct a patch if people want to take a look.

- z
Pekka Enberg
2005-08-09 14:49:43 UTC
Permalink
Post by Zach Brown
Post by Pekka J Enberg
Sorry if this is an obvious question but what prevents another thread
from doing mmap() before we do the second walk and messing up num_gh?
Nothing, I suspect. OCFS2 has a problem like this, too. It wants a way
for a file system to serialize mmap/munmap/mremap during file IO. Well,
more specifically, it wants to make sure that the locks it acquired at
the start of the IO really cover the buf regions that might fault during
the IO.. mapping activity during the IO can wreck that.
In addition, the vma walk will become an unmaintainable mess as soon as
someone introduces another mmap() capable fs that needs similar locking.

I am not an expert so could someone please explain why this cannot be
done with a_ops->prepare_write and friends?

Pekka
David Teigland
2005-08-10 05:59:45 UTC
Permalink
On Mon, Aug 08, 2005 at 05:14:45PM +0300, Pekka J Enberg wrote:

if (!dumping)
down_read(&mm->mmap_sem);
Post by Pekka J Enberg
+
+ for (vma = find_vma(mm, start); vma; vma = vma->vm_next) {
+ if (end <= vma->vm_start)
+ break;
+ if (vma->vm_file &&
+ vma->vm_file->f_dentry->d_inode->i_sb == sb) {
+ num_gh++;
+ }
+ }
+
+ ghs = kmalloc((num_gh + 1) * sizeof(struct gfs2_holder),
+ GFP_KERNEL);
+ if (!ghs) {
+ if (!dumping)
+ up_read(&mm->mmap_sem);
+ return -ENOMEM;
+ }
+
+ for (vma = find_vma(mm, start); vma; vma = vma->vm_next) {
Sorry if this is an obvious question but what prevents another thread from
doing mmap() before we do the second walk and messing up num_gh?
mm->mmap_sem ?
Pekka J Enberg
2005-08-10 06:06:42 UTC
Permalink
if (!dumping)
down_read(&mm->mmap_sem);
Post by Pekka J Enberg
+
+ for (vma = find_vma(mm, start); vma; vma = vma->vm_next) {
+ if (end <= vma->vm_start)
+ break;
+ if (vma->vm_file &&
+ vma->vm_file->f_dentry->d_inode->i_sb == sb) {
+ num_gh++;
+ }
+ }
+
+ ghs = kmalloc((num_gh + 1) * sizeof(struct gfs2_holder),
+ GFP_KERNEL);
+ if (!ghs) {
+ if (!dumping)
+ up_read(&mm->mmap_sem);
+ return -ENOMEM;
+ }
+
+ for (vma = find_vma(mm, start); vma; vma = vma->vm_next) {
Sorry if this is an obvious question but what prevents another thread from
doing mmap() before we do the second walk and messing up num_gh?
mm->mmap_sem ?
Aah, I read that !dumping expression the other way around. Sorry and thanks.

Pekka
Kyle Moffett
2005-08-03 04:07:38 UTC
Permalink
because reiser got merged before jbd. Next question.
That is the wrong reason. We use our own journaling layer for the
reason that Vivaldi used his own melody.
I don't know anything about GFS, but expecting a filesystem author to
use a journaling layer he does not want to is a bit arrogant. Now, if
you got into details, and said jbd does X, Y and Z, and GFS does the
same X and Y, and does not do Z as well as jbd, that would be a more
serious comment. He might want to look at how reiser4 does wandering
logs instead of using jbd..... but I would never claim that for sure
some other author should be expected to use it..... and something
like
changing one's journaling system is not something to do just before a
merge.....
I don't want to start another big reiser4 flamewar, but...

"I don't know anything about Reiser4, but expecting a filesystem author
to use a VFS layer he does not want to is a bit arrogant. Now, if you
got into details, and said the linux VFS does X, Y, and Z, and Reiser4
does..."

Do you see my point here? If every person who added new kernel code
just wrote their own thing without checking to see if it had already
been done before, then there would be a lot of poorly maintained code
in the kernel. If a journalling layer already exists, _new_ journaled
filesystems should either (A) use the layer as is, or (B) fix the layer
so it has sufficient functionality for them to use, and submit patches.
That way if somebody later says, "Ah, crap, there's a bug in the kernel
journalling layer", and fixes it, there are not eight other filesystems
with their own open-coded layers that need to be audited for similar
mistakes.

This is similar to why some kernel developers did not like the Reiser4
code, because it implemented some private layers that looked kinda like
stuff the VFS should be doing (Again, I don't want to get into that
argument again, I'm just bringing up the similarities to clarify _this_
particular point, as that one has been beaten to death enough already).
Now the question for GFS is still a valid one; there might be
reasons to
not use it (which is fair enough) but if there's no real reason then
using jdb sounds a lot better given it's maturity (and it is used
by 2
filesystems in -mm already).
Personally, I am of the opinion that if GFS cannot use jdb, the
developers
ought to clarify why it isn't useable, and possibly submit fixes to make
it useful, so that others can share the benefits.

Cheers,
Kyle Moffett

--
I lost interest in "blade servers" when I found they didn't throw
knives at
people who weren't supposed to be in your machine room.
-- Anthony de Boer
Jan Engelhardt
2005-08-03 06:37:19 UTC
Permalink
Post by Kyle Moffett
because reiser got merged before jbd. Next question.
That is the wrong reason. We use our own journaling layer for the
reason that Vivaldi used his own melody.
[...] He might want to look at how reiser4 does wandering
logs instead of using jbd..... but I would never claim that for sure
some other author should be expected to use it..... and something like
changing one's journaling system is not something to do just before a
merge.....
Do you see my point here? If every person who added new kernel code
just wrote their own thing without checking to see if it had already
been done before, then there would be a lot of poorly maintained code
in the kernel. If a journalling layer already exists, _new_ journaled
filesystems should either (A) use the layer as is, or (B) fix the layer
so it has sufficient functionality for them to use, and submit patches.
Maybe jbd 'sucks' for something 'cool' like reiser*, and modifying jbd to be
'eleet enough' for reiser* would overwhelm ext.

Lastly, there is the 'political' thing, when a <your-favorite-jbd-fs>-only
specific change to jbd is rejected by all other jbd-using fs. (Basically the
situation thing that leads to software forks, in any area.)



Jan Engelhardt
--
Pekka Enberg
2005-08-03 06:44:06 UTC
Permalink
Hi David,

Some more comments below.

Pekka
+/**
+ * inode_create - create a struct gfs2_inode
+ *
+ * Returns: errno
+ */
+
+static int inode_create(struct gfs2_glock *i_gl, struct gfs2_inum *inum,
+ struct gfs2_glock *io_gl, unsigned int io_state,
+ struct gfs2_inode **ipp)
+{
+ struct gfs2_sbd *sdp = i_gl->gl_sbd;
+ struct gfs2_inode *ip;
+ int error = 0;
+
+ RETRY_MALLOC(ip = kmem_cache_alloc(gfs2_inode_cachep, GFP_KERNEL), ip);
Why do you want to do this? The callers can handle ENOMEM just fine.
+/**
+ * gfs2_random - Generate a random 32-bit number
+ *
+ * Generate a semi-crappy 32-bit pseudo-random number without using
+ * floating point.
+ *
+ * The PRNG is from "Numerical Recipes in C" (second edition), page 284.
+ *
+ * Returns: a 32-bit random number
+ */
+
+uint32_t gfs2_random(void)
+{
+ gfs2_random_number = 0x0019660D * gfs2_random_number + 0x3C6EF35F;
+ return gfs2_random_number;
+}
Please consider moving this into lib/random.c. This one already appears in
drivers/net/hamradio/dmascc.c.
+/**
+ * gfs2_hash - hash an array of data
+ *
+ * Take some data and convert it to a 32-bit hash.
+ *
+ * http://www.isthe.com/chongo/tech/comp/fnv/
+ *
+ * Returns: the hash
+ */
+
+uint32_t gfs2_hash(const void *data, unsigned int len)
+{
+ uint32_t h = 0x811C9DC5;
+ h = hash_more_internal(data, len, h);
+ return h;
+}
Is there a reason why you cannot use <linux/hash.h> or <linux/jhash.h>?
+void gfs2_sort(void *base, unsigned int num_elem, unsigned int size,
+ int (*compar) (const void *, const void *))
+{
+ register char *pbase = (char *)base;
+ int i, j, k, h;
+ static int cols[16] = {1391376, 463792, 198768, 86961,
+ 33936, 13776, 4592, 1968,
+ 861, 336, 112, 48,
+ 21, 7, 3, 1};
+
+ for (k = 0; k < 16; k++) {
+ h = cols[k];
+ for (i = h; i < num_elem; i++) {
+ j = i;
+ while (j >= h &&
+ (*compar)((void *)(pbase + size * (j - h)),
+ (void *)(pbase + size * j)) > 0) {
+ SWAP(pbase + size * j,
+ pbase + size * (j - h),
+ size);
+ j = j - h;
+ }
+ }
+ }
+}
Please use sort() from lib/sort.c.
+/**
+ * gfs2_io_error_inode_i - Flag an inode I/O error and withdraw
Please drop empty kerneldoc tags. (Appears in various other places as well.)
+#define RETRY_MALLOC(do_this, until_this) \
+for (;;) { \
+ { do_this; } \
+ if (until_this) \
+ break; \
+ if (time_after_eq(jiffies, gfs2_malloc_warning + 5 * HZ)) { \
+ printk("GFS2: out of memory: %s, %u\n", __FILE__, __LINE__); \
+ gfs2_malloc_warning = jiffies; \
+ } \
+ yield(); \
+}
Please drop this.
+int gfs2_acl_create(struct gfs2_inode *dip, struct gfs2_inode *ip)
+{
+ struct gfs2_sbd *sdp = dip->i_sbd;
+ struct posix_acl *acl = NULL;
+ struct gfs2_ea_request er;
+ mode_t mode = ip->i_di.di_mode;
+ int error;
+
+ if (!sdp->sd_args.ar_posix_acl)
+ return 0;
+ if (S_ISLNK(ip->i_di.di_mode))
+ return 0;
+
+ memset(&er, 0, sizeof(struct gfs2_ea_request));
+ er.er_type = GFS2_EATYPE_SYS;
+
+ error = acl_get(dip, ACL_DEFAULT, &acl, NULL,
+ &er.er_data, &er.er_data_len);
+ if (error)
+ return error;
+ if (!acl) {
+ mode &= ~current->fs->umask;
+ if (mode != ip->i_di.di_mode)
+ error = munge_mode(ip, mode);
+ return error;
+ }
+
+ {
+ struct posix_acl *clone = posix_acl_clone(acl, GFP_KERNEL);
+ error = -ENOMEM;
+ if (!clone)
+ goto out;
+ gfs2_memory_add(clone);
+ gfs2_memory_rm(acl);
+ posix_acl_release(acl);
+ acl = clone;
+ }
Please make this a real function. It is duplicated below.
+ if (error > 0) {
+ er.er_name = GFS2_POSIX_ACL_ACCESS;
+ er.er_name_len = GFS2_POSIX_ACL_ACCESS_LEN;
+ posix_acl_to_xattr(acl, er.er_data, er.er_data_len);
+ er.er_mode = mode;
+ er.er_flags = GFS2_ERF_MODE;
+ error = gfs2_system_eaops.eo_set(ip, &er);
+ if (error)
+ goto out;
+ } else
+ munge_mode(ip, mode);
+
+ gfs2_memory_rm(acl);
+ posix_acl_release(acl);
+ kfree(er.er_data);
+
+ return error;
Whitespace damage.
+int gfs2_acl_chmod(struct gfs2_inode *ip, struct iattr *attr)
+{
+ struct posix_acl *acl = NULL;
+ struct gfs2_ea_location el;
+ char *data;
+ unsigned int len;
+ int error;
+
+ error = acl_get(ip, ACL_ACCESS, &acl, &el, &data, &len);
+ if (error)
+ return error;
+ if (!acl)
+ return gfs2_setattr_simple(ip, attr);
+
+ {
+ struct posix_acl *clone = posix_acl_clone(acl, GFP_KERNEL);
+ error = -ENOMEM;
+ if (!clone)
+ goto out;
+ gfs2_memory_add(clone);
+ gfs2_memory_rm(acl);
+ posix_acl_release(acl);
+ acl = clone;
+ }
Duplicated above.
+static int ea_foreach(struct gfs2_inode *ip, ea_call_t ea_call, void *data)
+{
+ struct buffer_head *bh;
+ int error;
+
+ error = gfs2_meta_read(ip->i_gl, ip->i_di.di_eattr,
+ DIO_START | DIO_WAIT, &bh);
+ if (error)
+ return error;
+
+ if (!(ip->i_di.di_flags & GFS2_DIF_EA_INDIRECT))
+ error = ea_foreach_i(ip, bh, ea_call, data);
goto out here so you can drop the else branch below.
+ else {
+ struct buffer_head *eabh;
+ uint64_t *eablk, *end;
+
+ if (gfs2_metatype_check(ip->i_sbd, bh, GFS2_METATYPE_IN)) {
+ error = -EIO;
+ goto out;
+ }
+
+ eablk = (uint64_t *)(bh->b_data +
+ sizeof(struct gfs2_meta_header));
+ end = eablk + ip->i_sbd->sd_inptrs;
+
+static int ea_find_i(struct gfs2_inode *ip, struct buffer_head *bh,
+ struct gfs2_ea_header *ea, struct gfs2_ea_header *prev,
+ void *private)
+{
+ struct ea_find *ef = (struct ea_find *)private;
+ struct gfs2_ea_request *er = ef->ef_er;
+
+ if (ea->ea_type == GFS2_EATYPE_UNUSED)
+ return 0;
+
+ if (ea->ea_type == er->er_type) {
+ if (ea->ea_name_len == er->er_name_len &&
+ !memcmp(GFS2_EA2NAME(ea), er->er_name, ea->ea_name_len)) {
+ struct gfs2_ea_location *el = ef->ef_el;
+ get_bh(bh);
+ el->el_bh = bh;
+ el->el_ea = ea;
+ el->el_prev = prev;
+ return 1;
+ }
+ }
+
+#if 0
+ else if ((ip->i_di.di_flags & GFS2_DIF_EA_PACKED) &&
+ er->er_type == GFS2_EATYPE_SYS)
+ return 1;
+#endif
Please drop commented out code.
+static int ea_list_i(struct gfs2_inode *ip, struct buffer_head *bh,
+ struct gfs2_ea_header *ea, struct gfs2_ea_header *prev,
+ void *private)
+{
+ struct ea_list *ei = (struct ea_list *)private;
Please drop redundant cast.
+static int ea_set_i(struct gfs2_inode *ip, struct gfs2_ea_request *er,
+ struct gfs2_ea_location *el)
+{
+ {
+ struct ea_set es;
+ int error;
+
+ memset(&es, 0, sizeof(struct ea_set));
+ es.es_er = er;
+ es.es_el = el;
+
+ error = ea_foreach(ip, ea_set_simple, &es);
+ if (error > 0)
+ return 0;
+ if (error)
+ return error;
+ }
+ {
+ unsigned int blks = 2;
+ if (!(ip->i_di.di_flags & GFS2_DIF_EA_INDIRECT))
+ blks++;
+ if (GFS2_EAREQ_SIZE_STUFFED(er) > ip->i_sbd->sd_jbsize)
+ blks += DIV_RU(er->er_data_len,
+ ip->i_sbd->sd_jbsize);
+
+ return ea_alloc_skeleton(ip, er, blks, ea_set_block, el);
+ }
Please drop the extra braces.
David Teigland
2005-08-08 09:57:47 UTC
Permalink
Post by Pekka Enberg
+uint32_t gfs2_hash(const void *data, unsigned int len)
+{
+ uint32_t h = 0x811C9DC5;
+ h = hash_more_internal(data, len, h);
+ return h;
+}
Is there a reason why you cannot use <linux/hash.h> or <linux/jhash.h>?
See gfs2_hash_more() and comment; we hash discontiguous regions.
Post by Pekka Enberg
+#define RETRY_MALLOC(do_this, until_this) \
+for (;;) { \
+ { do_this; } \
+ if (until_this) \
+ break; \
+ if (time_after_eq(jiffies, gfs2_malloc_warning + 5 * HZ)) { \
+ printk("GFS2: out of memory: %s, %u\n", __FILE__, __LINE__); \
+ gfs2_malloc_warning = jiffies; \
+ } \
+ yield(); \
+}
Please drop this.
Done in the spot that could deal with an error, but there are three other
places that still need it.
Post by Pekka Enberg
+static int ea_set_i(struct gfs2_inode *ip, struct gfs2_ea_request *er,
+ struct gfs2_ea_location *el)
+{
+ {
+ struct ea_set es;
+ int error;
+
+ memset(&es, 0, sizeof(struct ea_set));
+ es.es_er = er;
+ es.es_el = el;
+
+ error = ea_foreach(ip, ea_set_simple, &es);
+ if (error > 0)
+ return 0;
+ if (error)
+ return error;
+ }
+ {
+ unsigned int blks = 2;
+ if (!(ip->i_di.di_flags & GFS2_DIF_EA_INDIRECT))
+ blks++;
+ if (GFS2_EAREQ_SIZE_STUFFED(er) > ip->i_sbd->sd_jbsize)
+ blks += DIV_RU(er->er_data_len,
+ ip->i_sbd->sd_jbsize);
+
+ return ea_alloc_skeleton(ip, er, blks, ea_set_block, el);
+ }
Please drop the extra braces.
Here and elsewhere we try to keep unused stuff off the stack. Are you
suggesting that we're being overly cautious, or do you just dislike the
way it looks?

Thanks,
Dave
Arjan van de Ven
2005-08-08 10:05:25 UTC
Permalink
Post by David Teigland
Post by Pekka Enberg
Please drop the extra braces.
Here and elsewhere we try to keep unused stuff off the stack. Are you
suggesting that we're being overly cautious, or do you just dislike the
way it looks?
nice theory. In practice gcc 3.x still adds up all the stack space
anyway and as long as gcc 3.x is a supported kernel compiler, you can't
depend on this. Also.. please favor readability. gcc is getting smarter
about stack use nowadays, and {}'s shouldn't be needed to help it, it
tracks liveness of variables already.
Jörn Engel
2005-08-08 10:20:22 UTC
Permalink
Post by Arjan van de Ven
Post by David Teigland
Post by Pekka Enberg
Please drop the extra braces.
Here and elsewhere we try to keep unused stuff off the stack. Are you
suggesting that we're being overly cautious, or do you just dislike the
way it looks?
nice theory. In practice gcc 3.x still adds up all the stack space
anyway and as long as gcc 3.x is a supported kernel compiler, you can't
depend on this. Also.. please favor readability. gcc is getting smarter
about stack use nowadays, and {}'s shouldn't be needed to help it, it
tracks liveness of variables already.
Plus, you don't have to guess about stack usage. Run "make
checkstack" or, better yet, run the objdump of fs/gfs/built-in.o
through the perl script.

Jörn
--
It's just what we asked for, but not what we want!
-- anonymous
David Teigland
2005-08-08 10:56:13 UTC
Permalink
gfs2-02.patch:+ RETRY_MALLOC(ip = kmem_cache_alloc(gfs2_inode_cachep,
-> GFP_NOFAIL.
Already gone, inode_create() can return an error.

if (create) {
RETRY_MALLOC(page = grab_cache_page(aspace->i_mapping, index),
page);
} else {
page = find_lock_page(aspace->i_mapping, index);
if (!page)
return NULL;
}
I think you can set aspace->flags to GFP_NOFAIL
will try that
but why can't you return NULL here on failure like you do for
find_lock_page()?
because create is set
gfs2-02.patch:+ RETRY_MALLOC(bd = kmem_cache_alloc(gfs2_bufdata_cachep,
GFP_KERNEL),
-> GFP_NOFAIL
It looks to me like NOFAIL does nothing for kmem_cache_alloc().
Am I seeing that wrong?
gfs2-10.patch:+ RETRY_MALLOC(new_gh = gfs2_holder_get(gl, state,
gfs2_holder_get uses kmalloc which can use GFP_NOFAIL.
Which means adding a new gfp_flags parameter... fine.

Dave
David Teigland
2005-08-08 11:39:10 UTC
Permalink
Post by David Teigland
but why can't you return NULL here on failure like you do for
find_lock_page()?
because create is set
Yes, but looking at (some of the) top-level callers, there's no real reason
why create must not fail. Am I missing something here?
I'll trace the callers back farther and see about dealing with errors.
Post by David Teigland
gfs2-02.patch:+ RETRY_MALLOC(bd = kmem_cache_alloc(gfs2_bufdata_cachep,
It is passed to the page allocator just like with kmalloc() which uses
__cache_alloc() too.
Yes, I read it wrongly, looks like NOFAIL should work fine. I think we
can get rid of the RETRY macro entirely.
Thanks,
Dave
Pekka J Enberg
2005-08-08 10:57:55 UTC
Permalink
Post by David Teigland
but why can't you return NULL here on failure like you do for
find_lock_page()?
because create is set
Yes, but looking at (some of the) top-level callers, there's no real reason
why create must not fail. Am I missing something here?
Post by David Teigland
gfs2-02.patch:+ RETRY_MALLOC(bd = kmem_cache_alloc(gfs2_bufdata_cachep,
GFP_KERNEL),
-> GFP_NOFAIL
It looks to me like NOFAIL does nothing for kmem_cache_alloc().
Am I seeing that wrong?
It is passed to the page allocator just like with kmalloc() which uses
__cache_alloc() too.

Pekka
Pekka J Enberg
2005-08-08 10:18:45 UTC
Permalink
Post by David Teigland
Post by Pekka Enberg
+#define RETRY_MALLOC(do_this, until_this) \
+for (;;) { \
+ { do_this; } \
+ if (until_this) \
+ break; \
+ if (time_after_eq(jiffies, gfs2_malloc_warning + 5 * HZ)) { \
+ printk("GFS2: out of memory: %s, %u\n", __FILE__, __LINE__); \
+ gfs2_malloc_warning = jiffies; \
+ } \
+ yield(); \
+}
Please drop this.
Done in the spot that could deal with an error, but there are three other
places that still need it.
Which places are those? I only see these:

gfs2-02.patch:+ RETRY_MALLOC(ip = kmem_cache_alloc(gfs2_inode_cachep,
GFP_KERNEL), ip);
gfs2-02.patch-+ gfs2_memory_add(ip);
gfs2-02.patch-+ memset(ip, 0, sizeof(struct gfs2_inode));
gfs2-02.patch-+
gfs2-02.patch-+ ip->i_num = *inum;
gfs2-02.patch-+

-> GFP_NOFAIL.

gfs2-02.patch:+ RETRY_MALLOC(page =
grab_cache_page(aspace->i_mapping, index),
gfs2-02.patch-+ page);
gfs2-02.patch-+ } else {
gfs2-02.patch-+ page = find_lock_page(aspace->i_mapping, index);
gfs2-02.patch-+ if (!page)
gfs2-02.patch-+ return NULL;

I think you can set aspace->flags to GFP_NOFAIL but why can't you return
NULL here on failure like you do for find_lock_page()?

gfs2-02.patch:+ RETRY_MALLOC(bd = kmem_cache_alloc(gfs2_bufdata_cachep,
GFP_KERNEL),
gfs2-02.patch-+ bd);
gfs2-02.patch-+ gfs2_memory_add(bd);
gfs2-02.patch-+ atomic_inc(&gl->gl_sbd->sd_bufdata_count);
gfs2-02.patch-+
gfs2-02.patch-+ memset(bd, 0, sizeof(struct gfs2_bufdata));

-> GFP_NOFAIL

gfs2-08.patch:+ RETRY_MALLOC(gm = kmalloc(sizeof(struct gfs2_memory),
GFP_KERNEL), gm);
gfs2-08.patch-+ gm->gm_data = data;
gfs2-08.patch-+ gm->gm_file = file;
gfs2-08.patch-+ gm->gm_line = line;
gfs2-08.patch-+
gfs2-08.patch-+ spin_lock(&memory_lock);

-> GFP_NOFAIL

gfs2-10.patch:+ RETRY_MALLOC(new_gh = gfs2_holder_get(gl, state,
gfs2-10.patch-+ LM_FLAG_TRY |
gfs2-10.patch-+
GL_NEVER_RECURSE),
gfs2-10.patch-+ new_gh);
gfs2-10.patch-+ set_bit(HIF_DEMOTE, &new_gh->gh_iflags);
gfs2-10.patch-+ set_bit(HIF_DEALLOC, &new_gh->gh_iflags);

gfs2_holder_get uses kmalloc which can use GFP_NOFAIL.

Pekka
Pekka J Enberg
2005-08-08 10:00:43 UTC
Permalink
Post by David Teigland
Post by Pekka Enberg
+static int ea_set_i(struct gfs2_inode *ip, struct gfs2_ea_request *er,
+ struct gfs2_ea_location *el)
+{
+ {
+ struct ea_set es;
+ int error;
+
+ memset(&es, 0, sizeof(struct ea_set));
+ es.es_er = er;
+ es.es_el = el;
+
+ error = ea_foreach(ip, ea_set_simple, &es);
+ if (error > 0)
+ return 0;
+ if (error)
+ return error;
+ }
+ {
+ unsigned int blks = 2;
+ if (!(ip->i_di.di_flags & GFS2_DIF_EA_INDIRECT))
+ blks++;
+ if (GFS2_EAREQ_SIZE_STUFFED(er) > ip->i_sbd->sd_jbsize)
+ blks += DIV_RU(er->er_data_len,
+ ip->i_sbd->sd_jbsize);
+
+ return ea_alloc_skeleton(ip, er, blks, ea_set_block, el);
+ }
Please drop the extra braces.
Here and elsewhere we try to keep unused stuff off the stack. Are you
suggesting that we're being overly cautious, or do you just dislike the
way it looks?
The extra braces hurt readability. Please drop them or make them proper
functions instead.

And yes, I think you're hiding potential stack usage problems here. Small
unused stuff on the stack don't matter and large ones should probably be
kmalloc() anyway.

Pekka
Pekka J Enberg
2005-08-08 10:34:56 UTC
Permalink
Post by David Teigland
Post by Pekka Enberg
Is there a reason why you cannot use <linux/hash.h> or <linux/jhash.h>?
See gfs2_hash_more() and comment; we hash discontiguous regions.
jhash() takes an initial value. Isn't that sufficient?

Pekka
Pekka J Enberg
2005-08-09 14:55:41 UTC
Permalink
Hi David,

Here are some more comments.

Pekka

+/**************************************************************************
****
+*******************************************************************************
+**
+** Copyright (C) Sistina Software, Inc. 1997-2003 All rights reserved.
+** Copyright (C) 2004-2005 Red Hat, Inc. All rights reserved.
+**
+** This copyrighted material is made available to anyone wishing to use,
+** modify, copy, or redistribute it subject to the terms and conditions
+** of the GNU General Public License v.2.
+**
+*******************************************************************************
+******************************************************************************/
Do you really need this verbose banner?
+#define NO_CREATE 0
+#define CREATE 1
+
+#define NO_WAIT 0
+#define WAIT 1
+
+#define NO_FORCE 0
+#define FORCE 1
The code seems to interchangeably use FORCE and NO_FORCE together
with TRUE and FALSE. Perhaps they could be dropped?
+#define GLF_PLUG 0
+#define GLF_LOCK 1
+#define GLF_STICKY 2
+#define GLF_PREFETCH 3
+#define GLF_SYNC 4
+#define GLF_DIRTY 5
+#define GLF_SKIP_WAITERS2 6
+#define GLF_GREEDY 7
Would be nice if these were either enums or had a comment linking
them to the struct member they are used for.
+#define GIF_MIN_INIT 0
+#define GIF_QD_LOCKED 1
+#define GIF_PAGED 2
+#define GIF_SW_PAGED 3
Same here and in few other places too.
+#define LO_BEFORE_COMMIT(sdp) \
+do { \
+ int __lops_x; \
+ for (__lops_x = 0; gfs2_log_ops[__lops_x]; __lops_x++) \
+ if (gfs2_log_ops[__lops_x]->lo_before_commit) \
+ gfs2_log_ops[__lops_x]->lo_before_commit((sdp)); \
+} while (0)
+
+#define LO_AFTER_COMMIT(sdp, ai) \
+do { \
+ int __lops_x; \
+ for (__lops_x = 0; gfs2_log_ops[__lops_x]; __lops_x++) \
+ if (gfs2_log_ops[__lops_x]->lo_after_commit) \
+ gfs2_log_ops[__lops_x]->lo_after_commit((sdp), (ai)); \
+} while (0)
+
+#define LO_BEFORE_SCAN(jd, head, pass) \
+do \
+{ \
+ int __lops_x; \
+ for (__lops_x = 0; gfs2_log_ops[__lops_x]; __lops_x++) \
+ if (gfs2_log_ops[__lops_x]->lo_before_scan) \
+ gfs2_log_ops[__lops_x]->lo_before_scan((jd), (head), (pass)); \
+} \
+while (0)
static inline functions, please.
+static inline int LO_SCAN_ELEMENTS(struct gfs2_jdesc *jd, unsigned int start,
+ struct gfs2_log_descriptor *ld,
+ unsigned int pass)
Lower case name, please.
+{
+ unsigned int x;
+ int error;
+
+ for (x = 0; gfs2_log_ops[x]; x++)
+ if (gfs2_log_ops[x]->lo_scan_elements) {
+ error = gfs2_log_ops[x]->lo_scan_elements(jd, start,
+ ld, pass);
+ if (error)
+ return error;
+ }
+
+ return 0;
+}
+
+#define LO_AFTER_SCAN(jd, error, pass) \
+do \
+{ \
+ int __lops_x; \
+ for (__lops_x = 0; gfs2_log_ops[__lops_x]; __lops_x++) \
+ if (gfs2_log_ops[__lops_x]->lo_before_scan) \
+ gfs2_log_ops[__lops_x]->lo_after_scan((jd), (error), (pass)); \
+} \
+while (0)
static inline function, please.
+
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/smp_lock.h>
+#include <linux/spinlock.h>
+#include <asm/semaphore.h>
+#include <linux/completion.h>
+#include <linux/buffer_head.h>
+#include <asm/uaccess.h>
+#include <linux/pagemap.h>
+#include <linux/uio.h>
+#include <linux/blkdev.h>
+#include <linux/mm.h>
+#include <asm/uaccess.h>
+#include <linux/gfs2_ioctl.h>
Preferred order is to include linux/ first and asm/ after that.
+#define vma2state(vma) \
+((((vma)->vm_flags & (VM_MAYWRITE | VM_MAYSHARE)) == \
+ (VM_MAYWRITE | VM_MAYSHARE)) ? \
+ LM_ST_EXCLUSIVE : LM_ST_SHARED) \
static inline function, please. The above is completely unreadable.
+struct inode *gfs2_ip2v(struct gfs2_inode *ip, int create)
+{
+ struct inode *inode = NULL, *tmp;
+
+ gfs2_assert_warn(ip->i_sbd,
+ test_bit(GIF_MIN_INIT, &ip->i_flags));
+
+ spin_lock(&ip->i_spin);
+ if (ip->i_vnode)
+ inode = igrab(ip->i_vnode);
+ spin_unlock(&ip->i_spin);
Suggestion: make the above a separate function __gfs2_lookup_inode(),
use it here and where you pass NO_CREATE to get rid of the create
parameter.
+
+ if (inode || !create)
+ return inode;
+
+ tmp = new_inode(ip->i_sbd->sd_vfs);
+ if (!tmp)
+ return NULL;
[snip]
+ entries = gfs2_tune_get(sdp, gt_entries_per_readdir);
+ size = sizeof(struct filldir_bad) +
+ entries * (sizeof(struct filldir_bad_entry) + GFS2_FAST_NAME_SIZE);
+
+ fdb = kmalloc(size, GFP_KERNEL);
+ if (!fdb)
+ return -ENOMEM;
+ memset(fdb, 0, size);
kzalloc(), which is in 2.6.13-rc6-mm5 please. Appears in other places as
well.
+ if (error) {
+ printk("GFS2: fsid=%s: can't make FS RW: %d\n",
+ sdp->sd_fsname, error);
+ goto fail_proc;
+ }
+ }
+
+ gfs2_glock_dq_uninit(&mount_gh);
+
+ return 0;
+
+ gfs2_proc_fs_del(sdp);
+ init_threads(sdp, UNDO);
Please provide a release_threads instead and make it deal with partial
initialization. The above is very confusing.
+ parent,
+ strlen(system_utsname.nodename));
+ new = lookup_one_len(system_utsname.machine,
+ parent,
+ strlen(system_utsname.machine));
+ new = lookup_one_len(system_utsname.sysname,
+ parent,
+ strlen(system_utsname.sysname));
+ new = lookup_one_len(buf,
+ parent,
+ sprintf(buf, "%u", current->fsuid));
+ new = lookup_one_len(buf,
+ parent,
+ sprintf(buf, "%u", current->fsgid));
+ new = lookup_one_len(buf,
+ parent,
+ sprintf(buf, "%s_%s",
+ system_utsname.machine,
+ system_utsname.sysname));
+ new = lookup_one_len(buf,
+ parent,
+ sprintf(buf, "%u",
+ sdp->sd_jdesc->jd_jid));
Smells like policy in the kernel. Why can't this be done in the
userspace?
+ parent,
+ strlen(system_utsname.nodename));
+ else if (gfs2_filecmp(&dentry->d_name, "{mach}", 6))
+ new = lookup_one_len(system_utsname.machine,
+ parent,
+ strlen(system_utsname.machine));
+ else if (gfs2_filecmp(&dentry->d_name, "{os}", 4))
+ new = lookup_one_len(system_utsname.sysname,
+ parent,
+ strlen(system_utsname.sysname));
+ else if (gfs2_filecmp(&dentry->d_name, "{uid}", 5))
+ new = lookup_one_len(buf,
+ parent,
+ sprintf(buf, "%u", current->fsuid));
+ else if (gfs2_filecmp(&dentry->d_name, "{gid}", 5))
+ new = lookup_one_len(buf,
+ parent,
+ sprintf(buf, "%u", current->fsgid));
+ else if (gfs2_filecmp(&dentry->d_name, "{sys}", 5))
+ new = lookup_one_len(buf,
+ parent,
+ sprintf(buf, "%s_%s",
+ system_utsname.machine,
+ system_utsname.sysname));
+ else if (gfs2_filecmp(&dentry->d_name, "{jid}", 5))
+ new = lookup_one_len(buf,
+ parent,
+ sprintf(buf, "%u",
+ sdp->sd_jdesc->jd_jid));
Ditto.
+int gfs2_statfs_slow(struct gfs2_sbd *sdp, struct gfs2_statfs_change *sc)
+{
+ struct gfs2_holder ri_gh;
+ struct gfs2_rgrpd *rgd_next;
+ struct gfs2_holder *gha, *gh;
+ unsigned int slots = 64;
+ unsigned int x;
+ int done;
+ int error = 0, err;
+
+ memset(sc, 0, sizeof(struct gfs2_statfs_change));
+ gha = kmalloc(slots * sizeof(struct gfs2_holder), GFP_KERNEL);
+ if (!gha)
+ return -ENOMEM;
+ memset(gha, 0, slots * sizeof(struct gfs2_holder));
kcalloc, please
+ line = kmalloc(256, GFP_KERNEL);
+ if (!line)
+ return -ENOMEM;
+
+ len = snprintf(line, 256, "GFS2: fsid=%s: quota %s for %s %u\r\n",
+ sdp->sd_fsname, type,
+ (test_bit(QDF_USER, &qd->qd_flags)) ? "user" : "group",
+ qd->qd_id);
Please use constant instead of magic number 256.
+struct lm_lockops gdlm_ops = {
+ lm_proto_name:"lock_dlm",
+ lm_mount:gdlm_mount,
+ lm_others_may_mount:gdlm_others_may_mount,
+ lm_unmount:gdlm_unmount,
+ lm_withdraw:gdlm_withdraw,
+ lm_get_lock:gdlm_get_lock,
+ lm_put_lock:gdlm_put_lock,
+ lm_lock:gdlm_lock,
+ lm_unlock:gdlm_unlock,
+ lm_plock:gdlm_plock,
+ lm_punlock:gdlm_punlock,
+ lm_plock_get:gdlm_plock_get,
+ lm_cancel:gdlm_cancel,
+ lm_hold_lvb:gdlm_hold_lvb,
+ lm_unhold_lvb:gdlm_unhold_lvb,
+ lm_sync_lvb:gdlm_sync_lvb,
+ lm_recovery_done:gdlm_recovery_done,
+ lm_owner:THIS_MODULE,
+};
C99 initializers, please.
Pekka J Enberg
2005-08-10 07:40:37 UTC
Permalink
Hi David,
+ return -EINVAL;
+ if (!access_ok(VERIFY_WRITE, buf, size))
+ return -EFAULT;
+
+ if (!(file->f_flags & O_LARGEFILE)) {
+ if (*offset >= 0x7FFFFFFFull)
+ return -EFBIG;
+ if (*offset + size > 0x7FFFFFFFull)
+ size = 0x7FFFFFFFull - *offset;
Please use a constant instead for 0x7FFFFFFFull. (Appears in various other
places as well.)

Pekka
Christoph Hellwig
2005-08-10 07:43:38 UTC
Permalink
Post by Pekka Enberg
Hi David,
+ return -EINVAL;
+ if (!access_ok(VERIFY_WRITE, buf, size))
+ return -EFAULT;
+
+ if (!(file->f_flags & O_LARGEFILE)) {
+ if (*offset >= 0x7FFFFFFFull)
+ return -EFBIG;
+ if (*offset + size > 0x7FFFFFFFull)
+ size = 0x7FFFFFFFull - *offset;
Please use a constant instead for 0x7FFFFFFFull. (Appears in various other
places as well.)
In fact this very much looks like it's duplicating generic_write_checks().
Folks, please use common code.
Hans Reiser
2005-08-03 01:00:02 UTC
Permalink
* Why use your own journalling layer and not say ... jbd ?
Why does reiser use its own journalling layer and not say ... jbd ?
because reiser got merged before jbd. Next question.
That is the wrong reason. We use our own journaling layer for the
reason that Vivaldi used his own melody.

I don't know anything about GFS, but expecting a filesystem author to
use a journaling layer he does not want to is a bit arrogant. Now, if
you got into details, and said jbd does X, Y and Z, and GFS does the
same X and Y, and does not do Z as well as jbd, that would be a more
serious comment. He might want to look at how reiser4 does wandering
logs instead of using jbd..... but I would never claim that for sure
some other author should be expected to use it..... and something like
changing one's journaling system is not something to do just before a
merge.....
Now the question for GFS is still a valid one; there might be reasons to
not use it (which is fair enough) but if there's no real reason then
using jdb sounds a lot better given it's maturity (and it is used by 2
filesystems in -mm already).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Arjan van de Ven
2005-08-03 09:09:02 UTC
Permalink
I don't know anything about GFS, but expecting a filesystem author to
use a journaling layer he does not want to is a bit arrogant.
good that I didn't expect that then.
I think it's fair enough to ask people if they can use it. If the answer
is "No because it doesn't fit our model <here>" then that's fine. If the
answer is "eh yeah we could" then I think it's entirely reasonable to
expect people to use common code as opposed to adding new code.
Arjan van de Ven
2005-08-02 15:02:52 UTC
Permalink
* Why use your own journalling layer and not say ... jbd ?
Why does reiser use its own journalling layer and not say ... jbd ?
because reiser got merged before jbd. Next question.

Now the question for GFS is still a valid one; there might be reasons to
not use it (which is fair enough) but if there's no real reason then
using jdb sounds a lot better given it's maturity (and it is used by 2
filesystems in -mm already).
Arjan van de Ven
2005-08-02 07:45:24 UTC
Permalink
Hi, GFS (Global File System) is a cluster file system that we'd like to
see added to the kernel. The 14 patches total about 900K so I won't send
them to the list unless that's requested. Comments and suggestions are
welcome. Thanks
http://redhat.com/~teigland/gfs2/20050801/gfs2-full.patch
http://redhat.com/~teigland/gfs2/20050801/broken-out/
* The on disk structures are defined in terms of uint32_t and friends,
which are NOT endian neutral. Why are they not le32/be32 and thus
endian-defined? Did you run bitwise-sparse on GFS yet ?

* None of your on disk structures are packet. Are you sure?

*
+#define gfs2_16_to_cpu be16_to_cpu
+#define gfs2_32_to_cpu be32_to_cpu
+#define gfs2_64_to_cpu be64_to_cpu

why this pointless abstracting?

* +static const uint32_t crc_32_tab[] = .....

why do you duplicate this? The kernel has a perfectly good set of generic crc32 tables/functions just fine

* Why are you using bufferheads extensively in a new filesystem?

* + if (create)
+ down_write(&ip->i_rw_mutex);
+ else
+ down_read(&ip->i_rw_mutex);

why do you use a rwsem and not a regular semaphore? You are aware that rwsems are far more expensive than regular ones right?
How skewed is the read/write ratio?

* Why use your own journalling layer and not say ... jbd ?

* + while (!kthread_should_stop()) {
+ gfs2_scand_internal(sdp);
+
+ set_current_state(TASK_INTERRUPTIBLE);
+ schedule_timeout(gfs2_tune_get(sdp, gt_scand_secs) * HZ);
+ }

you probably really want to check for signals if you do interruptible sleeps
(multiple places)

* why not use msleep() and friends instead of schedule_timeout(), you're not using the complex variants anyway

* +++ b/fs/gfs2/fixed_div64.h 2005-08-01 14:13:08.009808200 +0800

ehhhh why?

* int gfs2_copy2user(struct buffer_head *bh, char **buf, unsigned int offset,
+ unsigned int size)
+{
+ int error;
+
+ if (bh)
+ error = copy_to_user(*buf, bh->b_data + offset, size);
+ else
+ error = clear_user(*buf, size);

that looks to be missing a few kmaps.. whats the guarantee that b_data is actually, like in lowmem?

* [PATCH 08/14] GFS: diaper device

The diaper device is a block device within gfs that gets transparently
inserted between the real device the and rest of the filesystem.

hmmmm why not use device mapper or something? Is this really needed? Should it live in drivers/block ? Doesn't
this wrapper just increase the risk for memory deadlocks?

* [PATCH 06/14] GFS: logging and recovery

quoting the ren and stimpy show is nice.. but did the ren ans stimpy authors agree to license their stuff under the GPL?


* do_lock_wait

that almost screams for using wait_event and related APIs


*
+static inline void gfs2_log_lock(struct gfs2_sbd *sdp)
+{
+ spin_lock(&sdp->sd_log_lock);
+}
why the abstraction ?
Jan Engelhardt
2005-08-02 14:57:11 UTC
Permalink
* Why use your own journalling layer and not say ... jbd ?
Why does reiser use its own journalling layer and not say ... jbd ?



Jan Engelhardt
--
David Teigland
2005-08-05 07:14:15 UTC
Permalink
* +static const uint32_t crc_32_tab[] = .....
why do you duplicate this? The kernel has a perfectly good set of
generic crc32 tables/functions just fine
The gfs2_disk_hash() function and the crc table on which it's based are a
part of gfs2_ondisk.h: the ondisk metadata specification. This is a bit
unusual since gfs uses a hash table on-disk for its directory structure.
This header, including the hash function/table, must be included by user
space programs like fsck that want to decipher a fs, and any change to the
function or table would effectively make the fs corrupted. Because of
this I think it's best for gfs to keep it's own copy as part of its ondisk
format spec.
* Why are you using bufferheads extensively in a new filesystem?
bh's are used for metadata, the log, and journaled data which need to be
written at the block granularity, not page.
why do you use a rwsem and not a regular semaphore? You are aware that
rwsems are far more expensive than regular ones right? How skewed is
the read/write ratio?
Aware, yes, it's the only rwsem in gfs. Specific skew, no, we'll have to
measure that.
* +++ b/fs/gfs2/fixed_div64.h 2005-08-01 14:13:08.009808200 +0800
ehhhh why?
I'm not sure, actually, apart from the comments:

do_div: /* For ia32 we need to pull some tricks to get past various versions
of the compiler which do not like us using do_div in the middle
of large functions. */

do_mod: /* Side effect free 64 bit mod operation */

fs/xfs/linux-2.6/xfs_linux.h (the origin of this file) has the same thing,
perhaps this is an old problem that's now fixed?
* int gfs2_copy2user(struct buffer_head *bh, char **buf, unsigned int offset,
+ unsigned int size)
+{
+ int error;
+
+ if (bh)
+ error = copy_to_user(*buf, bh->b_data + offset, size);
+ else
+ error = clear_user(*buf, size);
that looks to be missing a few kmaps.. whats the guarantee that b_data
is actually, like in lowmem?
This is only used in the specific case of reading a journaled-data file.
That seems to effectively be the same as reading a buffer of fs metadata.
The diaper device is a block device within gfs that gets transparently
inserted between the real device the and rest of the filesystem.
hmmmm why not use device mapper or something? Is this really needed?
This is needed for the "withdraw" feature (described in the comment) which
is fairly important. We'll see if dm could be used instead.

Thanks,
Dave
Mike Christie
2005-08-05 07:27:08 UTC
Permalink
Post by David Teigland
Post by Arjan van de Ven
* Why are you using bufferheads extensively in a new filesystem?
bh's are used for metadata, the log, and journaled data which need to be
written at the block granularity, not page.
In a scsi tree
http://kernel.org/git/?p=linux/kernel/git/jejb/scsi-block-2.6.git;a=summary
there is a function, bio_map_kern(), in fs.c that maps a buffer into a
bio. It does not have to be page granularity. Can something like that be
used in these places?
Mike Christie
2005-08-05 07:30:41 UTC
Permalink
Post by Mike Christie
Post by David Teigland
Post by Arjan van de Ven
* Why are you using bufferheads extensively in a new filesystem?
bh's are used for metadata, the log, and journaled data which need to be
written at the block granularity, not page.
In a scsi tree
http://kernel.org/git/?p=linux/kernel/git/jejb/scsi-block-2.6.git;a=summary
oh yeah it is in -mm too.
Post by Mike Christie
there is a function, bio_map_kern(), in fs.c that maps a buffer into a
bio. It does not have to be page granularity. Can something like that be
used in these places?
--
Linux-cluster mailing list
http://www.redhat.com/mailman/listinfo/linux-cluster
Arjan van de Ven
2005-08-05 07:34:38 UTC
Permalink
Post by David Teigland
* +static const uint32_t crc_32_tab[] = .....
why do you duplicate this? The kernel has a perfectly good set of
generic crc32 tables/functions just fine
The gfs2_disk_hash() function and the crc table on which it's based are a
part of gfs2_ondisk.h: the ondisk metadata specification. This is a bit
unusual since gfs uses a hash table on-disk for its directory structure.
This header, including the hash function/table, must be included by user
space programs like fsck that want to decipher a fs, and any change to the
function or table would effectively make the fs corrupted. Because of
this I think it's best for gfs to keep it's own copy as part of its ondisk
format spec.
for userspace there's libcrc32 as well. If it's *the* bog standard crc32
I don't see a reason why your "spec" can't just reference that instead.
And esp in the kernel you should just use the in kernel one not your own
regardless; you can assume the in kernel one is optimized and it also
keeps size down.
David Teigland
2005-08-05 09:44:52 UTC
Permalink
Post by Arjan van de Ven
Post by David Teigland
* +static const uint32_t crc_32_tab[] = .....
why do you duplicate this? The kernel has a perfectly good set of
generic crc32 tables/functions just fine
The gfs2_disk_hash() function and the crc table on which it's based are a
part of gfs2_ondisk.h: the ondisk metadata specification. This is a bit
unusual since gfs uses a hash table on-disk for its directory structure.
This header, including the hash function/table, must be included by user
space programs like fsck that want to decipher a fs, and any change to the
function or table would effectively make the fs corrupted. Because of
this I think it's best for gfs to keep it's own copy as part of its ondisk
format spec.
for userspace there's libcrc32 as well. If it's *the* bog standard crc32
I don't see a reason why your "spec" can't just reference that instead.
And esp in the kernel you should just use the in kernel one not your own
regardless; you can assume the in kernel one is optimized and it also
keeps size down.
linux/lib/crc32table.h : crc32table_le[] is the same as our crc_32_tab[].
This looks like a standard that's not going to change, as you've said, so
including crc32table.h and getting rid of our own table would work fine.

Do we go a step beyond this and use say the crc32() function from
linux/crc32.h? Is this _function_ as standard and unchanging as the table
of crcs? In my tests it doesn't produce the same results as our
gfs2_disk_hash() function, even with both using the same crc table. I
don't mind adopting a new function and just writing a user space
equivalent for the tools if it's a fixed standard.

Dave
David Teigland
2005-08-05 10:31:38 UTC
Permalink
Post by David Teigland
Do we go a step beyond this and use say the crc32() function from
linux/crc32.h? Is this _function_ as standard and unchanging as the table
of crcs? In my tests it doesn't produce the same results as our
gfs2_disk_hash() function, even with both using the same crc table. I
don't mind adopting a new function and just writing a user space
equivalent for the tools if it's a fixed standard.
The function is basically set in stone. Variants exists depending on
1. Initial value is 0
2. Initial value is 0xffffffff
a) Result is taken as-is
b) Result is XORed with 0xffffffff
Maybe your code implements 1a, while you tried 2b with the lib/crc32.c
function or something similar?
You're right, initial value 0xffffffff and xor result with 0xffffffff
matches the results from our function. Great, we can get rid of
gfs2_disk_hash() and use crc32() directly.
Thanks,
Dave
Jörn Engel
2005-08-05 10:07:50 UTC
Permalink
Post by David Teigland
linux/lib/crc32table.h : crc32table_le[] is the same as our crc_32_tab[].
This looks like a standard that's not going to change, as you've said, so
including crc32table.h and getting rid of our own table would work fine.
Do we go a step beyond this and use say the crc32() function from
linux/crc32.h? Is this _function_ as standard and unchanging as the table
of crcs? In my tests it doesn't produce the same results as our
gfs2_disk_hash() function, even with both using the same crc table. I
don't mind adopting a new function and just writing a user space
equivalent for the tools if it's a fixed standard.
The function is basically set in stone. Variants exists depending on
how it is called. I know of four variants, but there may be more:

1. Initial value is 0
2. Initial value is 0xffffffff
a) Result is taken as-is
b) Result is XORed with 0xffffffff

Maybe your code implements 1a, while you tried 2b with the lib/crc32.c
function or something similar?

Jörn
--
And spam is a useful source of entropy for /dev/random too!
-- Jasmine Strong
Jan Engelhardt
2005-08-05 08:28:13 UTC
Permalink
Post by David Teigland
The gfs2_disk_hash() function and the crc table on which it's based are a
part of gfs2_ondisk.h: the ondisk metadata specification. This is a bit
unusual since gfs uses a hash table on-disk for its directory structure.
This header, including the hash function/table, must be included by user
space programs like fsck that want to decipher a fs, and any change to the
function or table would effectively make the fs corrupted. Because of
this I think it's best for gfs to keep it's own copy as part of its ondisk
format spec.
Tune the spec to use kernel and libcrc32 tables and bump the version number of
the spec to e.g. GFS 2.1. That way, things transform smoothly and could go out
eventually at some later date.



Jan Engelhardt
--
Arjan van de Ven
2005-08-05 08:34:32 UTC
Permalink
Post by Jan Engelhardt
Post by David Teigland
The gfs2_disk_hash() function and the crc table on which it's based are a
part of gfs2_ondisk.h: the ondisk metadata specification. This is a bit
unusual since gfs uses a hash table on-disk for its directory structure.
This header, including the hash function/table, must be included by user
space programs like fsck that want to decipher a fs, and any change to the
function or table would effectively make the fs corrupted. Because of
this I think it's best for gfs to keep it's own copy as part of its ondisk
format spec.
Tune the spec to use kernel and libcrc32 tables and bump the version number of
the spec to e.g. GFS 2.1. That way, things transform smoothly and could go out
eventually at some later date.
afaik the tables aren't actually different. So no need to bump the spec!
David Teigland
2005-08-08 06:26:36 UTC
Permalink
Post by David Teigland
Post by Arjan van de Ven
* +++ b/fs/gfs2/fixed_div64.h 2005-08-01 14:13:08.009808200 +0800
ehhhh why?
do_div: /* For ia32 we need to pull some tricks to get past various versions
of the compiler which do not like us using do_div in the middle
of large functions. */
do_mod: /* Side effect free 64 bit mod operation */
fs/xfs/linux-2.6/xfs_linux.h (the origin of this file) has the same thing,
perhaps this is an old problem that's now fixed?
I've looked into getting rid of these:

- The existing do_div() works fine for me with 64 bit numerators, so I'll
get rid of the "fixed" version.

- The "fixed" do_mod() seems to be the only way to do 64 bit modulus.
It would be great if I was wrong about that...

Thanks,
Dave
David Teigland
2005-08-11 06:06:02 UTC
Permalink
Post by Arjan van de Ven
* + if (create)
+ down_write(&ip->i_rw_mutex);
+ else
+ down_read(&ip->i_rw_mutex);
why do you use a rwsem and not a regular semaphore? You are aware that
rwsems are far more expensive than regular ones right? How skewed is
the read/write ratio?
Rough tests show around 4/1, that high or low?
Arjan van de Ven
2005-08-11 06:55:49 UTC
Permalink
Post by David Teigland
Post by Arjan van de Ven
* + if (create)
+ down_write(&ip->i_rw_mutex);
+ else
+ down_read(&ip->i_rw_mutex);
why do you use a rwsem and not a regular semaphore? You are aware that
rwsems are far more expensive than regular ones right? How skewed is
the read/write ratio?
Rough tests show around 4/1, that high or low?
that's quite borderline; if it was my code I'd not use a rwsem for that
ratio (my own rule of thumb, based on not a lot other than gut feeling)
is a 10/1 ratio at minimum... but it's not so low that it screams for
removing it. However.... it might well make your code a lot simpler so
it might still be worth simplifying.
Lars Marowsky-Bree
2005-08-10 10:30:41 UTC
Permalink
Kindly lose the "Context Dependent Pathname" crap.
Same for ocfs2.
Would a generic implementation of that higher up in the VFS be more
acceptable?

It's not like context-dependent symlinks are an arbitary feature, but
rather very useful in practice.


Sincerely,
Lars Marowsky-Brée <***@suse.de>
--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"
Lars Marowsky-Bree
2005-08-10 10:34:24 UTC
Permalink
Post by Lars Marowsky-Bree
Would a generic implementation of that higher up in the VFS be more
acceptable?
No. Use mount --bind
That's a working and less complex alternative for upto how many places
at once? That works for non-root users how...?


Sincerely,
Lars Marowsky-Brée <***@suse.de>
--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"
Lars Marowsky-Bree
2005-08-10 11:02:59 UTC
Permalink
It works now. Unlike context link which steal totally valid symlink
targets for magic mushroom bullshit.
Right, that is a valid concern. Avoiding context dependent symlinks
entirely certainly is one possible path around this. But, let's just for
the sake of this discussion continue the other path for a bit, to
explore the options available for implementing CPS which don't result in
shivers running down the spine, because I believe CPS do have some
applications in which bind mounts are not entirely adequate
replacements.

(Unless, of course, you want a bind mount for each homedirectory which
might include architecture-specific subdirectories or for every
host-specific configuration file.)

What would a syntax look like which in your opinion does not remove
totally valid symlink targets for magic mushroom bullshit? Prefix with
// (which, according to POSIX, allows for implementation-defined
behaviour)? Something else, not allowed in a regular pathname?

If we can't find an acceptable way of implementing them, maybe it's time
to grab some magic mushrooms and come up with a new approach, then ;-)


Sincerely,
Lars Marowsky-Brée <***@suse.de>
--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"
Lars Marowsky-Bree
2005-08-10 11:09:17 UTC
Permalink
Post by Lars Marowsky-Bree
What would a syntax look like which in your opinion does not remove
totally valid symlink targets for magic mushroom bullshit? Prefix with
// (which, according to POSIX, allows for implementation-defined
behaviour)? Something else, not allowed in a regular pathname?
None. just don't do it. Use bindmount, they're cheap and have sane
defined semtantics.
So for every directoy hiearchy on a shared filesystem, each user needs
to have the complete list of bindmounts needed, and automatically resync
that across all nodes when a new one is added or removed? And then have
that executed by root, because a regular user can't?

Sure. Very cheap and sane. I'm buying.


Sincerely,
Lars Marowsky-Brée <***@suse.de>
--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"
Christoph Hellwig
2005-08-10 11:11:10 UTC
Permalink
Post by Lars Marowsky-Bree
Post by Lars Marowsky-Bree
What would a syntax look like which in your opinion does not remove
totally valid symlink targets for magic mushroom bullshit? Prefix with
// (which, according to POSIX, allows for implementation-defined
behaviour)? Something else, not allowed in a regular pathname?
None. just don't do it. Use bindmount, they're cheap and have sane
defined semtantics.
So for every directoy hiearchy on a shared filesystem, each user needs
to have the complete list of bindmounts needed, and automatically resync
that across all nodes when a new one is added or removed? And then have
that executed by root, because a regular user can't?
Do it in an initscripts and let users simply not do it, they shouldn't
even know what kind of filesystem they are on.
AJ Lewis
2005-08-10 13:26:26 UTC
Permalink
Post by Christoph Hellwig
Post by Lars Marowsky-Bree
So for every directoy hiearchy on a shared filesystem, each user needs
to have the complete list of bindmounts needed, and automatically resync
that across all nodes when a new one is added or removed? And then have
that executed by root, because a regular user can't?
Do it in an initscripts and let users simply not do it, they shouldn't
even know what kind of filesystem they are on.
I'm just thinking of a 100-node cluster that has different mounts on different
nodes, and trying to update the bind mounts in a sane and efficient manner
without clobbering the various mount setups. Ouch.
--
AJ Lewis Voice: 612-638-0500
Red Hat E-Mail: ***@redhat.com
One Main Street SE, Suite 209
Minneapolis, MN 55414

Current GPG fingerprint = D9F8 EDCE 4242 855F A03D 9B63 F50C 54A8 578C 8715
Grab the key at: http://people.redhat.com/alewis/gpg.html or one of the
many keyservers out there...
Kyle Moffett
2005-08-10 15:43:02 UTC
Permalink
Post by AJ Lewis
Post by Christoph Hellwig
So for every directory hierarchy on a shared filesystem, each
user needs
to have the complete list of bindmounts needed, and automatically resync
that across all nodes when a new one is added or removed? And then have
that executed by root, because a regular user can't?
Do it in an initscripts and let users simply not do it, they
shouldn't
even know what kind of filesystem they are on.
I'm just thinking of a 100-node cluster that has different mounts on different
nodes, and trying to update the bind mounts in a sane and efficient manner
without clobbering the various mount setups. Ouch.
How about something like the following:
cpslink() => Create a Context Dependent Symlink
readcpslink() => Return the Context Dependent path data
readlink() => Return the path of the Context Dependent
Symlink as it
would be evaluated in the current context,
basically as a
normal symlink.
lstat() => Return information on the Context Dependent
Symlink in
the same format as a regular symlink.
unlink() => Delete the Context Dependent Symlink.

You would need an extra userspace tool that understands cpslink/
readcpslink to
create and get information on the links for now, but ls and ln could
eventually
be updated, and until then the would provide sane behavior. Perhaps
this
should be extended into a new API for some of the strange things several
filesystems want to do in the VFS:
extlink() => Create an extended filesystem link (with type
specified)
readextlink() => Return the path (and type) for the link

The filesystem could define how each type of link acts with respect
to other
syscalls. OpenAFS could use extlink() instead of their symlink magic
for
adjusting the AFS volume hierarchy. The new in-kernel AFS client
could use it
in similar fashion (It has no method to adjust hierarchy, because
it's still
read-only). GFS could use it for their Context Dependent Symlinks.
Since it
would pass the type in as well, it would be possible to use it for
different
kinds of links on the same filesystem.

Cheers,
Kyle Moffett

--
Simple things should be simple and complex things should be possible
-- Alan Kay
Joel Becker
2005-08-12 02:46:56 UTC
Permalink
Post by Christoph Hellwig
Post by Lars Marowsky-Bree
So for every directoy hiearchy on a shared filesystem, each user needs
to have the complete list of bindmounts needed, and automatically resync
that across all nodes when a new one is added or removed? And then have
that executed by root, because a regular user can't?
Do it in an initscripts and let users simply not do it, they shouldn't
even know what kind of filesystem they are on.
Christoph,
Users know. They want to know. They want to install git on a
shared filesystem, and have it Just Work no matter what architecture
they're on ({arch} context symlink).
I've yet to see a sane way to replace symlinks with bind mounts
for anything but the most trivial of usages.

1) You can't make them as non-root
2) It's not stored in the filesystem, so permanence is separate. Other
nodes and namespaces don't see them automatically if you want it.
These both violate KISS and PoLS.
3) You pollute the output of "mount", and when you have as many bind
mounts as you might have symlinks, that's a ton of output you didn't
want to see when you were wondering what disks are mounted.
4) When I'm looking at a file, ls -l doesn't tell me what I'm really
looking at. With symlinks it does. In some circumstances, that's a
good thing. For most symlink-like uses it is not. The two uses
(security and "symlink-like") are both valid approaches, and one
should not preclude the other.

Now, (3) can easily be fixed with an option. (4) can probably
be massaged the same way. But (1) and (2) can't be, and that needs
fixing before this is even viable to most real users.
CDSL, or whatever you call it, exists in most can-be-shared
filesystems for a reason. On AFS and DFS, it was @foo.
/.../thisdcecell/foo/@sys/bin/git-ls-tree would be
/.../thisdcecell/foo/linux-i386/bin/git-ls-tree on my machine. I'd just
put the @sys path in my PATH, and never worry whether I was on x86, ppc,
or s390. I don't know how GFS/GFS2 do theirs, but OCFS2 copied straight
from VMS clustering, where they used it as well. They seem to have set
the standard on the topic of clustering. It would be
/usr/{arch}/bin/git-ls-tree -> /usr/i386/bin/git-ls-tree or whatever.
If you can't do this as a user, it's irrelevant to you. Major
installations, where the person installing the application never gets
root, would expect it to work easily and nicely. Bind mounts, as of
now, do not.

Joel
--
"Here's something to think about: How come you never see a headline
like ``Psychic Wins Lottery''?"
- Jay Leno

Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: ***@oracle.com
Phone: (650) 506-8127
Christoph Hellwig
2005-08-10 11:05:11 UTC
Permalink
Post by Lars Marowsky-Bree
What would a syntax look like which in your opinion does not remove
totally valid symlink targets for magic mushroom bullshit? Prefix with
// (which, according to POSIX, allows for implementation-defined
behaviour)? Something else, not allowed in a regular pathname?
None. just don't do it. Use bindmount, they're cheap and have sane
defined semtantics.
Christoph Hellwig
2005-08-10 10:54:50 UTC
Permalink
Post by Lars Marowsky-Bree
Post by Lars Marowsky-Bree
Would a generic implementation of that higher up in the VFS be more
acceptable?
No. Use mount --bind
That's a working and less complex alternative for upto how many places
at once? That works for non-root users how...?
It works now. Unlike context link which steal totally valid symlink
targets for magic mushroom bullshit.
Christoph Hellwig
2005-08-10 10:32:56 UTC
Permalink
Post by Lars Marowsky-Bree
Kindly lose the "Context Dependent Pathname" crap.
Same for ocfs2.
Would a generic implementation of that higher up in the VFS be more
acceptable?
No. Use mount --bind
Al Viro
2005-08-09 15:20:45 UTC
Permalink
Hi, GFS (Global File System) is a cluster file system that we'd like to
see added to the kernel. The 14 patches total about 900K so I won't send
them to the list unless that's requested. Comments and suggestions are
welcome. Thanks
http://redhat.com/~teigland/gfs2/20050801/gfs2-full.patch
http://redhat.com/~teigland/gfs2/20050801/broken-out/
Kindly lose the "Context Dependent Pathname" crap.
Christoph Hellwig
2005-08-10 07:03:09 UTC
Permalink
Hi, GFS (Global File System) is a cluster file system that we'd like to
see added to the kernel. The 14 patches total about 900K so I won't send
them to the list unless that's requested. Comments and suggestions are
welcome. Thanks
http://redhat.com/~teigland/gfs2/20050801/gfs2-full.patch
http://redhat.com/~teigland/gfs2/20050801/broken-out/
Kindly lose the "Context Dependent Pathname" crap.
Same for ocfs2.
David Teigland
2005-08-11 08:17:29 UTC
Permalink
Thanks for all the review and comments. This is a new set of patches that
incorporates the suggestions we've received.

http://redhat.com/~teigland/gfs2/20050811/gfs2-full.patch
http://redhat.com/~teigland/gfs2/20050811/broken-out/

Dave
Michael
2005-08-11 08:21:04 UTC
Permalink
I have the same question as I asked before, how can I see GFS in "make
menuconfig", after I patch gfs2-full.patch into a 2.6.12.2 kernel?

Michael
Post by David Teigland
Thanks for all the review and comments. This is a new set of patches that
incorporates the suggestions we've received.
http://redhat.com/~teigland/gfs2/20050811/gfs2-full.patch
http://redhat.com/~teigland/gfs2/20050811/broken-out/
Dave
--
Linux-cluster mailing list
http://www.redhat.com/mailman/listinfo/linux-cluster
David Teigland
2005-08-11 08:46:45 UTC
Permalink
Post by Michael
I have the same question as I asked before, how can I see GFS in "make
menuconfig", after I patch gfs2-full.patch into a 2.6.12.2 kernel?
You need to select the dlm under drivers. It's in -mm, or apply
http://redhat.com/~teigland/dlm.patch
Michael
2005-08-11 08:49:42 UTC
Permalink
yes, after apply dlm.patch, I saw it! although I don't know what's "-mm".

Thanks,

Michael
Post by David Teigland
Post by Michael
I have the same question as I asked before, how can I see GFS in "make
menuconfig", after I patch gfs2-full.patch into a 2.6.12.2 kernel?
You need to select the dlm under drivers. It's in -mm, or apply
http://redhat.com/~teigland/dlm.patch
Arjan van de Ven
2005-08-11 08:32:38 UTC
Permalink
Post by David Teigland
Thanks for all the review and comments. This is a new set of patches that
incorporates the suggestions we've received.
all of them or only a subset?
David Teigland
2005-08-11 08:50:06 UTC
Permalink
Post by Arjan van de Ven
Post by David Teigland
Thanks for all the review and comments. This is a new set of patches that
incorporates the suggestions we've received.
all of them or only a subset?
All patches, now 01-13 (what was patch 08 disappeared entirely)
Arjan van de Ven
2005-08-11 08:50:32 UTC
Permalink
Post by David Teigland
Post by Arjan van de Ven
Post by David Teigland
Thanks for all the review and comments. This is a new set of patches that
incorporates the suggestions we've received.
all of them or only a subset?
All patches, now 01-13 (what was patch 08 disappeared entirely)
with them I meant the suggestions not the patches ;)
David Teigland
2005-08-11 09:16:51 UTC
Permalink
Post by Arjan van de Ven
Post by Arjan van de Ven
Post by David Teigland
Thanks for all the review and comments. This is a new set of
patches that incorporates the suggestions we've received.
all of them or only a subset?
with them I meant the suggestions not the patches ;)
The large majority, and I think all that people care about. If we ignored
something that someone thinks is important, a reminder would be useful.
Pekka Enberg
2005-08-11 10:04:10 UTC
Permalink
Hi,
Post by David Teigland
The large majority, and I think all that people care about. If we ignored
something that someone thinks is important, a reminder would be useful.
The only remaining issue for me is the vma walk. Thanks, David!

Pekka
Michael
2005-08-11 09:54:33 UTC
Permalink
Hi, Dave,

I quickly applied gfs2 and dlm patches in kernel 2.6.12.2, it passed
compiling but has some warning log, see attachment. maybe helpful to
you.

Thanks,

Michael
Post by David Teigland
Thanks for all the review and comments. This is a new set of patches that
incorporates the suggestions we've received.
http://redhat.com/~teigland/gfs2/20050811/gfs2-full.patch
http://redhat.com/~teigland/gfs2/20050811/broken-out/
Dave
--
Linux-cluster mailing list
http://www.redhat.com/mailman/listinfo/linux-cluster
Pekka Enberg
2005-08-11 10:00:35 UTC
Permalink
Post by Michael
Hi, Dave,
I quickly applied gfs2 and dlm patches in kernel 2.6.12.2, it passed
compiling but has some warning log, see attachment. maybe helpful to
you.
kzalloc is not in Linus' tree yet. Try with 2.6.13-rc5-mm1.

Pekka
Pekka J Enberg
2005-08-11 07:10:16 UTC
Permalink
Hi Mark,
Reading and writing from other filesystems to a GFS2 mmap'd file
does not walk the vmas. Therefore, data consistency guarantees
What I meant was that, if a filesystem requires vma walks, we need to do
it VFS level with something like the following patch. With this, your
filesystem would implement a_ops->iolock_acquire that sorts the locks
and takes them all. In case of GFS2, this would replace walk_vm().

Thoughts?

Pekka

[PATCH] vfs: iolock

This patch introduces iolock which can be used by filesystems that require
special locking when accessing an mmap'd region.

Unfinished and untested.

Signed-off-by: Pekka Enberg <***@cs.helsinki.fi>
---

fs/Makefile | 2 -
fs/iolock.c | 88 +++++++++++++++++++++++++++++++++++++++++++++++++
fs/read_write.c | 15 ++++++++
include/linux/fs.h | 2 +
include/linux/iolock.h | 11 ++++++
5 files changed, 117 insertions(+), 1 deletion(-)

Index: 2.6-mm/fs/iolock.c
===================================================================
--- /dev/null
+++ 2.6-mm/fs/iolock.c
@@ -0,0 +1,88 @@
+/*
+ * fs/iolock.c
+ *
+ * Derived from GFS2.
+ */
+
+#include <linux/iolock.h>
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/slab.h>
+
+/*
+ * I/O lock contains all files that participate in locking a memory region.
+ * It is used for filesystems that require special locks to access mmap'd
+ * memory.
+ */
+struct iolock {
+ struct address_space *mapping;
+ unsigned long nr_files;
+ struct file **files;
+};
+
+struct iolock *iolock_region(const char __user *buf, size_t size)
+{
+ int err = -ENOMEM;
+ struct mm_struct *mm = current->mm;
+ struct vm_area_struct *vma;
+ unsigned long start = (unsigned long)buf;
+ unsigned long end = start + size;
+ struct iolock *ret;
+
+ ret = kcalloc(1, sizeof(*ret), GFP_KERNEL);
+ if (!ret)
+ return ERR_PTR(-ENOMEM);
+
+ down_read(&mm->mmap_sem);
+
+ ret->files = kcalloc(mm->map_count, sizeof(struct file*), GFP_KERNEL);
+ if (!ret->files)
+ goto error;
+
+ for (vma = find_vma(mm, start); vma; vma = vma->vm_next) {
+ struct file *file;
+ struct address_space *mapping;
+
+ if (end <= vma->vm_start)
+ break;
+
+ file = vma->vm_file;
+ if (!file)
+ continue;
+
+ mapping = file->f_mapping;
+ if (!mapping->a_ops->iolock_acquire ||
+ !mapping->a_ops->iolock_release)
+ continue;
+
+ /* FIXME: This only works when one address_space participates
+ in the iolock. */
+ ret->mapping = mapping;
+ ret->files[ret->nr_files++] = file;
+ }
+out:
+ up_read(&mm->mmap_sem);
+
+ if (ret->mapping->a_ops->iolock_acquire) {
+ err = ret->mapping->a_ops->iolock_acquire(ret->files, ret->nr_files);
+ if (!err)
+ goto error;
+ }
+
+ return ret;
+
+error:
+ iolock_release(ret);
+ ret = ERR_PTR(err);
+ goto out;
+}
+
+void iolock_release(struct iolock *iolock)
+{
+ struct address_space *mapping = iolock->mapping;
+ if (mapping && mapping->a_ops->iolock_release)
+ mapping->a_ops->iolock_release(iolock->files, iolock->nr_files);
+ kfree(iolock->files);
+ kfree(iolock);
+}
Index: 2.6-mm/fs/read_write.c
===================================================================
--- 2.6-mm.orig/fs/read_write.c
+++ 2.6-mm/fs/read_write.c
@@ -14,6 +14,7 @@
#include <linux/security.h>
#include <linux/module.h>
#include <linux/syscalls.h>
+#include <linux/iolock.h>

#include <asm/uaccess.h>
#include <asm/unistd.h>
@@ -247,14 +248,21 @@ ssize_t vfs_read(struct file *file, char
if (!ret) {
ret = security_file_permission (file, MAY_READ);
if (!ret) {
+ struct iolock * iolock = iolock_region(buf, count);
+ if (IS_ERR(iolock)) {
+ ret = PTR_ERR(iolock);
+ goto out;
+ }
if (file->f_op->read)
ret = file->f_op->read(file, buf, count, pos);
else
ret = do_sync_read(file, buf, count, pos);
+ iolock_release(iolock);
if (ret > 0) {
fsnotify_access(file->f_dentry);
current->rchar += ret;
}
+ out:
current->syscr++;
}
}
@@ -298,14 +306,21 @@ ssize_t vfs_write(struct file *file, con
if (!ret) {
ret = security_file_permission (file, MAY_WRITE);
if (!ret) {
+ struct iolock * iolock = iolock_region(buf, count);
+ if (IS_ERR(iolock)) {
+ ret = PTR_ERR(iolock);
+ goto out;
+ }
if (file->f_op->write)
ret = file->f_op->write(file, buf, count, pos);
else
ret = do_sync_write(file, buf, count, pos);
+ iolock_release(iolock);
if (ret > 0) {
fsnotify_modify(file->f_dentry);
current->wchar += ret;
}
+ out:
current->syscw++;
}
}
Index: 2.6-mm/include/linux/iolock.h
===================================================================
--- /dev/null
+++ 2.6-mm/include/linux/iolock.h
@@ -0,0 +1,11 @@
+#ifndef __LINUX_IOLOCK_H
+#define __LINUX_IOLOCK_H
+
+#include <linux/kernel.h>
+
+struct iolock;
+
+struct iolock *iolock_region(const char __user *buf, size_t count);
+void iolock_release(struct iolock *lock);
+
+#endif
Index: 2.6-mm/fs/Makefile
===================================================================
--- 2.6-mm.orig/fs/Makefile
+++ 2.6-mm/fs/Makefile
@@ -10,7 +10,7 @@ obj-y := open.o read_write.o file_table.
ioctl.o readdir.o select.o fifo.o locks.o dcache.o inode.o \
attr.o bad_inode.o file.o filesystems.o namespace.o aio.o \
seq_file.o xattr.o libfs.o fs-writeback.o mpage.o direct-io.o \
- ioprio.o
+ ioprio.o iolock.o

obj-$(CONFIG_INOTIFY) += inotify.o
obj-$(CONFIG_EPOLL) += eventpoll.o
Index: 2.6-mm/include/linux/fs.h
===================================================================
--- 2.6-mm.orig/include/linux/fs.h
+++ 2.6-mm/include/linux/fs.h
@@ -334,6 +334,8 @@ struct address_space_operations {
loff_t offset, unsigned long nr_segs);
struct page* (*get_xip_page)(struct address_space *, sector_t,
int);
+ int (*iolock_acquire)(struct file **, unsigned long);
+ void (*iolock_release)(struct file **, unsigned long);
};

struct backing_dev_info;
Zach Brown
2005-08-11 16:33:41 UTC
Permalink
Post by Pekka J Enberg
What I meant was that, if a filesystem requires vma walks, we need to do
it VFS level with something like the following patch.
I don't think this patch is the way to go at all. It imposes an
allocation and vma walking overhead for the vast majority of IOs that
aren't interested. It doesn't look like it will get a consistent
ordering when multiple file systems are concerned. It doesn't record
the ranges of the mappings involved so Lustre can't properly use its
range locks. And finally, it doesn't prohibit mapping operations for
the duration of the IO -- the whole reason we ended up in this thread in
the first place :)

Christoph, would you be interested in looking at a more thorough patch
if I threw one together?

- z
Zach Brown
2005-08-11 16:39:43 UTC
Permalink
That doesn't matter. Please don't put in any effort for lustre special
cases - they are unwilling to cooperate and they'll get what they deserve.
Sure, we can add that extra functional layer in another pass. I thought
I'd still bring it up, though, as OCFS2 is slated to care at some point
in the not too distant future.
Sure, I'm not sure that'll happen in a timely fashion, though.
Roger.

- z
Christoph Hellwig
2005-08-11 16:35:41 UTC
Permalink
Post by Zach Brown
ordering when multiple file systems are concerned. It doesn't record
the ranges of the mappings involved so Lustre can't properly use its
range locks.
That doesn't matter. Please don't put in any effort for lustre special
cases - they are unwilling to cooperate and they'll get what they deserve.
Post by Zach Brown
And finally, it doesn't prohibit mapping operations for
the duration of the IO -- the whole reason we ended up in this thread in
the first place :)
Christoph, would you be interested in looking at a more thorough patch
if I threw one together?
Sure, I'm not sure that'll happen in a timely fashion, though.
Pekka Enberg
2005-08-11 16:44:50 UTC
Permalink
Post by Zach Brown
I don't think this patch is the way to go at all. It imposes an
allocation and vma walking overhead for the vast majority of IOs that
aren't interested. It doesn't look like it will get a consistent
ordering when multiple file systems are concerned. It doesn't record
the ranges of the mappings involved so Lustre can't properly use its
range locks. And finally, it doesn't prohibit mapping operations for
the duration of the IO -- the whole reason we ended up in this thread in
the first place :)
Hmm. So how do you propose we get rid of the mandatory vma walk? I was
thinking of making iolock a config option so when you don't have any
filesystems that need it, it can go away. I have also optimized the
extra allocation away when there are none mmap'd files that require
locking.

As for the rest of your comments, I heartly agree with them and
hopefully some interested party will take care of them :-).

Pekka

Index: 2.6-mm/fs/iolock.c
===================================================================
--- /dev/null
+++ 2.6-mm/fs/iolock.c
@@ -0,0 +1,183 @@
+/*
+ * I/O locking for memory regions. Used by filesystems that need special
+ * locking for mmap'd files.
+ */
+
+#include <linux/iolock.h>
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/slab.h>
+
+/*
+ * TODO:
+ *
+ * - Deadlock when two nodes acquire iolocks in reverse order for two
+ * different filesystems. Solution: use rbtree in iolock_chain so we
+ * can walk iolocks in order. XXX: what order is stable for two nodes
+ * that don't know about each other?
+ */
+
+/*
+ * I/O lock contains all files that participate in locking a memory region
+ * in an address_space.
+ */
+struct iolock {
+ struct address_space *mapping;
+ unsigned long nr_files;
+ struct file **files;
+ struct list_head chain;
+};
+
+struct iolock_chain {
+ struct list_head list;
+};
+
+static struct iolock *iolock_new(unsigned long max_files)
+{
+ struct iolock *ret = kzalloc(sizeof(*ret), GFP_KERNEL);
+ if (!ret)
+ goto out;
+ ret->files = kcalloc(max_files, sizeof(struct file *), GFP_KERNEL);
+ if (!ret->files) {
+ kfree(ret);
+ ret = NULL;
+ goto out;
+ }
+ INIT_LIST_HEAD(&ret->chain);
+out:
+ return ret;
+}
+
+static struct iolock_chain *iolock_chain_new(void)
+{
+ struct iolock_chain * ret = kzalloc(sizeof(*ret), GFP_KERNEL);
+ if (ret) {
+ INIT_LIST_HEAD(&ret->list);
+ }
+ return ret;
+}
+
+static int iolock_chain_acquire(struct iolock_chain *chain)
+{
+ struct iolock * iolock;
+ int err = 0;
+
+ list_for_each_entry(iolock, &chain->list, chain) {
+ if (iolock->mapping->a_ops->iolock_acquire) {
+ err = iolock->mapping->a_ops->iolock_acquire(
+ iolock->files, iolock->nr_files);
+ if (!err)
+ goto error;
+ }
+ }
+error:
+ return err;
+}
+
+static struct iolock *iolock_lookup(struct iolock_chain *chain,
+ struct address_space *mapping)
+{
+ struct iolock *ret = NULL;
+ struct iolock *iolock;
+
+ list_for_each_entry(iolock, &chain->list, chain) {
+ if (iolock->mapping == mapping) {
+ ret = iolock;
+ break;
+ }
+ }
+ return ret;
+}
+
+/**
+ * iolock_region - Lock memory region for file I/O.
+ * @buf: the buffer we want to lock.
+ * @size: size of the buffer.
+ *
+ * Returns a pointer to the iolock_chain or NULL to denote an empty chain;
+ * otherwise returns ERR_PTR().
+ */
+struct iolock_chain *iolock_region(const char __user *buf, size_t size)
+{
+ struct iolock_chain *ret = NULL;
+ int err = -ENOMEM;
+ struct mm_struct *mm = current->mm;
+ struct vm_area_struct *vma;
+ unsigned long start = (unsigned long)buf;
+ unsigned long end = start + size;
+ int max_files;
+
+ down_read(&mm->mmap_sem);
+ max_files = mm->map_count;
+
+ for (vma = find_vma(mm, start); vma; vma = vma->vm_next) {
+ struct file *file;
+ struct address_space *mapping;
+ struct iolock *iolock;
+
+ if (end <= vma->vm_start)
+ break;
+
+ file = vma->vm_file;
+ if (!file)
+ continue;
+
+ mapping = file->f_mapping;
+ if (!mapping->a_ops->iolock_acquire ||
+ !mapping->a_ops->iolock_release)
+ continue;
+
+ /* Allocate chain lazily to avoid initialization overhead
+ when we don't have any files that require iolock. */
+ if (!ret) {
+ ret = iolock_chain_new();
+ if (!ret)
+ goto error;
+ }
+
+ iolock = iolock_lookup(ret, mapping);
+ if (!iolock) {
+ iolock = iolock_new(max_files);
+ if (!iolock)
+ goto error;
+ iolock->mapping = mapping;
+ }
+
+ iolock->files[iolock->nr_files++] = file;
+ list_add(&iolock->chain, &ret->list);
+ }
+ err = iolock_chain_acquire(ret);
+ if (!err)
+ goto error;
+
+out:
+ up_read(&mm->mmap_sem);
+ return ret;
+
+error:
+ iolock_release(ret);
+ ret = ERR_PTR(err);
+ goto out;
+}
+
+/**
+ * iolock_release - Release file I/O locks for a memory region.
+ * @chain: The I/O lock chain to release. Passing NULL means no-op.
+ */
+void iolock_release(struct iolock_chain *chain)
+{
+ struct iolock *iolock;
+
+ if (!chain)
+ return;
+
+ list_for_each_entry(iolock, &chain->list, chain) {
+ struct address_space *mapping = iolock->mapping;
+ if (mapping && mapping->a_ops->iolock_release)
+ mapping->a_ops->iolock_release(iolock->files, iolock->nr_files);
+ kfree(iolock->files);
+ kfree(iolock);
+ }
+ kfree(chain);
+}
Index: 2.6-mm/fs/read_write.c
===================================================================
--- 2.6-mm.orig/fs/read_write.c
+++ 2.6-mm/fs/read_write.c
@@ -14,6 +14,7 @@
#include <linux/security.h>
#include <linux/module.h>
#include <linux/syscalls.h>
+#include <linux/iolock.h>

#include <asm/uaccess.h>
#include <asm/unistd.h>
@@ -247,14 +248,21 @@ ssize_t vfs_read(struct file *file, char
if (!ret) {
ret = security_file_permission (file, MAY_READ);
if (!ret) {
+ struct iolock_chain * lock = iolock_region(buf, count);
+ if (IS_ERR(lock)) {
+ ret = PTR_ERR(lock);
+ goto out;
+ }
if (file->f_op->read)
ret = file->f_op->read(file, buf, count, pos);
else
ret = do_sync_read(file, buf, count, pos);
+ iolock_release(lock);
if (ret > 0) {
fsnotify_access(file->f_dentry);
current->rchar += ret;
}
+ out:
current->syscr++;
}
}
@@ -298,14 +306,21 @@ ssize_t vfs_write(struct file *file, con
if (!ret) {
ret = security_file_permission (file, MAY_WRITE);
if (!ret) {
+ struct iolock_chain * lock = iolock_region(buf, count);
+ if (IS_ERR(lock)) {
+ ret = PTR_ERR(lock);
+ goto out;
+ }
if (file->f_op->write)
ret = file->f_op->write(file, buf, count, pos);
else
ret = do_sync_write(file, buf, count, pos);
+ iolock_release(lock);
if (ret > 0) {
fsnotify_modify(file->f_dentry);
current->wchar += ret;
}
+ out:
current->syscw++;
}
}
Index: 2.6-mm/include/linux/iolock.h
===================================================================
--- /dev/null
+++ 2.6-mm/include/linux/iolock.h
@@ -0,0 +1,11 @@
+#ifndef __LINUX_IOLOCK_H
+#define __LINUX_IOLOCK_H
+
+#include <linux/kernel.h>
+
+struct iolock_chain;
+
+extern struct iolock_chain *iolock_region(const char __user *, size_t);
+extern void iolock_release(struct iolock_chain *);
+
+#endif
Index: 2.6-mm/fs/Makefile
===================================================================
--- 2.6-mm.orig/fs/Makefile
+++ 2.6-mm/fs/Makefile
@@ -10,7 +10,7 @@ obj-y := open.o read_write.o file_table.
ioctl.o readdir.o select.o fifo.o locks.o dcache.o inode.o \
attr.o bad_inode.o file.o filesystems.o namespace.o aio.o \
seq_file.o xattr.o libfs.o fs-writeback.o mpage.o direct-io.o \
- ioprio.o
+ ioprio.o iolock.o

obj-$(CONFIG_INOTIFY) += inotify.o
obj-$(CONFIG_EPOLL) += eventpoll.o
Index: 2.6-mm/include/linux/fs.h
===================================================================
--- 2.6-mm.orig/include/linux/fs.h
+++ 2.6-mm/include/linux/fs.h
@@ -334,6 +334,8 @@ struct address_space_operations {
loff_t offset, unsigned long nr_segs);
struct page* (*get_xip_page)(struct address_space *, sector_t,
int);
+ int (*iolock_acquire)(struct file **, unsigned long);
+ void (*iolock_release)(struct file **, unsigned long);
};

struct backing_dev_info;
Continue reading on narkive:
Loading...