Discussion:
[ANNOUNCE] Ext3 vs Reiserfs benchmarks
(too old to reply)
Dax Kelson
2002-07-12 16:21:05 UTC
Permalink
Tested:

ext3 data=ordered
ext3 data=writeback
reiserfs
reiserfs notail

http://www.gurulabs.com/ext3-reiserfs.html

Any suggestions or comments appreciated.

Dax Kelson
Guru Labs

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Andreas Dilger
2002-07-12 17:05:32 UTC
Permalink
Post by Dax Kelson
ext3 data=ordered
ext3 data=writeback
reiserfs
reiserfs notail
http://www.gurulabs.com/ext3-reiserfs.html
Any suggestions or comments appreciated.
Did you try data=journal mode on ext3? For real-life workloads sync-IO
workloads like mail (e.g. not benchmarks where the system is 100% busy)
you can have considerable performance benefits from doing the sync IO
directly to the journal instead of partly to the journal and partly to
the rest of the filesystem.

The reason why "real life" is important here is because the data=journal
mode writes all the files to disk twice - once to the journal and again
to the filesystem, so you must have some "slack" in your disk bandwidth
in order to benefit from this increased throughput on the part of the
mail transport.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
k***@zianet.com
2002-07-12 17:26:48 UTC
Permalink
I compared reiserfs with notails and with tails to
ext3 in journaled mode about a month ago.
Strangely enough the machine that was being
built is eventually slated for a mail machine. I used
postmark to simulate the mail environment.

Benchmarks are available here:
http://labs.zianet.com

Let me know if I am missing any info on there.

Steven
Post by Andreas Dilger
Post by Dax Kelson
ext3 data=ordered
ext3 data=writeback
reiserfs
reiserfs notail
http://www.gurulabs.com/ext3-reiserfs.html
Any suggestions or comments appreciated.
Did you try data=journal mode on ext3? For real-life workloads sync-IO
workloads like mail (e.g. not benchmarks where the system is 100% busy)
you can have considerable performance benefits from doing the sync IO
directly to the journal instead of partly to the journal and partly to
the rest of the filesystem.
The reason why "real life" is important here is because the data=journal
mode writes all the files to disk twice - once to the journal and again
to the filesystem, so you must have some "slack" in your disk bandwidth
in order to benefit from this increased throughput on the part of the
mail transport.
Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Andreas Dilger
2002-07-12 17:36:49 UTC
Permalink
Post by k***@zianet.com
I compared reiserfs with notails and with tails to
ext3 in journaled mode about a month ago.
Strangely enough the machine that was being
built is eventually slated for a mail machine. I used
postmark to simulate the mail environment.
http://labs.zianet.com
Let me know if I am missing any info on there.
Yes, I saw this benchmark when it was first posted. It isn't clear
from the web pages that you are using data=journal for ext3. Note
that this is only a benefit for sync I/O workloads like mail and
NFS, but not other types of usage. Also, for sync I/O workloads
you can get a big boost by using an external journal device.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Chris Mason
2002-07-12 20:34:53 UTC
Permalink
Post by Dax Kelson
ext3 data=ordered
ext3 data=writeback
reiserfs
reiserfs notail
http://www.gurulabs.com/ext3-reiserfs.html
Any suggestions or comments appreciated.
postmark is an interesting workload, but it does not do fsync or renames
on the working set, and postfix does lots of both while delivering.
postmark does do a good job of showing the difference between lots of
files in one directory (great for reiserfs) and lots of directories with
fewer files in each (better for ext3).

Andreas Dilger already mentioned -o data=journal on ext3, you can try
the beta reiserfs patches that add support for data=journal and
data=ordered at:

ftp.suse.com/pub/people/mason/patches/data-logging

They improve reiserfs performance for just about everything, but
data=journal is especially good for fsync/O_SYNC heavy workloads.

Andrew Morton sent me a benchmark of his that tries to simulate
postfix. He has posted it to l-k before but a quick google search found
dead links only, so I'm attaching it. What I like about his synctest is
the results are consistent and you can play with various
fsync/rename/unlink options.

-chris
Daniel Phillips
2002-07-13 04:44:22 UTC
Permalink
Post by Dax Kelson
Any suggestions or comments appreciated.
"it is clear that IF your server is stable and not prone to crashing, and/or
you have the write cache on your hard drives battery backed, you should
strongly consider using the writeback journaling mode of Ext3 versus ordered."

You probably want to suggest UPS there rather than battery backed disk
cache, since the writeback caching is predominantly on the cpu side.
--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Dax Kelson
2002-07-14 20:40:45 UTC
Permalink
Post by Dax Kelson
Any suggestions or comments appreciated.
Thanks for the feedback. Look for more testing from us soon addressing
the suggestions brought up.

Dax
Sam Vilain
2002-07-15 08:26:03 UTC
Permalink
Post by Dax Kelson
Post by Dax Kelson
Any suggestions or comments appreciated.
Thanks for the feedback. Look for more testing from us soon addressing
the suggestions brought up.
One more thing - can I just make the comment that testing freshly formatted filesystems is not going to show up ext2's real weaknesses, that happen to old filesystems - particularly those where the filesystem has been allowed to fill up.

I timed *15 minutes* for a system I admin to unlink a single 1G file on a fairly old ext2 filesystem the other day (perhaps ext3 would have improved this, I'm not sure). It took 30 minutes to scan a snort log directory log on ext2, but less than 2 minutes on reiser - and only 3 seconds once it was in the buffercache.

You are testing for a mail server - how many mailboxes are in your spool directory for the tests? Try it with about five to ten thousand mailboxes and see how your results vary.
--
Sam Vilain, ***@vilain.net WWW: http://sam.vilain.net/
7D74 2A09 B2D3 C30F F78E GPG: http://sam.vilain.net/sam.asc
278A A425 30A9 05B5 2F13

Although Mr Chavez 'was democratically elected,' one had to bear in
mind that 'Legitimacy is something that is conferred not just by a
majority of the voters'"
- The office of George "Dubya" Bush commenting on the Venezuelan
election
Alan Cox
2002-07-15 12:30:51 UTC
Permalink
Post by Sam Vilain
You are testing for a mail server - how many mailboxes are in your spool
directory for the tests? Try it with about five to ten thousand
mailboxes and see how your results vary.
If your mail server can't get heirarchical mail spools right, get one
that can.

Alan
Sam Vilain
2002-07-15 12:02:01 UTC
Permalink
Post by Alan Cox
Post by Sam Vilain
You are testing for a mail server - how many mailboxes are in your spool
directory for the tests? Try it with about five to ten thousand
mailboxes and see how your results vary.
If your mail server can't get heirarchical mail spools right, get one
that can.
Translation

"Yes, we know that there is no directory hashing in ext2/3. You'll have to find another solution to the problem, I'm afraid. Why not ease the burden on the filesystem by breaking up the task for it, and giving it to it in small pieces. That way it's much less likely to choke."

:-)

Sure, you could set up hierarchical mail spools. But it sure stinks of a temporary solution for a long-term problem. What about the next application that grows to massive proportions?

Hey, while I've got your attention, how do you go about debugging your kernel? I'm trying to add fair scheduling to the new O(1) scheduler, something of a token bucket filter counting jiffies used by a process/user/s_context (in scheduler_tick()) and tweaking their priority accordingly (in effective_prio()). It'd be really nice if I could run it under UML or something like that so I can trace through it with gdb, but I couldn't get the UML patch to apply to your tree. Any hints?
--
Sam Vilain, ***@vilain.net WWW: http://sam.vilain.net/
7D74 2A09 B2D3 C30F F78E GPG: http://sam.vilain.net/sam.asc
278A A425 30A9 05B5 2F13
Alan Cox
2002-07-15 13:23:03 UTC
Permalink
Post by Sam Vilain
"Yes, we know that there is no directory hashing in ext2/3. You'll have
to find another solution to the problem, I'm afraid. Why not ease the
burden on the filesystem by breaking up the task for it, and giving it
to it in small pieces. That way it's much less likely to choke."
Actually there are several other reasons for it. It sucks a lot less
when you need to use ls and friends to inspect part of the spool. It
also makes it much easier to split the mail spool over multiple disks as
it grows without having to backup/restore the spool area
Post by Sam Vilain
Sure, you could set up hierarchical mail spools. But it sure stinks of a
temporary solution for a long-term problem. What about the next
application that grows to massive proportions?
JFS ?
Post by Sam Vilain
Hey, while I've got your attention, how do you go about debugging your
kernel? I'm trying to add fair scheduling to the new O(1) scheduler,
something of a token bucket filter counting jiffies used by a
process/user/s_context (in scheduler_tick()) and tweaking their
priority accordingly (in effective_prio()). It'd be really nice if I
could run it under UML or something like that so I can trace through
it with gdb, but I couldn't get the UML patch to apply to your tree.
Any hints?
The UML tree and my tree don't quite merge easily. Your best bet is to
grab the Red Hat Limbo beta packages for the kernel source, which if I
remember rightly are both -ac based and include the option to build UML

Alan
Chris Mason
2002-07-15 13:40:49 UTC
Permalink
Post by Alan Cox
Post by Sam Vilain
"Yes, we know that there is no directory hashing in ext2/3. You'll have
to find another solution to the problem, I'm afraid. Why not ease the
burden on the filesystem by breaking up the task for it, and giving it
to it in small pieces. That way it's much less likely to choke."
Actually there are several other reasons for it. It sucks a lot less
when you need to use ls and friends to inspect part of the spool. It
also makes it much easier to split the mail spool over multiple disks as
it grows without having to backup/restore the spool area
Another good reason is i_sem. If you've got more than one process doing
something to that directory, you spend lots of time waiting for the
semaphore. I think it was andrew that reminded me i_sem is held on
fsync, so fync(dir) to make things safe after a rename can slow things
down.

reiserfs only needs fsync(file), ext3 needs fsync(anything on the fs).
If ext3 would promise to make fsync(file) sufficient forever, it might
help the mta authors tune.

-chris
Andrew Morton
2002-07-15 19:40:51 UTC
Permalink
Post by Chris Mason
...
If ext3 would promise to make fsync(file) sufficient forever, it might
help the mta authors tune.
ext3 promises. This side-effect is bolted firmly into the design
of ext3 and it's hard to see any way in which it will go away.

-
Andrea Arcangeli
2002-07-15 15:12:07 UTC
Permalink
Post by Sam Vilain
Hey, while I've got your attention, how do you go about debugging your
kernel? I'm trying to add fair scheduling to the new O(1) scheduler,
something of a token bucket filter counting jiffies used by a
process/user/s_context (in scheduler_tick()) and tweaking their
priority accordingly (in effective_prio()). It'd be really nice if I
could run it under UML or something like that so I can trace through
it with gdb, but I couldn't get the UML patch to apply to your tree.
Any hints?
-aa ships with both uml and o1 scheduler. I need uml for everything non
hardware related so expect it to be always uptodate there. However since
I merged the O(1) scheduler there is the annoyance that sometime wakeup
events don't arrive at least until kupdate reschedule or something
like that (of course only with uml, not with real hardware). Also
pressing keys is enough to unblock it. I didn't debugged it hard yet.
Accoring to Jeff it's a problem with cli that masks signals.

Andrea
Andreas Dilger
2002-07-15 16:03:57 UTC
Permalink
Post by Sam Vilain
"Yes, we know that there is no directory hashing in ext2/3. You'll
have to find another solution to the problem, I'm afraid. Why not ease
the burden on the filesystem by breaking up the task for it, and giving
it to it in small pieces. That way it's much less likely to choke."
Amusingly, there IS directory hashing available for ext2 and ext3, and
it is just as fast as reiserfs hashed directories. See:

http://people.nl.linux.org/~phillips/htree/paper/htree.html

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/
Daniel Phillips
2002-07-15 16:12:50 UTC
Permalink
Post by Andreas Dilger
Post by Sam Vilain
"Yes, we know that there is no directory hashing in ext2/3. You'll
have to find another solution to the problem, I'm afraid. Why not ease
the burden on the filesystem by breaking up the task for it, and giving
it to it in small pieces. That way it's much less likely to choke."
Amusingly, there IS directory hashing available for ext2 and ext3, and
http://people.nl.linux.org/~phillips/htree/paper/htree.html
Faster, last time I checked. I really must test against XFS and JFS at
some point.
--
Daniel
Sam Vilain
2002-07-15 17:48:05 UTC
Permalink
Post by Andreas Dilger
Amusingly, there IS directory hashing available for ext2 and ext3, and
http://people.nl.linux.org/~phillips/htree/paper/htree.html
You learn something new every day. So, with that in mind - what has reiserfs got that ext2 doesn't?

- tail merging, giving much more efficient space usage for lots of small
files.
- B*Tree allocation offering ``a 1/3rd reduction in internal
fragmentation in return for slightly more complicated insertions and
deletion algorithms'' (from the htree paper).
- online resizing in the main kernel (ext2 needs a patch -
http://ext2resize.sourceforge.net/).
- Resizing does not require the use of `ext2prepare' run on the
filesystem while unmounted to resize over arbitrary boundaries.
- directory hashing in the main kernel

On the flipside, ext2 over reiserfs:

- support for attributes without a patch or 2.4.19-pre4+ kernel
- support for filesystem quotas without a patch
- there is a `dump' command (but it's useless, because it hangs when you
run it on mounted filesystems - come on, who REALLY unmounts their
filesystems for a nightly dump? You need a 3 way mirror to do it
while guaranteeing filesystem availability...)

I'd be very interested in seeing postmark results without the hierarchical directory structure (which an unpatched postfix doesn't support), with about 5000 mailboxes with and without the htree patch (or with the htree patch but without that directory indexed, if that is possible).
--
Sam Vilain, ***@vilain.net WWW: http://sam.vilain.net/
7D74 2A09 B2D3 C30F F78E GPG: http://sam.vilain.net/sam.asc
278A A425 30A9 05B5 2F13

Try to be the best of what you are, even if what you are is no good.
ASHLEIGH BRILLIANT
Mathieu Chouquet-Stringer
2002-07-15 18:47:04 UTC
Permalink
Post by Sam Vilain
- there is a `dump' command (but it's useless, because it hangs when you
run it on mounted filesystems - come on, who REALLY unmounts their
filesystems for a nightly dump? You need a 3 way mirror to do it
while guaranteeing filesystem availability...)
According to everybody, dump is deprecated (and it shouldn't work reliably
with 2.4, in two words: "forget it")...
--
Mathieu Chouquet-Stringer E-Mail : ***@newview.com
It is exactly because a man cannot do a thing that he is a
proper judge of it.
-- Oscar Wilde
Sam Vilain
2002-07-15 19:26:20 UTC
Permalink
Post by Mathieu Chouquet-Stringer
Post by Sam Vilain
- there is a `dump' command (but it's useless, because it hangs when you
run it on mounted filesystems - come on, who REALLY unmounts their
filesystems for a nightly dump? You need a 3 way mirror to do it
while guaranteeing filesystem availability...)
According to everybody, dump is deprecated (and it shouldn't work reliably
with 2.4, in two words: "forget it")...
It's a shame, because `tar' doesn't save things like inode attributes and
places unnecessary load on the VFS layer. It also takes considerably
longer than dump did on one backup server I admin - like ~12 hours to back
up ~26G in ~414k inodes to a tape capable of about 1MB/sec. But that's
probably the old directory hashing thing again, there are some
reeeeaaallllllly large directories on that machine...

Ah, the joys of legacy.
--
Sam Vilain, ***@vilain.net WWW: http://sam.vilain.net/
7D74 2A09 B2D3 C30F F78E GPG: http://sam.vilain.net/sam.asc
278A A425 30A9 05B5 2F13

If you think the United States has stood still, who built the
largest shopping center in the world?
RICHARD M NIXON
Stelian Pop
2002-07-16 08:18:09 UTC
Permalink
Post by Mathieu Chouquet-Stringer
According to everybody, dump is deprecated (and it shouldn't work reliably
with 2.4, in two words: "forget it")...
This needs to be "according to Linus, dump is deprecated". Given the
interest Linus has manifested for backups, I wouldn't really rely on
his statement :-)

Stelian.
--
Stelian Pop <***@fr.alcove.com>
Alcove - http://www.alcove.com
Gerhard Mack
2002-07-16 12:22:53 UTC
Permalink
Post by Stelian Pop
Post by Mathieu Chouquet-Stringer
According to everybody, dump is deprecated (and it shouldn't work reliably
with 2.4, in two words: "forget it")...
This needs to be "according to Linus, dump is deprecated". Given the
interest Linus has manifested for backups, I wouldn't really rely on
his statement :-)
Either way dump is not likely to give you a reliable backup when used
with a 2.4.x kernel.

Gerhard


--
Gerhard Mack

***@innerfire.net

<>< As a computer I find your faith in technology amusing.
Stelian Pop
2002-07-16 12:49:56 UTC
Permalink
Post by Gerhard Mack
Post by Stelian Pop
This needs to be "according to Linus, dump is deprecated". Given the
interest Linus has manifested for backups, I wouldn't really rely on
his statement :-)
Either way dump is not likely to give you a reliable backup when used
with a 2.4.x kernel.
Since you are so well informed, maybe you could share your knowledge
with us.

I'm the dump maintainer, so I'll be very interested in knowing how
comes that dump works for me and many other users... :-)

Stelian.
--
Stelian Pop <***@fr.alcove.com>
Alcove - http://www.alcove.com
Gerhard Mack
2002-07-16 15:11:20 UTC
Permalink
Date: Tue, 16 Jul 2002 14:49:56 +0200
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
Post by Gerhard Mack
Post by Stelian Pop
This needs to be "according to Linus, dump is deprecated". Given the
interest Linus has manifested for backups, I wouldn't really rely on
his statement :-)
Either way dump is not likely to give you a reliable backup when used
with a 2.4.x kernel.
Since you are so well informed, maybe you could share your knowledge
with us.
I'm the dump maintainer, so I'll be very interested in knowing how
comes that dump works for me and many other users... :-)
I'll save myself the trouble when Linus said it better than I could:

Note that dump simply won't work reliably at all even in
2.4.x: the buffer cache and the page cache (where all the
actual data is) are not coherent. This is only going to
get even worse in 2.5.x, when the directories are moved
into the page cache as well.

So anybody who depends on "dump" getting backups right is
already playing russian rulette with their backups. It's
not at all guaranteed to get the right results - you may
end up having stale data in the buffer cache that ends up
being "backed up".


In other words you have a backup system that works some of the time or
even most of the time... brilliant!

Gerhard

--
Gerhard Mack

***@innerfire.net

<>< As a computer I find your faith in technology amusing.
Andrea Arcangeli
2002-07-16 15:22:11 UTC
Permalink
Post by Gerhard Mack
Date: Tue, 16 Jul 2002 14:49:56 +0200
Subject: Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks
Post by Gerhard Mack
Post by Stelian Pop
This needs to be "according to Linus, dump is deprecated". Given the
interest Linus has manifested for backups, I wouldn't really rely on
his statement :-)
Either way dump is not likely to give you a reliable backup when used
with a 2.4.x kernel.
Since you are so well informed, maybe you could share your knowledge
with us.
I'm the dump maintainer, so I'll be very interested in knowing how
comes that dump works for me and many other users... :-)
Note that dump simply won't work reliably at all even in
2.4.x: the buffer cache and the page cache (where all the
actual data is) are not coherent. This is only going to
get even worse in 2.5.x, when the directories are moved
into the page cache as well.
So anybody who depends on "dump" getting backups right is
already playing russian rulette with their backups. It's
not at all guaranteed to get the right results - you may
end up having stale data in the buffer cache that ends up
being "backed up".
In other words you have a backup system that works some of the time or
even most of the time... brilliant!
just to clarify, the above implicitly assumes the fs is mounted
read-write while you're dumping it. if the fs is mounted readonly or if
it's unmounted, there is no problem with dumping it. Also note that dump
has the same problem with read-write mounted fs also in 2.2, and I guess
in 2.0 too, it's nothing new of 2.4, it just gets more visible the more
logical dirty caches we have.

Andrea
Stelian Pop
2002-07-16 15:39:26 UTC
Permalink
Post by Gerhard Mack
In other words you have a backup system that works some of the time or
even most of the time... brilliant!
Dump is a backup system that works 100% of the time when used as
it was designed to: on unmounted filesystems (or mounted R/O).

It is indeed brilliant to have it work, even most of the time,
in conditions it wasn't designed for.

Stelian.
--
Stelian Pop <***@fr.alcove.com>
Alcove - http://www.alcove.com
Matthias Andree
2002-07-16 19:45:42 UTC
Permalink
Post by Stelian Pop
Post by Gerhard Mack
In other words you have a backup system that works some of the time or
even most of the time... brilliant!
Dump is a backup system that works 100% of the time when used as
it was designed to: on unmounted filesystems (or mounted R/O).
Practical question: how do I get a file system mounted R/O for backup
with dump without putting that system into single-user mode?
Particularly when running automated backups, this is an issue. I cannot
kill all writers (syslog, Postfix, INN, CVS server, ...) on my
production machines just for the sake of taking a backup.
Shawn
2002-07-16 20:04:22 UTC
Permalink
You don't.

This is where you have a filesystem where syslog, xinetd, blogd,
bloatd-config-d2, raffle-ticketd DO NOT LIVE.

People forget so easily the wonders of multiple partitions.
Post by Matthias Andree
Post by Stelian Pop
Post by Gerhard Mack
In other words you have a backup system that works some of the time or
even most of the time... brilliant!
Dump is a backup system that works 100% of the time when used as
it was designed to: on unmounted filesystems (or mounted R/O).
Practical question: how do I get a file system mounted R/O for backup
with dump without putting that system into single-user mode?
Particularly when running automated backups, this is an issue. I cannot
kill all writers (syslog, Postfix, INN, CVS server, ...) on my
production machines just for the sake of taking a backup.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
--
Shawn Leas
***@enodev.com

So, do you live around here often?
-- Stephen Wright
Mathieu Chouquet-Stringer
2002-07-16 20:11:58 UTC
Permalink
Post by Shawn
You don't.
This is where you have a filesystem where syslog, xinetd, blogd,
bloatd-config-d2, raffle-ticketd DO NOT LIVE.
People forget so easily the wonders of multiple partitions.
I'm sorry, but I don't understand how it's going to change anything. For
sure, it makes your life easier because you don't have to shutdown all your
programs that have files opened in R/W mode. But in the end, you will have
to shutdown something to remount the partition in R/O mode and usually you
don't want or can't afford to do that.
--
Mathieu Chouquet-Stringer E-Mail : ***@newview.com
It is exactly because a man cannot do a thing that he is a
proper judge of it.
-- Oscar Wilde
Shawn
2002-07-16 20:22:31 UTC
Permalink
In this case, can you use a RAID mirror or something, then break it?

Also, there's the LVM snapshot at the block layer someone already
mentioned, which when used with smaller partions is less overhead.
(less FS delta)

This problem isn't that complex.
Post by Mathieu Chouquet-Stringer
Post by Shawn
You don't.
This is where you have a filesystem where syslog, xinetd, blogd,
bloatd-config-d2, raffle-ticketd DO NOT LIVE.
People forget so easily the wonders of multiple partitions.
I'm sorry, but I don't understand how it's going to change anything. For
sure, it makes your life easier because you don't have to shutdown all your
programs that have files opened in R/W mode. But in the end, you will have
to shutdown something to remount the partition in R/O mode and usually you
don't want or can't afford to do that.
--
It is exactly because a man cannot do a thing that he is a
proper judge of it.
-- Oscar Wilde
--
Shawn Leas
***@enodev.com

I bought my brother some gift-wrap for Christmas. I took it to the Gift
Wrap department and told them to wrap it, but in a different print so he
would know when to stop unwrapping.
-- Stephen Wright
Mathieu Chouquet-Stringer
2002-07-16 20:27:34 UTC
Permalink
Post by Shawn
In this case, can you use a RAID mirror or something, then break it?
Also, there's the LVM snapshot at the block layer someone already
mentioned, which when used with smaller partions is less overhead.
(less FS delta)
This problem isn't that complex.
I agree but I guess that if Matthias asked the question that way, he
probably meant he doesn't have a raid mirror or "something" (as you
say)... If you didn't plan your install (meaning you don't have the nice
raid or anything else), you're basically screwed...
--
Mathieu Chouquet-Stringer E-Mail : ***@newview.com
It is exactly because a man cannot do a thing that he is a
proper judge of it.
-- Oscar Wilde
Matthias Andree
2002-07-17 11:45:01 UTC
Permalink
Post by Shawn
In this case, can you use a RAID mirror or something, then break it?
Also, there's the LVM snapshot at the block layer someone already
mentioned, which when used with smaller partions is less overhead.
(less FS delta)
All these "solutions" don't work out, I cannot remount R/O my partition,
and LVM low-level snapshots or breaking a RAID mirror simply won't work
out. I would have to remount r/o the partition to get a consistent image
in the first place, so the first step must fail already...
Andreas Dilger
2002-07-15 21:14:48 UTC
Permalink
Post by Sam Vilain
Post by Andreas Dilger
Amusingly, there IS directory hashing available for ext2 and ext3, and
http://people.nl.linux.org/~phillips/htree/paper/htree.html
You learn something new every day. So, with that in mind - what has
reiserfs got that ext2 doesn't?
- tail merging, giving much more efficient space usage for lots of small
files.
Well, there was a tail merging patch for ext2, but it has been dropped
for now. In reality, any benchmarks with reiserfs (except the
very-small-files case) will run with tail packing disabled because it
kills performance.
Post by Sam Vilain
- B*Tree allocation offering ``a 1/3rd reduction in internal
fragmentation in return for slightly more complicated insertions and
deletion algorithms'' (from the htree paper).
- online resizing in the main kernel (ext2 needs a patch -
http://ext2resize.sourceforge.net/).
Yes, I wrote it...
Post by Sam Vilain
- Resizing does not require the use of `ext2prepare' run on the
filesystem while unmounted to resize over arbitrary boundaries.
That is coming this summer. It will be part of some changes to support
"meta blockgroups", and the resizing comes for free at the same time.
Post by Sam Vilain
- directory hashing in the main kernel
Probably will happen in 2.5, as Andrew is already testing htree support
for ext3. It is also in the ext3 CVS tree for 2.4, so I wouldn't be
surprised if it shows up in 2.4 also.
Post by Sam Vilain
- support for attributes without a patch or 2.4.19-pre4+ kernel
- support for filesystem quotas without a patch
- there is a `dump' command (but it's useless, because it hangs when you
run it on mounted filesystems - come on, who REALLY unmounts their
filesystems for a nightly dump? You need a 3 way mirror to do it
while guaranteeing filesystem availability...)
Well, the dump can only be inconsistent for files that are being changed
during the dump itself. As for hanging the system, that would be a bug
regardless of whether it was dump or "dd" reading from the block device.
A bug related to this was fixed, probably in 2.4.19-preX somewhere.
Post by Sam Vilain
I'd be very interested in seeing postmark results without the
hierarchical directory structure (which an unpatched postfix doesn't
support), with about 5000 mailboxes with and without the htree patch
(or with the htree patch but without that directory indexed, if that
is possible).
Let me know what you find. It is possible to use an htree-patched
kernel and not have indexed directories - just don't mount with
"-o index". Note that there is a data-corrupting bug somewhere in
the ext3 htree code, so I wouldn't suggest using indexed directories
except for test.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/
Stelian Pop
2002-07-16 08:15:31 UTC
Permalink
[...]
Post by Sam Vilain
- there is a `dump' command (but it's useless, because it hangs when you
run it on mounted filesystems - come on, who REALLY unmounts their
filesystems for a nightly dump? You need a 3 way mirror to do it
while guaranteeing filesystem availability...)
dump(8) doesn't hang when dumping mounted filesystems. You are refering
to a genuine bug which was fixed some time ago.

However, in some rare occasions, dump can save corrupted data when
saving a mounted and generaly highly active filesystem. Even then,
in 99% of the cases it doesn't really matter because the corrupted
files will get saved by the next incremental dump.

Come on, who REALLY expects to have consistent backups without
either unmounting the filesystem or using some snapshot techniques ?

Stelian.
--
Stelian Pop <***@fr.alcove.com>
Alcove - http://www.alcove.com
Matthias Andree
2002-07-16 12:27:56 UTC
Permalink
Post by Stelian Pop
Come on, who REALLY expects to have consistent backups without
either unmounting the filesystem or using some snapshot techniques ?
The who uses [s|g]tar, cpio, afio, dsmc (Tivoli distributed storage
manager), ...

Low-level snapshots don't do any good, they just freeze the "halfway
there" on-disk structure.
Stelian Pop
2002-07-16 12:43:31 UTC
Permalink
Post by Matthias Andree
Post by Stelian Pop
Come on, who REALLY expects to have consistent backups without
either unmounting the filesystem or using some snapshot techniques ?
The who uses [s|g]tar, cpio, afio, dsmc (Tivoli distributed storage
manager), ...
Low-level snapshots don't do any good, they just freeze the "halfway
there" on-disk structure.
But [s|g]tar, cpio, afio (don't know about dsmc) also freeze the
"halfway there" data, but at the file level instead (application
instead of filesystem)...

Stelian.
--
Stelian Pop <***@fr.alcove.com>
Alcove - http://www.alcove.com
Matthias Andree
2002-07-16 12:53:01 UTC
Permalink
Post by Stelian Pop
Post by Matthias Andree
Low-level snapshots don't do any good, they just freeze the "halfway
there" on-disk structure.
But [s|g]tar, cpio, afio (don't know about dsmc) also freeze the
"halfway there" data, but at the file level instead (application
instead of filesystem)...
Not if some day somebody implements file system level snapshots for
Linux. Until then, better have garbled file contents constrained to a
file than random data as on-disk layout changes with hefty directory
updates.

dsmc fstat()s the file it is currently reading regularly and retries the
dump as the changes, and gives up if it is updated too often. Not sure
about the server side, and certainly not a useful option for sequential
devices that you directly write on. Looks like a cache for the biggest
file is necessary.
Christoph Hellwig
2002-07-16 13:05:49 UTC
Permalink
Post by Matthias Andree
Not if some day somebody implements file system level snapshots for
Linux. Until then, better have garbled file contents constrained to a
file than random data as on-disk layout changes with hefty directory
updates.
or the blockdevice-level snapshots already implemented in Linux..
Matthias Andree
2002-07-16 19:38:36 UTC
Permalink
Post by Christoph Hellwig
Post by Matthias Andree
Not if some day somebody implements file system level snapshots for
Linux. Until then, better have garbled file contents constrained to a
file than random data as on-disk layout changes with hefty directory
updates.
or the blockdevice-level snapshots already implemented in Linux..
That would require three atomic steps:

1. mount read-only, flushing all pending updates
2. take snapshot
3. mount read-write

and then backup the snapshot. A snapshots of a live file system won't
do, it can be as inconsistent as it desires -- if your corrupt target is
moving or not, dumping it is not of much use.
--
Matthias Andree
Andreas Dilger
2002-07-16 19:49:29 UTC
Permalink
Post by Matthias Andree
Post by Christoph Hellwig
Post by Matthias Andree
Not if some day somebody implements file system level snapshots for
Linux. Until then, better have garbled file contents constrained to a
file than random data as on-disk layout changes with hefty directory
updates.
or the blockdevice-level snapshots already implemented in Linux..
1. mount read-only, flushing all pending updates
2. take snapshot
3. mount read-write
and then backup the snapshot. A snapshots of a live file system won't
do, it can be as inconsistent as it desires -- if your corrupt target is
moving or not, dumping it is not of much use.
Luckily, there is already an interface which does this -
sync_supers_lockfs(), which the LVM code will use if it is patched in.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/
Thunder from the hill
2002-07-16 20:11:20 UTC
Permalink
Hi,
Post by Matthias Andree
Post by Christoph Hellwig
or the blockdevice-level snapshots already implemented in Linux..
1. mount read-only, flushing all pending updates
2. take snapshot
3. mount read-write
and then backup the snapshot. A snapshots of a live file system won't
do, it can be as inconsistent as it desires -- if your corrupt target is
moving or not, dumping it is not of much use.
Well, couldn't we just kindof lock the file system so that while backing
up no writes get through to the real filesystem? This will possibly
require a lot of memory (or another space to write to), but it might be
done?

Regards,
Thunder
--
(Use http://www.ebb.org/ungeek if you can't decode)
------BEGIN GEEK CODE BLOCK------
Version: 3.12
GCS/E/G/S/AT d- s++:-- a? C++$ ULAVHI++++$ P++$ L++++(+++++)$ E W-$
N--- o? K? w-- O- M V$ PS+ PE- Y- PGP+ t+ 5+ X+ R- !tv b++ DI? !D G
e++++ h* r--- y-
------END GEEK CODE BLOCK------
Matthias Andree
2002-07-16 21:06:39 UTC
Permalink
Post by Thunder from the hill
Hi,
Post by Matthias Andree
Post by Christoph Hellwig
or the blockdevice-level snapshots already implemented in Linux..
1. mount read-only, flushing all pending updates
2. take snapshot
3. mount read-write
and then backup the snapshot. A snapshots of a live file system won't
do, it can be as inconsistent as it desires -- if your corrupt target is
moving or not, dumping it is not of much use.
Well, couldn't we just kindof lock the file system so that while backing
up no writes get through to the real filesystem? This will possibly
require a lot of memory (or another space to write to), but it might be
done?
But you would want to backup a consistent file system, so when entering
the freeze or snapshot mode, you must flush all pending data in such a
way that the snapshot is consistent (i. e. needs not fsck action
whatsoever).
Andreas Dilger
2002-07-16 21:23:22 UTC
Permalink
Post by Matthias Andree
Post by Thunder from the hill
Post by Matthias Andree
1. mount read-only, flushing all pending updates
2. take snapshot
3. mount read-write
and then backup the snapshot. A snapshots of a live file system won't
do, it can be as inconsistent as it desires -- if your corrupt target is
moving or not, dumping it is not of much use.
Well, couldn't we just kindof lock the file system so that while backing
up no writes get through to the real filesystem? This will possibly
require a lot of memory (or another space to write to), but it might be
done?
But you would want to backup a consistent file system, so when entering
the freeze or snapshot mode, you must flush all pending data in such a
way that the snapshot is consistent (i. e. needs not fsck action
whatsoever).
This is all done already for both LVM and EVMS snapshots. The filesystem
(ext3, reiserfs, XFS, JFS) flushes the outstanding operations and is
frozen, the snapshot is created, and the filesystem becomes active again.
It takes a second or less. Then dump will guarantee 100% correct backups
of the snapshot filesystem. You would have to do a backup on the snapshot
to guarantee 100% correctness even with tar.

Most people don't care, because they don't even do backups in the first
place, until they have lost a lot of their data and they learn. Even
without snapshots, while dump isn't guaranteed to be 100% correct for
rapidly changing filesystems, I have been using it for years on both
2.2 and 2.4 without any problems on my home systems. I have even
restored data from those same backups...

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/
Thunder from the hill
2002-07-16 21:38:50 UTC
Permalink
Hi,
Post by Andreas Dilger
This is all done already for both LVM and EVMS snapshots. The filesystem
(ext3, reiserfs, XFS, JFS) flushes the outstanding operations and is
frozen, the snapshot is created, and the filesystem becomes active again.
It takes a second or less.
Anyway, we could do that in parallel if we did it like that:

sync -> significant data is being written
lock -> data writes stay cached, but aren't written
snapshot
unlock -> data is getting written
now unmount the snapshout (clean it)
write the modified snapshot to disk...

Regards,
Thunder
--
(Use http://www.ebb.org/ungeek if you can't decode)
------BEGIN GEEK CODE BLOCK------
Version: 3.12
GCS/E/G/S/AT d- s++:-- a? C++$ ULAVHI++++$ P++$ L++++(+++++)$ E W-$
N--- o? K? w-- O- M V$ PS+ PE- Y- PGP+ t+ 5+ X+ R- !tv b++ DI? !D G
e++++ h* r--- y-
------END GEEK CODE BLOCK------
Matthias Andree
2002-07-17 11:47:09 UTC
Permalink
Post by Andreas Dilger
This is all done already for both LVM and EVMS snapshots. The filesystem
(ext3, reiserfs, XFS, JFS) flushes the outstanding operations and is
frozen, the snapshot is created, and the filesystem becomes active again.
It takes a second or less. Then dump will guarantee 100% correct backups
of the snapshot filesystem. You would have to do a backup on the snapshot
to guarantee 100% correctness even with tar.
Sure. On some machines, they will go with dsmc anyhow which reads the
file and rereads if it changes under dsmc's hands.
--
Matthias Andree
s***@lucent.com
2002-07-16 22:19:35 UTC
Permalink
It's really quite simple in theory to do proper backups. But you need
to have application support to make it work in most cases. It would
flow like this:

1. lock application(s), flush any outstanding transactions.
2. lock filesystems, flush any outstanding transactions.

3a. lock mirrored volume, flush any outstanding transactions, break
mirror.
--or--
3b. snapshot filesystem to another volume.

4. unlock volume

5. unlock filesystem

6. unlock application(s).

7. do backup against quiescent volume/filesystem.

In reality, people didn't lock filesystems (remount R/O) unless they
had too (ClearCase, Oracle, any DBMS, etc are the exceptions), since
the time hit was too much. The chances of getting a bad backup on
user home directories or mail spools wasn't worth the extra cost to be
sure to get a clean backup. For the exceptions, that's why god made
backup windows and such. These days, those windows are miniscule, so
the seven steps outlined above are what needs to happen these days for
a trully reliable backup of important data.

John
John Stoffel - Senior Unix Systems Administrator - Lucent Technologies
***@lucent.com - http://www.lucent.com - 978-399-0479
Thunder from the hill
2002-07-16 22:33:18 UTC
Permalink
Hi,

I do it like this:

-> Reconfigure port switch to use B server
-> Backup A server
-> Replay B server journals on A server
-> Switch to A server
-> Backup B server
-> Replay A server journals on B server
-> Reconfigure port switch to dynamic mode

Regards,
Thunder
--
(Use http://www.ebb.org/ungeek if you can't decode)
------BEGIN GEEK CODE BLOCK------
Version: 3.12
GCS/E/G/S/AT d- s++:-- a? C++$ ULAVHI++++$ P++$ L++++(+++++)$ E W-$
N--- o? K? w-- O- M V$ PS+ PE- Y- PGP+ t+ 5+ X+ R- !tv b++ DI? !D G
e++++ h* r--- y-
------END GEEK CODE BLOCK------
Matti Aarnio
2002-07-15 12:09:04 UTC
Permalink
Post by Alan Cox
Post by Sam Vilain
You are testing for a mail server - how many mailboxes are in your spool
directory for the tests? Try it with about five to ten thousand
mailboxes and see how your results vary.
If your mail server can't get heirarchical mail spools right, get one
that can.
Long ago (10-15 internet-years ago..) I followed testing of
FFS-family of filesystems in Squid cache.

We noticed at Solaris machines using UFS, than when the directory
data size grew above the number of blocks directly addressable by
the direct-index pointers in the i-node, system speed plummeted.
(Or perhaps it was something a bit smaller, like 32 kB)

Consider: 4 kB block size, 12 direct indexes: 48 kB directory size.

Spend 16 bytes for each file name + auxiliary data: 3000 files/subdirs

Optimal would be to store the files inside only the first block,
e.g. the directory shall not grow over 4k (or 1k, or ..)

Name subdirs as: 00 thru 7F (128+2, 12 bytes ?)
Possibly do that in 2 layers: 128^2 = 16384 subdirs, each
with 50 long named users (even more files?): 820 000 users.

Tune the subdir hashing function to suit your application, and
you should be happy.


Putting all your eggs in one basket (files in one directory)
is not a smart thing.
Post by Alan Cox
Alan
/Matti Aarnio
Patrick J. LoPresti
2002-07-15 15:22:19 UTC
Permalink
Consider this argument:

Given: On ext3, fsync() of any file on a partition commits all
outstanding transactions on that partition to the log.

Given: data=ordered forces pending data writes for a file to happen
before related transactions are committed to the log.

Therefore: With data=ordered, fsync() of any file on a partition
syncs the outstanding writes of EVERY file on that
partition.

Is this argument correct? If so, it suggests that data=ordered is
actually the *worst* possible journalling mode for a mail spool.

One other thing. I think this statement is misleading:

IF your server is stable and not prone to crashing, and/or you
have the write cache on your hard drives battery backed, you
should strongly consider using the writeback journaling mode of
Ext3 versus ordered.

This makes it sound like data=writeback is somehow unsafe when
machines crash. I do not think this is true. If your application
(e.g., Postfix) is written correctly (which it is), so it calls
fsync() when it is supposed to, then data=writeback is *exactly* as
safe as any other journalling mode. "Battery backed caches" and the
like have nothing to do with it. And if your application is written
incorrectly, then other journalling modes will reduce but not
eliminate the chances for things to break catastrophically on a crash.

So if the partition is dedicated to correct applications, like a mail
spool is, then data=writeback is perfectly safe. If it is faster,
too, then it really is a no-brainer.

- Pat
Chris Mason
2002-07-15 17:31:49 UTC
Permalink
Post by Patrick J. LoPresti
Given: On ext3, fsync() of any file on a partition commits all
outstanding transactions on that partition to the log.
Given: data=ordered forces pending data writes for a file to happen
before related transactions are committed to the log.
Therefore: With data=ordered, fsync() of any file on a partition
syncs the outstanding writes of EVERY file on that
partition.
Is this argument correct? If so, it suggests that data=ordered is
actually the *worst* possible journalling mode for a mail spool.
Yes. In practice this doesn't hurt as much as it could, because ext3
does a good job of letting more writers come in before forcing the
commit. What hurts you is when a forced commit comes in the middle of
creating the file. A data write that could have been contiguous gets
broken into two or more writes instead.
Post by Patrick J. LoPresti
IF your server is stable and not prone to crashing, and/or you
have the write cache on your hard drives battery backed, you
should strongly consider using the writeback journaling mode of
Ext3 versus ordered.
This makes it sound like data=writeback is somehow unsafe when
machines crash. I do not think this is true. If your application
(e.g., Postfix) is written correctly (which it is), so it calls
fsync() when it is supposed to, then data=writeback is *exactly* as
safe as any other journalling mode.
Almost. data=writeback makes it possible for the old contents of a
block to end up in a newly grown file. There are a few ways this can
screw you up:

1) that newly grown file is someone's inbox, and the old contents of the
new block include someone else's private message.

2) That newly grown file is a control file for the application, and the
application expects it to contain valid data within (think sendmail).
Post by Patrick J. LoPresti
"Battery backed caches" and the
like have nothing to do with it.
Nope, battery backed caches don't make data=writeback more or less safe
(with respect to the data anyway). They do make data=ordered and
data=journal more safe.
Post by Patrick J. LoPresti
And if your application is written
incorrectly, then other journalling modes will reduce but not
eliminate the chances for things to break catastrophically on a crash.
So if the partition is dedicated to correct applications, like a mail
spool is, then data=writeback is perfectly safe. If it is faster,
too, then it really is a no-brainer.
For mail servers, data=journal is your friend. ext3 sometimes needs a
bigger log for it (reiserfs data=journal patches don't), but the
performance increase can be significant.

-chris
Matthias Andree
2002-07-15 18:33:35 UTC
Permalink
Post by Patrick J. LoPresti
IF your server is stable and not prone to crashing, and/or you
have the write cache on your hard drives battery backed, you
should strongly consider using the writeback journaling mode of
Ext3 versus ordered.
This makes it sound like data=writeback is somehow unsafe when
machines crash. I do not think this is true. If your application
Well, if your fsync() completes...
Post by Patrick J. LoPresti
(e.g., Postfix) is written correctly (which it is), so it calls
fsync() when it is supposed to, then data=writeback is *exactly* as
safe as any other journalling mode. "Battery backed caches" and the
like have nothing to do with it. And if your application is written
incorrectly, then other journalling modes will reduce but not
eliminate the chances for things to break catastrophically on a crash.
...then you're right. If the machine crashes amidst the fsync()
operation, but has scheduled meta data before file contents, then
journal recovery can present you a file that contains bogus data which
will confuse some applications. I believe Postfix will recover from
this condition either way, see its file is hosed and ignore or discard
it (depending on what it is), but software that blindly relies on a
special format without checking will barf.

All of this assumes two things:

1. the application actually calls fsync()

2. the application can detect if fsync() succeeded before the crash
(like fsync -> fchmod -> fsync, structured file contents, whatever).
Post by Patrick J. LoPresti
So if the partition is dedicated to correct applications, like a mail
spool is, then data=writeback is perfectly safe. If it is faster,
too, then it really is a no-brainer.
These ordering promises also apply to applications that do not call
fsync() or that cannot detect hosed files. Been there, seen that, with
CVS on unpatched ReiserFS as of Linux-2.4.19-presomething: suddenly one
,v file contained NUL blocks. The server barfed, the (remote!) client
segfaulted... yes, it's almost as bad as it can get.

Not catastrophic, tape backup available, but it gave some time to
restore the file and investigate this issue nonetheless. It boiled down
to "nobody's fault, but missing feature". With data=ordered or
data=journal, I would have either had my old ,v file around or a proper
new one.

I'm now using Chris Mason's data-logging patches to try and see how
things work out, I had one crash with an old version, then updated to
the -11 version and have yet to see something break again.

I'd certainly appreciate if these patches were merged early in
2.4.20-pre so they get some testing and can be in 2.4.20 and Linux had
two file systems with data=ordered to choose from.

Disclaimer: I don't know anything except the bare existence, about XFS
or JFS. Feel free to add comments.
--
Matthias Andree
Patrick J. LoPresti
2002-07-15 19:13:24 UTC
Permalink
Post by Chris Mason
Post by Patrick J. LoPresti
IF your server is stable and not prone to crashing, and/or you
have the write cache on your hard drives battery backed, you
should strongly consider using the writeback journaling mode of
Ext3 versus ordered.
This makes it sound like data=writeback is somehow unsafe when
machines crash. I do not think this is true. If your application
(e.g., Postfix) is written correctly (which it is), so it calls
fsync() when it is supposed to, then data=writeback is *exactly* as
safe as any other journalling mode.
Almost. data=writeback makes it possible for the old contents of a
block to end up in a newly grown file.
Only if the application is already broken.
Post by Chris Mason
1) that newly grown file is someone's inbox, and the old contents of the
new block include someone else's private message.
2) That newly grown file is a control file for the application, and the
application expects it to contain valid data within (think sendmail).
In a correctly-written application, neither of these things can
happen. (See my earlier message today on fsync() and MTAs.) To get a
file onto disk reliably, the application must 1) flush the data, and
then 2) flush a "validity" indicator. This could be a sequence like:

create temp file
flush data to temp file
rename temp file
flush rename operation

In this sequence, the file's existence under a particular name is the
indicator of its validity.

If you skip either of these flush operations, you are not behaving
reliably. Skipping the first flush means the validity indicator might
hit the disk before the data; so after a crash, you might see invalid
data in an allegedly valid file. Skipping the second flush means you
do not know that the validity indicator has been set, so you cannot
report success to whoever is waiting for this "reliable write" to
happen.

It is possible to make an application which relies on data=ordered
semantics; for example, skipping the "flush data to temp file" step
above. But such an application would be broken for every version of
Unix *except* Linux in data=ordered mode. I would call that an
incorrect application.
Post by Chris Mason
Nope, battery backed caches don't make data=writeback more or less safe
(with respect to the data anyway). They do make data=ordered and
data=journal more safe.
A theorist would say that "more safe" is a sloppy concept. Either an
operation is safe or it is not. As I said in my last message,
data=ordered (and data=journal) can reduce the risk for poorly written
apps. But they cannot eliminate that risk, and for a correctly
written app, data=writeback is 100% as safe.

- Pat
Matthias Andree
2002-07-15 20:55:05 UTC
Permalink
Post by Patrick J. LoPresti
In a correctly-written application, neither of these things can
happen. (See my earlier message today on fsync() and MTAs.) To get a
file onto disk reliably, the application must 1) flush the data, and
create temp file
flush data to temp file
rename temp file
flush rename operation
In this sequence, the file's existence under a particular name is the
indicator of its validity.
Assume that most applications are broken then.

I assume that most will just call close() or fclose() and exit() right
away. Does fclose() imply fsync()?

Some applications will not even check the [f]close() return value...
Post by Patrick J. LoPresti
It is possible to make an application which relies on data=ordered
semantics; for example, skipping the "flush data to temp file" step
above. But such an application would be broken for every version of
Unix *except* Linux in data=ordered mode. I would call that an
incorrect application.
Or very specific, at least.
Post by Patrick J. LoPresti
Post by Chris Mason
Nope, battery backed caches don't make data=writeback more or less safe
(with respect to the data anyway). They do make data=ordered and
data=journal more safe.
A theorist would say that "more safe" is a sloppy concept. Either an
operation is safe or it is not. As I said in my last message,
data=ordered (and data=journal) can reduce the risk for poorly written
apps. But they cannot eliminate that risk, and for a correctly
written app, data=writeback is 100% as safe.
IF that application uses a marker to mark completion. If it does not,
data=ordered will be the safe bet, regardless of fsync() or not. The
machine can crash BEFORE the fsync() is called.
--
Matthias Andree
Patrick J. LoPresti
2002-07-15 21:23:53 UTC
Permalink
Post by Matthias Andree
I assume that most will just call close() or fclose() and exit() right
away. Does fclose() imply fsync()?
Not according to my close(2) man page:

A successful close does not guarantee that the data has
been successfully saved to disk, as the kernel defers
writes. It is not common for a filesystem to flush the
buffers when the stream is closed. If you need to be sure
that the data is physically stored use fsync(2). (It will
depend on the disk hardware at this point.)

Note that this means writing a truly reliable shell or Perl script is
tricky. I suppose you can "use POSIX qw(fsync);" in Perl. But what
do you do for a shell script? /bin/sync :-) ?
Post by Matthias Andree
Some applications will not even check the [f]close() return value...
Such applications are broken, of course.
Post by Matthias Andree
Post by Patrick J. LoPresti
It is possible to make an application which relies on data=ordered
semantics; for example, skipping the "flush data to temp file" step
above. But such an application would be broken for every version of
Unix *except* Linux in data=ordered mode. I would call that an
incorrect application.
Or very specific, at least.
Hm. Does BSD with soft updates guarantee anything about write
ordering on fsync()? In particular, does it promise to commit the
data before the metadata?
Post by Matthias Andree
Post by Patrick J. LoPresti
A theorist would say that "more safe" is a sloppy concept. Either an
operation is safe or it is not. As I said in my last message,
data=ordered (and data=journal) can reduce the risk for poorly written
apps. But they cannot eliminate that risk, and for a correctly
written app, data=writeback is 100% as safe.
IF that application uses a marker to mark completion. If it does not,
data=ordered will be the safe bet, regardless of fsync() or not. The
machine can crash BEFORE the fsync() is called.
Without marking completion, there is no safe bet. Without calling
fsync(), you *never* know when the data will hit the disk. It is very
hard to build a reliable system that way... For an MTA, for example,
you can never safely inform the remote mailer that you have accepted
the message. But this problem goes beyond MTAs; very few applications
live in a vacuum.

Reliable systems are tricky. I guess this is why Oracle and Sybase
make all that money.

- Pat
Thunder from the hill
2002-07-15 21:38:37 UTC
Permalink
Hi,
Post by Patrick J. LoPresti
Note that this means writing a truly reliable shell or Perl script is
tricky. I suppose you can "use POSIX qw(fsync);" in Perl. But what do
you do for a shell script? /bin/sync :-) ?
Write a binary (/usr/bin/fsync) which opens a fd, fsync it, close it, be
done with it.

Regards,
Thunder
--
(Use http://www.ebb.org/ungeek if you can't decode)
------BEGIN GEEK CODE BLOCK------
Version: 3.12
GCS/E/G/S/AT d- s++:-- a? C++$ ULAVHI++++$ P++$ L++++(+++++)$ E W-$
N--- o? K? w-- O- M V$ PS+ PE- Y- PGP+ t+ 5+ X+ R- !tv b++ DI? !D G
e++++ h* r--- y-
------END GEEK CODE BLOCK------
Matthias Andree
2002-07-16 12:31:22 UTC
Permalink
Post by Thunder from the hill
Hi,
Post by Patrick J. LoPresti
Note that this means writing a truly reliable shell or Perl script is
tricky. I suppose you can "use POSIX qw(fsync);" in Perl. But what do
you do for a shell script? /bin/sync :-) ?
Write a binary (/usr/bin/fsync) which opens a fd, fsync it, close it, be
done with it.
Or steal one from FreeBSD (written by Paul Saab), fix the err() function
and be done with it.

.../usr.bin/fsync/fsync.{1,c}

Interesting side note -- mind the O_RDONLY:

for (i = 1; i < argc; ++i) {
if ((fd = open(argv[i], O_RDONLY)) < 0)
err(1, "open %s", argv[i]);

if (fsync(fd) != 0)
err(1, "fsync %s", argv[1]);
close(fd);
}
Thunder from the hill
2002-07-16 15:53:48 UTC
Permalink
Hi,
Post by Matthias Andree
Post by Thunder from the hill
Write a binary (/usr/bin/fsync) which opens a fd, fsync it, close it, be
done with it.
Or steal one from FreeBSD (written by Paul Saab), fix the err() function
and be done with it.
.../usr.bin/fsync/fsync.{1,c}
for (i = 1; i < argc; ++i) {
if ((fd = open(argv[i], O_RDONLY)) < 0)
err(1, "open %s", argv[i]);
if (fsync(fd) != 0)
err(1, "fsync %s", argv[1]);
close(fd);
}
Pretty much the thing I had in mind, except that the close return code is
disregarded here...

Regards,
Thunder
--
(Use http://www.ebb.org/ungeek if you can't decode)
------BEGIN GEEK CODE BLOCK------
Version: 3.12
GCS/E/G/S/AT d- s++:-- a? C++$ ULAVHI++++$ P++$ L++++(+++++)$ E W-$
N--- o? K? w-- O- M V$ PS+ PE- Y- PGP+ t+ 5+ X+ R- !tv b++ DI? !D G
e++++ h* r--- y-
------END GEEK CODE BLOCK------
Matthias Andree
2002-07-16 19:26:51 UTC
Permalink
Post by Thunder from the hill
Post by Matthias Andree
if (fsync(fd) != 0)
err(1, "fsync %s", argv[1]);
close(fd);
}
Pretty much the thing I had in mind, except that the close return code is
disregarded here...
Indeed, but OTOH, what error is close to report when the file is opened
read-only?
Thunder from the hill
2002-07-16 19:38:27 UTC
Permalink
Hi,
Post by Matthias Andree
Indeed, but OTOH, what error is close to report when the file is opened
read-only?
Well, you can still get EIO, EINTR, EBADF. Whatever you say, disregarding
the close return code is never any good.

Regards,
Thunder
--
(Use http://www.ebb.org/ungeek if you can't decode)
------BEGIN GEEK CODE BLOCK------
Version: 3.12
GCS/E/G/S/AT d- s++:-- a? C++$ ULAVHI++++$ P++$ L++++(+++++)$ E W-$
N--- o? K? w-- O- M V$ PS+ PE- Y- PGP+ t+ 5+ X+ R- !tv b++ DI? !D G
e++++ h* r--- y-
------END GEEK CODE BLOCK------
Zack Weinberg
2002-07-16 23:22:25 UTC
Permalink
Post by Thunder from the hill
Post by Matthias Andree
Indeed, but OTOH, what error is close to report when the file is
opened read-only?
Well, you can still get EIO, EINTR, EBADF. Whatever you say,
disregarding the close return code is never any good.
Making use of the close return value is also never any good.

Consider: There is no guarantee that close will detect errors. Only
NFS and Coda implement f_op->flush methods. For files on all other
file systems, sys_close will always return success (assuming the file
descriptor was open in the first place); the data may still be sitting
in the page cache. If you need the data pushed to the physical disk,
you have to call fsync.

Consider: If you have called fsync, and it returned successfully, an
immediate call to close is guaranteed to return successfully. (Any
hypothetical f_op->flush method would have nothing to do; if not, that
filesystem does not correctly implement fsync.)

Therefore, I would argue that it is wrong for any application ever to
inspect close's return value. Either the program does not need data
integrity guarantees, or it should be using fsync and paying attention
to that instead.

There's also an ugly semantic bind if you make close detect errors.
If close returns an error other than EBADF, has that file descriptor
been closed? The standards do not specify. If it has not been
closed, you have a descriptor leak. But if it has been closed, it is
too late to recover from the error. [As far as I know, Unix
implementations generally do close the descriptor.]

The manpage that was quoted earlier in this thread is incorrect in
claiming that errors will be detected by close; it should be fixed.

zw
Alan Cox
2002-07-17 01:03:02 UTC
Permalink
Post by Zack Weinberg
Making use of the close return value is also never any good.
This is untrue
Post by Zack Weinberg
Consider: There is no guarantee that close will detect errors. Only
NFS and Coda implement f_op->flush methods. For files on all other
file systems, sys_close will always return success (assuming the file
descriptor was open in the first place); the data may still be sitting
in the page cache. If you need the data pushed to the physical disk,
you have to call fsync.
close() checking is not about physical disk guarantees. It's about more
basic "I/O completed". In some future Linux only close() might tell you
about some kinds of I/O error. The fact it doesn't do it now is no
excuse for sloppy programming
Post by Zack Weinberg
There's also an ugly semantic bind if you make close detect errors.
If close returns an error other than EBADF, has that file descriptor
been closed? The standards do not specify. If it has not been
closed, you have a descriptor leak. But if it has been closed, it is
too late to recover from the error. [As far as I know, Unix
implementations generally do close the descriptor.]
If it bothers you close it again 8)
Post by Zack Weinberg
The manpage that was quoted earlier in this thread is incorrect in
claiming that errors will be detected by close; it should be fixed.
The man page matches the stsndard. Implementation may be a subset of the
allowed standard right now, but don't program to implementation
assumptions, it leads to nasty accidents
David S. Miller
2002-07-16 23:52:41 UTC
Permalink
From: Alan Cox <***@lxorguk.ukuu.org.uk>
Date: 17 Jul 2002 02:03:02 +0100

close() checking is not about physical disk guarantees. It's about more
basic "I/O completed". In some future Linux only close() might tell you
about some kinds of I/O error. The fact it doesn't do it now is no
excuse for sloppy programming

Practice dictates that if you make close() return error values
your whole system will blow up. Try it out for yourself.
I can tell you of at least 1 app that is going to explode :-)

I believe Linus mentioned way back when that this is a "shall not"
when we had similar problems with NFS returning errors from close().
Alan Cox
2002-07-17 01:35:41 UTC
Permalink
Post by David S. Miller
Date: 17 Jul 2002 02:03:02 +0100
close() checking is not about physical disk guarantees. It's about more
basic "I/O completed". In some future Linux only close() might tell you
about some kinds of I/O error. The fact it doesn't do it now is no
excuse for sloppy programming
Practice dictates that if you make close() return error values
your whole system will blow up. Try it out for yourself.
I can tell you of at least 1 app that is going to explode :-)
I believe Linus mentioned way back when that this is a "shall not"
when we had similar problems with NFS returning errors from close().
Our NFS can return errors from close(). So I'd get fixing the
applications.
David S. Miller
2002-07-17 00:20:26 UTC
Permalink
From: Alan Cox <***@lxorguk.ukuu.org.uk>
Date: 17 Jul 2002 02:35:41 +0100

Our NFS can return errors from close().

Better tell Linus.

So I'd get fixing the applications.

I wish you luck, it is quite a daunting task and nothing I would
sanely sigh up for :-)
Linus Torvalds
2002-07-17 01:05:00 UTC
Permalink
Post by David S. Miller
Date: 17 Jul 2002 02:35:41 +0100
Our NFS can return errors from close().
Better tell Linus.
Oh, Linus knows. In fact, Linus wrote some of the code in question.

But the thing is, Linus doesn't want to have people have the same issues
with local filesystems. I _know_ there are broken applications that do
not test the error return from close(), and I think it is a politeness
issue to return error codes that you can know about as soon as humanly
possible.

For NFS, you simply cannot do any reasonable performance without doing
deferred error reporting. The same isn't true of other filesystems.
Even in the presense of delayed block allocation, a local filesystem can
_reserve_ the blocks early, and has no excuse for giving errors late
(except, of course, for actual IO errors).

Linus
David S. Miller
2002-07-17 01:05:21 UTC
Permalink
From: ***@transmeta.com (Linus Torvalds)
Date: Wed, 17 Jul 2002 01:05:00 +0000 (UTC)
Post by David S. Miller
Better tell Linus.
Oh, Linus knows. In fact, Linus wrote some of the code in question.

Ok, I think the issue here is different.

Several years ago we were returning -EAGAIN from close() via NFS and
that is what caused the problems.
Linus Torvalds
2002-07-17 01:23:18 UTC
Permalink
Post by Linus Torvalds
Oh, Linus knows. In fact, Linus wrote some of the code in question.
Ok, I think the issue here is different.
Several years ago we were returning -EAGAIN from close() via NFS and
that is what caused the problems.
Oh.

Yes, EAGAIN doesn't really work as a close return value, simply because
_nobody_ expects that (and leaving the file descriptor open after a
close() is definitely unexpected, ie people can very validly complain
about buggy behaviour).

Linus
Matthias Andree
2002-07-17 11:51:25 UTC
Permalink
Post by Linus Torvalds
Yes, EAGAIN doesn't really work as a close return value, simply because
_nobody_ expects that (and leaving the file descriptor open after a
close() is definitely unexpected, ie people can very validly complain
about buggy behaviour).
non-issue, since EAGAIN would violates the specs that don't list EGAIN
(and EAGAIN in response does not make sense either, the kernel should
then try harder to get the I/O completed).
--
Matthias Andree
Andries Brouwer
2002-07-17 17:23:51 UTC
Permalink
Post by Matthias Andree
non-issue, since EAGAIN would violates the specs that don't list EGAIN
"Implementations may support additional errors not included in this
list, may generate errors included in this list under circumstances
other than those described here, or may contain extensions or
limitations that prevent some errors from occurring. The ERRORS
section on each reference page specifies whether an error shall be
returned, or whether it may be returned. Implementations shall not
generate a different error number from the ones described here for
error conditions described in this volume of IEEE Std 1003.1-2001, but
may generate additional errors unless explicitly disallowed for a
particular function."


Not listing an error in the spec does not mean it cannot occur.
Especially EFAULT is not usually listed.

Andries
Zack Weinberg
2002-07-17 00:10:32 UTC
Permalink
Post by Alan Cox
Post by Zack Weinberg
Making use of the close return value is also never any good.
This is untrue
I beg to differ.
Post by Alan Cox
Post by Zack Weinberg
Consider: There is no guarantee that close will detect errors. Only
NFS and Coda implement f_op->flush methods. For files on all other
file systems, sys_close will always return success (assuming the file
descriptor was open in the first place); the data may still be sitting
in the page cache. If you need the data pushed to the physical disk,
you have to call fsync.
close() checking is not about physical disk guarantees. It's about more
basic "I/O completed". In some future Linux only close() might tell you
about some kinds of I/O error.
I think we're talking past each other.

My first point is that a portable application cannot rely on close to
detect any error. Only fsync guarantees to detect any errors at all
(except ENOSPC/EDQUOT, which should come back on write; yes, I know
about the buggy NFS implementations that report them only on close).

My second point, which you deleted, is that if some hypothetical close
implementation reports an error under some circumstances, an
immediately preceding fsync call MUST also report the same error under
the same circumstances.

Therefore, if you've checked the return value of fsync, there's no
point in checking the subsequent close; and if you don't care to call
fsync, the close return value is useless since it isn't guaranteed to
detect anything.
Post by Alan Cox
Post by Zack Weinberg
There's also an ugly semantic bind if you make close detect errors.
If close returns an error other than EBADF, has that file descriptor
been closed? The standards do not specify. If it has not been
closed, you have a descriptor leak. But if it has been closed, it is
too late to recover from the error. [As far as I know, Unix
implementations generally do close the descriptor.]
If it bothers you close it again 8)
And watch it come back with an error again, repeat ad infinitum?
Post by Alan Cox
Post by Zack Weinberg
The manpage that was quoted earlier in this thread is incorrect in
claiming that errors will be detected by close; it should be fixed.
The man page matches the stsndard. Implementation may be a subset of the
allowed standard right now, but don't program to implementation
assumptions, it leads to nasty accidents
You missed the point. The manpage asserts that I/O errors are
guaranteed to be detected by close; there is no such guarantee.

zw
Alan Cox
2002-07-17 01:45:40 UTC
Permalink
Post by Zack Weinberg
My first point is that a portable application cannot rely on close to
detect any error. Only fsync guarantees to detect any errors at all
(except ENOSPC/EDQUOT, which should come back on write; yes, I know
about the buggy NFS implementations that report them only on close).
They are not buggy merely inconvenient. The reality of the NFS protocol
makes it the only viable way to do it
Post by Zack Weinberg
My second point, which you deleted, is that if some hypothetical close
implementation reports an error under some circumstances, an
immediately preceding fsync call MUST also report the same error under
the same circumstances.
I can't think of a case I'd disagree
Post by Zack Weinberg
Therefore, if you've checked the return value of fsync, there's no
point in checking the subsequent close; and if you don't care to call
fsync, the close return value is useless since it isn't guaranteed to
detect anything.
If you don't check the return code it might not detect anything. If you
do check the return code it might detect something. In fact you
contradict yourself IMHO by giving the NFS example.
Post by Zack Weinberg
Post by Alan Cox
If it bothers you close it again 8)
And watch it come back with an error again, repeat ad infinitum?
The use of intelligence doesn't help. Come on I know you aren't a cobol
programmer. Check for -EBADF ...
Post by Zack Weinberg
You missed the point. The manpage asserts that I/O errors are
guaranteed to be detected by close; there is no such guarantee.
Disagree. It says

It is quite possible that errors on a previous write(2) operation
are first reported at the final close

Not checking the return value when closing the file may lead to silent
loss of data.

A successful close does not guarantee that the data has
been successfully saved to disk, as the kernel defers
writes. It is not common for a filesystem to flush the
buffers when the stream is closed. If you need to be sure
that the data is physically stored use fsync(2). (It will
depend on the disk hardware at this point.)

None of which guarantee what you say, and which agree about the use of
fsync being appropriate now and then
Lars Marowsky-Bree
2002-07-17 08:00:37 UTC
Permalink
On 2002-07-16T17:10:32,
Post by Zack Weinberg
Therefore, if you've checked the return value of fsync, there's no
point in checking the subsequent close; and if you don't care to call
fsync, the close return value is useless since it isn't guaranteed to
detect anything.
There is _always_ a point in checking a return value of non void functi=
ons.

EOD.


Sincerely,
Lars Marowsky-Br=E9e <***@suse.de>

--=20
Immortality is an adequate definition of high availability for me.
--- Gregory F. Pfister
Thunder from the hill
2002-07-17 15:49:47 UTC
Permalink
Hi,
the close return value is useless since it isn't guaranteed to detect
anything.
"Isn't guaranteed to detect anything" is still a lot more encouraging to
see if it does detect anything than "Is guaranteed not to detect anything".

Regards,
Thunder
--
(Use http://www.ebb.org/ungeek if you can't decode)
------BEGIN GEEK CODE BLOCK------
Version: 3.12
GCS/E/G/S/AT d- s++:-- a? C++$ ULAVHI++++$ P++$ L++++(+++++)$ E W-$
N--- o? K? w-- O- M V$ PS+ PE- Y- PGP+ t+ 5+ X+ R- !tv b++ DI? !D G
e++++ h* r--- y-
------END GEEK CODE BLOCK------
Elladan
2002-07-17 02:22:52 UTC
Permalink
Post by Alan Cox
Post by Zack Weinberg
There's also an ugly semantic bind if you make close detect errors.
If close returns an error other than EBADF, has that file descriptor
been closed? The standards do not specify. If it has not been
closed, you have a descriptor leak. But if it has been closed, it is
too late to recover from the error. [As far as I know, Unix
implementations generally do close the descriptor.]
If it bothers you close it again 8)
Consider:

Two threads share the file descriptor table.

1. Thread 1 performs close() on a file descriptor. close fails.
2. Thread 2 performs open().
* 3. Thread 1 performs close() again, just to make sure.


open() may return any file descriptor not currently in use.

Is step 3 necessary? Is it dangerous? The question is, is close
guaranteed to work, or isn't it?


Case 1: Close is guaranteed to close the file.

Thread 2 may have just re-used the file descriptor. Thus, Thread 1
closes a different file in step 3. Thread 2 is now using a bad file
descriptor, and becomes very angry because the kernel just said all was
right with the world, and then claims there was a mistake. Thread 2
leaves in a huff.


Case 2: Close is guaranteed to leave the file open on error.

Thread 2 can't have just re-used the descriptor, so the world is ok in
that sense. However, Thread 1 *must* perform step 3, or it leaks a
descriptor, the tables fill, and the world becomes a frozen wasteland.


Case 3: Close may or may not leave it open due to random chance or
filesystem peculiarities.

Thread 1 may be required to close it twice, or it may be required not to
close it twice. It doesn't know! Night is falling! The world is in
flames! Aaaaaaugh!


I believe this demonstrates the need for a standard, one way, or the
other. :-)

-J
Thunder from the hill
2002-07-17 02:54:54 UTC
Permalink
Hi,
Post by Elladan
Two threads share the file descriptor table.
1. Thread 1 performs close() on a file descriptor. close fails.
2. Thread 2 performs open().
* 3. Thread 1 performs close() again, just to make sure.
Thread 2 shouldn't be able to reuse a currently open fd. This application
design is seriously broken.

Regards,
Thunder
--
(Use http://www.ebb.org/ungeek if you can't decode)
------BEGIN GEEK CODE BLOCK------
Version: 3.12
GCS/E/G/S/AT d- s++:-- a? C++$ ULAVHI++++$ P++$ L++++(+++++)$ E W-$
N--- o? K? w-- O- M V$ PS+ PE- Y- PGP+ t+ 5+ X+ R- !tv b++ DI? !D G
e++++ h* r--- y-
------END GEEK CODE BLOCK------
Elladan
2002-07-17 03:00:43 UTC
Permalink
Post by Thunder from the hill
Hi,
Post by Elladan
Two threads share the file descriptor table.
1. Thread 1 performs close() on a file descriptor. close fails.
2. Thread 2 performs open().
* 3. Thread 1 performs close() again, just to make sure.
Thread 2 shouldn't be able to reuse a currently open fd. This application
design is seriously broken.
No.

Thread 2 doesn't manage the file descriptor table, the kernel does.
Whether the kernel may re-use the descriptor or not depends on whether
the descriptor is closed or not. The kernel knows, but unless close()
behaves in a defined way, the application does not at this point. Thus,
step 3 may either be required, forbidden, or undefined.

-J
Thunder from the hill
2002-07-17 03:10:49 UTC
Permalink
Hi,
Post by Thunder from the hill
Thread 2 shouldn't be able to reuse a currently open fd. This application
design is seriously broken.
Okay, again. It's about doing a second close() in case the first one fails
with EAGAIN. If we have to do it again, the filehandle is not closed, and
if the filehandle is not closed, the kernel knows that, and if the kernel
knows that the filehandle is still open, it won't get reassigned. Problem
gone.

Regards,
Thunder
--
(Use http://www.ebb.org/ungeek if you can't decode)
------BEGIN GEEK CODE BLOCK------
Version: 3.12
GCS/E/G/S/AT d- s++:-- a? C++$ ULAVHI++++$ P++$ L++++(+++++)$ E W-$
N--- o? K? w-- O- M V$ PS+ PE- Y- PGP+ t+ 5+ X+ R- !tv b++ DI? !D G
e++++ h* r--- y-
------END GEEK CODE BLOCK------
Elladan
2002-07-17 03:31:02 UTC
Permalink
Post by Thunder from the hill
Hi,
Post by Thunder from the hill
Thread 2 shouldn't be able to reuse a currently open fd. This application
design is seriously broken.
Okay, again. It's about doing a second close() in case the first one fails
with EAGAIN. If we have to do it again, the filehandle is not closed, and
if the filehandle is not closed, the kernel knows that, and if the kernel
knows that the filehandle is still open, it won't get reassigned. Problem
gone.
This is case 2, "Close is guaranteed to leave the file open on error."

In this case, all applications are required to reissue close commands
upon certain errors, or leak a file descriptor. This would be a well
defined behavior, though perhaps error prone.

However, note that this is manifestly different from case 1, "Close is
guaranteed to close the file the first time." If the system behaves via
case 1, closing the handle again is broken as the example illustrated.

The worst, of course, would be undefined behavior for close. In this
case, the application effectively can't do the right thing without
extreme measures.

-J
Kai Henningsen
2002-07-17 07:34:00 UTC
Permalink
Stevie O
2002-07-17 04:17:40 UTC
Permalink
Post by Elladan
1. Thread 1 performs close() on a file descriptor. close fails.
2. Thread 2 performs open().
* 3. Thread 1 performs close() again, just to make sure.
open() may return any file descriptor not currently in use.
I'm confused here... the only way close() can fail is if the file descriptor is invalid (EBADF); wouldn't it be rather stupid to close() a known-to-be-bad descriptor?


--
Stevie-O

Real programmers use COPY CON PROGRAM.EXE
Elladan
2002-07-17 04:38:53 UTC
Permalink
Post by Stevie O
Post by Elladan
1. Thread 1 performs close() on a file descriptor. close fails.
2. Thread 2 performs open().
* 3. Thread 1 performs close() again, just to make sure.
open() may return any file descriptor not currently in use.
I'm confused here... the only way close() can fail is if the file
descriptor is invalid (EBADF); wouldn't it be rather stupid to close()
a known-to-be-bad descriptor?
Well, obviously, if that's the case. However, the man page for close(2)
doesn't agree (see below). close() is allowed to return EBADF, EINTR,
or EIO.

The question is, does the OS standard guarantee that the fd is closed,
even if close() returns EINTR or EIO? Just going by the normal usage of
EINTR, one might think otherwise. It doesn't appear to be documented
one way or another.

Alan said you could just issue close again to make sure - the example
shows that this is not the case. A second close is either required or
forbidden in that example - and the behavior has to be well defined or
you won't know which to do.

-J

NAME
close - close a file descriptor

SYNOPSIS
#include <unistd.h>

int close(int fd);

DESCRIPTION
close closes a file descriptor, so that it no longer refers
to any file and may be reused. Any locks held on the file it
was associated with, and owned by the process, are removed
(regardless of the file descriptor that was used to obtain the
lock).

If fd is the last copy of a particular file descriptor the
resources associated with it are freed; if the descriptor was the
last reference to a file which has been removed using unlink(2)
the file is deleted.

RETURN VALUE
close returns zero on success, or -1 if an error occurred.

ERRORS
EBADF fd isn't a valid open file descriptor.

EINTR The close() call was interrupted by a signal.

EIO An I/O error occurred.

CONFORMING TO
SVr4, SVID, POSIX, X/OPEN, BSD 4.3. SVr4 documents an
additional ENOLINK error condition.

NOTES
Not checking the return value of close is a common but
nevertheless serious programming error. File system
implementations which use techniques as `write-behind' to
increase performance may lead to write(2) succeeding, although
the data has not been written yet. The error status may be
reported at a later write operation, but it is guaranteed to be
reported on closing the file. Not checking the return value when
closing the file may lead to silent loss of data. This can
especially be observed with NFS and disk quotas.

A successful close does not guarantee that the data has
been successfully saved to disk, as the kernel defers
writes. It is not common for a filesystem to flush the
buffers when the stream is closed. If you need to be sure
that the data is physically stored use fsync(2) or
sync(2), they will get you closer to that goal (it will
depend on the disk hardware at this point).

SEE ALSO
open(2), fcntl(2), shutdown(2), unlink(2), fclose(3)
Andreas Schwab
2002-07-17 14:39:28 UTC
Permalink
Elladan <***@eskimo.com> writes:

|> On Wed, Jul 17, 2002 at 12:17:40AM -0400, Stevie O wrote:
|> > At 07:22 PM 7/16/2002 -0700, Elladan wrote:
|> > > 1. Thread 1 performs close() on a file descriptor. close fails=
=2E
|> > > 2. Thread 2 performs open().
|> > >* 3. Thread 1 performs close() again, just to make sure.
|> > >
|> > >
|> > >open() may return any file descriptor not currently in use.
|> >=20
|> > I'm confused here... the only way close() can fail is if the file
|> > descriptor is invalid (EBADF); wouldn't it be rather stupid to clo=
se()
|> > a known-to-be-bad descriptor?
|>=20
|> Well, obviously, if that's the case. However, the man page for clos=
e(2)
|> doesn't agree (see below). close() is allowed to return EBADF, EINT=
R,
|> or EIO.
|>=20
|> The question is, does the OS standard guarantee that the fd is close=
d,
|> even if close() returns EINTR or EIO? Just going by the normal usag=
e of
|> EINTR, one might think otherwise. It doesn't appear to be documente=
d
|> one way or another.

POSIX says the state of the file descriptor when close fails (with errn=
o
!=3D EBADF) is unspecified, which means:

The value or behavior may vary among implementations that conform t=
o
IEEE Std 1003.1-2001. An application should not rely on the existen=
ce
or validity of the value or behavior. An application that relies on
any particular value or behavior cannot be assured to be portable
across conforming implementations.

Andreas.

--=20
Andreas Schwab, SuSE Labs, ***@suse.de
SuSE Linux AG, Deutschherrnstr. 15-19, D-90429 N=FCrnberg
Key fingerprint =3D 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5
"And now for something completely different."
Elladan
2002-07-17 16:49:33 UTC
Permalink
|> > > 1. Thread 1 performs close() on a file descriptor. close fails.
|> > > 2. Thread 2 performs open().
|> > >* 3. Thread 1 performs close() again, just to make sure.
|> > >
|> > >
|> > >open() may return any file descriptor not currently in use.
|> >
|> > I'm confused here... the only way close() can fail is if the file
|> > descriptor is invalid (EBADF); wouldn't it be rather stupid to close()
|> > a known-to-be-bad descriptor?
|>
|> Well, obviously, if that's the case. However, the man page for close(2)
|> doesn't agree (see below). close() is allowed to return EBADF, EINTR,
|> or EIO.
|>
|> The question is, does the OS standard guarantee that the fd is closed,
|> even if close() returns EINTR or EIO? Just going by the normal usage of
|> EINTR, one might think otherwise. It doesn't appear to be documented
|> one way or another.
POSIX says the state of the file descriptor when close fails (with errno
The value or behavior may vary among implementations that conform to
IEEE Std 1003.1-2001. An application should not rely on the existence
or validity of the value or behavior. An application that relies on
any particular value or behavior cannot be assured to be portable
across conforming implementations.
This doesn't mean an OS shouldn't specify the behavior. Just because
the cross-platform standard leaves it unspecified doesn't mean the OS
should.

Consider what this says, if a particular OS doesn't pick a standard
which the application can port to. It means that the *only way* to
correctly close a file descriptor is like this:

int ret;
do {
ret = close(fd);
} while(ret == -1 && errno != EBADF);

That means, if we get an error, we have to loop until the kernel throws
a BADF error! We can't detect that the file is closed from any other
error value, because only BADF has a defined behavior.

This would sort of work, though of course be hideous, for a single
threaded app. Now consider a multithreaded app. To correctly implement
this we have to lock around all calls to close and
open/socket/dup/pipe/creat/etc...

This is clearly ridiculous, and not at all as intended. Either standard
will work for an OS (though guaranteeing close the first time is much
simpler all around), but it needs to be specified and stuck to, or you
get horrible things like this to work around a bad spec:


void lock_syscalls();
void unlock_syscalls();

int threadsafe_open(const char *file, int flags, mode_t mode)
{
int fd;
lock_syscalls();
fd = open(file, flags, mode);
unlock_syscalls();
return fd;
}

int threadsafe_close(int fd)
{
int ret;
lock_syscalls();
do {
ret = close(fd);
} while(ret == -1 && errno != EBADF);
unlock_syscalls();
return ret;
}

int threadsafe_socket() ...
int threadsafe_pipe() ...
int threadsafe_dup() ...
int threadsafe_creat() ...
int threadsafe_socketpair() ...
int threadsafe_accept() ...

-J
Linus Torvalds
2002-07-17 17:43:57 UTC
Permalink
Post by Elladan
Consider what this says, if a particular OS doesn't pick a standard
which the application can port to. It means that the *only way* to
int ret;
do {
ret = close(fd);
} while(ret == -1 && errno != EBADF);
NO.

The above is
(a) not portable
(b) not current practice

The "not portable" part comes from the fact that (as somebody pointed
out), a threaded environment in which the kernel _does_ close the FD on
errors, the FD may have been validly re-used (by the kernel) for some
other thread, and closing the FD a second time is a BUG.

The "not practice" comes from the fact that applications do not do what
you suggest.

The fact is, what Linux does and has always done is the only reasonable
thing to do: the close _will_ tear down the FD, and the error value is
nothing but a warning to the application that there may still be IO
pending (or there may have been failed IO) on the file that the (now
closed) descriptor pointed to.

The application may want to take evasive action (ie try to write the
file again, make a backup, or just warn the user), but the file
descriptor is _gone_.
Post by Elladan
That means, if we get an error, we have to loop until the kernel throws
a BADF error! We can't detect that the file is closed from any other
error value, because only BADF has a defined behavior.
But your loop is _provably_ incorrect for a threaded application. Your
explicit system call locking approach doesn't work either, because I'm
pretty certain that POSIX already states that open/close are thread
safe, so you can't just invalidate that _other_ standard.

Linus

Andries Brouwer
2002-07-17 17:17:22 UTC
Permalink
Post by Elladan
The question is, does the OS standard guarantee that the fd is closed,
even if close() returns EINTR or EIO? Just going by the normal usage of
EINTR, one might think otherwise. It doesn't appear to be documented
one way or another.
Alan said you could just issue close again to make sure - the example
shows that this is not the case. A second close is either required or
forbidden in that example - and the behavior has to be well defined or
you won't know which to do.
No, the behaviour is not well-defined at all.
The standard explicitly leaves undefined what happens when close returns
EINTR or EIO.
Richard Gooch
2002-07-17 17:51:31 UTC
Permalink
Post by Andries Brouwer
Post by Elladan
The question is, does the OS standard guarantee that the fd is closed,
even if close() returns EINTR or EIO? Just going by the normal usage of
EINTR, one might think otherwise. It doesn't appear to be documented
one way or another.
Alan said you could just issue close again to make sure - the example
shows that this is not the case. A second close is either required or
forbidden in that example - and the behavior has to be well defined or
you won't know which to do.
No, the behaviour is not well-defined at all.
The standard explicitly leaves undefined what happens when close
returns EINTR or EIO.
However, the only sane thing to do is to explicitly define one way or
another. The standard is broken. Consider a threaded application,
where one thread tries to call close(), gets an error and re-tries,
because it's not sure if the fd was closed or not. If the fd *is*
closed, and the thread loops calling close(), checking for EBADF,
there is a race if another thread tries calling open()/creat()/dup().

The ambiguity in the standard thus results in the impossibility of
writing a race-free application. And no, forcing the application to
protect system calls with mutexes isn't a solution.

Linux should define explicitly what happens on error return from
close(). Let that be the new standard.

Regards,

Richard....
Permanent: ***@atnf.csiro.au
Current: ***@ras.ucalgary.ca
Ketil Froyn
2002-07-15 21:59:48 UTC
Permalink
Without calling fsync(), you *never* know when the data will hit the
disk.
Doesn't bdflush ensure that data is written to disk within 30 seconds or
some tunable number of seconds?

Ketil
Matti Aarnio
2002-07-15 23:08:45 UTC
Permalink
Post by Ketil Froyn
Without calling fsync(), you *never* know when the data will hit the
disk.
Doesn't bdflush ensure that data is written to disk within 30 seconds or
some tunable number of seconds?
It TRIES TO, it does not guarantee anything.

The MTA systems are an example of software suites which have
transaction requirements. The goal has been usually stated
as: must not fail to deliver.

Practical implementations without full-blown all encompassing
transactions will usually mean that the message "will be delivered
at least once", e.g. double-delivery can happen.

One view to MTA behaviour is moving the message from one substate
to another during its processing.

These days, usually, the transaction database for MTAs is UNIX
filesystem. For ZMailer I have considered (although not actually
done - yet) using SleepyCat DB files for the transaction subsystem.
There are great challenges in failure compartementalisation, and
integrity, when using that kind of integrated database mechanisms.
Getting SEGV is potentially _very_ bad thing!
Post by Ketil Froyn
Ketil
/Matti Aarnio
Matthias Andree
2002-07-16 12:33:24 UTC
Permalink
Post by Matti Aarnio
These days, usually, the transaction database for MTAs is UNIX
filesystem. For ZMailer I have considered (although not actually
done - yet) using SleepyCat DB files for the transaction subsystem.
There are great challenges in failure compartementalisation, and
integrity, when using that kind of integrated database mechanisms.
Getting SEGV is potentially _very_ bad thing!
Read: lethal to the spool. Has SleepyCat DB learned to recover from
ENOSPC in the meanwhile? I had a db1.85 file corrupt after ENOSPC once...
--
Matthias Andree
Alan Cox
2002-07-15 22:55:12 UTC
Permalink
Post by Matthias Andree
I assume that most will just call close() or fclose() and exit() right
away. Does fclose() imply fsync()?
It doesn't.
Post by Matthias Andree
Some applications will not even check the [f]close() return value...
We are only interested in reliable code. Anything else is already
fatally broken.

-- quote --
Not checking the return value of close is a common but
nevertheless serious programming error. File system
implementations which use techniques as ``write-behind''
to increase performance may lead to write(2) succeeding,
although the data has not been written yet. The error
status may be reported at a later write operation, but it
is guaranteed to be reported on closing the file. Not
checking the return value when closing the file may lead
to silent loss of data. This can especially be observed
with NFS and disk quotas.
Matthias Andree
2002-07-15 21:58:14 UTC
Permalink
Post by Alan Cox
We are only interested in reliable code. Anything else is already
fatally broken.
-- quote --
Not checking the return value of close is a common but
nevertheless serious programming error. File system
As in 6. on http://www.apocalypse.org/pub/u/paul/docs/commandments.html
(The Ten Commandments for C Programmers, by Henry Spencer).
Chris Mason
2002-07-15 21:14:36 UTC
Permalink
Post by Patrick J. LoPresti
Post by Chris Mason
1) that newly grown file is someone's inbox, and the old contents of the
new block include someone else's private message.
2) That newly grown file is a control file for the application, and the
application expects it to contain valid data within (think sendmail).
In a correctly-written application, neither of these things can
happen. (See my earlier message today on fsync() and MTAs.) To get a
file onto disk reliably, the application must 1) flush the data, and
create temp file
flush data to temp file
rename temp file
flush rename operation
Yes, most mtas do this for queue files, I'm not sure how many do it for
the actual spool file. mail server authors are more than welcome to
recommend the best safety/performance combo for their product, and to
ask the FS guys which combinations are safe.

-chris
Patrick J. LoPresti
2002-07-15 21:31:07 UTC
Permalink
Post by Chris Mason
Yes, most mtas do this for queue files, I'm not sure how many do it for
the actual spool file.
Maybe the control files are small enough to fit in one disk block,
making the operations atomic in practice. Or something.
Post by Chris Mason
mail server authors are more than welcome to recommend the best
safety/performance combo for their product, and to ask the FS guys
which combinations are safe.
Yeah, but it's a shame if those combinations require performance hits
like "synchronous directory updates" or, worse, "fsync() == sync()".

I really wish MTA authors would just support Linux's "fsync the
directory" approach. It is simple, reliable, and fast. Yes, it does
require Linux-specific support in the application, but that's what
application authors should expect when there is a gap in the
standards.

- Pat
Lawrence Greenfield
2002-07-16 01:02:11 UTC
Permalink
From: "Patrick J. LoPresti" <***@curl.com>
Date: 15 Jul 2002 17:31:07 -0400
[...]
I really wish MTA authors would just support Linux's "fsync the
directory" approach. It is simple, reliable, and fast. Yes, it does
require Linux-specific support in the application, but that's what
application authors should expect when there is a gap in the
standards.

Actually, it's not all that simple (you have to find the enclosing
directories of any files you're modifying, which might require string
manipulation) or necessarily all that fast (you're doubling the number
of system calls and now the application is imposing an ordering on the
filesystem that didn't exist before).

It's only necessary for ext2. Modern Linux filesystems (such as ext3
or reiserfs) don't require it.

Finally: ext2 isn't safe even if you do call fsync() on the directory!

Let's consider: some filesystem operation modifies two different
blocks. This operation is safe if block A is written before block
B.

. FFS guarantees this by performing the writes synchronously: block A
is written when it is changed, followed by block B when it is changed.

. Journalling filesystems (ext3, reiserfs) guarantee this by
journalling the operation and forcing that journal entry to disk
before either A or B can be modified.

. What does ext2 do (in the default mode)? It modifies A, it modifies
B, and then leaves it up to the buffer cache to write them back---and
the buffer cache might decide to write B before A.

We're finally getting to some decent shared semantics on
filesystems. Reiserfs, ext3, FFS w/ softupdates, vxfs, etc., all work
with just fsync()ing the file (though an fsync() is required after a
link() or rename() operation). Let's encourage all filesystems to
provide these semantics and make it slightly easier on us stupid
application programmers.

Larry
Patrick J. LoPresti
2002-07-16 01:43:34 UTC
Permalink
Post by Lawrence Greenfield
Actually, it's not all that simple (you have to find the enclosing
directories of any files you're modifying, which might require string
manipulation)
No, you have to find the directories you are modifying. And the
application knows darn well which directories it is modifying.

Don't speculate. Show some sample code, and let's see how hard it
would be to use the "Linux way". I am betting on "not hard at all".
Post by Lawrence Greenfield
or necessarily all that fast (you're doubling the number of system
calls and now the application is imposing an ordering on the
filesystem that didn't exist before).
No, you are not doubling the number of system calls. As I have tried
to point out repeatedly, doing this stuff reliably and portably
already requires a sequence like this:

write data
flush data
write "validity" indicator (e.g., rename() or fchmod())
flush validity indicator

On Linux, flushing a rename() means calling fsync() on the directory
instead of the file. That's it. Doing that instead of fsync'ing the
file adds at most two system calls (to open and close the directory),
and those can be amortized over many operations on that directory
(think "mail spool"). So the system call overhead is non-existent.

As for "imposing an ordering on the filesystem that didn't exist
before", that is complete nonsense. This is imposing *precisely* the
ordering required for reliable operation; no more, no less. Relying
on mount options, "chattr +S", or journaling artifacts for your
ordering is the inefficient approach; since they impose extra
ordering, they can never be faster and will usually be slower.
Post by Lawrence Greenfield
It's only necessary for ext2. Modern Linux filesystems (such as ext3
or reiserfs) don't require it.
Only because they take the performance hit of flushing the whole log
to disk on every fsync(). Combine that with "data=ordered" and see
what happens to your performance. (Perhaps "data=ordered" should be
called "fsync=sync".) I would rather get back the performance and
convince application authors to understand what they are doing.
Post by Lawrence Greenfield
Finally: ext2 isn't safe even if you do call fsync() on the directory!
Wrong.

write temp file
fsync() temp file
rename() temp file to actual file
fsync() directory

No matter where this crashes, it is perfectly safe on ext2. (If not,
ext2 is badly broken.) The worst that can happen after a crash is
that the file might exist with both the old name and the new name.
But an application can detect this case on startup and clean it up.

- Pat
Thunder from the hill
2002-07-16 01:56:08 UTC
Permalink
Hi,
Post by Patrick J. LoPresti
Doing that instead of fsync'ing the
file adds at most two system calls (to open and close the directory),
Keep the directory fd open all the time, and flush it when needed. This
gets rid of the open(dir, dd); fsync(dd); close(dd);, you just have:
open(dir, dd); once, then fsync(dd); fsync(dd); ... and then one close(dd);

Not too much of an overhead, is it?

Regards,
Thunder
--
(Use http://www.ebb.org/ungeek if you can't decode)
------BEGIN GEEK CODE BLOCK------
Version: 3.12
GCS/E/G/S/AT d- s++:-- a? C++$ ULAVHI++++$ P++$ L++++(+++++)$ E W-$
N--- o? K? w-- O- M V$ PS+ PE- Y- PGP+ t+ 5+ X+ R- !tv b++ DI? !D G
e++++ h* r--- y-
------END GEEK CODE BLOCK------
Matthias Andree
2002-07-16 12:47:59 UTC
Permalink
Post by Patrick J. LoPresti
On Linux, flushing a rename() means calling fsync() on the directory
instead of the file. That's it. Doing that instead of fsync'ing the
file adds at most two system calls (to open and close the directory),
and those can be amortized over many operations on that directory
(think "mail spool"). So the system call overhead is non-existent.
Indeed, but I can also leave the file descriptor open on any file system
on any system except SOME of Linux'. (Ok, this precludes systems that
don't offer POSIX synchronous completion semantics, but these systems
don't nessarily have fsync() either).
Post by Patrick J. LoPresti
ordering required for reliable operation; no more, no less. Relying
on mount options, "chattr +S", or journaling artifacts for your
ordering is the inefficient approach; since they impose extra
ordering, they can never be faster and will usually be slower.
It is sometimes the only way, if the application is unaware. I hope I'm
not loosening a flame war if I mention qmail now, which is not even
softupdates aware. Without chattr +S or mount -o sync, nothing is to be
gained. OTOH, where mount -o sync only makes directory updates
synchronous, it's not too expensive, which is why the +D approach is
still useful there.
Post by Patrick J. LoPresti
Post by Lawrence Greenfield
It's only necessary for ext2. Modern Linux filesystems (such as ext3
or reiserfs) don't require it.
Only because they take the performance hit of flushing the whole log
to disk on every fsync(). Combine that with "data=ordered" and see
what happens to your performance. (Perhaps "data=ordered" should be
called "fsync=sync".) I would rather get back the performance and
convince application authors to understand what they are doing.
1. data=ordered is more than fsync=sync. It guarantees that data blocks
are flushed before flushing the meta data blocks that reference the data
blocks. Try this on ext2fs and lose.

2. sync() is unreliable, it can return control to the caller earlier
than what is sound. It can "complete" at any time it desires without
having completed.
(Probably so it can ever return as new blocks are written by another
process, but at least SUS v2 did not detail on this).

3. Application authors do not desire fsync=sync semantics, but they want
to rely on "fsync(fd) also syncs recent renames". It comes as a
now-guaranteed side effect of how ext3fs works, so I am told.

I'm not sure how the ext3fs journal works internally, but it'd fine with
all applications if only that part of a file system be synched that is
really relevant to the current fsync(fd). No more. It seems as though
fsync==sync is an artifact that ext2 also suffers from.
--
Matthias Andree
James Antill
2002-07-16 21:09:05 UTC
Permalink
Post by Patrick J. LoPresti
Post by Lawrence Greenfield
Actually, it's not all that simple (you have to find the enclosing
directories of any files you're modifying, which might require string
manipulation)
No, you have to find the directories you are modifying. And the
application knows darn well which directories it is modifying.
Don't speculate. Show some sample code, and let's see how hard it
would be to use the "Linux way". I am betting on "not hard at all".
I added fsync() on directories to exim-3.31, it took about 2hrs
coding and another hours testing it (with strace) to make sure it was
doing the right thing. That was from almost never seeing the source
before.
The only reason it took that long was because that version of exim
altered the spool in a couple of different places. Forward porting to
3.951 took about 20minutes IIRC (that version only plays witht he
spool in one place).
--
# James Antill -- ***@and.org
:0:
* ^From: .****@and\.org
/dev/null
Matthias Andree
2002-07-16 12:35:36 UTC
Permalink
Post by Chris Mason
Post by Patrick J. LoPresti
Post by Chris Mason
1) that newly grown file is someone's inbox, and the old contents of the
new block include someone else's private message.
2) That newly grown file is a control file for the application, and the
application expects it to contain valid data within (think sendmail).
In a correctly-written application, neither of these things can
happen. (See my earlier message today on fsync() and MTAs.) To get a
file onto disk reliably, the application must 1) flush the data, and
create temp file
flush data to temp file
rename temp file
flush rename operation
Yes, most mtas do this for queue files, I'm not sure how many do it for
the actual spool file. mail server authors are more than welcome to
Less. For one, Postfix' local(8) daemon relies on synchronous directory
update for Maildir spools. For mbox spool, the problem is less
prevalent, because spool files usually exist already and fsync() is
sufficient (and fsync() is done before local(8) reports success to the
queue manager).
--
Matthias Andree
Dax Kelson
2002-07-16 07:07:19 UTC
Permalink
Post by Patrick J. LoPresti
IF your server is stable and not prone to crashing, and/or you
have the write cache on your hard drives battery backed, you
should strongly consider using the writeback journaling mode of
Ext3 versus ordered.
I rewrote that statement on the website.

Dax Kelson
Guru Labs
Russ Allbery
2002-07-17 03:46:20 UTC
Permalink
Post by Zack Weinberg
Consider: There is no guarantee that close will detect errors. Only
NFS and Coda implement f_op->flush methods.
And AFS, I believe. (Not in the standard kernel, of course.)
--
Russ Allbery (***@stanford.edu) <http://www.eyrie.org/~eagle/>
Continue reading on narkive:
Loading...