Discussion:
xfs: very slow after mount, very slow at umount
(too old to reply)
Mark Lord
2011-01-27 01:22:25 UTC
Permalink
Alex / Christoph,

My mythtv box here uses XFS on a 2TB drive for storing recordings and videos.
It is behaving rather strangely though, and has gotten worse recently.
Here is what I see happening:

The drive mounts fine at boot, but the very first attempt to write a new file
to the filesystem suffers from a very very long pause, 30-60 seconds, during which
time the disk activity light is fully "on".

This happens only on the first new file write after mounting.
From then on, the filesystem is fast and responsive as expected.
If I umount the filesystem, and then mount it again,
the exact same behaviour can be observed.

This of course screws up mythtv, as it causes me to lose the first 30-60
seconds of the first recording it attempts after booting. So as a workaround
I now have a startup script to create, sync, and delete a 64MB file before
starting mythtv. This still takes 30-60 seconds, but it all happens and
finishes before mythtv has a real-time need to write to the filesystem.

The 2TB drive is fine -- zero errors, no events in the SMART logs,
and I've disabled the silly WD head-unload logic on it.

What's happening here? Why the big long burst of activity?
I've only just noticed this behaviour in the past few weeks,
running 2.6.35 and more recently 2.6.37.

* * *

The other issue is something I notice at umount time.
I have a second big drive used as a backup device for the drive discussed above.
I use "mirrordir" (similar to rsync) to clone directories/files from the main
drive to the backup drive. After mirrordir finishes, I then "umount /backup".
The umount promptly hangs, disk light on solid, for 30-60 seconds, then finishes.

If I type "sync" just before doing the umount, sync takes about 1 second,
and the umount finishes instantly.

Huh? What's happening there?

System is running 2.6.37 from kernel.org, but similar behaviour
has been there under 2.6.35 and 2.6.34. Dunno about earlier.

I can query any info you need from the filesystem.

Thanks
-ml
Mark Lord
2011-01-27 01:43:43 UTC
Permalink
On 11-01-26 08:22 PM, Mark Lord wrote:
> Alex / Christoph,
>
> My mythtv box here uses XFS on a 2TB drive for storing recordings and videos.
> It is behaving rather strangely though, and has gotten worse recently.
> Here is what I see happening:
>
> The drive mounts fine at boot, but the very first attempt to write a new file
> to the filesystem suffers from a very very long pause, 30-60 seconds, during which
> time the disk activity light is fully "on".
>
> This happens only on the first new file write after mounting.
>>From then on, the filesystem is fast and responsive as expected.
> If I umount the filesystem, and then mount it again,
> the exact same behaviour can be observed.
>
> This of course screws up mythtv, as it causes me to lose the first 30-60
> seconds of the first recording it attempts after booting. So as a workaround
> I now have a startup script to create, sync, and delete a 64MB file before
> starting mythtv. This still takes 30-60 seconds, but it all happens and
> finishes before mythtv has a real-time need to write to the filesystem.
>
> The 2TB drive is fine -- zero errors, no events in the SMART logs,
> and I've disabled the silly WD head-unload logic on it.
>
> What's happening here? Why the big long burst of activity?
> I've only just noticed this behaviour in the past few weeks,
> running 2.6.35 and more recently 2.6.37.
>
> * * *
>
> The other issue is something I notice at umount time.
> I have a second big drive used as a backup device for the drive discussed above.
> I use "mirrordir" (similar to rsync) to clone directories/files from the main
> drive to the backup drive. After mirrordir finishes, I then "umount /backup".
> The umount promptly hangs, disk light on solid, for 30-60 seconds, then finishes.
>
> If I type "sync" just before doing the umount, sync takes about 1 second,
> and the umount finishes instantly.
>
> Huh? What's happening there?
>
> System is running 2.6.37 from kernel.org, but similar behaviour
> has been there under 2.6.35 and 2.6.34. Dunno about earlier.
>
> I can query any info you need from the filesystem.


Thinking about it some more: the first problem very much appears as if
it is due to a filesystem check happening on the already-mounted filesystem,
if that makes any kind of sense (?).

Because.. running xfs_check on the umounted drive takes about the same 30-60
seconds,
with the disk activity light fully "on".

The other thought that came to mind: this behaviour has only been noticed recently,
probably because I have recently added about 1000 new files (hundreds of MB each)
to the videos/ directory on that filesystem. Whereas before, it had fewer
than 500 (multi-GB) files in total.

So if it really is doing some kind of internal filesystem check, then the time
required has only recently become 3X larger than before.. so the behaviour may
not be new/recent, but now is very noticeable.

I wonder what is really happening?

Cheers
Dave Chinner
2011-01-27 03:43:14 UTC
Permalink
On Wed, Jan 26, 2011 at 08:43:43PM -0500, Mark Lord wrote:
> On 11-01-26 08:22 PM, Mark Lord wrote:
> > Alex / Christoph,
> >
> > My mythtv box here uses XFS on a 2TB drive for storing recordings and videos.
> > It is behaving rather strangely though, and has gotten worse recently.
> > Here is what I see happening:
> >
> > The drive mounts fine at boot, but the very first attempt to write a new file
> > to the filesystem suffers from a very very long pause, 30-60 seconds, during which
> > time the disk activity light is fully "on".
> >
> > This happens only on the first new file write after mounting.
> >>From then on, the filesystem is fast and responsive as expected.
> > If I umount the filesystem, and then mount it again,
> > the exact same behaviour can be observed.
> >
> > This of course screws up mythtv, as it causes me to lose the first 30-60
> > seconds of the first recording it attempts after booting. So as a workaround
> > I now have a startup script to create, sync, and delete a 64MB file before
> > starting mythtv. This still takes 30-60 seconds, but it all happens and
> > finishes before mythtv has a real-time need to write to the filesystem.
> >
> > The 2TB drive is fine -- zero errors, no events in the SMART logs,
> > and I've disabled the silly WD head-unload logic on it.
> >
> > What's happening here? Why the big long burst of activity?
> > I've only just noticed this behaviour in the past few weeks,
> > running 2.6.35 and more recently 2.6.37.
> >
> > * * *
> >
> > The other issue is something I notice at umount time.
> > I have a second big drive used as a backup device for the drive discussed above.
> > I use "mirrordir" (similar to rsync) to clone directories/files from the main
> > drive to the backup drive. After mirrordir finishes, I then "umount /backup".
> > The umount promptly hangs, disk light on solid, for 30-60 seconds, then finishes.
> >
> > If I type "sync" just before doing the umount, sync takes about 1 second,
> > and the umount finishes instantly.
> >
> > Huh? What's happening there?
> >
> > System is running 2.6.37 from kernel.org, but similar behaviour
> > has been there under 2.6.35 and 2.6.34. Dunno about earlier.
> >
> > I can query any info you need from the filesystem.
>
>
> Thinking about it some more: the first problem very much appears as if
> it is due to a filesystem check happening on the already-mounted filesystem,
> if that makes any kind of sense (?).

Not to me. You can check this simply by looking at the output of
top while the problem is occurring...

> Because.. running xfs_check on the umounted drive takes about the same 30-60
> seconds,
> with the disk activity light fully "on".

Well, yeah - XFS check reads all the metadata in the filesystem, so
of course it's going to thrash your disk when it is run. The fact it
takes the same length of time as whatever problem you are having is
likely to be coincidental.

> The other thought that came to mind: this behaviour has only been
> noticed recently, probably because I have recently added about
> 1000 new files (hundreds of MB each) to the videos/ directory on
> that filesystem. Whereas before, it had fewer than 500 (multi-GB)
> files in total.
>
> So if it really is doing some kind of internal filesystem check,
> then the time required has only recently become 3X larger than
> before.. so the behaviour may not be new/recent, but now is very
> noticeable.

Where does that 3x figure come from? Have you measured it? If so,
what are the numbers?

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Mark Lord
2011-01-27 03:53:17 UTC
Permalink
On 11-01-26 10:43 PM, Dave Chinner wrote:
> On Wed, Jan 26, 2011 at 08:43:43PM -0500, Mark Lord wrote:
>> On 11-01-26 08:22 PM, Mark Lord wrote:
..
>> Thinking about it some more: the first problem very much appears as if
>> it is due to a filesystem check happening on the already-mounted filesystem,
>> if that makes any kind of sense (?).
>
> Not to me. You can check this simply by looking at the output of
> top while the problem is occurring...

Top doesn't show anything interesting, since disk I/O uses practically zero CPU.

>> running xfs_check on the umounted drive takes about the same 30-60 seconds,
>> with the disk activity light fully "on".
>
> Well, yeah - XFS check reads all the metadata in the filesystem, so
> of course it's going to thrash your disk when it is run. The fact it
> takes the same length of time as whatever problem you are having is
> likely to be coincidental.

I find it interesting that the mount takes zero-time,
as if it never actually reads much from the filesystem.
Something has to eventually read the metadata etc.

>> The other thought that came to mind: this behaviour has only been
>> noticed recently, probably because I have recently added about
>> 1000 new files (hundreds of MB each) to the videos/ directory on
>> that filesystem. Whereas before, it had fewer than 500 (multi-GB)
>> files in total.
>>
>> So if it really is doing some kind of internal filesystem check,
>> then the time required has only recently become 3X larger than
>> before.. so the behaviour may not be new/recent, but now is very
>> noticeable.
>
> Where does that 3x figure come from?

Well, it used to have about 500 files/subdirs on it,
and now it has somewhat over 1500 files/subdirs.
That's a ballpark estimate of 3X the amount of meta data.

All of these files are at least large (hundreds of MB),
and a lot are huge (many GB) in size.

Cheers
Mark Lord
2011-01-27 04:54:01 UTC
Permalink
On 11-01-26 10:53 PM, Mark Lord wrote:
> On 11-01-26 10:43 PM, Dave Chinner wrote:
>> On Wed, Jan 26, 2011 at 08:43:43PM -0500, Mark Lord wrote:
>>> On 11-01-26 08:22 PM, Mark Lord wrote:
> ..
>>> Thinking about it some more: the first problem very much appears as if
>>> it is due to a filesystem check happening on the already-mounted filesystem,
>>> if that makes any kind of sense (?).
>>
>> Not to me. You can check this simply by looking at the output of
>> top while the problem is occurring...
>
> Top doesn't show anything interesting, since disk I/O uses practically zero CPU.
>
>>> running xfs_check on the umounted drive takes about the same 30-60 seconds,
>>> with the disk activity light fully "on".
>>
>> Well, yeah - XFS check reads all the metadata in the filesystem, so
>> of course it's going to thrash your disk when it is run. The fact it
>> takes the same length of time as whatever problem you are having is
>> likely to be coincidental.
>
> I find it interesting that the mount takes zero-time,
> as if it never actually reads much from the filesystem.
> Something has to eventually read the metadata etc.
>
>>> The other thought that came to mind: this behaviour has only been
>>> noticed recently, probably because I have recently added about
>>> 1000 new files (hundreds of MB each) to the videos/ directory on
>>> that filesystem. Whereas before, it had fewer than 500 (multi-GB)
>>> files in total.
>>>
>>> So if it really is doing some kind of internal filesystem check,
>>> then the time required has only recently become 3X larger than
>>> before.. so the behaviour may not be new/recent, but now is very
>>> noticeable.
>>
>> Where does that 3x figure come from?
>
> Well, it used to have about 500 files/subdirs on it,
> and now it has somewhat over 1500 files/subdirs.
> That's a ballpark estimate of 3X the amount of meta data.
>
> All of these files are at least large (hundreds of MB),
> and a lot are huge (many GB) in size.

I've rebuilt the kernel with the various config options to enable blktrace
and XFS_DEBUG, but in the meanwhile we have also watched and deleted
a few GB of recordings.

The result is that the mysterious first-write delay has vanished, for now,
so there's nothing to trace.

I think I'll pick up an extra 2TB drive, so that next time
it surfaces I can simply bit-clone the filesystem or something,
to preserve the buggered state for further examination.

The second issue is probably still there, and I'll blktrace that instead.
But it will have to wait a spell -- I've run out of time here right now.

Cheers
Dave Chinner
2011-01-27 23:34:09 UTC
Permalink
On Wed, Jan 26, 2011 at 10:53:17PM -0500, Mark Lord wrote:
> On 11-01-26 10:43 PM, Dave Chinner wrote:
> > On Wed, Jan 26, 2011 at 08:43:43PM -0500, Mark Lord wrote:
> >> On 11-01-26 08:22 PM, Mark Lord wrote:
> ..
> >> Thinking about it some more: the first problem very much appears as if
> >> it is due to a filesystem check happening on the already-mounted filesystem,
> >> if that makes any kind of sense (?).
> >
> > Not to me. You can check this simply by looking at the output of
> > top while the problem is occurring...
>
> Top doesn't show anything interesting, since disk I/O uses practically zero CPU.

My point is that xfs_check doesn't use zero cpu or memory - it uses
quite a lot of both, so if it is not present in top output while the
disk is being thrashed, it ain't running...

>
> >> running xfs_check on the umounted drive takes about the same 30-60 seconds,
> >> with the disk activity light fully "on".
> >
> > Well, yeah - XFS check reads all the metadata in the filesystem, so
> > of course it's going to thrash your disk when it is run. The fact it
> > takes the same length of time as whatever problem you are having is
> > likely to be coincidental.
>
> I find it interesting that the mount takes zero-time,
> as if it never actually reads much from the filesystem.
> Something has to eventually read the metadata etc.

Sure, for a clean log it has basically nothing to do - a few disk
reads to read the superblock, find the head/tail of the log, and
little else needs doing. Only when log recovery needs to be done
does mount do any significant IO.

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Dave Chinner
2011-01-27 03:30:11 UTC
Permalink
[Please cc ***@oss.sgi.com on XFS bug reports. Added.]

On Wed, Jan 26, 2011 at 08:22:25PM -0500, Mark Lord wrote:
> Alex / Christoph,
>
> My mythtv box here uses XFS on a 2TB drive for storing recordings and videos.
> It is behaving rather strangely though, and has gotten worse recently.
> Here is what I see happening:
>
> The drive mounts fine at boot, but the very first attempt to write a new file
> to the filesystem suffers from a very very long pause, 30-60 seconds, during which
> time the disk activity light is fully "on".

Please post the output of xfs_info <mtpt> so we can see what you
filesystem configuration is.

> This happens only on the first new file write after mounting.
> From then on, the filesystem is fast and responsive as expected.
> If I umount the filesystem, and then mount it again,
> the exact same behaviour can be observed.

I can't say I've seen this. Can you capture a blktrace of the IO so
we can see what IO is actually being done, and perhaps also record
an XFS event trace as well (i.e. of all the events in
/sys/kernel/debug/tracing/events/xfs).

> This of course screws up mythtv, as it causes me to lose the first 30-60
> seconds of the first recording it attempts after booting. So as a workaround
> I now have a startup script to create, sync, and delete a 64MB file before
> starting mythtv. This still takes 30-60 seconds, but it all happens and
> finishes before mythtv has a real-time need to write to the filesystem.
>
> The 2TB drive is fine -- zero errors, no events in the SMART logs,
> and I've disabled the silly WD head-unload logic on it.
>
> What's happening here? Why the big long burst of activity?
> I've only just noticed this behaviour in the past few weeks,
> running 2.6.35 and more recently 2.6.37.

Can you be a bit more precise? what were you running before 2.6.35
when you didn't notice this?

> * * *
>
> The other issue is something I notice at umount time.
> I have a second big drive used as a backup device for the drive discussed above.
> I use "mirrordir" (similar to rsync) to clone directories/files from the main
> drive to the backup drive. After mirrordir finishes, I then "umount /backup".
> The umount promptly hangs, disk light on solid, for 30-60 seconds, then finishes.

Same again - blktrace and event traces for the different cases.

Also, how many files are you syncing? how much data, number of
inodes, etc...

> If I type "sync" just before doing the umount, sync takes about 1 second,
> and the umount finishes instantly.
>
> Huh? What's happening there?

Sounds like something is broken w.r.t. writeback during unmount.
Perhaps also adding the writeback events to the trace would help
understand what is happening here....

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Mark Lord
2011-01-27 03:49:03 UTC
Permalink
On 11-01-26 10:30 PM, Dave Chinner wrote:
> [Please cc ***@oss.sgi.com on XFS bug reports. Added.]
>
> On Wed, Jan 26, 2011 at 08:22:25PM -0500, Mark Lord wrote:
>> Alex / Christoph,
>>
>> My mythtv box here uses XFS on a 2TB drive for storing recordings and videos.
>> It is behaving rather strangely though, and has gotten worse recently.
>> Here is what I see happening:
>>
>> The drive mounts fine at boot, but the very first attempt to write a new file
>> to the filesystem suffers from a very very long pause, 30-60 seconds, during which
>> time the disk activity light is fully "on".
>
> Please post the output of xfs_info <mtpt> so we can see what you
> filesystem configuration is.

/dev/sdb1 on /var/lib/mythtv type xfs
(rw,noatime,allocsize=64M,logbufs=8,largeio)

[~] xfs_info /var/lib/mythtv
meta-data=/dev/sdb1 isize=256 agcount=7453, agsize=65536 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=488378638, imaxpct=5
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal bsize=4096 blocks=32768, version=2
= sectsz=512 sunit=0 blks, lazy-count=0
realtime =none extsz=4096 blocks=0, rtextents=0

>> This happens only on the first new file write after mounting.
>> From then on, the filesystem is fast and responsive as expected.
>> If I umount the filesystem, and then mount it again,
>> the exact same behaviour can be observed.
>
> I can't say I've seen this. Can you capture a blktrace of the IO so
> we can see what IO is actually being done, and perhaps also record
> an XFS event trace as well (i.e. of all the events in
> /sys/kernel/debug/tracing/events/xfs).

I'll have to reconfig/rebuild the kernel to include support for blktrace first.
Can you specify the exact commands/args you'd like for running blktrace etc?

>> This of course screws up mythtv, as it causes me to lose the first 30-60
>> seconds of the first recording it attempts after booting. So as a workaround
>> I now have a startup script to create, sync, and delete a 64MB file before
>> starting mythtv. This still takes 30-60 seconds, but it all happens and
>> finishes before mythtv has a real-time need to write to the filesystem.
>>
>> The 2TB drive is fine -- zero errors, no events in the SMART logs,
>> and I've disabled the silly WD head-unload logic on it.
>>
>> What's happening here? Why the big long burst of activity?
>> I've only just noticed this behaviour in the past few weeks,
>> running 2.6.35 and more recently 2.6.37.
>
> Can you be a bit more precise? what were you running before 2.6.35
> when you didn't notice this?

Those details are in my earlier follow-up posting.

>> The other issue is something I notice at umount time.

I'm going to let that issue rest for now,
until we figure out the first issue.
Heck, they might even be the exact same thing.. :)

Thanks!
Mark Lord
2011-01-27 15:12:23 UTC
Permalink
On 11-01-27 12:30 AM, Stan Hoeppner wrote:
> Mark Lord put forth on 1/26/2011 9:49 PM:
>
>> agcount=7453
>
> That's probably a bit high Mark, and very possibly the cause of your problems.
> :) Unless the disk array backing this filesystem has something like 400-800
> striped disk drives. You said it's a single 2TB drive right?
>
> The default agcount for a single drive filesystem is 4 allocation groups. For
> mdraid (of any number of disks/configuration) it's 16 allocation groups.
>
> Why/how did you end up with 7452 allocation groups? That can definitely cause
> some performance issues due to massively excessive head seeking, and possibly
> all manner of weirdness.

This is great info, exactly the kind of feedback I was hoping for!

The filesystem is about a year old now, and I probably used agsize=nnnnn
when creating it or something.

So if this resulted in what you consider to be many MANY too MANY ags,
then I can imagine the first new file write wanting to go out and read
in all of the ag data to determine the "best fit" or something.
Which might explain some of the delay.

Once I get the new 2TB drive, I'll re-run mkfs.xfs and then copy everything
over onto a fresh xfs filesystem.

Can you recommend a good set of mkfs.xfs parameters to suit the characteristics
of this system? Eg. Only a few thousand active inodes, and nearly all files are
in the 600MB -> 20GB size range. The usage pattern it must handle is up to
six concurrent streaming writes at the same time as up to three streaming reads,
with no significant delays permitted on the reads.

That's the kind of workload that I find XFS handles nicely,
and EXT4 has given me trouble with in the past.

Thanks

-ml
Justin Piszcz
2011-01-27 15:40:31 UTC
Permalink
On Thu, 27 Jan 2011, Mark Lord wrote:

> On 11-01-27 12:30 AM, Stan Hoeppner wrote:
>> Mark Lord put forth on 1/26/2011 9:49 PM:
>>
>>> agcount=7453
>>
>> That's probably a bit high Mark, and very possibly the cause of your problems.
>> :) Unless the disk array backing this filesystem has something like 400-800
>> striped disk drives. You said it's a single 2TB drive right?
>>
>> The default agcount for a single drive filesystem is 4 allocation groups. For
>> mdraid (of any number of disks/configuration) it's 16 allocation groups.
>>
>> Why/how did you end up with 7452 allocation groups? That can definitely cause
>> some performance issues due to massively excessive head seeking, and possibly
>> all manner of weirdness.
>
> This is great info, exactly the kind of feedback I was hoping for!
>
> The filesystem is about a year old now, and I probably used agsize=nnnnn
> when creating it or something.
>
> So if this resulted in what you consider to be many MANY too MANY ags,
> then I can imagine the first new file write wanting to go out and read
> in all of the ag data to determine the "best fit" or something.
> Which might explain some of the delay.
>
> Once I get the new 2TB drive, I'll re-run mkfs.xfs and then copy everything
> over onto a fresh xfs filesystem.
>
> Can you recommend a good set of mkfs.xfs parameters to suit the characteristics
> of this system? Eg. Only a few thousand active inodes, and nearly all files are
> in the 600MB -> 20GB size range. The usage pattern it must handle is up to
> six concurrent streaming writes at the same time as up to three streaming reads,
> with no significant delays permitted on the reads.
>
> That's the kind of workload that I find XFS handles nicely,
> and EXT4 has given me trouble with in the past.
>
> Thanks


Hi Mark,

I did a load of benchmarks a long time ago testing every mkfs.xfs option
there was, and I found that most of the time (if not all), the defaults
were the best.

Justin.
Mark Lord
2011-01-27 16:03:49 UTC
Permalink
On 11-01-27 10:40 AM, Justin Piszcz wrote:
>
>
> On Thu, 27 Jan 2011, Mark Lord wrote:
..
>> Can you recommend a good set of mkfs.xfs parameters to suit the characteristics
>> of this system? Eg. Only a few thousand active inodes, and nearly all files are
>> in the 600MB -> 20GB size range. The usage pattern it must handle is up to
>> six concurrent streaming writes at the same time as up to three streaming reads,
>> with no significant delays permitted on the reads.
>>
>> That's the kind of workload that I find XFS handles nicely,
>> and EXT4 has given me trouble with in the past.
..
> I did a load of benchmarks a long time ago testing every mkfs.xfs option there
> was, and I found that most of the time (if not all), the defaults were the best.
..

I am concerned with fragmentation on the very special workload in this case.
I'd really like the 20GB files, written over a 1-2 hour period, to consist
of a very few very large extents, as much as possible.

Rather than hundreds or thousands of "tiny" MB sized extents.
I wonder what the best mkfs.xfs parameters might be to encourage that?

Cheers
Stan Hoeppner
2011-01-27 19:40:06 UTC
Permalink
This post might be inappropriate. Click to display it.
d***@lang.hm
2011-01-27 20:11:54 UTC
Permalink
On Thu, 27 Jan 2011, Stan Hoeppner wrote:

>> Rather than hundreds or thousands of "tiny" MB sized extents.
>> I wonder what the best mkfs.xfs parameters might be to encourage that?
>
> You need to use the mkfs.xfs defaults for any single drive filesystem, and trust
> the allocator to do the right thing. XFS uses variable size extents and the
> size is chosen dynamically--you don't have direct or indirect control of the
> extent size chosen for a given file or set of files AFAIK.
>
> As Dave Chinner is fond of pointing out, it's those who don't know enough about
> XFS and choose custom settings that most often get themselves into trouble (as
> you've already done once). :)
>
> The defaults exist for a reason, and they weren't chosen willy nilly. The vast
> bulk of XFS' configurability exists for tuning maximum performance on large to
> very large RAID arrays. There isn't much, if any, additional performance to be
> gained with parameter tweaks on a single drive XFS filesystem.

how do I understand how to setup things on multi-disk systems? the
documentation I've found online is not that helpful, and in some ways
contradictory.

If there really are good rules for how to do this, it would be very
helpful if you could just give mkfs.xfs the information about your system
(this partition is on a 16 drive raid6 array) and have it do the right
thing.

David Lang


> A brief explanation of agcount: the filesystem is divided into agcount regions
> called allocation groups, or AGs. The allocator writes to all AGs in parallel
> to increase performance. With extremely fast storage (SSD, large high RPM RAID)
> this increases throughput as the storage can often sink writes faster than a
> serial writer can push data. In your case, you have a single slow spindle with
> over 7,000 AGs. Thus, the allocator is writing to over 7,000 locations on that
> single disk simultaneously, or, at least, it's trying to. Thus, the poor head
> on that drive is being whipped all over the place without actually getting much
> writing done. To add insults to injury, this is one of these low RPM low head
> performance "green" drives correct?
>
> Trust the defaults. If they give you problems (unlikely) then we can't talk. ;)
>
>
Stan Hoeppner
2011-01-27 23:53:18 UTC
Permalink
This post might be inappropriate. Click to display it.
d***@lang.hm
2011-01-28 02:09:58 UTC
Permalink
On Thu, 27 Jan 2011, Stan Hoeppner wrote:

> ***@lang.hm put forth on 1/27/2011 2:11 PM:
>
>> how do I understand how to setup things on multi-disk systems? the documentation
>> I've found online is not that helpful, and in some ways contradictory.
>
> Visit http://xfs.org There you will find:
>
> Users guide:
> http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide//tmp/en-US/html/index.html
>
> File system structure:
> http://xfs.org/docs/xfsdocs-xml-dev/XFS_Filesystem_Structure//tmp/en-US/html/index.html
>
> Training labs:
> http://xfs.org/docs/xfsdocs-xml-dev/XFS_Labs/tmp/en-US/html/index.html

thanks for the pointers.

>> If there really are good rules for how to do this, it would be very helpful if
>> you could just give mkfs.xfs the information about your system (this partition
>> is on a 16 drive raid6 array) and have it do the right thing.
>
> If your disk array is built upon Linux mdraid, recent versions of mkfs.xfs will
> read the parameters and automatically make the filesystem accordingly, properly.
>
> mxfs.fxs will not do this for PCIe/x hardware RAID arrays or external FC/iSCSI
> based SAN arrays as there is no standard place to acquire the RAID configuration
> information for such systems. For these you will need to configure mkfs.xfs
> manually.
>
> At minimum you will want to specify stripe width (sw) which needs to match the
> hardware stripe width. For RAID0 sw=[#of_disks]. For RAID 10, sw=[#disks/2].
> For RAID5 sw=[#disks-1]. For RAID6 sw=[#disks-2].
>
> You'll want at minimum agcount=16 for striped hardware arrays. Depending on the
> number and spindle speed of the disks, the total size of the array, the
> characteristics of the RAID controller (big or small cache), you may want to
> increase agcount. Experimentation may be required to find the optimum
> parameters for a given hardware RAID array. Typically all other parameters may
> be left at defaults.

does this value change depending on the number of disks in the array?

> Picking the perfect mkfs.xfs parameters for a hardware RAID array can be
> somewhat of a black art, mainly because no two vendor arrays act or perform
> identically.

if mkfs.xfs can figure out how to do the 'right thing' for md raid arrays,
can there be a mode where it asks the users for the same information that
it gets from the kernel?

> Systems of a caliber requiring XFS should be thoroughly tested before going into
> production. Testing _with your workload_ of multiple parameters should be
> performed to identify those yielding best performance.

<rant>
the problem with this is that for large arrays, formatting the array and
loading it with data can take a day or more, even before you start running
the test. This is made even worse if you are scaling up an existing system
a couple orders of magnatude, because you may not have the full workload
available to you. Saying that you should test out every option before
going into production is a cop-out. The better you can test it, the better
off you are, but without knowing what the knobs do, just doing a test and
twiddling the knobs to do another test isn't very useful. If there is a
way to set the knobs in the general ballpark, then you can test and see if
the performance seems adaquate, if not you can try teaking one of the
knobs a little bit and see if it helps or hurts. but if the knobs aren't
even in the ballpark when you start, this doesn't help much.
</rant>

David Lang
Dave Chinner
2011-01-28 13:56:29 UTC
Permalink
On Thu, Jan 27, 2011 at 06:09:58PM -0800, ***@lang.hm wrote:
> On Thu, 27 Jan 2011, Stan Hoeppner wrote:
> >***@lang.hm put forth on 1/27/2011 2:11 PM:
> >
> >>how do I understand how to setup things on multi-disk systems? the documentation
> >>I've found online is not that helpful, and in some ways contradictory.
> >
> >Visit http://xfs.org There you will find:
> >
> >Users guide:
> >http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide//tmp/en-US/html/index.html
> >
> >File system structure:
> >http://xfs.org/docs/xfsdocs-xml-dev/XFS_Filesystem_Structure//tmp/en-US/html/index.html
> >
> >Training labs:
> >http://xfs.org/docs/xfsdocs-xml-dev/XFS_Labs/tmp/en-US/html/index.html
>
> thanks for the pointers.
>
> >>If there really are good rules for how to do this, it would be very helpful if
> >>you could just give mkfs.xfs the information about your system (this partition
> >>is on a 16 drive raid6 array) and have it do the right thing.
> >
> >If your disk array is built upon Linux mdraid, recent versions of mkfs.xfs will
> >read the parameters and automatically make the filesystem accordingly, properly.
> >
> >mxfs.fxs will not do this for PCIe/x hardware RAID arrays or external FC/iSCSI
> >based SAN arrays as there is no standard place to acquire the RAID configuration
> >information for such systems. For these you will need to configure mkfs.xfs
> >manually.
> >
> >At minimum you will want to specify stripe width (sw) which needs to match the
> >hardware stripe width. For RAID0 sw=[#of_disks]. For RAID 10, sw=[#disks/2].
> >For RAID5 sw=[#disks-1]. For RAID6 sw=[#disks-2].
> >
> >You'll want at minimum agcount=16 for striped hardware arrays. Depending on the
> >number and spindle speed of the disks, the total size of the array, the
> >characteristics of the RAID controller (big or small cache), you may want to
> >increase agcount. Experimentation may be required to find the optimum
> >parameters for a given hardware RAID array. Typically all other parameters may
> >be left at defaults.
>
> does this value change depending on the number of disks in the array?

Only depending on block device capacity. Once at the maximum AG size
(1TB), mkfs has to add more AGs. So once above 4TB for hardware RAID
LUNs and 16TB for md/dm devices, you will get an AG per TB of
storage by default.

As it is, the optimal number and size of AGs will depend on many
geometry factors as workload factors, such as the size of the luns,
the way they are striped, whether you are using linear concatenation
of luns or striping them or a combination of both, the amount of
allocation concurrency you require, etc. In these sorts of
situations, mkfs can only make a best guess - to do better you
really need someone proficient in the dark arts to configure the
storage and filesystem optimally.

> >Picking the perfect mkfs.xfs parameters for a hardware RAID array can be
> >somewhat of a black art, mainly because no two vendor arrays act or perform
> >identically.
>
> if mkfs.xfs can figure out how to do the 'right thing' for md raid
> arrays, can there be a mode where it asks the users for the same
> information that it gets from the kernel?

mkfs.xfs can get the information it needs directly from dm and md
devices. However, when hardware RAID luns present themselves to the
OS in an identical manner to single drives, how does mkfs tell the
difference between a 2TB hardware RAID lun made up of 30x73GB drives
and a single 2TB SATA drive? The person running mkfs should already
know this little detail....

> >Systems of a caliber requiring XFS should be thoroughly tested before going into
> >production. Testing _with your workload_ of multiple parameters should be
> >performed to identify those yielding best performance.
>
> <rant>
> the problem with this is that for large arrays, formatting the array
> and loading it with data can take a day or more, even before you
> start running the test. This is made even worse if you are scaling
> up an existing system a couple orders of magnatude, because you may
> not have the full workload available to you.

If your hardware procurement-to-production process doesn't include
testing performance of potential equipment on a representative
workload, then I'd say you have a process problem that we can't help
you solve....

> Saying that you should
> test out every option before going into production is a cop-out.

I never test every option. I know what the options do, so to decide
what to tweak (if anything) what I first need to know is how a
workload performs on a given storage layout with default options. I
need to have:

a) some idea of the expected performance of the workload
b) a baseline performance characterisation of the underlying
block devices
c) a set of baseline performance metrics from a
representative workload on a default filesystem
d) spent some time analysing the baseline metrics for
evidence of sub-optimal performance characteristics.

Once I have that information, I can suggest meaningful ways (if any)
to change the storage and filesystem configuration that may improve
the performance of the workload.

BTW, if you ask me how to optimise an ext4 filesystem for the same
workload, I'll tell you straight up that I have no idea and that you
should ask an ext4 expert....

> The better you can test it, the better off you are, but without
> knowing what the knobs do, just doing a test and twiddling the
> knobs to do another test isn't very useful.

Well, yes, that is precisely the reason you should use the defaults.
It's also the reason we have experts - they know what knob to
twiddle to fix specific problems. If you prefer to twiddle knobs
like Blind Freddy, then you should expect things to go wrong....

> If there is a way to
> set the knobs in the general ballpark,

Have you ever considered that this is exactly what mkfs does when
you use the defaults? And that this is the fundamental reason we
keep saying "use the defaults"?

> then you can test and see
> if the performance seems adaquate, if not you can try teaking one
> of the knobs a little bit and see if it helps or hurts. but if the
> knobs aren't even in the ballpark when you start, this doesn't
> help much.

The thread has now come full circle - you're ranting about not
knowing what knobs do or how to set reasonable values so you want to
twiddle random knobs them to see if they do anything as the basis of
your optimisation process. This is the exact process that lead to
the bug report that started this thread - a tweak-without-
understanding configuration leading to undesirable behavioural
characteristics from the filesystem.....

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
d***@lang.hm
2011-01-28 19:26:00 UTC
Permalink
On Sat, 29 Jan 2011, Dave Chinner wrote:

> On Thu, Jan 27, 2011 at 06:09:58PM -0800, ***@lang.hm wrote:
>> On Thu, 27 Jan 2011, Stan Hoeppner wrote:
>>> ***@lang.hm put forth on 1/27/2011 2:11 PM:
>>>
>>> Picking the perfect mkfs.xfs parameters for a hardware RAID array can be
>>> somewhat of a black art, mainly because no two vendor arrays act or perform
>>> identically.
>>
>> if mkfs.xfs can figure out how to do the 'right thing' for md raid
>> arrays, can there be a mode where it asks the users for the same
>> information that it gets from the kernel?
>
> mkfs.xfs can get the information it needs directly from dm and md
> devices. However, when hardware RAID luns present themselves to the
> OS in an identical manner to single drives, how does mkfs tell the
> difference between a 2TB hardware RAID lun made up of 30x73GB drives
> and a single 2TB SATA drive? The person running mkfs should already
> know this little detail....

that's my point, the person running mkfs knows this information, and can
easily answer questions that mkfs asks (or provide this information on the
command line). but mkfs doesn't ask for this infomation, instead it asks
the user to define a whole bunch of parameters that are not well
understood. An XFS guru can tell you how to configure these parameters
based on different hardware layouts, but as long as it remains a 'back
art' getting new people up to speed is really hard. If this can be reduced
down to

is this a hardware raid device
if yes
how many drives are there
what raid type is used (linear, raid 0, 1, 5, 6, 10)

and whatever questions are needed, it would _greatly_ improve the quality
of the settings that non-guru people end up using.

David Lang
Dave Chinner
2011-01-29 05:40:21 UTC
Permalink
On Fri, Jan 28, 2011 at 11:26:00AM -0800, ***@lang.hm wrote:
> On Sat, 29 Jan 2011, Dave Chinner wrote:
>
> >On Thu, Jan 27, 2011 at 06:09:58PM -0800, ***@lang.hm wrote:
> >>On Thu, 27 Jan 2011, Stan Hoeppner wrote:
> >>>***@lang.hm put forth on 1/27/2011 2:11 PM:
> >>>
> >>>Picking the perfect mkfs.xfs parameters for a hardware RAID array can be
> >>>somewhat of a black art, mainly because no two vendor arrays act or perform
> >>>identically.
> >>
> >>if mkfs.xfs can figure out how to do the 'right thing' for md raid
> >>arrays, can there be a mode where it asks the users for the same
> >>information that it gets from the kernel?
> >
> >mkfs.xfs can get the information it needs directly from dm and md
> >devices. However, when hardware RAID luns present themselves to the
> >OS in an identical manner to single drives, how does mkfs tell the
> >difference between a 2TB hardware RAID lun made up of 30x73GB drives
> >and a single 2TB SATA drive? The person running mkfs should already
> >know this little detail....
>
> that's my point, the person running mkfs knows this information, and
> can easily answer questions that mkfs asks (or provide this
> information on the command line). but mkfs doesn't ask for this
> infomation, instead it asks the user to define a whole bunch of
> parameters that are not well understood.

I'm going to be blunt - XFS is not a filesystem suited to use by
clueless noobs. XFS is a highly complex filesystem designed for high
end, high performance storage and therefore has the configurability
and flexibility required by such environments. Hence I expect that
anyone configuring an XFS filesystem for a production environments
is a professional and has, at minimum, done their homework before
they go fiddling with knobs. And we have a FAQ for a reason. ;)

> An XFS guru can tell you
> how to configure these parameters based on different hardware
> layouts, but as long as it remains a 'back art' getting new people
> up to speed is really hard. If this can be reduced down to
>
> is this a hardware raid device
> if yes
> how many drives are there
> what raid type is used (linear, raid 0, 1, 5, 6, 10)
>
> and whatever questions are needed, it would _greatly_ improve the
> quality of the settings that non-guru people end up using.

As opposed to just making mkfs DTRT without needing to ask
questions?

If you really think an interactive mkfs-for-dummies script is
necessary, then go ahead and write one - you don't need to modify
mkfs at all to do it.....

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
d***@lang.hm
2011-01-29 06:08:42 UTC
Permalink
On Sat, 29 Jan 2011, Dave Chinner wrote:

> On Fri, Jan 28, 2011 at 11:26:00AM -0800, ***@lang.hm wrote:
>> On Sat, 29 Jan 2011, Dave Chinner wrote:
>>
>>> On Thu, Jan 27, 2011 at 06:09:58PM -0800, ***@lang.hm wrote:
>>>> On Thu, 27 Jan 2011, Stan Hoeppner wrote:
>>>>> ***@lang.hm put forth on 1/27/2011 2:11 PM:
>>>>>
>>>>> Picking the perfect mkfs.xfs parameters for a hardware RAID array can be
>>>>> somewhat of a black art, mainly because no two vendor arrays act or perform
>>>>> identically.
>>>>
>>>> if mkfs.xfs can figure out how to do the 'right thing' for md raid
>>>> arrays, can there be a mode where it asks the users for the same
>>>> information that it gets from the kernel?
>>>
>>> mkfs.xfs can get the information it needs directly from dm and md
>>> devices. However, when hardware RAID luns present themselves to the
>>> OS in an identical manner to single drives, how does mkfs tell the
>>> difference between a 2TB hardware RAID lun made up of 30x73GB drives
>>> and a single 2TB SATA drive? The person running mkfs should already
>>> know this little detail....
>>
>> that's my point, the person running mkfs knows this information, and
>> can easily answer questions that mkfs asks (or provide this
>> information on the command line). but mkfs doesn't ask for this
>> infomation, instead it asks the user to define a whole bunch of
>> parameters that are not well understood.
>
> I'm going to be blunt - XFS is not a filesystem suited to use by
> clueless noobs. XFS is a highly complex filesystem designed for high
> end, high performance storage and therefore has the configurability
> and flexibility required by such environments. Hence I expect that
> anyone configuring an XFS filesystem for a production environments
> is a professional and has, at minimum, done their homework before
> they go fiddling with knobs. And we have a FAQ for a reason. ;)
>
>> An XFS guru can tell you
>> how to configure these parameters based on different hardware
>> layouts, but as long as it remains a 'back art' getting new people
>> up to speed is really hard. If this can be reduced down to
>>
>> is this a hardware raid device
>> if yes
>> how many drives are there
>> what raid type is used (linear, raid 0, 1, 5, 6, 10)
>>
>> and whatever questions are needed, it would _greatly_ improve the
>> quality of the settings that non-guru people end up using.
>
> As opposed to just making mkfs DTRT without needing to ask
> questions?

but you just said that mkfs couldn't do this with hardware raid because it
can't "tell the difference between a 2TB hardware RAID lun made up of
30x73GB drives and a single 2TB SATA drive" if it could tell the
difference, it should just do the right thing, but if it can't tell the
difference, it should ask the user who can give it the answer.

also, keep in mind that what it learns about the 'disks' from md and dm
may not be the complete picture. I have one system that thinks it's doing
a raid0 across 10 drives, but it's really 160 drives, grouped into 10
raid6 sets by hardware raid, than then gets combined by md.

I am all for the defaults and auto-config being as good as possible (one
of my biggest gripes about postgres is how bad it's defaults are), but whe
you can't tell what reality is, ask the admin who knows (or at least have
the option of asking the admin)

> If you really think an interactive mkfs-for-dummies script is
> necessary, then go ahead and write one - you don't need to modify
> mkfs at all to do it.....

it doesn't have to be interactive, the answers to the questions could be
comand-line options.

as for the reason that I don't do this, that's simple. I don't know enough
of the black arts to know what the logic is to convert from knowing the
disk layout to setting the existing parameters.

David Lang
Dave Chinner
2011-01-29 07:35:54 UTC
Permalink
On Fri, Jan 28, 2011 at 10:08:42PM -0800, ***@lang.hm wrote:
> On Sat, 29 Jan 2011, Dave Chinner wrote:
>
> >On Fri, Jan 28, 2011 at 11:26:00AM -0800, ***@lang.hm wrote:
> >>On Sat, 29 Jan 2011, Dave Chinner wrote:
> >>
> >>>On Thu, Jan 27, 2011 at 06:09:58PM -0800, ***@lang.hm wrote:
> >>>>On Thu, 27 Jan 2011, Stan Hoeppner wrote:
> >>>>>***@lang.hm put forth on 1/27/2011 2:11 PM:
> >>>>>
> >>>>>Picking the perfect mkfs.xfs parameters for a hardware RAID array can be
> >>>>>somewhat of a black art, mainly because no two vendor arrays act or perform
> >>>>>identically.
> >>>>
> >>>>if mkfs.xfs can figure out how to do the 'right thing' for md raid
> >>>>arrays, can there be a mode where it asks the users for the same
> >>>>information that it gets from the kernel?
> >>>
> >>>mkfs.xfs can get the information it needs directly from dm and md
> >>>devices. However, when hardware RAID luns present themselves to the
> >>>OS in an identical manner to single drives, how does mkfs tell the
> >>>difference between a 2TB hardware RAID lun made up of 30x73GB drives
> >>>and a single 2TB SATA drive? The person running mkfs should already
> >>>know this little detail....
> >>
> >>that's my point, the person running mkfs knows this information, and
> >>can easily answer questions that mkfs asks (or provide this
> >>information on the command line). but mkfs doesn't ask for this
> >>infomation, instead it asks the user to define a whole bunch of
> >>parameters that are not well understood.
> >
> >I'm going to be blunt - XFS is not a filesystem suited to use by
> >clueless noobs. XFS is a highly complex filesystem designed for high
> >end, high performance storage and therefore has the configurability
> >and flexibility required by such environments. Hence I expect that
> >anyone configuring an XFS filesystem for a production environments
> >is a professional and has, at minimum, done their homework before
> >they go fiddling with knobs. And we have a FAQ for a reason. ;)
> >
> >>An XFS guru can tell you
> >>how to configure these parameters based on different hardware
> >>layouts, but as long as it remains a 'back art' getting new people
> >>up to speed is really hard. If this can be reduced down to
> >>
> >>is this a hardware raid device
> >> if yes
> >> how many drives are there
> >> what raid type is used (linear, raid 0, 1, 5, 6, 10)
> >>
> >>and whatever questions are needed, it would _greatly_ improve the
> >>quality of the settings that non-guru people end up using.
> >
> >As opposed to just making mkfs DTRT without needing to ask
> >questions?
>
> but you just said that mkfs couldn't do this with hardware raid
> because it can't "tell the difference between a 2TB hardware RAID
> lun made up of 30x73GB drives and a single 2TB SATA drive" if it
> could tell the difference, it should just do the right thing, but if
> it can't tell the difference, it should ask the user who can give it
> the answer.

Just because we can't do it right now doesn't mean it is not
possible. Array/raid controller vendors need to implement the SCSI
block limit VPD page, and if they do then stripe unit/stripe width
may be exposed for the device in sysfs. However, I haven't seen any
devices except for md and dm that actually export values that
reflect sunit/swidth in the files:

/sys/block/<dev>/queue/minimum_io_size
/sys/block/<dev>/queue/optimal_io_size

There's information about it here:

http://www.kernel.org/doc/ols/2009/ols2009-pages-235-238.pdf

But what we really need here is for RAID vendors to implement the
part of the SCSI protocol that gives us the necessary information.

> also, keep in mind that what it learns about the 'disks' from md and
> dm may not be the complete picture. I have one system that thinks
> it's doing a raid0 across 10 drives, but it's really 160 drives,
> grouped into 10 raid6 sets by hardware raid, than then gets combined
> by md.

MD doesn't care whether the block devices are single disks or RAID
LUNS. In this case, it's up to you to configure the md chunk size
appropriately for those devices. i.e. the MD chunk size needs to be
the RAID6 lun stripe width. If you get the MD config right, then
mkfs will do exactly the right thing without needing to be tweaked.
The same goes for any sort of heirarchical aggregation of storage -
if you don't get the geometry right at each level, then performance
will suck.

FWIW, SGI has been using XFS in complex, multilayer, multipath,
heirarchical configurations like this for 15 years. What you
describe is a typical, everyday configuration that XFS is used on
and it is this sort of configuration we tend to optimise the
default behaviour for....

> I am all for the defaults and auto-config being as good as possible
> (one of my biggest gripes about postgres is how bad it's defaults
> are), but whe you can't tell what reality is, ask the admin who
> knows (or at least have the option of asking the admin)
>
> >If you really think an interactive mkfs-for-dummies script is
> >necessary, then go ahead and write one - you don't need to modify
> >mkfs at all to do it.....
>
> it doesn't have to be interactive, the answers to the questions
> could be comand-line options.

Which means you're assuming a competent admin is running the tool,
in which case they could just run mkfs directly. Anyway, it still
doesn't need mkfs changes.

> as for the reason that I don't do this, that's simple. I don't know
> enough of the black arts to know what the logic is to convert from
> knowing the disk layout to setting the existing parameters.

Writing such a script would be a good way to learn the art and
document the information that people are complaining that is
lacking. I don't have the time (or need) to write such a script, but
I can answer questions when they arise should someone decide to do
it....

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Christoph Hellwig
2011-01-31 19:17:20 UTC
Permalink
On Sat, Jan 29, 2011 at 06:35:54PM +1100, Dave Chinner wrote:
> Just because we can't do it right now doesn't mean it is not
> possible. Array/raid controller vendors need to implement the SCSI
> block limit VPD page, and if they do then stripe unit/stripe width
> may be exposed for the device in sysfs. However, I haven't seen any
> devices except for md and dm that actually export values that
> reflect sunit/swidth in the files:

I have access to a few big vendor arrays that export it, but I think
they are still running beta firmware versions.
Mark Lord
2011-01-27 21:56:20 UTC
Permalink
On 11-01-27 02:40 PM, Stan Hoeppner wrote:
..
> You need to use the mkfs.xfs defaults for any single drive filesystem, and trust
> the allocator to do the right thing.

But it did not do the right thing when I used the defaults.
Big files ended up with tons of (exactly) 64MB extents, ISTR.

With the increased number of ags, I saw much less fragmentation,
and the drive was still very light on I/O despite multiple simultaneous
recordings, commflaggers, and playback at once.

The only, ONLY, glitch, was this recent "first write takes 45 seconds" glitch.
After that initial write after boot, throughput was normal (great).

Thus the attempts to tweak.

> Trust the defaults.

I imagine the defaults are designed to handle a typical Linux install,
with 100,000 to 1,000,000 files varying from a few bytes to a few megabytes.

That's not what this filesystem will have. It will have only a few thousand
(max) inodes at any given time, but each file will be HUGE.

XFS is fantastic at adapting to the workload, but I'd like to try and have
it tuned more closely for the known workload this system is throwing at it.

I'm now trying again, but with 8 ags instead of 8000+.

Thanks!
Dave Chinner
2011-01-28 00:17:35 UTC
Permalink
On Thu, Jan 27, 2011 at 04:56:20PM -0500, Mark Lord wrote:
> On 11-01-27 02:40 PM, Stan Hoeppner wrote:
> ..
> > You need to use the mkfs.xfs defaults for any single drive filesystem, and trust
> > the allocator to do the right thing.
>
> But it did not do the right thing when I used the defaults.
> Big files ended up with tons of (exactly) 64MB extents, ISTR.

Because your AG size is 64MB. An extent can't be larger than an AG.
Hence you are fragmenting your large files unnecessarily, as extents
can be up to 8GB in size on a 4k block size filesystem.

> With the increased number of ags, I saw much less fragmentation,
> and the drive was still very light on I/O despite multiple simultaneous
> recordings, commflaggers, and playback at once.

The allocsize mount option is the prefered method of keeping
fragemntation down for dvr style workloads.

> > Trust the defaults.
>
> I imagine the defaults are designed to handle a typical Linux install,
> with 100,000 to 1,000,000 files varying from a few bytes to a few megabytes.

Why would we optimise a filesystem designed for use on high end
storage and large amounts of IO concurrency for what a typical Linux
desktop needs? For such storage (i.e. single spindle) mkfs optimises
the layout for minimal seeks and relatively low amounts of
concurrency. This gives _adequate_ performance on desktop machines
without compromising scalbility on high end storage.

In my experience with XFS, most people who tweak mkfs parameters end
up with some kind of problem they can't explain and don't know how
to solve. And they are typically problems that would not have
occurred had they simply used the defaults in the first place. What
you've done is a perfect example of this.

Yes, I know we are taking the fun out of tweaking knobs so you can
say it's 1% faster than the default, but that's our job: to
determine the right default settings so the filesystem works as well
as possible out of the box with no tweaking for most workloads on a
wide range of storage....

> That's not what this filesystem will have. It will have only a few thousand
> (max) inodes at any given time, but each file will be HUGE.

Which is exactly the use case XFS was designed for, and....

> XFS is fantastic at adapting to the workload, but I'd like to try and have
> it tuned more closely for the known workload this system is throwing at it.

.... as such the mkfs defaults are already tuned as well as they can
be for such usage.

> I'm now trying again, but with 8 ags instead of 8000+.

Why 8 AGs and not the default?

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Mark Lord
2011-01-28 01:22:48 UTC
Permalink
On 11-01-27 07:17 PM, Dave Chinner wrote:
>
> In my experience with XFS, most people who tweak mkfs parameters end
> up with some kind of problem they can't explain and don't know how
> to solve. And they are typically problems that would not have
> occurred had they simply used the defaults in the first place. What
> you've done is a perfect example of this.

Maybe. But what I read from the paragraph above,
is that the documentation could perhaps explain things better,
and then people other than the coders might understand how
best to tweak it.

> Why 8 AGs and not the default?

How AGs are used is not really explained anywhere I've looked,
so I am guessing at what they do and how the system might respond
to different values there (that documentation thing again).

Lacking documentation, my earlier experiences suggest that more AGs
gives me less fragmentation when multiple simultaneous recording streams
are active. I got higher fragmentation with the defaults than with
the tweaked value.

Now, that might be due to differences in kernel versions too,
as things in XFS are continuously getting even better (thanks!),
and the original "defaults" assessment was with the kernel-of-the-day
back in early 2010 (2.6.34?), and now the system is using 2.6.37.

But I just don't know. My working theory, likely entirely wrong,
is that if I have N streams active, odds are that each of those
streams might get assigned to different AGs, given sufficient AGs >= N.

Since the box often has 3-7 recording streams active,
I'm trying it out with 8 AGs now.

Cheers
Mark Lord
2011-01-28 01:36:26 UTC
Permalink
On 11-01-27 08:22 PM, Mark Lord wrote:
> On 11-01-27 07:17 PM, Dave Chinner wrote:
>>
>> In my experience with XFS, most people who tweak mkfs parameters end
>> up with some kind of problem they can't explain and don't know how
>> to solve. And they are typically problems that would not have
>> occurred had they simply used the defaults in the first place. What
>> you've done is a perfect example of this.
>
> Maybe. But what I read from the paragraph above,
> is that the documentation could perhaps explain things better,
> and then people other than the coders might understand how
> best to tweak it.


By the way, the documentation is excellent, for a developer who
wants to work on the codebase. It describes the data structures
and layouts etc.. better than perhaps any other Linux filesystem.

But it doesn't seem to describe the algorithms, such as how it
decides where to store a recording stream.

I'm not complaining, far from it. XFS is simply wonderful,
and my DVR literally couldn't work without it.

But I am as technical as you are, and I like to experiment and
understand the technology I use. That's partly why we both work
on the Linux kernel.

Cheers
David Rees
2011-01-28 04:14:50 UTC
Permalink
On Thu, Jan 27, 2011 at 5:22 PM, Mark Lord <***@teksavvy.com> wrote:
> But I just don't know. =A0My working theory, likely entirely wrong,
> is that if I have N streams active, odds are that each of those
> streams might get assigned to different AGs, given sufficient AGs >=3D=
N.
>
> Since the box often has 3-7 recording streams active,
> I'm trying it out with 8 AGs now.

As suggested before - why are you messing with AGs instead of allocsize=
?

I suspect that with the default configuration, XFS was trying to
maximize throughput by reducing seeks with multiple processes writing
streams.

But now, you're telling XFS that it's OK to write in up to 8 different
locations on the disk without worrying about seek performance.

I think this is likely to result in overall worse performance at the
worst time - under write load.

If you are trying to optimize single thread read performance by
minimizing file fragments, why don't you simply figure out at what
point increasing allocsize stops increasing read performance?

I suspect that the the defaults do good job because even if your file
are fragmented in 64MB chunks because you have multiple streams
writing, those chunks are very likely to be very close together so
there isn't much of a seek penalty.

-Dave
Mark Lord
2011-01-28 14:22:18 UTC
Permalink
On 11-01-27 11:14 PM, David Rees wrote:
>
> As suggested before - why are you messing with AGs instead of allocsize?

Who said "instead of"? I'm using both.

Cheers
Dave Chinner
2011-01-28 07:31:19 UTC
Permalink
On Thu, Jan 27, 2011 at 08:22:48PM -0500, Mark Lord wrote:
> On 11-01-27 07:17 PM, Dave Chinner wrote:
> >
> > In my experience with XFS, most people who tweak mkfs parameters end
> > up with some kind of problem they can't explain and don't know how
> > to solve. And they are typically problems that would not have
> > occurred had they simply used the defaults in the first place. What
> > you've done is a perfect example of this.
>
> Maybe. But what I read from the paragraph above,
> is that the documentation could perhaps explain things better,
> and then people other than the coders might understand how
> best to tweak it.

A simple google search turns up discussions like this:

http://oss.sgi.com/archives/xfs/2009-01/msg01161.html

Where someone reads the docco and asks questions to fill in gaps in
their knowledge that the docco didn't explain fully before they
try to twiddle knobs.

Configuring XFS filesystems for optimal performance has always been
a black art because it requires you to understand your storage, your
application workload(s) and XFS from the ground up. Most people
can't even tick one of those boxes, let alone all three....

> > Why 8 AGs and not the default?
>
> How AGs are used is not really explained anywhere I've looked,
> so I am guessing at what they do and how the system might respond
> to different values there (that documentation thing again).

Section 5.1 of this 1996 whitepaper tells you what allocation groups
are and the general allocation strategy around them:

http://oss.sgi.com/projects/xfs/papers/xfs_usenix/index.html

> Lacking documentation, my earlier experiences suggest that more AGs
> gives me less fragmentation when multiple simultaneous recording streams
> are active. I got higher fragmentation with the defaults than with
> the tweaked value.

Fragmentation is not a big problem if you've got extents larger than
a typical IO. Once extents get to a few megabytes in size, it just
doesn't matter if they are any bigger for small DVR workloads
because the seek cost between streams is sufficiently amortised with
a few MB of sequential access per stream....

> Now, that might be due to differences in kernel versions too,
> as things in XFS are continuously getting even better (thanks!),
> and the original "defaults" assessment was with the kernel-of-the-day
> back in early 2010 (2.6.34?), and now the system is using 2.6.37.
>
> But I just don't know. My working theory, likely entirely wrong,
> is that if I have N streams active, odds are that each of those
> streams might get assigned to different AGs, given sufficient AGs >= N.

Streaming into different AGs is not necessarily the right solution;
it causes seeks between every streami, and the stream in AG0 will be
able to read/write faster than the stream in AG 7 because of their
locations on disk.

IOWs, interleaving streams within an AG might give better IO
patterns, lower latency and better throughput. Of course, it depends
on the storage subsystem, the application, etc. And yes, you can
change this sort of allocation behaviour by fiddling with XFS knobs
in the right way - start to see what I mean about tuning XFS really
being a "black art"?

> Since the box often has 3-7 recording streams active,
> I'm trying it out with 8 AGs now.

Seems like a reasonable decsion. Good luck.
>
> Cheers
>
>

--
Dave Chinner
***@fromorbit.com
Mark Lord
2011-01-28 14:33:02 UTC
Permalink
On 11-01-28 02:31 AM, Dave Chinner wrote:
>
> A simple google search turns up discussions like this:
>
> http://oss.sgi.com/archives/xfs/2009-01/msg01161.html

"in the long term we still expect fragmentation to degrade the performance of
XFS file systems"
Other than that, no hints there about how changing agcount affects things.


> Configuring XFS filesystems for optimal performance has always been
> a black art because it requires you to understand your storage, your
> application workload(s) and XFS from the ground up. Most people
> can't even tick one of those boxes, let alone all three....

Well, I've got 2/3 of those down just fine, thanks.
But it's the "XFS" part that is still the "black art" part,
because so little is written about *how* it works
(as opposed to how it is laid out on disk).

Again, that's only a minor complaint -- XFS is way better documented
than the alternatives, and also works way better than the others I've
tried here on this workload.

>>> Why 8 AGs and not the default?
>>
>> How AGs are used is not really explained anywhere I've looked,
>> so I am guessing at what they do and how the system might respond
>> to different values there (that documentation thing again).
>
> Section 5.1 of this 1996 whitepaper tells you what allocation groups
> are and the general allocation strategy around them:
>
> http://oss.sgi.com/projects/xfs/papers/xfs_usenix/index.html

Looks a bit dated: "Allocation groups are typically 0.5 to 4 gigabytes in size."
But it does suggest that "processes running concurrently can allocate space
in the file system concurrently without interfering with each other".

Dunno if that's still true today, but it sounds pretty close to what I was
theorizing about how it might work.

> start to see what I mean about tuning XFS really being a "black art"?

No, I've seen that "black" (aka. undefined, undocumented) part from the start. :)
Thanks for chipping in here, though -- it's been really useful.

Cheers!
Dave Chinner
2011-01-28 23:58:47 UTC
Permalink
On Fri, Jan 28, 2011 at 09:33:02AM -0500, Mark Lord wrote:
> On 11-01-28 02:31 AM, Dave Chinner wrote:
> >
> > A simple google search turns up discussions like this:
> >
> > http://oss.sgi.com/archives/xfs/2009-01/msg01161.html
>
> "in the long term we still expect fragmentation to degrade the performance of
> XFS file systems"

"so we intend to add an on-line file system defragmentation utility
to optimize the file system in the future"

You are quoting from the wrong link - that's from the 1996
whitepaper. And sure, at the time that was written, nobody had any
real experience with long term aging of XFS filesystems so it was
still a guess at that point. XFS has had that online defragmentation
utility since 1998, IIRC, even though in most cases it is
unnecessary to use it.

> Other than that, no hints there about how changing agcount affects things.

If the reason given in the whitepaper for multiple AGs (i.e. they
are for increasing the concurrency of allocation) doesn't help you
understand why you'd want to increase the number of AGs in the
filesystem, then you haven't really thought about what you read.

As it is, from the same google search that found the above link
as #1 hit, this was #6:

http://oss.sgi.com/archives/xfs/2010-11/msg00497.html

| > AG count has a
| > direct relationship to the storage hardware, not the number of CPUs
| > (cores) in the system
|
| Actually, I used 16 AGs because it's twice the number of CPU cores
| and I want to make sure that CPU parallel workloads (e.g. make -j 8)
| don't serialise on AG locks during allocation. IOWs, I laid it out
| that way precisely because of the number of CPUs in the system...
|
| And to point out the not-so-obvious, this is the _default layout_
| that mkfs.xfs in the debian squeeze installer came up with. IOWs,
| mkfs.xfs did exactly what I wanted without me having to tweak
| _anything_."
|
[...]
|
| In that case, you are right. Single spindle SRDs go backwards in
| performance pretty quickly once you go over 4 AGs...

It seems to me that you haven't really done much looking for
information; there's lots of relevant advice in xfs mailing list
archives...

(and before you ask - SRD == Spinning Rust Disk)

> > Configuring XFS filesystems for optimal performance has always been
> > a black art because it requires you to understand your storage, your
> > application workload(s) and XFS from the ground up. Most people
> > can't even tick one of those boxes, let alone all three....
>
> Well, I've got 2/3 of those down just fine, thanks.
> But it's the "XFS" part that is still the "black art" part,
> because so little is written about *how* it works
> (as opposed to how it is laid out on disk).

If you want to know exactly how it works, there plenty of code to
read. I know, you're going to call that a cop out, but I've got more
important things to do than document 20,000 lines of allocation
code just for you.

In a world of infinite resources then everything would be documented
just the way you want, but we don't have infinite resources so it
remains documented by the code that implements it. However, if you
want to go and understand it and document it all for us, then we'll
happily take the patches. :)

Cheers,,

Dave.
--
Dave Chinner
***@fromorbit.com
Martin Steigerwald
2011-01-28 19:18:11 UTC
Permalink
Am Thursday 27 January 2011 schrieb Stan Hoeppner:
> Trust the defaults. If they give you problems (unlikely) then we can't
> talk. ;)

With one addition: Use a recent xfsprogs! ;)

Earlier ones created more AGs, didn't activate lazy super block counter
(likely no issue here) and whatnot...

--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
John Stoffel
2011-01-27 20:24:25 UTC
Permalink
>>>>> "Mark" == Mark Lord <***@teksavvy.com> writes:

Mark> On 11-01-27 10:40 AM, Justin Piszcz wrote:
>>
>>
>> On Thu, 27 Jan 2011, Mark Lord wrote:
Mark> ..
>>> Can you recommend a good set of mkfs.xfs parameters to suit the characteristics
>>> of this system? Eg. Only a few thousand active inodes, and nearly all files are
>>> in the 600MB -> 20GB size range. The usage pattern it must handle is up to
>>> six concurrent streaming writes at the same time as up to three streaming reads,
>>> with no significant delays permitted on the reads.
>>>
>>> That's the kind of workload that I find XFS handles nicely,
>>> and EXT4 has given me trouble with in the past.
Mark> ..
>> I did a load of benchmarks a long time ago testing every mkfs.xfs option there
>> was, and I found that most of the time (if not all), the defaults were the best.
Mark> ..

Mark> I am concerned with fragmentation on the very special workload
Mark> in this case. I'd really like the 20GB files, written over a
Mark> 1-2 hour period, to consist of a very few very large extents, as
Mark> much as possible.

Mark> Rather than hundreds or thousands of "tiny" MB sized extents. I
Mark> wonder what the best mkfs.xfs parameters might be to encourage
Mark> that?

Hmmm, should the application be pre-allocating the disk space then, so
that the writes get into nice large extents automatically? Isn't this
what the fallocate() system call is for? Doesn't MythTV use this?

I don't use XFS, or MythTV, but I like keeping track of this stuff.

John
Dave Chinner
2011-01-27 23:41:52 UTC
Permalink
On Thu, Jan 27, 2011 at 10:12:23AM -0500, Mark Lord wrote:
> On 11-01-27 12:30 AM, Stan Hoeppner wrote:
> > Mark Lord put forth on 1/26/2011 9:49 PM:
> >
> >> agcount=7453
> >
> > That's probably a bit high Mark, and very possibly the cause of your problems.
> > :) Unless the disk array backing this filesystem has something like 400-800
> > striped disk drives. You said it's a single 2TB drive right?
> >
> > The default agcount for a single drive filesystem is 4 allocation groups. For
> > mdraid (of any number of disks/configuration) it's 16 allocation groups.
> >
> > Why/how did you end up with 7452 allocation groups? That can definitely cause
> > some performance issues due to massively excessive head seeking, and possibly
> > all manner of weirdness.
>
> This is great info, exactly the kind of feedback I was hoping for!
>
> The filesystem is about a year old now, and I probably used agsize=nnnnn
> when creating it or something.
>
> So if this resulted in what you consider to be many MANY too MANY ags,
> then I can imagine the first new file write wanting to go out and read
> in all of the ag data to determine the "best fit" or something.
> Which might explain some of the delay.
>
> Once I get the new 2TB drive, I'll re-run mkfs.xfs and then copy everything
> over onto a fresh xfs filesystem.
>
> Can you recommend a good set of mkfs.xfs parameters to suit the characteristics
> of this system?

http://xfs.org/index.php/XFS_FAQ#Q:_I_want_to_tune_my_XFS_filesystems_for_.3Csomething.3E

And perhaps you want to consider the allocsize mount option, though
that shouldn't be necessary for 2.6.38+...

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Mark Lord
2011-01-28 00:59:51 UTC
Permalink
On 11-01-27 06:41 PM, Dave Chinner wrote:
> On Thu, Jan 27, 2011 at 10:12:23AM -0500, Mark Lord wrote:
..
>> Can you recommend a good set of mkfs.xfs parameters to suit the characteristics
>> of this system?
>
> http://xfs.org/index.php/XFS_FAQ#Q:_I_want_to_tune_my_XFS_filesystems_for_.3Csomething.3E

That entry says little beyond "blindly trust the defaults".
But thanks anyway (really).

> And perhaps you want to consider the allocsize mount option, though
> that shouldn't be necessary for 2.6.38+...

That's a good tip, thanks.
From my earlier posting:

> /dev/sdb1 on /var/lib/mythtv type xfs
> (rw,noatime,allocsize=64M,logbufs=8,largeio)

Maybe that allocsize value could be increased though.
Perhaps something on the order of 256MB might do it.

Thanks again!
Dave Chinner
2011-01-27 23:39:29 UTC
Permalink
On Wed, Jan 26, 2011 at 10:49:03PM -0500, Mark Lord wrote:
> On 11-01-26 10:30 PM, Dave Chinner wrote:
> > [Please cc ***@oss.sgi.com on XFS bug reports. Added.]
> >
> > On Wed, Jan 26, 2011 at 08:22:25PM -0500, Mark Lord wrote:
> >> Alex / Christoph,
> >>
> >> My mythtv box here uses XFS on a 2TB drive for storing recordings and videos.
> >> It is behaving rather strangely though, and has gotten worse recently.
> >> Here is what I see happening:
> >>
> >> The drive mounts fine at boot, but the very first attempt to write a new file
> >> to the filesystem suffers from a very very long pause, 30-60 seconds, during which
> >> time the disk activity light is fully "on".
> >
> > Please post the output of xfs_info <mtpt> so we can see what you
> > filesystem configuration is.
>
> /dev/sdb1 on /var/lib/mythtv type xfs
> (rw,noatime,allocsize=64M,logbufs=8,largeio)
>
> [~] xfs_info /var/lib/mythtv
> meta-data=/dev/sdb1 isize=256 agcount=7453, agsize=65536 blks
> = sectsz=512 attr=2
> data = bsize=4096 blocks=488378638, imaxpct=5
> = sunit=0 swidth=0 blks
> naming =version 2 bsize=4096 ascii-ci=0
> log =internal bsize=4096 blocks=32768, version=2
> = sectsz=512 sunit=0 blks, lazy-count=0
> realtime =none extsz=4096 blocks=0, rtextents=0

7453 AGs means that the first write coud cause up to ~7500 disk
reads to occur as the AGF headers are read in to find where the
best free space extent for allocation lies.

That'll be your problem.

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Continue reading on narkive:
Loading...