Discussion:
atime and filesystems with snapshots (especially Btrfs)
Alexander Block
2012-05-25 15:35:37 UTC
Permalink
Hello,

(this is a resend with proper CC for linux-fsdevel and linux-kernel)

I would like to start a discussion on atime in Btrfs (and other
filesystems with snapshot support).

As atime is updated on every access of a file or directory, we get
many changes to the trees in btrfs that as always trigger cow
operations. This is no problem as long as the changed tree blocks are
not shared by other subvolumes. Performance is also not a problem, no
matter if shared or not (thanks to relatime which is the default).
The problems start when someone starts to use snapshots. If you for
example snapshot your root and continue working on your root, after
some time big parts of the tree will be cowed and unshared. In the
worst case, the whole tree gets unshared and thus takes up the double
space. Normally, a user would expect to only use extra space for a
tree if he changes something.
A worst case scenario would be if someone took regular snapshots for
backup purposes and later greps the contents of all snapshots to find
a specific file. This would touch all inodes in all trees and thus
make big parts of the trees unshared.

relatime (which is the default) reduces this problem a little bit, as
it by default only updates atime once a day. This means, if anyone
wants to test this problem, mount with relatime disabled or change the
system date before you try to update atime (that's the way i tested
it).

As a solution, I would suggest to make noatime the default for btrfs.
I'm however not sure if it is allowed in linux to have different
default mount options for different filesystem types. I know this
discussion pops up every few years (last time it resulted in making
relatime the default). But this is a special case for btrfs. atime is
already bad on other filesystems, but it's much much worse in btrfs.

Alex.
Josef Bacik
2012-05-25 15:42:50 UTC
Permalink
Post by Alexander Block
Hello,
(this is a resend with proper CC for linux-fsdevel and linux-kernel)
I would like to start a discussion on atime in Btrfs (and other
filesystems with snapshot support).
As atime is updated on every access of a file or directory, we get
many changes to the trees in btrfs that as always trigger cow
operations. This is no problem as long as the changed tree blocks are
not shared by other subvolumes. Performance is also not a problem, no
matter if shared or not (thanks to relatime which is the default).
The problems start when someone starts to use snapshots. If you for
example snapshot your root and continue working on your root, after
some time big parts of the tree will be cowed and unshared. In the
worst case, the whole tree gets unshared and thus takes up the double
space. Normally, a user would expect to only use extra space for a
tree if he changes something.
A worst case scenario would be if someone took regular snapshots for
backup purposes and later greps the contents of all snapshots to find
a specific file. This would touch all inodes in all trees and thus
make big parts of the trees unshared.
relatime (which is the default) reduces this problem a little bit, as
it by default only updates atime once a day. This means, if anyone
wants to test this problem, mount with relatime disabled or change the
system date before you try to update atime (that's the way i tested
it).
As a solution, I would suggest to make noatime the default for btrfs.
I'm however not sure if it is allowed in linux to have different
default mount options for different filesystem types. I know this
discussion pops up every few years (last time it resulted in making
relatime the default). But this is a special case for btrfs. atime is
already bad on other filesystems, but it's much much worse in btrfs.
Just mount with -o noatime, there's no chance of turning something like that on
by default since it will break some applications (notably mutt). Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Alexander Block
2012-05-25 15:59:45 UTC
Permalink
Post by Alexander Block
Hello,
(this is a resend with proper CC for linux-fsdevel and linux-kernel)
I would like to start a discussion on atime in Btrfs (and other
filesystems with snapshot support).
As atime is updated on every access of a file or directory, we get
many changes to the trees in btrfs that as always trigger cow
operations. This is no problem as long as the changed tree blocks ar=
e
Post by Alexander Block
not shared by other subvolumes. Performance is also not a problem, n=
o
Post by Alexander Block
matter if shared or not (thanks to relatime which is the default).
The problems start when someone starts to use snapshots. If you for
example snapshot your root and continue working on your root, after
some time big parts of the tree will be cowed and unshared. In the
worst case, the whole tree gets unshared and thus takes up the doubl=
e
Post by Alexander Block
space. Normally, a user would expect to only use extra space for a
tree if he changes something.
A worst case scenario would be if someone took regular snapshots for
backup purposes and later greps the contents of all snapshots to fin=
d
Post by Alexander Block
a specific file. This would touch all inodes in all trees and thus
make big parts of the trees unshared.
relatime (which is the default) reduces this problem a little bit, a=
s
Post by Alexander Block
it by default only updates atime once a day. This means, if anyone
wants to test this problem, mount with relatime disabled or change t=
he
Post by Alexander Block
system date before you try to update atime (that's the way i tested
it).
As a solution, I would suggest to make noatime the default for btrfs=
=2E
Post by Alexander Block
I'm however not sure if it is allowed in linux to have different
default mount options for different filesystem types. I know this
discussion pops up every few years (last time it resulted in making
relatime the default). But this is a special case for btrfs. atime i=
s
Post by Alexander Block
already bad on other filesystems, but it's much much worse in btrfs.
Just mount with -o noatime, there's no chance of turning something li=
ke that on
by default since it will break some applications (notably mutt). =A0T=
hanks,
Josef
I know about the discussions regarding compatibility with existing
applications. The problem here is, that it is not only a compatibility
problem. Having atime enabled by default, may give you ENOSPC
for reasons that a normal user does not understand or expect.
As a normal user, I would think: If I never change something, why
does it then take up more space just by reading it?
Andreas Dilger
2012-05-25 16:28:16 UTC
Permalink
Post by Alexander Block
Post by Josef Bacik
Post by Alexander Block
Hello,
(this is a resend with proper CC for linux-fsdevel and linux-kernel)
I would like to start a discussion on atime in Btrfs (and other
filesystems with snapshot support).
As atime is updated on every access of a file or directory, we get
many changes to the trees in btrfs that as always trigger cow
operations. This is no problem as long as the changed tree blocks are
not shared by other subvolumes. Performance is also not a problem, no
matter if shared or not (thanks to relatime which is the default).
The problems start when someone starts to use snapshots. If you for
example snapshot your root and continue working on your root, after
some time big parts of the tree will be cowed and unshared. In the
worst case, the whole tree gets unshared and thus takes up the double
space. Normally, a user would expect to only use extra space for a
tree if he changes something.
A worst case scenario would be if someone took regular snapshots for
backup purposes and later greps the contents of all snapshots to find
a specific file. This would touch all inodes in all trees and thus
make big parts of the trees unshared.
Are you talking about the atime for the primary copy, or the atime for the snapshots? IMHO, the atime should not be updated for a snapshot unless it is explicitly mounted r/w, or it isn't really a good snapshot.
Post by Alexander Block
Post by Josef Bacik
Post by Alexander Block
relatime (which is the default) reduces this problem a little bit, as
it by default only updates atime once a day. This means, if anyone
wants to test this problem, mount with relatime disabled or change the
system date before you try to update atime (that's the way i tested
it).
As a solution, I would suggest to make noatime the default for btrfs.
I'm however not sure if it is allowed in linux to have different
default mount options for different filesystem types. I know this
discussion pops up every few years (last time it resulted in making
relatime the default). But this is a special case for btrfs. atime is
already bad on other filesystems, but it's much much worse in btrfs.
Just mount with -o noatime, there's no chance of turning something like that on
by default since it will break some applications (notably mutt). Thanks,
Josef
I know about the discussions regarding compatibility with existing
applications. The problem here is, that it is not only a compatibility
problem. Having atime enabled by default, may give you ENOSPC
for reasons that a normal user does not understand or expect.
As a normal user, I would think: If I never change something, why
does it then take up more space just by reading it?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Alexander Block
2012-05-25 16:38:54 UTC
Permalink
Are you talking about the atime for the primary copy, or the atime fo=
r the snapshots? =A0IMHO, the atime should not be updated for a snapsho=
t unless it is explicitly mounted r/w, or it isn't really a good snapsh=
ot.
Snapshots are by default r/w but can be created r/o explicitly. But
that doesn't matter for the normal use case where you snapshot / and
continue working on /. After snapshotting, all metadata is shared
between the two subvolumes, but when a metadata block in one of both
subvolume changes (no matter which one), this one metadata block get's
cowed and unshared and uses up more space.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Alexander Block
2012-05-25 16:48:30 UTC
Permalink
Post by Alexander Block
Post by Alexander Block
Hello,
(this is a resend with proper CC for linux-fsdevel and linux-kern=
el)
Post by Alexander Block
Post by Alexander Block
I would like to start a discussion on atime in Btrfs (and other
filesystems with snapshot support).
As atime is updated on every access of a file or directory, we ge=
t
Post by Alexander Block
Post by Alexander Block
many changes to the trees in btrfs that as always trigger cow
operations. This is no problem as long as the changed tree blocks=
are
Post by Alexander Block
Post by Alexander Block
not shared by other subvolumes. Performance is also not a problem=
, no
Post by Alexander Block
Post by Alexander Block
matter if shared or not (thanks to relatime which is the default)=
=2E
Post by Alexander Block
Post by Alexander Block
The problems start when someone starts to use snapshots. If you f=
or
Post by Alexander Block
Post by Alexander Block
example snapshot your root and continue working on your root, aft=
er
Post by Alexander Block
Post by Alexander Block
some time big parts of the tree will be cowed and unshared. In th=
e
Post by Alexander Block
Post by Alexander Block
worst case, the whole tree gets unshared and thus takes up the do=
uble
Post by Alexander Block
Post by Alexander Block
space. Normally, a user would expect to only use extra space for =
a
Post by Alexander Block
Post by Alexander Block
tree if he changes something.
A worst case scenario would be if someone took regular snapshots =
for
Post by Alexander Block
Post by Alexander Block
backup purposes and later greps the contents of all snapshots to =
find
Post by Alexander Block
Post by Alexander Block
a specific file. This would touch all inodes in all trees and thu=
s
Post by Alexander Block
Post by Alexander Block
make big parts of the trees unshared.
relatime (which is the default) reduces this problem a little bit=
, as
Post by Alexander Block
Post by Alexander Block
it by default only updates atime once a day. This means, if anyon=
e
Post by Alexander Block
Post by Alexander Block
wants to test this problem, mount with relatime disabled or chang=
e the
Post by Alexander Block
Post by Alexander Block
system date before you try to update atime (that's the way i test=
ed
Post by Alexander Block
Post by Alexander Block
it).
As a solution, I would suggest to make noatime the default for bt=
rfs.
Post by Alexander Block
Post by Alexander Block
I'm however not sure if it is allowed in linux to have different
default mount options for different filesystem types. I know this
discussion pops up every few years (last time it resulted in maki=
ng
Post by Alexander Block
Post by Alexander Block
relatime the default). But this is a special case for btrfs. atim=
e is
Post by Alexander Block
Post by Alexander Block
already bad on other filesystems, but it's much much worse in btr=
fs.
Post by Alexander Block
Just mount with -o noatime, there's no chance of turning something=
like
Post by Alexander Block
that on
by default since it will break some applications (notably mutt).
=A0Thanks,
Josef
I know about the discussions regarding compatibility with existing
applications. The problem here is, that it is not only a compatibili=
ty
Post by Alexander Block
problem. Having atime enabled by default, may give you ENOSPC
for reasons that a normal user does not understand or expect.
As a normal user, I would think: If I never change something, why
does it then take up more space just by reading it?
Atime is metadata. Thus, by reading a file, only the metadata block f=
or that
file is CoW'd...not the actual file data blocks. IOW, your snapshots =
won't
change and suddenly balloon in size from reading files (metadata bloc=
ks are
tiny).
And, if they do, then something is horribly wrong with the snapshot s=
ystem.
Fixing that would be more important than changing the default mount o=
ptions.
:)
That's true, metadata blocks are tiny. But they still cost space, and
if you run through the whole tree and access all files/directories
(e.g. with grep, rsync, diff, or whatever) a lot (probably all)
metadata blocks are affected, which can be megabytes or even
gigabytes. All those metadata blocks get cowed and unshared, and thus
use up more and more space. If you use snapshots and get to a point
where nearly no space is left, a simple search for files that one
could delete may already result in no space left. If you use hundreds
(or millions...there is no limit on snapshot counts) of snapshots, the
problem gets worse and worse.
Alexander Block
2012-05-25 19:10:43 UTC
Permalink
On Fri, May 25, 2012 at 5:35 PM, Alexander Block
Post by Alexander Block
Hello,
(this is a resend with proper CC for linux-fsdevel and linux-kernel)
I would like to start a discussion on atime in Btrfs (and other
filesystems with snapshot support).
As atime is updated on every access of a file or directory, we get
many changes to the trees in btrfs that as always trigger cow
operations. This is no problem as long as the changed tree blocks are
not shared by other subvolumes. Performance is also not a problem, no
matter if shared or not (thanks to relatime which is the default).
The problems start when someone starts to use snapshots. If you for
example snapshot your root and continue working on your root, after
some time big parts of the tree will be cowed and unshared. In the
worst case, the whole tree gets unshared and thus takes up the double
space. Normally, a user would expect to only use extra space for a
tree if he changes something.
A worst case scenario would be if someone took regular snapshots for
backup purposes and later greps the contents of all snapshots to find
a specific file. This would touch all inodes in all trees and thus
make big parts of the trees unshared.
relatime (which is the default) reduces this problem a little bit, as
it by default only updates atime once a day. This means, if anyone
wants to test this problem, mount with relatime disabled or change the
system date before you try to update atime (that's the way i tested
it).
As a solution, I would suggest to make noatime the default for btrfs.
I'm however not sure if it is allowed in linux to have different
default mount options for different filesystem types. I know this
discussion pops up every few years (last time it resulted in making
relatime the default). But this is a special case for btrfs. atime is
already bad on other filesystems, but it's much much worse in btrfs.
Alex.
Just to show some numbers I made a simple test on a fresh btrfs fs. I
copied my hosts /usr (4 gig) folder to that fs and checked metadata
usage with "btrfs fi df /mnt", which was around 300m. Then I created
10 snapshots and checked metadata usage again, which didn't change
much. Then I run "grep foobar /mnt -R" to update all files atime.
After this was finished, metadata usage was 2.59 gig. So I lost 2.2
gig just because I searched for something. If someone already has
nearly no space left, he probably won't be able to move some data to
another disk, as he may get ENOSPC while copying the data.

Here is the output of the final "btrfs fi df":

# btrfs fi df /mnt
Data: total=6.01GB, used=4.19GB
System, DUP: total=8.00MB, used=4.00KB
System: total=4.00MB, used=0.00
Metadata, DUP: total=3.25GB, used=2.59GB
Metadata: total=8.00MB, used=0.00

I don't know much about other filesystems that support snapshots, but
I have the feeling that most of them would have the same problem. Also
all other filesystems in combination with LVM snapshots may cause
problems (I'm not very familiar with LVM). Filesystem image formats,
like qcow, vmdk, vbox and so on may also have problems with atime.
Peter Maloney
2012-05-25 20:27:53 UTC
Permalink
Post by Alexander Block
Just to show some numbers I made a simple test on a fresh btrfs fs. I
copied my hosts /usr (4 gig) folder to that fs and checked metadata
usage with "btrfs fi df /mnt", which was around 300m. Then I created
10 snapshots and checked metadata usage again, which didn't change
much. Then I run "grep foobar /mnt -R" to update all files atime.
After this was finished, metadata usage was 2.59 gig. So I lost 2.2
gig just because I searched for something. If someone already has
nearly no space left, he probably won't be able to move some data to
another disk, as he may get ENOSPC while copying the data.
# btrfs fi df /mnt
Data: total=6.01GB, used=4.19GB
System, DUP: total=8.00MB, used=4.00KB
System: total=4.00MB, used=0.00
Metadata, DUP: total=3.25GB, used=2.59GB
Metadata: total=8.00MB, used=0.00
I don't know much about other filesystems that support snapshots, but
I have the feeling that most of them would have the same problem. Also
all other filesystems in combination with LVM snapshots may cause
problems (I'm not very familiar with LVM). Filesystem image formats,
like qcow, vmdk, vbox and so on may also have problems with atime.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Did you run the recursive grep after each snapshot (which I would expect
would result in 11 times as many metadata blocks, max 3.3 GB), or just
once after all 10 snapshots (which I think would mean only 2x as many
metadata blocks, max 600 MB)?
Alexander Block
2012-05-25 20:42:21 UTC
Permalink
On Fri, May 25, 2012 at 10:27 PM, Peter Maloney
Just to show some numbers I made a simple test on a fresh btrfs fs. =
I
copied my hosts /usr (4 gig) folder to that fs and checked metadata
usage with "btrfs fi df /mnt", which was around 300m. Then I created
10 snapshots and checked metadata usage again, which didn't change
much. Then I run "grep foobar /mnt -R" to update all files atime.
After this was finished, metadata usage was 2.59 gig. So I lost 2.2
gig just because I searched for something. If someone already has
nearly no space left, he probably won't be able to move some data to
another disk, as he may get ENOSPC while copying the data.
# btrfs fi df /mnt
Data: total=3D6.01GB, used=3D4.19GB
System, DUP: total=3D8.00MB, used=3D4.00KB
System: total=3D4.00MB, used=3D0.00
Metadata, DUP: total=3D3.25GB, used=3D2.59GB
Metadata: total=3D8.00MB, used=3D0.00
I don't know much about other filesystems that support snapshots, bu=
t
I have the feeling that most of them would have the same problem. Al=
so
all other filesystems in combination with LVM snapshots may cause
problems (I'm not very familiar with LVM). Filesystem image formats,
like qcow, vmdk, vbox and so on may also have problems with atime.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrf=
s" in
More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
Did you run the recursive grep after each snapshot (which I would exp=
ect
would result in 11 times as many metadata blocks, max 3.3 GB), or jus=
t
once after all 10 snapshots (which I think would mean only 2x as many
metadata blocks, max 600 MB)?
I've run it only once after creating all snapshots. My expectation is t=
hat
in both cases the result is the same. If all snapshots have the file /f=
oo/bar,
then each individual snapshotted copy of it would have a different atim=
e
and thus an own metadata block for it. As this happens with all files, =
no
matter how i iterated the files, then nearly all metadata blocks get th=
eir
own copy.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Alexander Block
2012-05-25 20:48:41 UTC
Permalink
On Fri, May 25, 2012 at 10:42 PM, Alexander Block
Post by Alexander Block
On Fri, May 25, 2012 at 10:27 PM, Peter Maloney
Just to show some numbers I made a simple test on a fresh btrfs fs.=
I
Post by Alexander Block
copied my hosts /usr (4 gig) folder to that fs and checked metadata
usage with "btrfs fi df /mnt", which was around 300m. Then I create=
d
Post by Alexander Block
10 snapshots and checked metadata usage again, which didn't change
much. Then I run "grep foobar /mnt -R" to update all files atime.
After this was finished, metadata usage was 2.59 gig. So I lost 2.2
gig just because I searched for something. If someone already has
nearly no space left, he probably won't be able to move some data t=
o
Post by Alexander Block
another disk, as he may get ENOSPC while copying the data.
# btrfs fi df /mnt
Data: total=3D6.01GB, used=3D4.19GB
System, DUP: total=3D8.00MB, used=3D4.00KB
System: total=3D4.00MB, used=3D0.00
Metadata, DUP: total=3D3.25GB, used=3D2.59GB
Metadata: total=3D8.00MB, used=3D0.00
I don't know much about other filesystems that support snapshots, b=
ut
Post by Alexander Block
I have the feeling that most of them would have the same problem. A=
lso
Post by Alexander Block
all other filesystems in combination with LVM snapshots may cause
problems (I'm not very familiar with LVM). Filesystem image formats=
,
Post by Alexander Block
like qcow, vmdk, vbox and so on may also have problems with atime.
--
To unsubscribe from this list: send the line "unsubscribe linux-btr=
fs" in
Post by Alexander Block
More majordomo info at =A0http://vger.kernel.org/majordomo-info.htm=
l
Post by Alexander Block
Did you run the recursive grep after each snapshot (which I would ex=
pect
Post by Alexander Block
would result in 11 times as many metadata blocks, max 3.3 GB), or ju=
st
Post by Alexander Block
once after all 10 snapshots (which I think would mean only 2x as man=
y
Post by Alexander Block
metadata blocks, max 600 MB)?
I've run it only once after creating all snapshots. My expectation is=
that
Post by Alexander Block
in both cases the result is the same. If all snapshots have the file =
/foo/bar,
Post by Alexander Block
then each individual snapshotted copy of it would have a different at=
ime
Post by Alexander Block
and thus an own metadata block for it. As this happens with all files=
, no
Post by Alexander Block
matter how i iterated the files, then nearly all metadata blocks get =
their
Post by Alexander Block
own copy.
Hmm, you did maybe assume the snapshots were r/o. In my test, the
snapshots were all r/w. In the r/o case, I would have to do the recursi=
ve
grep after each snapshot creation to get the same result.

Loading...