Discussion:
Writeback threads and freezable
(too old to reply)
Tejun Heo
2013-12-13 17:49:32 UTC
Permalink
Hello, guys.

This is discovered while investigating bug 62801 - "EliteBoot hangs at
dock, suspend, undock, resume".

https://bugzilla.kernel.org/show_bug.cgi?id=62801

The laptop locks up during resume if undocked while suspended. The
dock contains a hard drive and the drive removal path and resume path
get stuck. This got bisected to 839a8e8660b6 ("writeback: replace
custom worker pool implementation with unbound workqueue") by the
reporter. The problem can be reproduced by just removing mounted
harddrive while a machine is suspended. I first thought it was some
dumb mistake but it turns out to be something fundamental.

So, here's the lock up.

* Resume starts and libata resume is kicked off.

* libata EH determines that the device is gone. Eventually it invokes
device_del() through a work item.

* device_del() tries to delete the block device which invokes
writeback_inodes_sb() on the mounted filesystem which in turn
schedules and flushes bdi work item. Note that at this point, the
kworker is holding multiple driver layer locks.

Now, if the rest of dpm resume races with the above and gets stuck
behind one of the locks libata device removal path is holding, it
deadlocks because thaw_workqueues() is called only after dpm resume is
complete. libata device removal waits for the scheduled writeback
work item to finish which in turn is waiting for the workqueue to be
thawed, and dpm resume can't make it to thaw_workqueues() because it's
stuck behind one of the mutexes held by the libata device removal
path.

I think the issues already existed before 839a8e8660b6 ("writeback:
replace custom worker pool implementation with unbound workqueue")
although the patch seems to have exposed it a lot more. The issue
probably was a lot less visible because, if the bdi had a flusher
spawned, kthread_stop() overrides freezer and if not the device has
been idle & clean for some time and unlikely to require any action on
shutdown, but the forker thread was still freezable and I think it
should be possible to trigger some condition where forker action is
waited upon during shutdown sequence.

The fundamental problem, really, is that device drivers make use of
kthreads and/or workqueues in their operation and freezable kthreads
and workqueues essentially behave as a giant lock. If any
suspend/resume operation somehow ends up depending on a freezable
kthread or workqueue, which can happen through quite convoluted and
long dependency chain, there are possibilities for deadlock even if
those operations themselves aren't interlocked. In this case, libata
resume finishes fine and device removal is run completely
asynchronously but the dependency chain is still completed through
driver layer locks.

I've always thought that using freezer for kernel threads / workqueues
is something fundamentally undesirable, exactly because freezer acts
as a giant lock and we don't have any meangingful debug support around
it. Even in leaf drivers proper where the dependency isn't
complicated, this has ample possibility to lead to surprise lockups.

I'm unsure why bdi should be freezable. It's not like corking things
at bdi can reliably prevent block layer from issuing further commands
down to drivers. Maybe there are things which directly hook into bdi
and depend on it being frozen, I don't know, but it just seems
dangerous to inject a large locking dependency into the chain from
such high level construct.

I'm not completely sure what to do. We can make an exception for
bdi_wq and thaw it early but that breaks the guarantee that bdi is
frozen between device suspend and resume invocations and we might as
well just remove WQ_FREEZABLE from it. I don't know.

Ideas?

Thanks.
--
tejun
Tejun Heo
2013-12-13 18:52:37 UTC
Permalink
Post by Tejun Heo
I'm not completely sure what to do. We can make an exception for
bdi_wq and thaw it early but that breaks the guarantee that bdi is
frozen between device suspend and resume invocations and we might as
well just remove WQ_FREEZABLE from it. I don't know.
Ugh, so I removed WQ_FREEZABLE on bdi and the same deadlock now
happens through jbd2 which is a freezable kthread. This whole kthread
freezer thing is rotten and fundamentally broken. :(
--
tejun
Tejun Heo
2013-12-13 20:40:34 UTC
Permalink
Hello,

So, this is the laughable workaround that I came up with. Seriously,
this is tragic. :(

Thanks.

------- 8< -------
=46reezable kthreads and workqueues are fundamentally problematic in
that they effectively introduce a big kernel lock widely used in the
kernel and have already been the culprit of several deadlock
scenarios. This is the latest occurrence.

During resume, libata rescans all the ports and revalidates all
pre-existing devices. If it determines that a device has gone
missing, the device is removed from the system which involves
invalidating block device and flushing bdi while holding driver core
layer locks. Unfortunately, this can race with the rest of device
resume. Because freezable kthreads and workqueues are thawed after
device resume is complete and block device removal depends on
freezable workqueues and kthreads (e.g. bdi_wq, jbd2) to make
progress, this can lead to deadlock - block device removal can't
proceed because kthreads are frozen and kthreads can't be thawed
because device resume is blocked behind block device removal.

839a8e8660b6 ("writeback: replace custom worker pool implementation
with unbound workqueue") made this particular deadlock scenario more
visible but the underlying problem has always been there - the
original forker task and jbd2 are freezable too. In fact, this is
highly likely just one of many possible deadlock scenarios given that
freezer behaves as a big kernel lock and we don't have any debug
mechanism around it.

I believe the right thing to do is getting rid of freezable kthreads
and workqueues. This is something fundamentally broken. For now,
implement a funny workaround in libata - just avoid doing block device
hot[un]plug while the system is frozen. Kernel engineering at its
finest. :(

Signed-off-by: Tejun Heo <***@kernel.org>
Reported-by: Toma=C5=BE =C5=A0olc <***@tablix.org>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=3D62801
Link: http://lkml.kernel.org/r/***@htj.dyndns.org
Cc: "Rafael J. Wysocki" <***@rjwysocki.net>
Cc: Greg Kroah-Hartman <***@linuxfoundation.org>
Cc: Len Brown <***@intel.com>
Cc: Oleg Nesterov <***@redhat.com>
Cc: ***@vger.kernel.org
---
drivers/ata/libata-scsi.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index ab58556..60a01d3 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -3872,6 +3872,25 @@ void ata_scsi_hotplug(struct work_struct *work)
return;
}
=20
+ /*
+ * XXX - UGLY HACK
+ *
+ * The core suspend/resume path is fundamentally broken due to
+ * freezable kthreads and workqueue and may deadlock if a block
+ * device gets removed while resume is in progress. I don't know
+ * what the solution is short of removing freezable kthreads and
+ * workqueues altogether.
+ *
+ * The following is an ugly hack to avoid kicking off device
+ * removal while freezer is active. This is a joke but does avoid
+ * this particular deadlock scenario.
+ *
+ * https://bugzilla.kernel.org/show_bug.cgi?id=3D62801
+ * http://marc.info/?l=3Dlinux-kernel&m=3D138695698516487
+ */
+ while (pm_freezing)
+ msleep(100);
+
DPRINTK("ENTER\n");
mutex_lock(&ap->scsi_scan_mutex);
=20
Nigel Cunningham
2013-12-13 22:45:59 UTC
Permalink
Hi Tejun.

Thanks for your work on this.

In your first email, in the first substantial paragraph (starting "Now,
if the rest.."), you say "libata device removal waits for the scheduled
writeback work item to finish". I wonder if that's the lynchpin. If we
know the device is gone, why are we trying to write to it?

All pending I/O should have been flushed when suspend/hibernate started,
and there's no point in trying to update metadata on a device we can't
access, so there should be no writeback needed (and anything that does
somehow get there should just be discarded since it will never succeed
anyway).

Or am I missing something?

Having said the above, I agree that we shouldn't need to freeze kernel
threads and workqueues themselves. I think we should be giving the
producers of I/O the nous needed to avoid producing I/O during
suspend/hibernate. But perhaps I'm missing something here, too.

Regards,

Nigel
Tejun Heo
2013-12-13 23:07:44 UTC
Permalink
Hello, Nigel.
Post by Nigel Cunningham
In your first email, in the first substantial paragraph (starting
"Now, if the rest.."), you say "libata device removal waits for the
scheduled writeback work item to finish". I wonder if that's the
lynchpin. If we know the device is gone, why are we trying to write
to it?
It's just a standard part of block device removal -
invalidate_partition(), bdi_wb_shutdown().
Post by Nigel Cunningham
All pending I/O should have been flushed when suspend/hibernate
started, and there's no point in trying to update metadata on a
Frozen or not, it isn't guaranteed that bdi wb queue is empty when the
system went to suspend. They're likely to be empty but there's no
guarantee. Conversion to workqueue only makes the behavior more
deterministic.
Post by Nigel Cunningham
device we can't access, so there should be no writeback needed (and
anything that does somehow get there should just be discarded since
it will never succeed anyway).
Even if they'll never succeed, they still need to be issued and
drained; otherwise, we'll end up with leaked items and hung issuers.
Post by Nigel Cunningham
Having said the above, I agree that we shouldn't need to freeze
kernel threads and workqueues themselves. I think we should be
giving the producers of I/O the nous needed to avoid producing I/O
during suspend/hibernate. But perhaps I'm missing something here,
too.
I never understood that part. Why do we need to control the
producers? The chain between the producer and consumer is a long one
and no matter what we do with the producers, the consumers need to be
plugged all the same. Why bother with the producers at all? I think
that's where all this freezable kthreads started but I don't
understand what the benefit of that is. Not only that, freezer is
awefully inadequate in its role too. There are flurry of activities
which happen in the IO path without any thread involved and many of
them can lead to issuance of new IO, so the only thing freezer is
achieving is making existing bugs less visible, which is a bad thing
especially for suspend/resume as the failure mode often doesn't yield
to easy debugging.

I asked the same question years ago and ISTR getting only fairly vague
answers but this whole freezable kthread is expectedly proving to be a
continuous source of problems. Let's at least find out whether we
need it and why if so. Not some "I feel better knowing things are
calmer" type vagueness but actual technical necessity of it.

Thanks.
--
tejun
Nigel Cunningham
2013-12-13 23:15:21 UTC
Permalink
Hi again.
Post by Tejun Heo
Hello, Nigel.
Post by Nigel Cunningham
In your first email, in the first substantial paragraph (starting
"Now, if the rest.."), you say "libata device removal waits for the
scheduled writeback work item to finish". I wonder if that's the
lynchpin. If we know the device is gone, why are we trying to write
to it?
It's just a standard part of block device removal -
invalidate_partition(), bdi_wb_shutdown().
Mmm. But perhaps there needs to be some special code in there to handle
the "we can't write to this device anymore" case?
Post by Tejun Heo
Post by Nigel Cunningham
All pending I/O should have been flushed when suspend/hibernate
started, and there's no point in trying to update metadata on a
Frozen or not, it isn't guaranteed that bdi wb queue is empty when the
system went to suspend. They're likely to be empty but there's no
guarantee. Conversion to workqueue only makes the behavior more
deterministic.
Post by Nigel Cunningham
device we can't access, so there should be no writeback needed (and
anything that does somehow get there should just be discarded since
it will never succeed anyway).
Even if they'll never succeed, they still need to be issued and
drained; otherwise, we'll end up with leaked items and hung issuers.
Yeah - I get that, but drained needs to work differently if the device
doesn't exist?
Post by Tejun Heo
Post by Nigel Cunningham
Having said the above, I agree that we shouldn't need to freeze
kernel threads and workqueues themselves. I think we should be
giving the producers of I/O the nous needed to avoid producing I/O
during suspend/hibernate. But perhaps I'm missing something here,
too.
I never understood that part. Why do we need to control the
producers? The chain between the producer and consumer is a long one
and no matter what we do with the producers, the consumers need to be
plugged all the same. Why bother with the producers at all? I think
that's where all this freezable kthreads started but I don't
understand what the benefit of that is. Not only that, freezer is
awefully inadequate in its role too. There are flurry of activities
which happen in the IO path without any thread involved and many of
them can lead to issuance of new IO, so the only thing freezer is
achieving is making existing bugs less visible, which is a bad thing
especially for suspend/resume as the failure mode often doesn't yield
to easy debugging.
I asked the same question years ago and ISTR getting only fairly vague
answers but this whole freezable kthread is expectedly proving to be a
continuous source of problems. Let's at least find out whether we
need it and why if so. Not some "I feel better knowing things are
calmer" type vagueness but actual technical necessity of it.
My understanding is that the point is ensuring that - particularly in
the case of hibernation - we don't cause filesystem corruption by
writing one thing while writing the image and then doing something else
(without knowledge of what happened while the image was being written)
while reading the image or after restoring it.

Regards,

Nigel
Dave Chinner
2013-12-14 01:55:18 UTC
Permalink
Post by Nigel Cunningham
Hi again.
Post by Tejun Heo
Hello, Nigel.
Post by Nigel Cunningham
In your first email, in the first substantial paragraph (starting
"Now, if the rest.."), you say "libata device removal waits for the
scheduled writeback work item to finish". I wonder if that's the
lynchpin. If we know the device is gone, why are we trying to write
to it?
It's just a standard part of block device removal -
invalidate_partition(), bdi_wb_shutdown().
Mmm. But perhaps there needs to be some special code in there to
handle the "we can't write to this device anymore" case?
I've been asking for that "special code" for years, Nigel... :/

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Tejun Heo
2013-12-14 20:31:21 UTC
Permalink
Hello, Nigel.
Post by Nigel Cunningham
My understanding is that the point is ensuring that - particularly
in the case of hibernation - we don't cause filesystem corruption by
writing one thing while writing the image and then doing something
else (without knowledge of what happened while the image was being
written) while reading the image or after restoring it.
So, all this is about hibernation? Does that mean that it's safe to
unfreeze before invoking resume? ie. we currently do,

freeze
suspend devs
resume devs
unfreeze

If we can just do,

freeze
suspend devs
unfreeze
resume devs

it should be a lot better and would remove all locking dependency
chains in the resume path.

Thanks.
--
tejun
Tejun Heo
2013-12-14 20:36:52 UTC
Permalink
Post by Tejun Heo
So, all this is about hibernation? Does that mean that it's safe to
unfreeze before invoking resume? ie. we currently do,
freeze
suspend devs
resume devs
unfreeze
If we can just do,
freeze
suspend devs
unfreeze
resume devs
Ummm... even better, does that mean, in the long term, we can decouple
freezing and device pm operations? If this is just about hibernation,
there's no reason to wrap device driver pm ops with freezer, right?

Thanks.
--
tejun
Nigel Cunningham
2013-12-14 21:21:36 UTC
Permalink
Hi.
Post by Tejun Heo
Post by Tejun Heo
So, all this is about hibernation? Does that mean that it's safe to
unfreeze before invoking resume? ie. we currently do,
freeze
suspend devs
resume devs
unfreeze
If we can just do,
freeze
suspend devs
unfreeze
resume devs
Ummm... even better, does that mean, in the long term, we can decouple
freezing and device pm operations? If this is just about hibernation,
there's no reason to wrap device driver pm ops with freezer, right?
Thanks.
Rafael has spent far more time than me thinking about the semantics
here, so I'd like to defer to him. Rafael?...

Regards,

Nigel
Rafael J. Wysocki
2013-12-17 02:35:14 UTC
Permalink
Post by Nigel Cunningham
Hi.
Post by Tejun Heo
Post by Tejun Heo
So, all this is about hibernation? Does that mean that it's safe to
unfreeze before invoking resume? ie. we currently do,
freeze
suspend devs
resume devs
unfreeze
If we can just do,
freeze
suspend devs
unfreeze
resume devs
Ummm... even better, does that mean, in the long term, we can decouple
freezing and device pm operations? If this is just about hibernation,
there's no reason to wrap device driver pm ops with freezer, right?
Thanks.
Rafael has spent far more time than me thinking about the semantics
here, so I'd like to defer to him. Rafael?...
Please see my reply to Tejun.

Thanks,
Rafael
Rafael J. Wysocki
2013-12-17 02:34:00 UTC
Permalink
Post by Tejun Heo
Hello, Nigel.
Post by Nigel Cunningham
My understanding is that the point is ensuring that - particularly
in the case of hibernation - we don't cause filesystem corruption by
writing one thing while writing the image and then doing something
else (without knowledge of what happened while the image was being
written) while reading the image or after restoring it.
So, all this is about hibernation?
No, it isn't. [I guess it was originally, but it has not been the case
for a very long time.] It is about getting user space interactions (all of
the sysfs/ioctl/mmap/read/write/you-name-it thingies user space can do to
devices) when we're calling device suspend/resume routines. The reason is
that otherwise all of them would have had to do a "oh, are we suspending by
the way?" check pretty much on every code path that can be triggered by
user space.

And I'd appreciate it if people didn't repeat the nonsense about possible
filesystem corruption. The freezing of tasks doesn't prevent that from
happening (which is easy to demonstrate by artificially fail restore from
hibernation multiple times in a row).
Post by Tejun Heo
Does that mean that it's safe to
unfreeze before invoking resume?
No, it isn't.

Thanks,
Rafael
Tejun Heo
2013-12-17 12:34:57 UTC
Permalink
Hello, Rafael.
Post by Rafael J. Wysocki
No, it isn't. [I guess it was originally, but it has not been the case
for a very long time.] It is about getting user space interactions (all of
Heh... no wonder people are all so confused about this thing.
Post by Rafael J. Wysocki
the sysfs/ioctl/mmap/read/write/you-name-it thingies user space can do to
devices) when we're calling device suspend/resume routines. The reason is
that otherwise all of them would have had to do a "oh, are we suspending by
the way?" check pretty much on every code path that can be triggered by
user space.
Freezing userland is fine. I have no problem with that but up until
now the only use case that seems fundamentally valid to me is freezing
IO processing kthread in a driver as a cheap way to implement
suspend/resume. At this point, given the general level of confusion,
it seems to be costing more than benefiting.
Post by Rafael J. Wysocki
Does that mean that it's safe to unfreeze before invoking resume?
No, it isn't.
So, are you saying it's really about giving device drivers easy way to
implement suspend/resume? If that's the case, let's please make it
*way* more specific and clear - ie. things like helpers to implement
suspend/resume hooks trivially or whatnot. Freezable kthreads (and
now workqueues) have been becoming a giant mess for a while now.

Thanks.
--
tejun
Rafael J. Wysocki
2013-12-18 00:35:13 UTC
Permalink
Post by Tejun Heo
Hello, Rafael.
Post by Rafael J. Wysocki
No, it isn't. [I guess it was originally, but it has not been the case
for a very long time.] It is about getting user space interactions (all of
Heh... no wonder people are all so confused about this thing.
Post by Rafael J. Wysocki
the sysfs/ioctl/mmap/read/write/you-name-it thingies user space can do to
devices) when we're calling device suspend/resume routines. The reason is
that otherwise all of them would have had to do a "oh, are we suspending by
the way?" check pretty much on every code path that can be triggered by
user space.
Freezing userland is fine.
OK

So do I understand correctly that you're talking about kernel threads/worqueues
freezing below?
Post by Tejun Heo
I have no problem with that but up until
now the only use case that seems fundamentally valid to me is freezing
IO processing kthread in a driver as a cheap way to implement
suspend/resume. At this point, given the general level of confusion,
it seems to be costing more than benefiting.
There are corner cases like the runtime PM workqueue that's freezable by
design.
Post by Tejun Heo
Post by Rafael J. Wysocki
Does that mean that it's safe to unfreeze before invoking resume?
No, it isn't.
So, are you saying it's really about giving device drivers easy way to
implement suspend/resume?
Well, that's a side effect rather than a recommeded interface. A *few* pieces
of code need to freeze kernel threads/workqueues, but they should know who they
are and they really really should know *why* they need that (the above-mentioned
runtime PM workqueue is one example). The rest is just doing that because they
can, which may not be entirely reasonable (or because they did that in the past
and the original author is not responsive and everyone else does not dare to try
removing that).
Post by Tejun Heo
If that's the case, let's please make it
*way* more specific and clear - ie. things like helpers to implement
suspend/resume hooks trivially or whatnot. Freezable kthreads (and
now workqueues) have been becoming a giant mess for a while now.
They were a lot more of that to start with really. We removed quite a number
of try_to_freeze() instances from the kernel a few years ago, but apparently
people are taking shortcuts.

The rule should be to require patch submitters to put in comments explaining
why they need their kernel threads/workqueues to be freezable and generally
there should be no such things in drivers.

Thanks,
Rafael
Tejun Heo
2013-12-18 11:17:26 UTC
Permalink
Hello, Rafael.
Post by Rafael J. Wysocki
So do I understand correctly that you're talking about kernel threads/worqueues
freezing below?
Yeap, I'm strictly talking about kernel freezables.
Post by Rafael J. Wysocki
Post by Tejun Heo
So, are you saying it's really about giving device drivers easy way to
implement suspend/resume?
Well, that's a side effect rather than a recommeded interface. A *few* pieces
of code need to freeze kernel threads/workqueues, but they should know who they
are and they really really should know *why* they need that (the above-mentioned
runtime PM workqueue is one example). The rest is just doing that because they
can, which may not be entirely reasonable (or because they did that in the past
and the original author is not responsive and everyone else does not dare to try
removing that).
I see. In the long term, I think the right thing to do is making the
freezer interface more specific so that only the ones which actually
need it do so explicitly. Right now, kernel freezables are
conceptually at a very high level - it's a global task attribute and a
major knob in workqueue. I suppose most of that is historical but by
perpetuating the model we're encouraging misuse of freezer in large
swaths of the kernel. Even in this specific case, both writeback and
jbd workers have no fundamental reason to be freezable and yet
they're, eventually developing into totally unnecessary deadlocks.
Post by Rafael J. Wysocki
They were a lot more of that to start with really. We removed quite a number
of try_to_freeze() instances from the kernel a few years ago, but apparently
people are taking shortcuts.
Great, so we're at least headed towards the right direction.
Post by Rafael J. Wysocki
The rule should be to require patch submitters to put in comments explaining
why they need their kernel threads/workqueues to be freezable and generally
there should be no such things in drivers.
I'm not so sure whether that's something which can be effectively
enforced given the current high visibility and confusion around
freezer. I think the only way to get this under control is weed out
the current spurious users actively, deprecate the existing interface
and switch the real ones to something a lot more specific.

Thanks.
--
tejun
Rafael J. Wysocki
2013-12-18 21:48:31 UTC
Permalink
Post by Tejun Heo
Hello, Rafael.
Post by Rafael J. Wysocki
So do I understand correctly that you're talking about kernel threads/worqueues
freezing below?
Yeap, I'm strictly talking about kernel freezables.
Post by Rafael J. Wysocki
Post by Tejun Heo
So, are you saying it's really about giving device drivers easy way to
implement suspend/resume?
Well, that's a side effect rather than a recommeded interface. A *few* pieces
of code need to freeze kernel threads/workqueues, but they should know who they
are and they really really should know *why* they need that (the above-mentioned
runtime PM workqueue is one example). The rest is just doing that because they
can, which may not be entirely reasonable (or because they did that in the past
and the original author is not responsive and everyone else does not dare to try
removing that).
I see. In the long term, I think the right thing to do is making the
freezer interface more specific so that only the ones which actually
need it do so explicitly. Right now, kernel freezables are
conceptually at a very high level - it's a global task attribute and a
major knob in workqueue. I suppose most of that is historical but by
perpetuating the model we're encouraging misuse of freezer in large
swaths of the kernel. Even in this specific case, both writeback and
jbd workers have no fundamental reason to be freezable and yet
they're, eventually developing into totally unnecessary deadlocks.
You're right, but I'm not sure how we can make the interface for workqueues
more specific, for example. I guess we can simply drop create_freezable_workqueue()
so that whoever wants to create a freezable workqueue has to use the right
combination of flags. Can we make it more specific than that?

BTW, pm_start_workqueue(), which is a legitimate user, doesn't even use that macro. :-)
Post by Tejun Heo
Post by Rafael J. Wysocki
They were a lot more of that to start with really. We removed quite a number
of try_to_freeze() instances from the kernel a few years ago, but apparently
people are taking shortcuts.
Great, so we're at least headed towards the right direction.
Post by Rafael J. Wysocki
The rule should be to require patch submitters to put in comments explaining
why they need their kernel threads/workqueues to be freezable and generally
there should be no such things in drivers.
I'm not so sure whether that's something which can be effectively
enforced given the current high visibility and confusion around
freezer. I think the only way to get this under control is weed out
the current spurious users actively, deprecate the existing interface
and switch the real ones to something a lot more specific.
That'd be fine by me modulo the above remarks.

Thanks,
Rafael
Tejun Heo
2013-12-18 21:39:36 UTC
Permalink
Hello, Rafael.
Post by Rafael J. Wysocki
Post by Tejun Heo
I see. In the long term, I think the right thing to do is making the
freezer interface more specific so that only the ones which actually
need it do so explicitly. Right now, kernel freezables are
conceptually at a very high level - it's a global task attribute and a
major knob in workqueue. I suppose most of that is historical but by
perpetuating the model we're encouraging misuse of freezer in large
swaths of the kernel. Even in this specific case, both writeback and
jbd workers have no fundamental reason to be freezable and yet
they're, eventually developing into totally unnecessary deadlocks.
You're right, but I'm not sure how we can make the interface for workqueues
more specific, for example. I guess we can simply drop create_freezable_workqueue()
so that whoever wants to create a freezable workqueue has to use the right
combination of flags. Can we make it more specific than that?
BTW, pm_start_workqueue(), which is a legitimate user, doesn't even use that macro. :-)
Yeah, we can just rip out the whole freezer support and let the
caller's pm notification hooks implement it by doing

workqueue_set_max_active(wq, 0);
flush_workqueue(wq);

when it needs to "freeze" the workqueue and then reverse it by doing
the following.

workqueue_set_max_active(wq, WQ_DFL_ACTIVE);

It'll be a bit more code in the specific users but given the
specificity of the usage I think that's the appropriate way to do it.
It'll drop quite a bit of complexity from the core freezer and
workqueue code paths too.

Thanks.
--
tejun
Tejun Heo
2013-12-18 21:41:28 UTC
Permalink
Post by Tejun Heo
Yeah, we can just rip out the whole freezer support and let the
caller's pm notification hooks implement it by doing
workqueue_set_max_active(wq, 0);
flush_workqueue(wq);
Oops, the above doesn't work but I can trivially add a new interface
for this. Something which waits for the max_active to drop below the
newly set level before returning.

workqueue_set_max_active_and_wait(wq, 0);

Thanks.
--
tejun
Rafael J. Wysocki
2013-12-18 22:04:40 UTC
Permalink
Post by Tejun Heo
Post by Tejun Heo
Yeah, we can just rip out the whole freezer support and let the
caller's pm notification hooks implement it by doing
workqueue_set_max_active(wq, 0);
flush_workqueue(wq);
Oops, the above doesn't work but I can trivially add a new interface
for this. Something which waits for the max_active to drop below the
newly set level before returning.
workqueue_set_max_active_and_wait(wq, 0);
If you add that, I can convert the PM workqueue to using it and then
we can go through all of the existing users and make similar changes - or
just make them non-freezable if there's no good reason for freezing those
particular ones.

Thanks,
Rafael
Tejun Heo
2013-12-19 23:35:26 UTC
Permalink
From bdd220b2a1b86fee14a12b69fb0cadafe60a1dac Mon Sep 17 00:00:00 2001
From: Tejun Heo <***@kernel.org>
Date: Thu, 19 Dec 2013 18:33:09 -0500

@max_active handling is currently inconsistent.

* In alloc_workqueue(), 0 gets converted to the default and the value
gets clamped to [1, lim].

* In workqueue_set_max_active(), 0 is not converted to the default and
the value is clamped to [1, lim].

* When a workqueue is exposed through sysfs, input < 1 fails with
-EINVAL; otherwise, the value is clamped to [1, lim].

* fscache exposes max_active through a sysctl and clamps the value in
[1, lim].

We want to be able to express zero @max_active but as it's a special
case and 0 is already used for default, don't want to use 0 for it.
Update @max_active clamping so that the following rules are followed.

* In both alloc_workqueue() and workqueue_set_max_active(), 0 is
converted to the default, and a new special value WQ_FROZEN_ACTIVE
becomes 0; otherwise, the value is clamped to [1, lim].

* In both sysfs and fscache sysctl, input < 1 fails with -EINVAL;
otherwise, the value is clamped to [1, lim].

Signed-off-by: Tejun Heo <***@kernel.org>
Cc: Lai Jiangshan <***@cn.fujitsu.com>
Cc: David Howells <***@redhat.com>
---
fs/fscache/main.c | 10 +++++++---
include/linux/workqueue.h | 1 +
kernel/workqueue.c | 6 +++++-
3 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/fs/fscache/main.c b/fs/fscache/main.c
index 7c27907..9d5a716 100644
--- a/fs/fscache/main.c
+++ b/fs/fscache/main.c
@@ -62,9 +62,13 @@ static int fscache_max_active_sysctl(struct ctl_table *table, int write,
int ret;

ret = proc_dointvec(table, write, buffer, lenp, ppos);
- if (ret == 0)
- workqueue_set_max_active(*wqp, *datap);
- return ret;
+ if (ret < 0)
+ return ret;
+ if (*datap < 1)
+ return -EINVAL;
+
+ workqueue_set_max_active(*wqp, *datap);
+ return 0;
}

ctl_table fscache_sysctls[] = {
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 594521b..334daa3 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -338,6 +338,7 @@ enum {
__WQ_DRAINING = 1 << 16, /* internal: workqueue is draining */
__WQ_ORDERED = 1 << 17, /* internal: workqueue is ordered */

+ WQ_FROZEN_ACTIVE = -1, /* special value for frozen wq */
WQ_MAX_ACTIVE = 512, /* I like 512, better ideas? */
WQ_MAX_UNBOUND_PER_CPU = 4, /* 4 * #cpus for unbound wq */
WQ_DFL_ACTIVE = WQ_MAX_ACTIVE / 2,
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 987293d..6748fbf 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4136,6 +4136,11 @@ static int wq_clamp_max_active(int max_active, unsigned int flags,
{
int lim = flags & WQ_UNBOUND ? WQ_UNBOUND_MAX_ACTIVE : WQ_MAX_ACTIVE;

+ if (max_active == 0)
+ return WQ_DFL_ACTIVE;
+ if (max_active == WQ_FROZEN_ACTIVE)
+ return 0;
+
if (max_active < 1 || max_active > lim)
pr_warn("workqueue: max_active %d requested for %s is out of range, clamping between %d and %d\n",
max_active, name, 1, lim);
@@ -4176,7 +4181,6 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt,
vsnprintf(wq->name, sizeof(wq->name), fmt, args);
va_end(args);

- max_active = max_active ?: WQ_DFL_ACTIVE;
max_active = wq_clamp_max_active(max_active, flags, wq->name);

/* init wq */
--
1.8.4.2
Rafael J. Wysocki
2013-12-20 01:26:36 UTC
Permalink
From bdd220b2a1b86fee14a12b69fb0cadafe60a1dac Mon Sep 17 00:00:00 2001
Date: Thu, 19 Dec 2013 18:33:09 -0500
@max_active handling is currently inconsistent.
* In alloc_workqueue(), 0 gets converted to the default and the value
gets clamped to [1, lim].
* In workqueue_set_max_active(), 0 is not converted to the default and
the value is clamped to [1, lim].
* When a workqueue is exposed through sysfs, input < 1 fails with
-EINVAL; otherwise, the value is clamped to [1, lim].
* fscache exposes max_active through a sysctl and clamps the value in
[1, lim].
case and 0 is already used for default, don't want to use 0 for it.
* In both alloc_workqueue() and workqueue_set_max_active(), 0 is
converted to the default, and a new special value WQ_FROZEN_ACTIVE
becomes 0; otherwise, the value is clamped to [1, lim].
* In both sysfs and fscache sysctl, input < 1 fails with -EINVAL;
otherwise, the value is clamped to [1, lim].
---
fs/fscache/main.c | 10 +++++++---
include/linux/workqueue.h | 1 +
kernel/workqueue.c | 6 +++++-
3 files changed, 13 insertions(+), 4 deletions(-)
diff --git a/fs/fscache/main.c b/fs/fscache/main.c
index 7c27907..9d5a716 100644
--- a/fs/fscache/main.c
+++ b/fs/fscache/main.c
@@ -62,9 +62,13 @@ static int fscache_max_active_sysctl(struct ctl_table *table, int write,
int ret;
ret = proc_dointvec(table, write, buffer, lenp, ppos);
- if (ret == 0)
- workqueue_set_max_active(*wqp, *datap);
- return ret;
+ if (ret < 0)
+ return ret;
+ if (*datap < 1)
+ return -EINVAL;
+
+ workqueue_set_max_active(*wqp, *datap);
+ return 0;
}
ctl_table fscache_sysctls[] = {
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 594521b..334daa3 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -338,6 +338,7 @@ enum {
__WQ_DRAINING = 1 << 16, /* internal: workqueue is draining */
__WQ_ORDERED = 1 << 17, /* internal: workqueue is ordered */
+ WQ_FROZEN_ACTIVE = -1, /* special value for frozen wq */
WQ_MAX_ACTIVE = 512, /* I like 512, better ideas? */
WQ_MAX_UNBOUND_PER_CPU = 4, /* 4 * #cpus for unbound wq */
WQ_DFL_ACTIVE = WQ_MAX_ACTIVE / 2,
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 987293d..6748fbf 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4136,6 +4136,11 @@ static int wq_clamp_max_active(int max_active, unsigned int flags,
{
int lim = flags & WQ_UNBOUND ? WQ_UNBOUND_MAX_ACTIVE : WQ_MAX_ACTIVE;
+ if (max_active == 0)
+ return WQ_DFL_ACTIVE;
+ if (max_active == WQ_FROZEN_ACTIVE)
+ return 0;
+
if (max_active < 1 || max_active > lim)
pr_warn("workqueue: max_active %d requested for %s is out of range, clamping between %d and %d\n",
max_active, name, 1, lim);
@@ -4176,7 +4181,6 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt,
vsnprintf(wq->name, sizeof(wq->name), fmt, args);
va_end(args);
- max_active = max_active ?: WQ_DFL_ACTIVE;
max_active = wq_clamp_max_active(max_active, flags, wq->name);
/* init wq */
--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.
Tejun Heo
2013-12-19 23:37:20 UTC
Permalink
Hello, Rafael.

If this looks good to you, I'll commit it to wq/for-3.14 and we can at
least start to clean up things.

Thanks!

------- 8< -------
From 6b5182f3f193d0ff9296f53a4665a55b6477aa77 Mon Sep 17 00:00:00 2001
From: Tejun Heo <***@kernel.org>
Date: Thu, 19 Dec 2013 18:33:12 -0500

workqueue_set_max_active() currently doesn't wait for the number of
in-flight work items to fall under the new @max_active. This patch
adds @drain paramter to workqueue_set_max_active(), if set, the
function sleeps until nr_active on each pool_workqueue of the target
workqueue drops below the current saved_max_active.

This is planned to replace freezable workqueues. It is determined
that kernel freezables - both kthreads and workqueues - aren't
necessary and just add to the confusion and unnecessary deadlocks.
There are only a handful which actually need to stop processing for
system power events and they can be converted to use
workqueue_set_max_active(WQ_FROZEN_ACTIVE) instead of freezable
workqueues. Ultimately, all other uses of freezables will be dropped
and the whole system freezer machinery will be excised. Well, that's
the plan anyway.

The implementation is fairly straight-forward. As this is expected to
be used by only a handful and most likely not concurrently, a single
wait queue is used. set_max_active drainers wait on it as necessary
and pwq_dec_nr_in_flight() triggers it if nr_active == max_active
after nr_active is decremented. This unfortunately adds on unlikely
branch to the work item execution path but this is extremely unlikely
to be noticeable and I think it's worthwhile to avoid polling here as
there may be multiple calls to this function in succession during
suspend and some platforms use suspend quite frequently.

Signed-off-by: Tejun Heo <***@kernel.org>
Link: http://lkml.kernel.org/g/***@mtj.dyndns.org
Cc: Lai Jiangshan <***@cn.fujitsu.com>
Cc: David Howells <***@redhat.com>
Cc: "Rafael J. Wysocki" <***@rjwysocki.net>
---
fs/fscache/main.c | 2 +-
include/linux/workqueue.h | 2 +-
kernel/workqueue.c | 58 ++++++++++++++++++++++++++++++++++++++++++++---
3 files changed, 57 insertions(+), 5 deletions(-)

diff --git a/fs/fscache/main.c b/fs/fscache/main.c
index 9d5a716..e2837f2 100644
--- a/fs/fscache/main.c
+++ b/fs/fscache/main.c
@@ -67,7 +67,7 @@ static int fscache_max_active_sysctl(struct ctl_table *table, int write,
if (*datap < 1)
return -EINVAL;

- workqueue_set_max_active(*wqp, *datap);
+ workqueue_set_max_active(*wqp, *datap, false);
return 0;
}

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 334daa3..8b9c628 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -488,7 +488,7 @@ extern bool cancel_delayed_work(struct delayed_work *dwork);
extern bool cancel_delayed_work_sync(struct delayed_work *dwork);

extern void workqueue_set_max_active(struct workqueue_struct *wq,
- int max_active);
+ int max_active, bool drain);
extern bool current_is_workqueue_rescuer(void);
extern bool workqueue_congested(int cpu, struct workqueue_struct *wq);
extern unsigned int work_busy(struct work_struct *work);
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 6748fbf..f18c35b 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -293,6 +293,12 @@ static DEFINE_SPINLOCK(wq_mayday_lock); /* protects wq->maydays list */
static LIST_HEAD(workqueues); /* PL: list of all workqueues */
static bool workqueue_freezing; /* PL: have wqs started freezing? */

+/*
+ * Wait for nr_active to drain after max_active adjustment. This is a cold
+ * path and not expected to have many users. A global waitq should do.
+ */
+static DECLARE_WAIT_QUEUE_HEAD(wq_nr_active_drain_waitq);
+
/* the per-cpu worker pools */
static DEFINE_PER_CPU_SHARED_ALIGNED(struct worker_pool [NR_STD_WORKER_POOLS],
cpu_worker_pools);
@@ -1123,6 +1129,14 @@ static void pwq_dec_nr_in_flight(struct pool_workqueue *pwq, int color)
pwq->nr_in_flight[color]--;

pwq->nr_active--;
+
+ /*
+ * This can happen only after max_active is lowered. Tell the
+ * waiters that draining might be complete.
+ */
+ if (unlikely(pwq->nr_active == pwq->max_active))
+ wake_up_all(&wq_nr_active_drain_waitq);
+
if (!list_empty(&pwq->delayed_works)) {
/* one down, submit a delayed one */
if (pwq->nr_active < pwq->max_active)
@@ -3140,7 +3154,7 @@ static ssize_t max_active_store(struct device *dev,
if (sscanf(buf, "%d", &val) != 1 || val <= 0)
return -EINVAL;

- workqueue_set_max_active(wq, val);
+ workqueue_set_max_active(wq, val, false);
return count;
}
static DEVICE_ATTR_RW(max_active);
@@ -4339,16 +4353,22 @@ EXPORT_SYMBOL_GPL(destroy_workqueue);
* workqueue_set_max_active - adjust max_active of a workqueue
* @wq: target workqueue
* @max_active: new max_active value.
+ * @drain: wait until the actual level of concurrency becomes <= @max_active
*
- * Set max_active of @wq to @max_active.
+ * Set max_active of @wq to @max_active. If @drain is true, wait until the
+ * in-flight work items are drained to the new level.
*
* CONTEXT:
* Don't call from IRQ context.
*/
-void workqueue_set_max_active(struct workqueue_struct *wq, int max_active)
+void workqueue_set_max_active(struct workqueue_struct *wq, int max_active,
+ bool drain)
{
+ DEFINE_WAIT(wait);
struct pool_workqueue *pwq;

+ might_sleep_if(drain);
+
/* disallow meddling with max_active for ordered workqueues */
if (WARN_ON(wq->flags & __WQ_ORDERED))
return;
@@ -4363,6 +4383,38 @@ void workqueue_set_max_active(struct workqueue_struct *wq, int max_active)
pwq_adjust_max_active(pwq);

mutex_unlock(&wq->mutex);
+
+ /*
+ * If we have increased max_active, pwq_dec_nr_in_flight() might
+ * not trigger for other instances of this function waiting for
+ * drain. Force them to check.
+ */
+ wake_up_all(&wq_nr_active_drain_waitq);
+
+ if (!drain)
+ return;
+
+ /* let's wait for the drain to complete */
+restart:
+ mutex_lock(&wq->mutex);
+ prepare_to_wait(&wq_nr_active_drain_waitq, &wait, TASK_UNINTERRUPTIBLE);
+
+ for_each_pwq(pwq, wq) {
+ /*
+ * nr_active should be monotonously decreasing as long as
+ * it's over max_active, so no need to grab pool lock.
+ * Also, test against saved_max_active in case multiple
+ * instances and/or system freezer are racing.
+ */
+ if (pwq->nr_active > wq->saved_max_active) {
+ mutex_unlock(&wq->mutex);
+ schedule();
+ goto restart;
+ }
+ }
+
+ finish_wait(&wq_nr_active_drain_waitq, &wait);
+ mutex_unlock(&wq->mutex);
}
EXPORT_SYMBOL_GPL(workqueue_set_max_active);
--
1.8.4.2
Rafael J. Wysocki
2013-12-20 01:31:37 UTC
Permalink
Post by Tejun Heo
Hello, Rafael.
Hi,
Post by Tejun Heo
If this looks good to you, I'll commit it to wq/for-3.14 and we can at
least start to clean up things.
Yes, it does.

So with that I need to do workqueue_set_max_active(wq, WQ_FROZEN_ACTIVE) during
suspend and then workqueue_set_max_active(wq, WQ_DFL_ACTIVE) during resume on
my workqueue, right?

Rafael
Post by Tejun Heo
------- 8< -------
From 6b5182f3f193d0ff9296f53a4665a55b6477aa77 Mon Sep 17 00:00:00 2001
Date: Thu, 19 Dec 2013 18:33:12 -0500
workqueue_set_max_active() currently doesn't wait for the number of
function sleeps until nr_active on each pool_workqueue of the target
workqueue drops below the current saved_max_active.
This is planned to replace freezable workqueues. It is determined
that kernel freezables - both kthreads and workqueues - aren't
necessary and just add to the confusion and unnecessary deadlocks.
There are only a handful which actually need to stop processing for
system power events and they can be converted to use
workqueue_set_max_active(WQ_FROZEN_ACTIVE) instead of freezable
workqueues. Ultimately, all other uses of freezables will be dropped
and the whole system freezer machinery will be excised. Well, that's
the plan anyway.
The implementation is fairly straight-forward. As this is expected to
be used by only a handful and most likely not concurrently, a single
wait queue is used. set_max_active drainers wait on it as necessary
and pwq_dec_nr_in_flight() triggers it if nr_active == max_active
after nr_active is decremented. This unfortunately adds on unlikely
branch to the work item execution path but this is extremely unlikely
to be noticeable and I think it's worthwhile to avoid polling here as
there may be multiple calls to this function in succession during
suspend and some platforms use suspend quite frequently.
---
fs/fscache/main.c | 2 +-
include/linux/workqueue.h | 2 +-
kernel/workqueue.c | 58 ++++++++++++++++++++++++++++++++++++++++++++---
3 files changed, 57 insertions(+), 5 deletions(-)
diff --git a/fs/fscache/main.c b/fs/fscache/main.c
index 9d5a716..e2837f2 100644
--- a/fs/fscache/main.c
+++ b/fs/fscache/main.c
@@ -67,7 +67,7 @@ static int fscache_max_active_sysctl(struct ctl_table *table, int write,
if (*datap < 1)
return -EINVAL;
- workqueue_set_max_active(*wqp, *datap);
+ workqueue_set_max_active(*wqp, *datap, false);
return 0;
}
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 334daa3..8b9c628 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -488,7 +488,7 @@ extern bool cancel_delayed_work(struct delayed_work *dwork);
extern bool cancel_delayed_work_sync(struct delayed_work *dwork);
extern void workqueue_set_max_active(struct workqueue_struct *wq,
- int max_active);
+ int max_active, bool drain);
extern bool current_is_workqueue_rescuer(void);
extern bool workqueue_congested(int cpu, struct workqueue_struct *wq);
extern unsigned int work_busy(struct work_struct *work);
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 6748fbf..f18c35b 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -293,6 +293,12 @@ static DEFINE_SPINLOCK(wq_mayday_lock); /* protects wq->maydays list */
static LIST_HEAD(workqueues); /* PL: list of all workqueues */
static bool workqueue_freezing; /* PL: have wqs started freezing? */
+/*
+ * Wait for nr_active to drain after max_active adjustment. This is a cold
+ * path and not expected to have many users. A global waitq should do.
+ */
+static DECLARE_WAIT_QUEUE_HEAD(wq_nr_active_drain_waitq);
+
/* the per-cpu worker pools */
static DEFINE_PER_CPU_SHARED_ALIGNED(struct worker_pool [NR_STD_WORKER_POOLS],
cpu_worker_pools);
@@ -1123,6 +1129,14 @@ static void pwq_dec_nr_in_flight(struct pool_workqueue *pwq, int color)
pwq->nr_in_flight[color]--;
pwq->nr_active--;
+
+ /*
+ * This can happen only after max_active is lowered. Tell the
+ * waiters that draining might be complete.
+ */
+ if (unlikely(pwq->nr_active == pwq->max_active))
+ wake_up_all(&wq_nr_active_drain_waitq);
+
if (!list_empty(&pwq->delayed_works)) {
/* one down, submit a delayed one */
if (pwq->nr_active < pwq->max_active)
@@ -3140,7 +3154,7 @@ static ssize_t max_active_store(struct device *dev,
if (sscanf(buf, "%d", &val) != 1 || val <= 0)
return -EINVAL;
- workqueue_set_max_active(wq, val);
+ workqueue_set_max_active(wq, val, false);
return count;
}
static DEVICE_ATTR_RW(max_active);
@@ -4339,16 +4353,22 @@ EXPORT_SYMBOL_GPL(destroy_workqueue);
* workqueue_set_max_active - adjust max_active of a workqueue
*
+ * in-flight work items are drained to the new level.
*
* Don't call from IRQ context.
*/
-void workqueue_set_max_active(struct workqueue_struct *wq, int max_active)
+void workqueue_set_max_active(struct workqueue_struct *wq, int max_active,
+ bool drain)
{
+ DEFINE_WAIT(wait);
struct pool_workqueue *pwq;
+ might_sleep_if(drain);
+
/* disallow meddling with max_active for ordered workqueues */
if (WARN_ON(wq->flags & __WQ_ORDERED))
return;
@@ -4363,6 +4383,38 @@ void workqueue_set_max_active(struct workqueue_struct *wq, int max_active)
pwq_adjust_max_active(pwq);
mutex_unlock(&wq->mutex);
+
+ /*
+ * If we have increased max_active, pwq_dec_nr_in_flight() might
+ * not trigger for other instances of this function waiting for
+ * drain. Force them to check.
+ */
+ wake_up_all(&wq_nr_active_drain_waitq);
+
+ if (!drain)
+ return;
+
+ /* let's wait for the drain to complete */
+ mutex_lock(&wq->mutex);
+ prepare_to_wait(&wq_nr_active_drain_waitq, &wait, TASK_UNINTERRUPTIBLE);
+
+ for_each_pwq(pwq, wq) {
+ /*
+ * nr_active should be monotonously decreasing as long as
+ * it's over max_active, so no need to grab pool lock.
+ * Also, test against saved_max_active in case multiple
+ * instances and/or system freezer are racing.
+ */
+ if (pwq->nr_active > wq->saved_max_active) {
+ mutex_unlock(&wq->mutex);
+ schedule();
+ goto restart;
+ }
+ }
+
+ finish_wait(&wq_nr_active_drain_waitq, &wait);
+ mutex_unlock(&wq->mutex);
}
EXPORT_SYMBOL_GPL(workqueue_set_max_active);
--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.
Tejun Heo
2013-12-20 13:32:20 UTC
Permalink
Hello, Rafael.
Post by Rafael J. Wysocki
Post by Tejun Heo
If this looks good to you, I'll commit it to wq/for-3.14 and we can at
least start to clean up things.
Yes, it does.
So with that I need to do workqueue_set_max_active(wq, WQ_FROZEN_ACTIVE) during
suspend and then workqueue_set_max_active(wq, WQ_DFL_ACTIVE) during resume on
my workqueue, right?
Yeah, you can also do workqueue_set_max_active(wq, 0) during resume to
restore the default value, which is what's used by alloc_workqueue()
anyway.

I'll apply the two patches to wq/for-3.14 once David reviews them or
if you wanna take them through the pm tree, that's fine too. I don't
have anything scheduled for wq for the next merge window.

Thanks.
--
tejun
Rafael J. Wysocki
2013-12-20 13:56:09 UTC
Permalink
Post by Tejun Heo
Hello, Rafael.
Post by Rafael J. Wysocki
Post by Tejun Heo
If this looks good to you, I'll commit it to wq/for-3.14 and we can at
least start to clean up things.
Yes, it does.
So with that I need to do workqueue_set_max_active(wq, WQ_FROZEN_ACTIVE) during
suspend and then workqueue_set_max_active(wq, WQ_DFL_ACTIVE) during resume on
my workqueue, right?
Yeah, you can also do workqueue_set_max_active(wq, 0) during resume to
restore the default value, which is what's used by alloc_workqueue()
anyway.
OK
Post by Tejun Heo
I'll apply the two patches to wq/for-3.14 once David reviews them or
if you wanna take them through the pm tree, that's fine too. I don't
have anything scheduled for wq for the next merge window.
I have some material queued up, so I can take them once they've been reviewed.

Thanks,
Rafael
Tejun Heo
2013-12-20 14:23:21 UTC
Permalink
Post by Tejun Heo
Hello, Rafael.
Post by Rafael J. Wysocki
Post by Tejun Heo
If this looks good to you, I'll commit it to wq/for-3.14 and we can at
least start to clean up things.
Yes, it does.
So with that I need to do workqueue_set_max_active(wq, WQ_FROZEN_ACTIVE) during
suspend and then workqueue_set_max_active(wq, WQ_DFL_ACTIVE) during resume on
my workqueue, right?
Yeah, you can also do workqueue_set_max_active(wq, 0) during resume to
restore the default value, which is what's used by alloc_workqueue()
anyway.
Oops, it sould be obvious but just in case. The call to use during
suspend is workqueue_set_max_active(wq, WQ_FROZEN_ACTIVE, true). :)
--
tejun
Tejun Heo
2013-12-16 16:05:42 UTC
Permalink
Hello, Ming.
You mean there are still some write I/O scheduled after processes are
frozen? by unfreezable kernel threads?
By unfreezable kernel threads, timer, bh, freezable kernel threads,
whatever really. Please note that freezer doesn't enforce any order
in how the target tasks are frozen. There's no mechanism to guarantee
that flusher is frozen after all other freezable tasks are frozen.
There isn't even a meachnism which guarantees flusher flushes all its
queues before getting frozen. There's no interlocking whatsoever.
The only thing which happens is that we flush all filesystems before
freezing the kernel threads, so the queues are *likely* to be empty.

I think trying to control all IO sources using the freezer is a
fundamentally flawed idea. There are N sources which are difficult to
track down reliably while there is *single* queue all those have to go
through, which needs to be quiesced anyway. The only thing which
makes sense is controlling the queue.

Maybe it really is necessary for hibernation. If so, let's please
make it something tailored for that purpose - quiesce only the ones
which are actually relevant and in the places where it's necessary.
Not this "we stopped most of the system, whatever that means, and it
feels good" thing which doesn't really solve anything while
introducing this fuzzy wishful idea of mostly stopped system and giant
lock semantics all over the place.

Thanks.
--
tejun
Rafael J. Wysocki
2013-12-17 02:38:05 UTC
Permalink
Post by Tejun Heo
Hello,
=20
So, this is the laughable workaround that I came up with. Seriously,
this is tragic. :(
=20
Thanks.
=20
------- 8< -------
Freezable kthreads and workqueues are fundamentally problematic in
that they effectively introduce a big kernel lock widely used in the
kernel and have already been the culprit of several deadlock
scenarios. This is the latest occurrence.
OK, so I'm too tired now to go through all that, but I'll look at it to=
morrow.

The rule of thumb is to get rid of freezable kernel threads in the firs=
t
place if possible anyway if they are causing problems to happen.

Thanks!
Post by Tejun Heo
During resume, libata rescans all the ports and revalidates all
pre-existing devices. If it determines that a device has gone
missing, the device is removed from the system which involves
invalidating block device and flushing bdi while holding driver core
layer locks. Unfortunately, this can race with the rest of device
resume. Because freezable kthreads and workqueues are thawed after
device resume is complete and block device removal depends on
freezable workqueues and kthreads (e.g. bdi_wq, jbd2) to make
progress, this can lead to deadlock - block device removal can't
proceed because kthreads are frozen and kthreads can't be thawed
because device resume is blocked behind block device removal.
=20
839a8e8660b6 ("writeback: replace custom worker pool implementation
with unbound workqueue") made this particular deadlock scenario more
visible but the underlying problem has always been there - the
original forker task and jbd2 are freezable too. In fact, this is
highly likely just one of many possible deadlock scenarios given that
freezer behaves as a big kernel lock and we don't have any debug
mechanism around it.
=20
I believe the right thing to do is getting rid of freezable kthreads
and workqueues. This is something fundamentally broken. For now,
implement a funny workaround in libata - just avoid doing block devic=
e
Post by Tejun Heo
hot[un]plug while the system is frozen. Kernel engineering at its
finest. :(
=20
Link: https://bugzilla.kernel.org/show_bug.cgi?id=3D62801
---
drivers/ata/libata-scsi.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
=20
diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index ab58556..60a01d3 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -3872,6 +3872,25 @@ void ata_scsi_hotplug(struct work_struct *work=
)
Post by Tejun Heo
return;
}
=20
+ /*
+ * XXX - UGLY HACK
+ *
+ * The core suspend/resume path is fundamentally broken due to
+ * freezable kthreads and workqueue and may deadlock if a block
+ * device gets removed while resume is in progress. I don't know
+ * what the solution is short of removing freezable kthreads and
+ * workqueues altogether.
+ *
+ * The following is an ugly hack to avoid kicking off device
+ * removal while freezer is active. This is a joke but does avoid
+ * this particular deadlock scenario.
+ *
+ * https://bugzilla.kernel.org/show_bug.cgi?id=3D62801
+ * http://marc.info/?l=3Dlinux-kernel&m=3D138695698516487
+ */
+ while (pm_freezing)
+ msleep(100);
+
DPRINTK("ENTER\n");
mutex_lock(&ap->scsi_scan_mutex);
=20
--
To unsubscribe from this list: send the line "unsubscribe linux-kerne=
l" in
Post by Tejun Heo
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
--=20
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.
Tejun Heo
2013-12-17 12:36:43 UTC
Permalink
Hello, Rafael.
Post by Tejun Heo
Hello,
So, this is the laughable workaround that I came up with. Seriously,
this is tragic. :(
Thanks.
------- 8< -------
Freezable kthreads and workqueues are fundamentally problematic in
that they effectively introduce a big kernel lock widely used in the
kernel and have already been the culprit of several deadlock
scenarios. This is the latest occurrence.
OK, so I'm too tired now to go through all that, but I'll look at it tomorrow.
The rule of thumb is to get rid of freezable kernel threads in the first
place if possible anyway if they are causing problems to happen.
Yes, that'd be awesome. In fact, getting rid of all kernel freezables
in non-low-level-drivers would be a great step forward. That said, we
need something easily backportable so I think we probably need this
bandaid for immediate fix for now.

Thanks.
--
tejun
Rafael J. Wysocki
2013-12-18 00:23:57 UTC
Permalink
Post by Tejun Heo
Hello, Rafael.
Post by Tejun Heo
Hello,
So, this is the laughable workaround that I came up with. Seriously,
this is tragic. :(
Thanks.
------- 8< -------
Freezable kthreads and workqueues are fundamentally problematic in
that they effectively introduce a big kernel lock widely used in the
kernel and have already been the culprit of several deadlock
scenarios. This is the latest occurrence.
OK, so I'm too tired now to go through all that, but I'll look at it tomorrow.
The rule of thumb is to get rid of freezable kernel threads in the first
place if possible anyway if they are causing problems to happen.
Yes, that'd be awesome. In fact, getting rid of all kernel freezables
in non-low-level-drivers would be a great step forward.
Agreed.
Post by Tejun Heo
That said, we need something easily backportable so I think we probably need
this bandaid for immediate fix for now.
Well, we need something, but I have a couple of concerns about this particular
patch (I'll reply to it with comments shortly).

Thanks,
Rafael
Tejun Heo
2013-12-17 12:50:42 UTC
Permalink
Hello,

Rafael, if you're okay with the workaround, I'll route it through
libata/for-3.13-fixes.

Thanks.
------- 8< -------
=46reezable kthreads and workqueues are fundamentally problematic in
that they effectively introduce a big kernel lock widely used in the
kernel and have already been the culprit of several deadlock
scenarios. This is the latest occurrence.

During resume, libata rescans all the ports and revalidates all
pre-existing devices. If it determines that a device has gone
missing, the device is removed from the system which involves
invalidating block device and flushing bdi while holding driver core
layer locks. Unfortunately, this can race with the rest of device
resume. Because freezable kthreads and workqueues are thawed after
device resume is complete and block device removal depends on
freezable workqueues and kthreads (e.g. bdi_wq, jbd2) to make
progress, this can lead to deadlock - block device removal can't
proceed because kthreads are frozen and kthreads can't be thawed
because device resume is blocked behind block device removal.

839a8e8660b6 ("writeback: replace custom worker pool implementation
with unbound workqueue") made this particular deadlock scenario more
visible but the underlying problem has always been there - the
original forker task and jbd2 are freezable too. In fact, this is
highly likely just one of many possible deadlock scenarios given that
freezer behaves as a big kernel lock and we don't have any debug
mechanism around it.

I believe the right thing to do is getting rid of freezable kthreads
and workqueues. This is something fundamentally broken. For now,
implement a funny workaround in libata - just avoid doing block device
hot[un]plug while the system is frozen. Kernel engineering at its
finest. :(

v2: Add EXPORT_SYMBOL_GPL(pm_freezing) for cases where libata is built
as a module.

Signed-off-by: Tejun Heo <***@kernel.org>
Reported-by: Toma=C5=BE =C5=A0olc <***@tablix.org>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=3D62801
Link: http://lkml.kernel.org/r/***@htj.dyndns.org
Cc: "Rafael J. Wysocki" <***@rjwysocki.net>
Cc: Greg Kroah-Hartman <***@linuxfoundation.org>
Cc: Len Brown <***@intel.com>
Cc: Oleg Nesterov <***@redhat.com>
Cc: ***@vger.kernel.org
---
drivers/ata/libata-scsi.c | 19 +++++++++++++++++++
kernel/freezer.c | 6 ++++++
2 files changed, 25 insertions(+)

--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -3871,6 +3871,25 @@ void ata_scsi_hotplug(struct work_struct
return;
}
=20
+ /*
+ * XXX - UGLY HACK
+ *
+ * The core suspend/resume path is fundamentally broken due to
+ * freezable kthreads and workqueue and may deadlock if a block
+ * device gets removed while resume is in progress. I don't know
+ * what the solution is short of removing freezable kthreads and
+ * workqueues altogether.
+ *
+ * The following is an ugly hack to avoid kicking off device
+ * removal while freezer is active. This is a joke but does avoid
+ * this particular deadlock scenario.
+ *
+ * https://bugzilla.kernel.org/show_bug.cgi?id=3D62801
+ * http://marc.info/?l=3Dlinux-kernel&m=3D138695698516487
+ */
+ while (pm_freezing)
+ msleep(100);
+
DPRINTK("ENTER\n");
mutex_lock(&ap->scsi_scan_mutex);
=20
--- a/kernel/freezer.c
+++ b/kernel/freezer.c
@@ -19,6 +19,12 @@ EXPORT_SYMBOL(system_freezing_cnt);
bool pm_freezing;
bool pm_nosig_freezing;
=20
+/*
+ * Temporary export for the deadlock workaround in ata_scsi_hotplug().
+ * Remove once the hack becomes unnecessary.
+ */
+EXPORT_SYMBOL_GPL(pm_freezing);
+
/* protects freezing and frozen transitions */
static DEFINE_SPINLOCK(freezer_lock);
=20
Rafael J. Wysocki
2013-12-18 01:04:35 UTC
Permalink
Post by Tejun Heo
Hello,
=20
Rafael, if you're okay with the workaround, I'll route it through
libata/for-3.13-fixes.
=20
Thanks.
------- 8< -------
Freezable kthreads and workqueues are fundamentally problematic in
that they effectively introduce a big kernel lock widely used in the
kernel and have already been the culprit of several deadlock
scenarios. This is the latest occurrence.
=20
During resume, libata rescans all the ports and revalidates all
pre-existing devices. If it determines that a device has gone
missing, the device is removed from the system which involves
invalidating block device and flushing bdi while holding driver core
layer locks. Unfortunately, this can race with the rest of device
resume. Because freezable kthreads and workqueues are thawed after
device resume is complete and block device removal depends on
freezable workqueues and kthreads (e.g. bdi_wq, jbd2) to make
progress, this can lead to deadlock - block device removal can't
proceed because kthreads are frozen and kthreads can't be thawed
because device resume is blocked behind block device removal.
=20
839a8e8660b6 ("writeback: replace custom worker pool implementation
with unbound workqueue") made this particular deadlock scenario more
visible but the underlying problem has always been there - the
original forker task and jbd2 are freezable too. In fact, this is
highly likely just one of many possible deadlock scenarios given that
freezer behaves as a big kernel lock and we don't have any debug
mechanism around it.
=20
I believe the right thing to do is getting rid of freezable kthreads
and workqueues. This is something fundamentally broken. For now,
implement a funny workaround in libata - just avoid doing block devic=
e
Post by Tejun Heo
hot[un]plug while the system is frozen. Kernel engineering at its
finest. :(
=20
v2: Add EXPORT_SYMBOL_GPL(pm_freezing) for cases where libata is buil=
t
Post by Tejun Heo
as a module.
=20
Link: https://bugzilla.kernel.org/show_bug.cgi?id=3D62801
---
drivers/ata/libata-scsi.c | 19 +++++++++++++++++++
kernel/freezer.c | 6 ++++++
2 files changed, 25 insertions(+)
=20
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -3871,6 +3871,25 @@ void ata_scsi_hotplug(struct work_struct
return;
}
=20
+ /*
+ * XXX - UGLY HACK
+ *
+ * The core suspend/resume path is fundamentally broken due to
+ * freezable kthreads and workqueue and may deadlock if a block
+ * device gets removed while resume is in progress. I don't know
+ * what the solution is short of removing freezable kthreads and
+ * workqueues altogether.
Do you mean the block device core or the SCSI core or something else? =
It would
be good to clarify that here to avoid confusion.
Post by Tejun Heo
+ * The following is an ugly hack to avoid kicking off device
+ * removal while freezer is active. This is a joke but does avoid
+ * this particular deadlock scenario.
+ *
+ * https://bugzilla.kernel.org/show_bug.cgi?id=3D62801
+ * http://marc.info/?l=3Dlinux-kernel&m=3D138695698516487
+ */
+ while (pm_freezing)
+ msleep(100);
Why is the sleep time 100 ms exactly? And why does it matter?

=46or example, what would change if it were 10 ms?
Post by Tejun Heo
+
DPRINTK("ENTER\n");
mutex_lock(&ap->scsi_scan_mutex);
=20
--- a/kernel/freezer.c
+++ b/kernel/freezer.c
@@ -19,6 +19,12 @@ EXPORT_SYMBOL(system_freezing_cnt);
bool pm_freezing;
bool pm_nosig_freezing;
=20
+/*
+ * Temporary export for the deadlock workaround in ata_scsi_hotplug(=
).
Post by Tejun Heo
+ * Remove once the hack becomes unnecessary.
+ */
+EXPORT_SYMBOL_GPL(pm_freezing);
+
/* protects freezing and frozen transitions */
static DEFINE_SPINLOCK(freezer_lock);
=20
Thanks,
Rafael
Tejun Heo
2013-12-18 11:08:07 UTC
Permalink
Hey, Rafael.
Post by Tejun Heo
+ * The core suspend/resume path is fundamentally broken due to
+ * freezable kthreads and workqueue and may deadlock if a block
+ * device gets removed while resume is in progress. I don't know
+ * what the solution is short of removing freezable kthreads and
+ * workqueues altogether.
Do you mean the block device core or the SCSI core or something else? It would
be good to clarify that here to avoid confusion.
Will clarify.
Post by Tejun Heo
+ * The following is an ugly hack to avoid kicking off device
+ * removal while freezer is active. This is a joke but does avoid
+ * this particular deadlock scenario.
+ *
+ * https://bugzilla.kernel.org/show_bug.cgi?id=62801
+ * http://marc.info/?l=linux-kernel&m=138695698516487
+ */
+ while (pm_freezing)
+ msleep(100);
Why is the sleep time 100 ms exactly? And why does it matter?
Just a number I pulled out of my ass.
For example, what would change if it were 10 ms?
Yeah, 10ms is my favorite human-visible polling duration too (because
it's slow enough not to cause overhead issues while fast enough to be
mostly unnoticeable to humans). This one doesn't really matter
because the operation's latency isn't something which the user would
wait for actively. That said, yeah, why not, I'll change it to 10ms.

Thanks.
--
tejun
Tejun Heo
2013-12-18 12:07:32 UTC
Permalink
=46reezable kthreads and workqueues are fundamentally problematic in
that they effectively introduce a big kernel lock widely used in the
kernel and have already been the culprit of several deadlock
scenarios. This is the latest occurrence.

During resume, libata rescans all the ports and revalidates all
pre-existing devices. If it determines that a device has gone
missing, the device is removed from the system which involves
invalidating block device and flushing bdi while holding driver core
layer locks. Unfortunately, this can race with the rest of device
resume. Because freezable kthreads and workqueues are thawed after
device resume is complete and block device removal depends on
freezable workqueues and kthreads (e.g. bdi_wq, jbd2) to make
progress, this can lead to deadlock - block device removal can't
proceed because kthreads are frozen and kthreads can't be thawed
because device resume is blocked behind block device removal.

839a8e8660b6 ("writeback: replace custom worker pool implementation
with unbound workqueue") made this particular deadlock scenario more
visible but the underlying problem has always been there - the
original forker task and jbd2 are freezable too. In fact, this is
highly likely just one of many possible deadlock scenarios given that
freezer behaves as a big kernel lock and we don't have any debug
mechanism around it.

I believe the right thing to do is getting rid of freezable kthreads
and workqueues. This is something fundamentally broken. For now,
implement a funny workaround in libata - just avoid doing block device
hot[un]plug while the system is frozen. Kernel engineering at its
finest. :(

v2: Add EXPORT_SYMBOL_GPL(pm_freezing) for cases where libata is built
as a module.

v3: Comment updated and polling interval changed to 10ms as suggested
by Rafael.

Signed-off-by: Tejun Heo <***@kernel.org>
Reported-by: Toma=C5=BE =C5=A0olc <***@tablix.org>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=3D62801
Link: http://lkml.kernel.org/r/***@htj.dyndns.org
Cc: "Rafael J. Wysocki" <***@rjwysocki.net>
Cc: Greg Kroah-Hartman <***@linuxfoundation.org>
Cc: Len Brown <***@intel.com>
Cc: Oleg Nesterov <***@redhat.com>
Cc: ***@vger.kernel.org
---
drivers/ata/libata-scsi.c | 19 +++++++++++++++++++
kernel/freezer.c | 6 ++++++
2 files changed, 25 insertions(+)

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index db6dfcf..f519868 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -3871,6 +3871,25 @@ void ata_scsi_hotplug(struct work_struct *work)
return;
}
=20
+ /*
+ * XXX - UGLY HACK
+ *
+ * The block layer suspend/resume path is fundamentally broken due
+ * to freezable kthreads and workqueue and may deadlock if a block
+ * device gets removed while resume is in progress. I don't know
+ * what the solution is short of removing freezable kthreads and
+ * workqueues altogether.
+ *
+ * The following is an ugly hack to avoid kicking off device
+ * removal while freezer is active. This is a joke but does avoid
+ * this particular deadlock scenario.
+ *
+ * https://bugzilla.kernel.org/show_bug.cgi?id=3D62801
+ * http://marc.info/?l=3Dlinux-kernel&m=3D138695698516487
+ */
+ while (pm_freezing)
+ msleep(10);
+
DPRINTK("ENTER\n");
mutex_lock(&ap->scsi_scan_mutex);
=20
diff --git a/kernel/freezer.c b/kernel/freezer.c
index b462fa1..aa6a8aa 100644
--- a/kernel/freezer.c
+++ b/kernel/freezer.c
@@ -19,6 +19,12 @@ EXPORT_SYMBOL(system_freezing_cnt);
bool pm_freezing;
bool pm_nosig_freezing;
=20
+/*
+ * Temporary export for the deadlock workaround in ata_scsi_hotplug().
+ * Remove once the hack becomes unnecessary.
+ */
+EXPORT_SYMBOL_GPL(pm_freezing);
+
/* protects freezing and frozen transitions */
static DEFINE_SPINLOCK(freezer_lock);
=20
Rafael J. Wysocki
2013-12-18 22:08:41 UTC
Permalink
Post by Tejun Heo
Freezable kthreads and workqueues are fundamentally problematic in
that they effectively introduce a big kernel lock widely used in the
kernel and have already been the culprit of several deadlock
scenarios. This is the latest occurrence.
=20
During resume, libata rescans all the ports and revalidates all
pre-existing devices. If it determines that a device has gone
missing, the device is removed from the system which involves
invalidating block device and flushing bdi while holding driver core
layer locks. Unfortunately, this can race with the rest of device
resume. Because freezable kthreads and workqueues are thawed after
device resume is complete and block device removal depends on
freezable workqueues and kthreads (e.g. bdi_wq, jbd2) to make
progress, this can lead to deadlock - block device removal can't
proceed because kthreads are frozen and kthreads can't be thawed
because device resume is blocked behind block device removal.
=20
839a8e8660b6 ("writeback: replace custom worker pool implementation
with unbound workqueue") made this particular deadlock scenario more
visible but the underlying problem has always been there - the
original forker task and jbd2 are freezable too. In fact, this is
highly likely just one of many possible deadlock scenarios given that
freezer behaves as a big kernel lock and we don't have any debug
mechanism around it.
=20
I believe the right thing to do is getting rid of freezable kthreads
and workqueues.
I agree. It may be useful to block them over suspend/resume, but that =
doesn't
have to be done through the freezer.
Post by Tejun Heo
This is something fundamentally broken. For now,
implement a funny workaround in libata - just avoid doing block devic=
e
Post by Tejun Heo
hot[un]plug while the system is frozen. Kernel engineering at its
finest. :(
=20
v2: Add EXPORT_SYMBOL_GPL(pm_freezing) for cases where libata is buil=
t
Post by Tejun Heo
as a module.
=20
v3: Comment updated and polling interval changed to 10ms as suggested
by Rafael.
This one is fine by my FWIW.

Thanks!
Post by Tejun Heo
Link: https://bugzilla.kernel.org/show_bug.cgi?id=3D62801
---
drivers/ata/libata-scsi.c | 19 +++++++++++++++++++
kernel/freezer.c | 6 ++++++
2 files changed, 25 insertions(+)
=20
diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index db6dfcf..f519868 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -3871,6 +3871,25 @@ void ata_scsi_hotplug(struct work_struct *work=
)
Post by Tejun Heo
return;
}
=20
+ /*
+ * XXX - UGLY HACK
+ *
+ * The block layer suspend/resume path is fundamentally broken due
+ * to freezable kthreads and workqueue and may deadlock if a block
+ * device gets removed while resume is in progress. I don't know
+ * what the solution is short of removing freezable kthreads and
+ * workqueues altogether.
+ *
+ * The following is an ugly hack to avoid kicking off device
+ * removal while freezer is active. This is a joke but does avoid
+ * this particular deadlock scenario.
+ *
+ * https://bugzilla.kernel.org/show_bug.cgi?id=3D62801
+ * http://marc.info/?l=3Dlinux-kernel&m=3D138695698516487
+ */
+ while (pm_freezing)
+ msleep(10);
+
DPRINTK("ENTER\n");
mutex_lock(&ap->scsi_scan_mutex);
=20
diff --git a/kernel/freezer.c b/kernel/freezer.c
index b462fa1..aa6a8aa 100644
--- a/kernel/freezer.c
+++ b/kernel/freezer.c
@@ -19,6 +19,12 @@ EXPORT_SYMBOL(system_freezing_cnt);
bool pm_freezing;
bool pm_nosig_freezing;
=20
+/*
+ * Temporary export for the deadlock workaround in ata_scsi_hotplug(=
).
Post by Tejun Heo
+ * Remove once the hack becomes unnecessary.
+ */
+EXPORT_SYMBOL_GPL(pm_freezing);
+
/* protects freezing and frozen transitions */
static DEFINE_SPINLOCK(freezer_lock);
=20
--
To unsubscribe from this list: send the line "unsubscribe linux-kerne=
l" in
Post by Tejun Heo
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
--=20
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.
Tejun Heo
2013-12-19 17:24:12 UTC
Permalink
Hello,
Post by Rafael J. Wysocki
Post by Tejun Heo
This is something fundamentally broken. For now,
implement a funny workaround in libata - just avoid doing block device
hot[un]plug while the system is frozen. Kernel engineering at its
finest. :(
v2: Add EXPORT_SYMBOL_GPL(pm_freezing) for cases where libata is built
as a module.
v3: Comment updated and polling interval changed to 10ms as suggested
by Rafael.
This one is fine by my FWIW.
Applied to libata/for-3.13-fixes. I liberally added your reviewed-by
from the above response. Please let me know if I should remove it.
I'll push it out to Linus after it spends a couple days in -next.

Thanks!
--
tejun
Tejun Heo
2013-12-19 18:54:59 UTC
Permalink
Hello, again.

Oops, it was missing CONFIG_FREEZER. Updated version committed. How
did we even survive before kbuild test robot? :)

Thanks.

------ 8< ------
=46rom 85fbd722ad0f5d64d1ad15888cd1eb2188bfb557 Mon Sep 17 00:00:00 200=
1
=46rom: Tejun Heo <***@kernel.org>
Date: Wed, 18 Dec 2013 07:07:32 -0500
MIME-Version: 1.0
Content-Type: text/plain; charset=3DUTF-8
Content-Transfer-Encoding: 8bit

=46reezable kthreads and workqueues are fundamentally problematic in
that they effectively introduce a big kernel lock widely used in the
kernel and have already been the culprit of several deadlock
scenarios. This is the latest occurrence.

During resume, libata rescans all the ports and revalidates all
pre-existing devices. If it determines that a device has gone
missing, the device is removed from the system which involves
invalidating block device and flushing bdi while holding driver core
layer locks. Unfortunately, this can race with the rest of device
resume. Because freezable kthreads and workqueues are thawed after
device resume is complete and block device removal depends on
freezable workqueues and kthreads (e.g. bdi_wq, jbd2) to make
progress, this can lead to deadlock - block device removal can't
proceed because kthreads are frozen and kthreads can't be thawed
because device resume is blocked behind block device removal.

839a8e8660b6 ("writeback: replace custom worker pool implementation
with unbound workqueue") made this particular deadlock scenario more
visible but the underlying problem has always been there - the
original forker task and jbd2 are freezable too. In fact, this is
highly likely just one of many possible deadlock scenarios given that
freezer behaves as a big kernel lock and we don't have any debug
mechanism around it.

I believe the right thing to do is getting rid of freezable kthreads
and workqueues. This is something fundamentally broken. For now,
implement a funny workaround in libata - just avoid doing block device
hot[un]plug while the system is frozen. Kernel engineering at its
finest. :(

v2: Add EXPORT_SYMBOL_GPL(pm_freezing) for cases where libata is built
as a module.

v3: Comment updated and polling interval changed to 10ms as suggested
by Rafael.

v4: Add #ifdef CONFIG_FREEZER around the hack as pm_freezing is not
defined when FREEZER is not configured thus breaking build.
Reported by kbuild test robot.

Signed-off-by: Tejun Heo <***@kernel.org>
Reported-by: Toma=C5=BE =C5=A0olc <***@tablix.org>
Reviewed-by: "Rafael J. Wysocki" <***@rjwysocki.net>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=3D62801
Link: http://lkml.kernel.org/r/***@htj.dyndns.org
Cc: Greg Kroah-Hartman <***@linuxfoundation.org>
Cc: Len Brown <***@intel.com>
Cc: Oleg Nesterov <***@redhat.com>
Cc: ***@vger.kernel.org
Cc: kbuild test robot <***@intel.com>
---
drivers/ata/libata-scsi.c | 21 +++++++++++++++++++++
kernel/freezer.c | 6 ++++++
2 files changed, 27 insertions(+)

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index db6dfcf..176f629 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -3871,6 +3871,27 @@ void ata_scsi_hotplug(struct work_struct *work)
return;
}
=20
+ /*
+ * XXX - UGLY HACK
+ *
+ * The block layer suspend/resume path is fundamentally broken due
+ * to freezable kthreads and workqueue and may deadlock if a block
+ * device gets removed while resume is in progress. I don't know
+ * what the solution is short of removing freezable kthreads and
+ * workqueues altogether.
+ *
+ * The following is an ugly hack to avoid kicking off device
+ * removal while freezer is active. This is a joke but does avoid
+ * this particular deadlock scenario.
+ *
+ * https://bugzilla.kernel.org/show_bug.cgi?id=3D62801
+ * http://marc.info/?l=3Dlinux-kernel&m=3D138695698516487
+ */
+#ifdef CONFIG_FREEZER
+ while (pm_freezing)
+ msleep(10);
+#endif
+
DPRINTK("ENTER\n");
mutex_lock(&ap->scsi_scan_mutex);
=20
diff --git a/kernel/freezer.c b/kernel/freezer.c
index b462fa1..aa6a8aa 100644
--- a/kernel/freezer.c
+++ b/kernel/freezer.c
@@ -19,6 +19,12 @@ EXPORT_SYMBOL(system_freezing_cnt);
bool pm_freezing;
bool pm_nosig_freezing;
=20
+/*
+ * Temporary export for the deadlock workaround in ata_scsi_hotplug().
+ * Remove once the hack becomes unnecessary.
+ */
+EXPORT_SYMBOL_GPL(pm_freezing);
+
/* protects freezing and frozen transitions */
static DEFINE_SPINLOCK(freezer_lock);
=20
--=20
1.8.4.2
Dave Chinner
2013-12-14 01:53:43 UTC
Permalink
Post by Tejun Heo
Hello, guys.
This is discovered while investigating bug 62801 - "EliteBoot hangs at
dock, suspend, undock, resume".
https://bugzilla.kernel.org/show_bug.cgi?id=62801
The laptop locks up during resume if undocked while suspended. The
dock contains a hard drive and the drive removal path and resume path
get stuck. This got bisected to 839a8e8660b6 ("writeback: replace
custom worker pool implementation with unbound workqueue") by the
reporter. The problem can be reproduced by just removing mounted
harddrive while a machine is suspended. I first thought it was some
dumb mistake but it turns out to be something fundamental.
So, here's the lock up.
* Resume starts and libata resume is kicked off.
* libata EH determines that the device is gone. Eventually it invokes
device_del() through a work item.
* device_del() tries to delete the block device which invokes
writeback_inodes_sb() on the mounted filesystem which in turn
schedules and flushes bdi work item. Note that at this point, the
kworker is holding multiple driver layer locks.
That's the fundamental problem here - device removal asks the device
to fsync the filesystem on top of the device that was just removed.
The simple way to trigger this is to pull a device from underneath
an active filesystem (e.g. user unplugs a USB device without first
unmounting it). There are many years worth of bug reports showing
that this attempt by the device removal code to sync the filesystem
leads to deadlocks.

It's simply not a valid thing to do - just how is the filesystem
supposed to sync to a non-existent device?

I've raised this each time a user reports it over the past few years
and never been able to convince any to fix the filesystem
re-entrancy problem device removal causes. Syncing the filesystem
will require taking locks that are by IO in progress, and so can't
make progress until the IO is completed, but that can't happen
until the error handling completes the sync of the filesystem....

Preventing fs/io re-entrancy from contexts where we might be holding
locks is the same reason we have GFP_NOFS and GFP_NOIO for memory
allocation: re-entering a filesystem or IO subsystem whenever we are
holding locks or serialised context in the fs/io path can deadlock
the fs/io path.

IOWs, syncing the filesystem from the device delete code is just
plain wrong. That's what needs fixing - removing the cause of
re-entrancy, not the workqueue or writeback code...
Post by Tejun Heo
Ideas?
Fix the device delete error handling not to re-enter the
filesystem and IO path. It's just wrong.

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Greg Kroah-Hartman
2013-12-14 17:30:11 UTC
Permalink
Post by Dave Chinner
Post by Tejun Heo
Ideas?
Fix the device delete error handling not to re-enter the
filesystem and IO path. It's just wrong.
That sounds totally reasonable to me, what is preventing someone from
making this change?

thanks,

greg k-h
Tejun Heo
2013-12-14 20:23:24 UTC
Permalink
Hello, Dave.
Post by Dave Chinner
That's the fundamental problem here - device removal asks the device
to fsync the filesystem on top of the device that was just removed.
The simple way to trigger this is to pull a device from underneath
an active filesystem (e.g. user unplugs a USB device without first
unmounting it). There are many years worth of bug reports showing
that this attempt by the device removal code to sync the filesystem
leads to deadlocks.
What are you suggesting? Implementing separate warm and hot unplug
paths? That makes no sense whatsoever. Device hot unplug is just a
sub operation of general device unplug which should be able to succeed
whether the underlying device is failing IOs or not.

Another thing to consider is that you want to excercise your error
handling path as often as possible. IOs could still fail even during
device warm unplugs. If anything atop the driver can't handle that,
that's a bug and we want to fix them. The fact that we have bug
reports going back years on the subject simply means that we've been
traditionally lousy with error handling throughout the layers - those
were the occassions where we fixed actual error handling bugs, hot
unplug or not. Special casing hot unplug would just make existing
error handling bugs more difficult to find.
Post by Dave Chinner
It's simply not a valid thing to do - just how is the filesystem
supposed to sync to a non-existent device?
By issuing all IOs and then handling the failures appropriately.
Exactly just as it would handle *ANY* io errors. This actually is the
simplest thing to do to drain the pipe - pushing things down all the
way and fail them at single point. What's the alternative? Implement
quick exit path at each layer? That's just silly and error-prone.
Post by Dave Chinner
I've raised this each time a user reports it over the past few years
and never been able to convince any to fix the filesystem
re-entrancy problem device removal causes. Syncing the filesystem
Well, it isn't a good idea.
Post by Dave Chinner
will require taking locks that are by IO in progress, and so can't
make progress until the IO is completed, but that can't happen
until the error handling completes the sync of the filesystem....
IOs in progress shouldn't ever be stalled by filesystem sync. That's
a massive layering violation. Where and how does that happen?
Post by Dave Chinner
Preventing fs/io re-entrancy from contexts where we might be holding
locks is the same reason we have GFP_NOFS and GFP_NOIO for memory
allocation: re-entering a filesystem or IO subsystem whenever we are
holding locks or serialised context in the fs/io path can deadlock
the fs/io path.
IOWs, syncing the filesystem from the device delete code is just
plain wrong. That's what needs fixing - removing the cause of
re-entrancy, not the workqueue or writeback code...
No, the wrong thing there is having IO failures depending on
filesystem sync. Nothing else.
Post by Dave Chinner
Post by Tejun Heo
Ideas?
Fix the device delete error handling not to re-enter the
filesystem and IO path. It's just wrong.
Nope, bad idea.

Thanks.
--
tejun
Dave Chinner
2013-12-18 00:35:10 UTC
Permalink
Post by Tejun Heo
What are you suggesting? Implementing separate warm and hot unplug
paths? That makes no sense whatsoever. Device hot unplug is just a
sub operation of general device unplug which should be able to succeed
whether the underlying device is failing IOs or not.
I don't care. Trying to issue IO from an an IO error handling path
where the device has just been removed is fundamentally broken.
What? Have you even read the original message? IO error handling
path isn't issuing the IO here. The hot unplug operation is
completely asynchronous to the IO path. What's dead locking is not
the filesystem and IO path but device driver layer and hot unplug
path. IOs are not stalled.
In fact, this deadlock can be reproduced without hotunplug at all. If
you initiate warm unplug and warm unplugging races with suspend/resume
cycle, it'll behave exactly the same - the IOs from flushing in that
scenario would succeed but IOs failing or not has *NOTHING* to do with
this deadlock. It'd still happen. It's freezer behaving as a giant
lock and other locks of course getting dragged into giant dependency
loop. Can you please at least *try* to understand what's going on
before throwing strong assertions?
Sure I understand the problem. Part of that dependency loop is
caused by filesystem re-entrancy from the hot unplug code.
Regardless of this /specific freezer problem/, that filesystem
re-entrancy path is *never OK* during a hot-unplug event and it
needs to be fixed.

Perhaps the function "invalidate_partition()" is badly named. To
state the obvious, fsync != invalidation. What it does is:

1. sync filesystem
2. shrink the dcache
3. invalidates inodes and kills dirty inodes
4. invalidates block device (removes cached bdev pages)

Basically, the first step is "flush", the remainder is "invalidate".

Indeed, step 3 throws away dirty inodes, so why on earth would we
even bother with step 1 to try to clean them in the first place?
IOWs, the flush is irrelevant in the hot-unplug case as it will
fail to flush stuff, and then we just throw the stuff we
failed to write away.

But in attempting to flush all the dirty data and metadata, we can
cause all sorts of other potential re-entrancy based deadlocks due
to attempting to issue IO. Whether they be freezer based or through
IO error handling triggering device removal or some other means, it
is irrelevant - it is the flush that causes all the problems.

We need to either get rid of the flush on device failure/hot-unplug,
or turn it into a callout for the superblock to take an appropriate
action (e.g. shutting down the filesystem) rather than trying to
issue IO. i.e. allow the filesystem to take appropriate action of
shutting down the filesystem and invalidating it's caches.

Indeed, in XFS there's several other caches that could contain dirty
metadata that isn't invalidated by invalidate_partition(), and so
unless the filesystem is shut down it can continue to try to issue
IO on those buffers to the removed device until the filesystem is
shutdown or unmounted.

Seriously, Tejun, the assumption that invalidate_partition() knows
enough about filesystems to safely "invalidate" them is just broken.
These days, filesystems often reference multiple block devices, and
so the way hotplug currently treats them as "one device, one
filesystem" is also fundamentally wrong.

So there's many ways in which the hot-unplug code is broken in it's
use of invalidate_partition(), the least of which is the
dependencies caused by re-entrancy. We really need a
"sb->shutdown()" style callout as step one in the above process, not
fsync_bdev().

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Tejun Heo
2013-12-18 11:43:43 UTC
Permalink
Hello, Dave.
Post by Dave Chinner
Sure I understand the problem. Part of that dependency loop is
caused by filesystem re-entrancy from the hot unplug code.
Regardless of this /specific freezer problem/, that filesystem
re-entrancy path is *never OK* during a hot-unplug event and it
needs to be fixed.
I don't think you do. The only relation between the two problems is
that they're traveling the same code path. Again, the deadlock can be
triggered *the same* with warm unplug && it *doesn't* need writeback
or jbd - we can deadlock with any freezables and we have drivers which
use freezables in IO issue path.

I don't really mind the discussion branching off to a related topic
but you've been consistently misunderstanding the issue at hand and
derailing the larger discussion. If you want to discuss the issue
that you brought up, that's completely fine but *PLEASE* stop getting
confused and mixing up the two.
Post by Dave Chinner
Perhaps the function "invalidate_partition()" is badly named. To
1. sync filesystem
2. shrink the dcache
3. invalidates inodes and kills dirty inodes
4. invalidates block device (removes cached bdev pages)
Basically, the first step is "flush", the remainder is "invalidate".
Indeed, step 3 throws away dirty inodes, so why on earth would we
even bother with step 1 to try to clean them in the first place?
IOWs, the flush is irrelevant in the hot-unplug case as it will
fail to flush stuff, and then we just throw the stuff we
failed to write away.
But in attempting to flush all the dirty data and metadata, we can
cause all sorts of other potential re-entrancy based deadlocks due
to attempting to issue IO. Whether they be freezer based or through
IO error handling triggering device removal or some other means, it
is irrelevant - it is the flush that causes all the problems.
Isn't the root cause there hotunplug reentering anything above it in
the first place. The problem with your proposal is that filesystem
isn't the only place where this could happen. Even with no filesystem
involved, block device could still be dirty and IOs pending in
whatever form - dirty bdi, bios queued in dm, requests queued in
request_queue, whatever really - and if the hotunplug path reenters
any of the higher layers in a way which blocks IO processing, it will
deadlock.

If knowing that the underlying device has gone away somehow helps
filesystem, maybe we can expose that interface and avoid flushing
after hotunplug but that merely hides the possible deadlock scenario
that you're concerned about. Nothing is really solved.

We can try to do the same thing at each layer and implement quick exit
path for hot unplug all the way down to the driver but that kinda
sounds complex and fragile to me. It's a lot larger surface to cover
when the root cause is hotunplug allowed to reenter anything at all
from IO path. This is especially true because hotunplug can trivially
be made fully asynchronous in most cases. In terms of destruction of
higher level objects, warm and hot unplugs can and should behave
identical.

In fact, it isn't possible to draw a strict line between warm and hot
unplugs - what if the device gets detached while warm unplug is in
progress? We'd be already traveling warm unplug path when the
operating condition becomes identical to hot unplug.
Post by Dave Chinner
We need to either get rid of the flush on device failure/hot-unplug,
or turn it into a callout for the superblock to take an appropriate
action (e.g. shutting down the filesystem) rather than trying to
issue IO. i.e. allow the filesystem to take appropriate action of
shutting down the filesystem and invalidating it's caches.
There could be cases where some optimizations for hot unplug could be
useful. Maybe suppressing pointless duplicate warning messages or
whatnot but I'm highly doubtful anything will be actually fixed that
way. We'll be most likely making bugs just less reproducible.
Post by Dave Chinner
Indeed, in XFS there's several other caches that could contain dirty
metadata that isn't invalidated by invalidate_partition(), and so
unless the filesystem is shut down it can continue to try to issue
IO on those buffers to the removed device until the filesystem is
shutdown or unmounted.
Do you mean xfs never gives up after IO failures?
Post by Dave Chinner
Seriously, Tejun, the assumption that invalidate_partition() knows
enough about filesystems to safely "invalidate" them is just broken.
These days, filesystems often reference multiple block devices, and
so the way hotplug currently treats them as "one device, one
filesystem" is also fundamentally wrong.
So there's many ways in which the hot-unplug code is broken in it's
use of invalidate_partition(), the least of which is the
dependencies caused by re-entrancy. We really need a
"sb->shutdown()" style callout as step one in the above process, not
fsync_bdev().
If filesystems need an indication that the underlying device is no
longer functional, please go ahead and add it, but please keep in mind
all these are completely asynchronous. Nothing guarantees you that
such events would happen in any specific order. IOW, you can be at
*ANY* point in your warm unplug path and the device is hot unplugged,
which essentially forces all the code paths to be ready for the worst,
and that's exactly why there isn't much effort in trying to separate
out warm and hot unplug paths.

Thanks.
--
tejun
Rafael J. Wysocki
2013-12-18 22:14:24 UTC
Permalink
On Wednesday, December 18, 2013 06:43:43 AM Tejun Heo wrote:

[...]
Post by Tejun Heo
If filesystems need an indication that the underlying device is no
longer functional, please go ahead and add it, but please keep in mind
all these are completely asynchronous. Nothing guarantees you that
such events would happen in any specific order. IOW, you can be at
*ANY* point in your warm unplug path and the device is hot unplugged,
which essentially forces all the code paths to be ready for the worst,
and that's exactly why there isn't much effort in trying to separate
out warm and hot unplug paths.
Yes. Devices can go away at any point without notice. Even PCI devices
that have never been assumed to be hot-removable. Any piece of code in the
kernel needs to be prepared to deal with such situations.

Thanks,
Rafael
Dave Chinner
2013-12-19 04:08:21 UTC
Permalink
Post by Tejun Heo
Hello, Dave.
Post by Dave Chinner
Perhaps the function "invalidate_partition()" is badly named. To
1. sync filesystem
2. shrink the dcache
3. invalidates inodes and kills dirty inodes
4. invalidates block device (removes cached bdev pages)
Basically, the first step is "flush", the remainder is "invalidate".
Indeed, step 3 throws away dirty inodes, so why on earth would we
even bother with step 1 to try to clean them in the first place?
IOWs, the flush is irrelevant in the hot-unplug case as it will
fail to flush stuff, and then we just throw the stuff we
failed to write away.
But in attempting to flush all the dirty data and metadata, we can
cause all sorts of other potential re-entrancy based deadlocks due
to attempting to issue IO. Whether they be freezer based or through
IO error handling triggering device removal or some other means, it
is irrelevant - it is the flush that causes all the problems.
Isn't the root cause there hotunplug reentering anything above it in
the first place. The problem with your proposal is that filesystem
isn't the only place where this could happen. Even with no filesystem
involved, block device could still be dirty and IOs pending in
whatever form - dirty bdi, bios queued in dm, requests queued in
request_queue, whatever really - and if the hotunplug path reenters
any of the higher layers in a way which blocks IO processing, it will
deadlock.
Entirely possible.
Post by Tejun Heo
If knowing that the underlying device has gone away somehow helps
filesystem, maybe we can expose that interface and avoid flushing
after hotunplug but that merely hides the possible deadlock scenario
that you're concerned about. Nothing is really solved.
Except that a user of the block device has been informed that it is
now gone and has been freed from under it. i.e. we can *immediately*
inform the user that their mounted filesystem is now stuffed and
supress all the errors that are going to occur as a result of
sync_filesystem() triggering IO failures all over the place and then
having to react to that.i

Indeed, there is no guarantee that sync_filesystem will result in
the filesystem being shut down - if the filesystem is clean then
nothing will happen, and it won't be until the user modifies some
metadata that a shutdown will be triggered. That could be a long
time after the device has been removed....
Post by Tejun Heo
We can try to do the same thing at each layer and implement quick exit
path for hot unplug all the way down to the driver but that kinda
sounds complex and fragile to me. It's a lot larger surface to cover
when the root cause is hotunplug allowed to reenter anything at all
from IO path. This is especially true because hotunplug can trivially
be made fully asynchronous in most cases. In terms of destruction of
higher level objects, warm and hot unplugs can and should behave
identical.
I don't see that there is a difference between a warm and hot unplug
from a filesystem point of view - both result in the filesystem's
backing device being deleted and freed, and in both cases we have to
take the same action....
Post by Tejun Heo
Post by Dave Chinner
We need to either get rid of the flush on device failure/hot-unplug,
or turn it into a callout for the superblock to take an appropriate
action (e.g. shutting down the filesystem) rather than trying to
issue IO. i.e. allow the filesystem to take appropriate action of
shutting down the filesystem and invalidating it's caches.
There could be cases where some optimizations for hot unplug could be
useful. Maybe suppressing pointless duplicate warning messages or
whatnot but I'm highly doubtful anything will be actually fixed that
way. We'll be most likely making bugs just less reproducible.
Post by Dave Chinner
Indeed, in XFS there's several other caches that could contain dirty
metadata that isn't invalidated by invalidate_partition(), and so
unless the filesystem is shut down it can continue to try to issue
IO on those buffers to the removed device until the filesystem is
shutdown or unmounted.
Do you mean xfs never gives up after IO failures?
There's this thing called a transient IO failure which we have to
handle. e.g multipath taking several minutes to detect a path
failure and fail over, whilst in the mean time IO errors are
reported after a 30s timeout. So some types of async metadata write
IO failures are simply rescheduled for a short time in the future.
They'll either succeed, or continual failure will eventually trigger
some kind of filesystem failure.

If it's a synchronous write or a write that we cannot tolerate even
transient errors on (e.g. journal writes), then we'll shut down the
filesystem immediately.
Post by Tejun Heo
Post by Dave Chinner
Seriously, Tejun, the assumption that invalidate_partition() knows
enough about filesystems to safely "invalidate" them is just broken.
These days, filesystems often reference multiple block devices, and
so the way hotplug currently treats them as "one device, one
filesystem" is also fundamentally wrong.
So there's many ways in which the hot-unplug code is broken in it's
use of invalidate_partition(), the least of which is the
dependencies caused by re-entrancy. We really need a
"sb->shutdown()" style callout as step one in the above process, not
fsync_bdev().
If filesystems need an indication that the underlying device is no
longer functional, please go ahead and add it, but please keep in mind
all these are completely asynchronous. Nothing guarantees you that
such events would happen in any specific order. IOW, you can be at
*ANY* point in your warm unplug path and the device is hot unplugged,
which essentially forces all the code paths to be ready for the worst,
and that's exactly why there isn't much effort in trying to separate
out warm and hot unplug paths.
I'm not concerned about the problems that might happen if you hot
unplug during a warm unplug. All I care about is when a device is
invalidated the filesystem on top of it can take appropriate action.

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Tejun Heo
2013-12-19 16:24:11 UTC
Permalink
Yo, Dave.
Post by Dave Chinner
Post by Tejun Heo
If knowing that the underlying device has gone away somehow helps
filesystem, maybe we can expose that interface and avoid flushing
after hotunplug but that merely hides the possible deadlock scenario
that you're concerned about. Nothing is really solved.
Except that a user of the block device has been informed that it is
now gone and has been freed from under it. i.e. we can *immediately*
inform the user that their mounted filesystem is now stuffed and
supress all the errors that are going to occur as a result of
sync_filesystem() triggering IO failures all over the place and then
having to react to that.i
Please note that there's no real "immediacy" in that it's inherently
racy and that the extent of the usefulness of such notification can't
reach much further than suppressing error messages. Even that benefit
is kinda dubious. Don't we want to generate errors when a device is
removed while dirty data / IOs are pending on it? I fail to see how
"supressing all the errors" would be a sane thing to do.

Another thing is that I think it's actually healthier in terms of
excercise of code paths to travel those error paths on hot unplugs
which are relatively common than taking a different behavior on them.
It'll inevitably lower our test coverage.
Post by Dave Chinner
Indeed, there is no guarantee that sync_filesystem will result in
the filesystem being shut down - if the filesystem is clean then
nothing will happen, and it won't be until the user modifies some
metadata that a shutdown will be triggered. That could be a long
time after the device has been removed....
I still fail to see that why that is a problem. Filesystems should be
able to handle hot unplug or IO failures at any point in a reasonable
way, so what difference would having a notification make other than
introducing yet another exception code path?
Post by Dave Chinner
I don't see that there is a difference between a warm and hot unplug
from a filesystem point of view - both result in the filesystem's
backing device being deleted and freed, and in both cases we have to
take the same action....
Yeah, exactly, so what'd be the point of getting separate notification
for hot unplug events?
Post by Dave Chinner
Post by Tejun Heo
Do you mean xfs never gives up after IO failures?
There's this thing called a transient IO failure which we have to
handle. e.g multipath taking several minutes to detect a path
failure and fail over, whilst in the mean time IO errors are
reported after a 30s timeout. So some types of async metadata write
IO failures are simply rescheduled for a short time in the future.
They'll either succeed, or continual failure will eventually trigger
some kind of filesystem failure.
If it's a synchronous write or a write that we cannot tolerate even
transient errors on (e.g. journal writes), then we'll shut down the
filesystem immediately.
Sure, filesystems should (be able to) react to different types of
errors in different ways. We still have a long way to go to do that
properly but that should be done through IO failures not some side
channel one-off "hotunplug" happened call. Again, it doesn't solve
anything. It just side steps one very specific case in a half-assed
way.
Post by Dave Chinner
Post by Tejun Heo
If filesystems need an indication that the underlying device is no
longer functional, please go ahead and add it, but please keep in mind
all these are completely asynchronous. Nothing guarantees you that
such events would happen in any specific order. IOW, you can be at
*ANY* point in your warm unplug path and the device is hot unplugged,
which essentially forces all the code paths to be ready for the worst,
and that's exactly why there isn't much effort in trying to separate
out warm and hot unplug paths.
I'm not concerned about the problems that might happen if you hot
unplug during a warm unplug. All I care about is when a device is
invalidated the filesystem on top of it can take appropriate action.
I can't follow your logic here. You started with a deadlock scenario
where lower layer calls into upper layer while blocking its own
operation, which apparently is a bug to be fixed in the lower layer as
discussed above; otherwise, we'd be chasing the symptoms rather than
plugging the source. Combined with the fact that you can't really
prevent hotunplug happening during warmunplug (it doesn't even have to
be hotunplug, there are other conditions which would match such IO
failure pattern), this reduces the benefit of such notification to
optimizations far from correctness issues.

These are all logically connected; yet, you claim that you're not
concerned about part of it and then continue to assert your original
position. It doesn't compute. You proposed it as a fix for a
deadlock issue, but it turns out your proposal can't fix the deadlock
issue exactly because of the part you aren't concerned about and you
continue to assert the original proposal. What's going on here?

Thanks.
--
tejun
Dave Chinner
2013-12-20 00:51:14 UTC
Permalink
Post by Tejun Heo
Yo, Dave.
Post by Dave Chinner
Post by Tejun Heo
If knowing that the underlying device has gone away somehow helps
filesystem, maybe we can expose that interface and avoid flushing
after hotunplug but that merely hides the possible deadlock scenario
that you're concerned about. Nothing is really solved.
Except that a user of the block device has been informed that it is
now gone and has been freed from under it. i.e. we can *immediately*
inform the user that their mounted filesystem is now stuffed and
supress all the errors that are going to occur as a result of
sync_filesystem() triggering IO failures all over the place and then
having to react to that.i
Please note that there's no real "immediacy" in that it's inherently
racy and that the extent of the usefulness of such notification can't
reach much further than suppressing error messages.
So says you. What about btrfs, which might use such a notification
to start replicating metadata and data to other devices to maintain
redundancy levels? Or even XFS, which might have a real-time device
go away - that's not a fatal error for the filesystem, but we sure
as hell want the user to know about it and be able to set a flag
indicating access to data in real-time files need to be errored out
at the .aio_write level rather than some time later when writeback
fails with EIO and the application cannot capture the IO error.

Seriously, Tejun, you need to stop focussing on tearing apart
micro-examples and consider the wider context of the issue that the
examples are being used to demonstrate. The notifications being
talked about here are already being delivered to userspace because
there are OS management applications that need to know about devices
going away. Filesystems are no different - there's all sorts of
optimisations we can make if we get such a notification.

Cheers,

Dave.
--
Dave Chinner
***@fromorbit.com
Tejun Heo
2013-12-20 14:51:51 UTC
Permalink
Hello, Dave.
Post by Dave Chinner
Post by Tejun Heo
Please note that there's no real "immediacy" in that it's inherently
racy and that the extent of the usefulness of such notification can't
reach much further than suppressing error messages.
So says you. What about btrfs, which might use such a notification
to start replicating metadata and data to other devices to maintain
redundancy levels? Or even XFS, which might have a real-time device
go away - that's not a fatal error for the filesystem, but we sure
as hell want the user to know about it and be able to set a flag
indicating access to data in real-time files need to be errored out
at the .aio_write level rather than some time later when writeback
fails with EIO and the application cannot capture the IO error.
Hmmm... but shouldn't such facilities also kick in on other fatal
failure conditions? I have no fundamental objection against adding
someway to distinguish / notify hotunplug events but am a bit worried
that it has the potential to branch out the hotunplug path
unnecessarily and it seems that such tendency has been shown amply in
this thread too.

As I've pointed out multiple times in the thread, hotunplug path
serves very well as the continual testbed for catastrphic failure
cases for the whole stack from filesystems all the way down to
low-level drivers and can be attributed for high ratio of error
handling improvements / bugfixes we made over the years, so even if we
add something to distinguish hotunplug for upper layers, I think it
should be something which at least doesn't encourage implementing a
completely separate path and preferably something more generic.
Post by Dave Chinner
Seriously, Tejun, you need to stop focussing on tearing apart
micro-examples and consider the wider context of the issue that the
examples are being used to demonstrate. The notifications being
Dear Dave, would you please drop the attitude? You've been
consistently wrong in this thread. Being aggressively assertive is
okay. Being wrong is okay too. Continuing the combination of the two
is not. That's actively harmful, especially when the one doing so
commands respect in the community - it has high chance of derailing
the discussion and/or leading to the wrong conclusions. Should I
reiterate the messed up confusions you showed at each turn of this
very thread?

Let's please stay technical. You apparently didn't have much idea how
things work in the lower layers, which is completely fine. Just don't
over-extend your asssertiveness without substance. If you just made
your technical input from the get-go when viewed from your side of the
stack, we could have been a lot more productive.

I still think there's something to be extracted from this thread - how
we should handle catastrophic failure scenarios and the shortcomings
of the current shutdown procedure, both of which are far from perfect.
I'm not yet convinced with the idea of handling hot-unplug as
something completely special for the reasons I've stated multiple
times now and curious about the specifics of what you have on mind
because I don't really know what would be convenient from the
filesystem side.
Post by Dave Chinner
talked about here are already being delivered to userspace because
there are OS management applications that need to know about devices
going away. Filesystems are no different - there's all sorts of
optimisations we can make if we get such a notification.
I'm not sure that analogy is too relevant. The notifications
delivered to userland aren't different for hot or warm unplugs and
they don't participate in any of the exception handling details, which
is what I'm really worried about. Again, I'm not necessarily against
the idea of flagging / notifying upper layers but please do keep in
mind that hotunplug handling shouldn't deviate much from other parts
of exception handling, which at least from the discussions we had in
this thread doesn't seem to be what you had on mind. There are other
exception cases which share many, if not most, characteristics making
such approach a natural fit and hotunplug is one of the best tools we
have in ensuring those paths are in a healthy shape.

Thanks.
--
tejun
Rafael J. Wysocki
2013-12-20 14:00:17 UTC
Permalink
Post by Tejun Heo
Yo, Dave.
Post by Dave Chinner
Post by Tejun Heo
If knowing that the underlying device has gone away somehow helps
filesystem, maybe we can expose that interface and avoid flushing
after hotunplug but that merely hides the possible deadlock scenario
that you're concerned about. Nothing is really solved.
Except that a user of the block device has been informed that it is
now gone and has been freed from under it. i.e. we can *immediately*
inform the user that their mounted filesystem is now stuffed and
supress all the errors that are going to occur as a result of
sync_filesystem() triggering IO failures all over the place and then
having to react to that.i
Please note that there's no real "immediacy" in that it's inherently
racy and that the extent of the usefulness of such notification can't
reach much further than suppressing error messages. Even that benefit
is kinda dubious. Don't we want to generate errors when a device is
removed while dirty data / IOs are pending on it? I fail to see how
"supressing all the errors" would be a sane thing to do.
Another thing is that I think it's actually healthier in terms of
excercise of code paths to travel those error paths on hot unplugs
which are relatively common than taking a different behavior on them.
It'll inevitably lower our test coverage.
Post by Dave Chinner
Indeed, there is no guarantee that sync_filesystem will result in
the filesystem being shut down - if the filesystem is clean then
nothing will happen, and it won't be until the user modifies some
metadata that a shutdown will be triggered. That could be a long
time after the device has been removed....
I still fail to see that why that is a problem. Filesystems should be
able to handle hot unplug or IO failures at any point in a reasonable
way, so what difference would having a notification make other than
introducing yet another exception code path?
Post by Dave Chinner
I don't see that there is a difference between a warm and hot unplug
from a filesystem point of view - both result in the filesystem's
backing device being deleted and freed, and in both cases we have to
take the same action....
Yeah, exactly, so what'd be the point of getting separate notification
for hot unplug events?
Post by Dave Chinner
Post by Tejun Heo
Do you mean xfs never gives up after IO failures?
There's this thing called a transient IO failure which we have to
handle. e.g multipath taking several minutes to detect a path
failure and fail over, whilst in the mean time IO errors are
reported after a 30s timeout. So some types of async metadata write
IO failures are simply rescheduled for a short time in the future.
They'll either succeed, or continual failure will eventually trigger
some kind of filesystem failure.
If it's a synchronous write or a write that we cannot tolerate even
transient errors on (e.g. journal writes), then we'll shut down the
filesystem immediately.
Sure, filesystems should (be able to) react to different types of
errors in different ways. We still have a long way to go to do that
properly but that should be done through IO failures not some side
channel one-off "hotunplug" happened call. Again, it doesn't solve
anything. It just side steps one very specific case in a half-assed
way.
Another problem is to distinguish "hotunplug" from "power failure", for
example, because it may not be entirely clear what happened to start with
until way after when we get a "these devices are gone" notification from the
platform firmware.

So even if something may be regarded as a "hotunplug" from the order of the
Universe perspective, it still may look like a regular IO error at the device
level until device_del() is called for that device (which may not happen for
quite a while).

[That said, there are situations in which we may want to wait until it is
"safe" to eject stuff, or even we may want to allow subsystems to fail
"offline" requests, like in the memory eject case. In those cases the
hot-removal actually consists of two parts, an offline and a removal,
where devices are supposed to be not in use after offline (which may fail).
But that is a different story. :-)]

Thanks,
Rafael
Continue reading on narkive:
Loading...