Discussion:
Please revert 5adc55da4a7758021bcc374904b0f8b076508a11 (PCI_MULTITHREAD_PROBE)
(too old to reply)
Cornelia Huck
2007-05-08 13:37:13 UTC
Permalink
As I wanted to add multithreaded probing for some s390 busses, I
discovered that the commit above removed the multithreaded probing
infrastructure again, while just some days before some of my patches
reworking it had been merged... I thought
http://marc.info/?l=linux-kernel&m=117591868412593&w=2 meant that we
should go ahead with per-subsystem multithreaded probing?

(OK, PCI_MULTITHREAD_PROBE should not depend on BROKEN, but on
EXPERIMENTAL with the reworked probing infrastructure. This got mixed
up, I can send a patch that changes it.)
Greg KH
2007-05-08 14:07:53 UTC
Permalink
On Tue, May 08, 2007 at 03:37:13PM +0200, Cornelia Huck wrote:
> As I wanted to add multithreaded probing for some s390 busses, I
> discovered that the commit above removed the multithreaded probing
> infrastructure again, while just some days before some of my patches
> reworking it had been merged... I thought
> http://marc.info/?l=linux-kernel&m=117591868412593&w=2 meant that we
> should go ahead with per-subsystem multithreaded probing?

Only if it ends up working properly. The commit you reference above
(which removed the PCI_MULTITHREAD_PROBE option), is fine, pci
multi-threaded probing is still broken as it is a model that PCI drivers
are not yet ready to handle properly yet.

Why do you want it to be enabled again? It shouldn't affect you doing
multi-threaded probing for some other subsystem.

> (OK, PCI_MULTITHREAD_PROBE should not depend on BROKEN, but on
> EXPERIMENTAL with the reworked probing infrastructure. This got mixed
> up, I can send a patch that changes it.)

No, it turns out that no matter how hard and loud you put the warnings
in the config file help, people will ignore it and enable the option
anyway, and then get mad at you when your machine breaks just like the
config option said it would.

It also upsets other developers who get blamed for bugs that were caused
by this change, not their code, making debugging very difficult.

So it should stay removed for now.

thanks,

greg k-h
David Miller
2007-05-08 20:58:10 UTC
Permalink
From: Greg KH <***@kroah.com>
Date: Tue, 8 May 2007 07:07:53 -0700

> Only if it ends up working properly. The commit you reference above
> (which removed the PCI_MULTITHREAD_PROBE option), is fine, pci
> multi-threaded probing is still broken as it is a model that PCI drivers
> are not yet ready to handle properly yet.

FWIW I would like to see this working properly at some point.

It seems to me that the issues are two-fold:

1) A proper dependency system is necessary

2) Proper mutual exclusion for shared system resources/registers/etc.
that are poked at in an ad-hoc unlocked manner currently

Is that basically what it boils down to?
Greg KH
2007-05-09 09:44:18 UTC
Permalink
On Tue, May 08, 2007 at 01:58:10PM -0700, David Miller wrote:
> From: Greg KH <***@kroah.com>
> Date: Tue, 8 May 2007 07:07:53 -0700
>
> > Only if it ends up working properly. The commit you reference above
> > (which removed the PCI_MULTITHREAD_PROBE option), is fine, pci
> > multi-threaded probing is still broken as it is a model that PCI drivers
> > are not yet ready to handle properly yet.
>
> FWIW I would like to see this working properly at some point.
>
> It seems to me that the issues are two-fold:
>
> 1) A proper dependency system is necessary
>
> 2) Proper mutual exclusion for shared system resources/registers/etc.
> that are poked at in an ad-hoc unlocked manner currently
>
> Is that basically what it boils down to?

Yes, that's about it.

Number 1 seemed to cause the most crashes, I don't think number 2 ever
caused any problems, but it might have, there were too many weird oopses
to be able to rule that out.

But in the end, I don't think that PCI really will benefit from this
speed wise, but I think the kernel overall will benefit if we can
document those dependencies somehow.

thanks,

greg k-h
Linus Torvalds
2007-05-09 17:21:31 UTC
Permalink
On Wed, 9 May 2007, Greg KH wrote:

> On Tue, May 08, 2007 at 01:58:10PM -0700, David Miller wrote:
> >
> > 1) A proper dependency system is necessary
> >
> > 2) Proper mutual exclusion for shared system resources/registers/etc.
> > that are poked at in an ad-hoc unlocked manner currently
> >
> > Is that basically what it boils down to?
>
> Yes, that's about it.
>
> Number 1 seemed to cause the most crashes, I don't think number 2 ever
> caused any problems, but it might have, there were too many weird oopses
> to be able to rule that out.

One issue is that a lot of shared resources and their locking really
aren't known until *after* you've done a first-level probing.

The classic example of this really is a cardbus controller, and almost any
multi-function PCI device. Yes, they are "independent" PCI devices in
their own right, but they almost invariably have some shared state.

A bus driver that probes them concurrently is simply broken.

And no, the solution is not to special-case multi-function devices and
always probe the subfunctions serially. That would suck for many things
(disk controllers are *also* often subfunctions). The solution really *is*
to probe the devices serially, and let the layer that actually knows what
it is doing (the low-level driver) decide how it goes from there.

I can almost guarantee that the same is true of most other buses. For
example, I wouldn't be surprised at all to hear that you shouldn't probe
the individual LUN's of many "SCSI" devices concurrently. The number of
bugs in things like multifunction card readers (total lockup if you read
the wrong config pages or even try to read past the end of the flash etc)
is just scary.

(Server people seem to think that "SCSI" == "high-end expensive hardware
that we paid too much for", and yes, that's sometimes true, but "SCSI"
also equals "el-cheapo stuff that sells for $5 and talks something that
looks enough like SCSI commands that we want to consider it SCSI").

So I'm really convinced that the bus layer should be serial and then have
some capability to allow lower levels to do the things *they* know is fine
to do independently in a parallel way. But anything that makes that be a
bus-level choice is almost guaranteed to be broken on just about all
buses!

Linus
Adrian Bunk
2007-05-08 14:11:49 UTC
Permalink
On Tue, May 08, 2007 at 03:37:13PM +0200, Cornelia Huck wrote:
> As I wanted to add multithreaded probing for some s390 busses, I
> discovered that the commit above removed the multithreaded probing
> infrastructure again, while just some days before some of my patches
> reworking it had been merged... I thought
> http://marc.info/?l=linux-kernel&m=117591868412593&w=2 meant that we
> should go ahead with per-subsystem multithreaded probing?

My patch (that started the discussion you mentioned and seems to have
been applied unchanged) contained both the wanted PCI_MULTITHREAD_PROBE
removal and a removal of some of your new infrastructure.

Let's revert it, and I'll then send a new patch containing only the
PCI_MULTITHREAD_PROBE removal.

> (OK, PCI_MULTITHREAD_PROBE should not depend on BROKEN, but on
> EXPERIMENTAL with the reworked probing infrastructure. This got mixed
> up, I can send a patch that changes it.)

You meant to say:
depends on EXPERIMENTAL && !ATA && !IDE && !...

Multithreaded probing on the PCI bus currently breaks too much kernel
code.

And we've already seen how many systems it therefore breaks in practice.

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
Cornelia Huck
2007-05-08 14:41:23 UTC
Permalink
On Tue, 8 May 2007 16:11:49 +0200,
Adrian Bunk <***@stusta.de> wrote:

> My patch (that started the discussion you mentioned and seems to have
> been applied unchanged) contained both the wanted PCI_MULTITHREAD_PROBE
> removal and a removal of some of your new infrastructure.
>
> Let's revert it, and I'll then send a new patch containing only the
> PCI_MULTITHREAD_PROBE removal.

Yes, that would be much preferred, thanks.

> You meant to say:
> depends on EXPERIMENTAL && !ATA && !IDE && !...
>
> Multithreaded probing on the PCI bus currently breaks too much kernel
> code.
>
> And we've already seen how many systems it therefore breaks in practice.

I'm not quite sure about that, that was with the older broken
infrastructure. But I'm not too emotionally involved with pci
multithread probing, so if the consensus is that it should be removed,
that's fine with me.
Linus Torvalds
2007-05-08 15:27:34 UTC
Permalink
On Tue, 8 May 2007, Adrian Bunk wrote:
>
> Let's revert it, and I'll then send a new patch containing only the
> PCI_MULTITHREAD_PROBE removal.

I really don't want to revert that removal. If somebody wants to resurrect
a part of the patch that has nothing to do with PCI, in order to do it for
some other bus, just send that as a patch (not as a revert). But no, I'm
not going to revert that patch.

And no, we should not do it at the device core level. In fact, I don't
think we should do it at that level at all.

I'm pretty sure that the performance problems are at individual device
drivers, and that the right solution is to thread at *that* level. Not
higher up.

Threading at the bus level just inevitably means things like random
numbers for devices depending on some timing/scheduling issue. That's
nasty.

Threading at a driver level still does that (ie individual disks may be
attached in some order that depends on how fast they are to respond), but
in a much more controlled fashion, and only for drivers that explicitly
say that they can do it.

Linus
Cornelia Huck
2007-05-08 16:38:46 UTC
Permalink
On Tue, 8 May 2007 08:27:34 -0700 (PDT),
Linus Torvalds <***@linux-foundation.org> wrote:

> And no, we should not do it at the device core level. In fact, I don't
> think we should do it at that level at all.
>
> I'm pretty sure that the performance problems are at individual device
> drivers, and that the right solution is to thread at *that* level. Not
> higher up.

These are two different problems:

1. Probing taking long for individual device drivers. I agree, this
should be solved at the driver level.

2. Sheer volume of devices on a bus. Even if the indivdual probing
doesn't take long, having all devices probed one after the other may
take a lot of time. Putting the actual probe on a thread makes it
possible to run several probes in parallel, thereby cutting probing
time.

(FWIW, the s390 cio layer does asynchronous probing at the bus level,
where work is usually outstanding for a lot of devices at once. Where a
2.4 kernel might take half an hour to detect all devices, we slashed it
down to half a minute. I believe we could make rescans work even better
with multithreaded probing with some tweaking.)

> Threading at the bus level just inevitably means things like random
> numbers for devices depending on some timing/scheduling issue. That's
> nasty.
>
> Threading at a driver level still does that (ie individual disks may be
> attached in some order that depends on how fast they are to respond), but
> in a much more controlled fashion, and only for drivers that explicitly
> say that they can do it.

How is that better? You still must rely on udev for persistent device
names. And controlling device names in the driver can still be done in
the device driver with multithreaded probing (on s390 ccw, devices
already pop up in a random order, and the dasd driver manages to
conjure up consistent names for those devices that the user specified.)
d***@lang.hm
2007-05-08 16:47:51 UTC
Permalink
On Tue, 8 May 2007, Cornelia Huck wrote:

> 2. Sheer volume of devices on a bus. Even if the indivdual probing
> doesn't take long, having all devices probed one after the other may
> take a lot of time. Putting the actual probe on a thread makes it
> possible to run several probes in parallel, thereby cutting probing
> time.
>
> (FWIW, the s390 cio layer does asynchronous probing at the bus level,
> where work is usually outstanding for a lot of devices at once. Where a
> 2.4 kernel might take half an hour to detect all devices, we slashed it
> down to half a minute. I believe we could make rescans work even better
> with multithreaded probing with some tweaking.)
>
>> Threading at the bus level just inevitably means things like random
>> numbers for devices depending on some timing/scheduling issue. That's
>> nasty.
>>
>> Threading at a driver level still does that (ie individual disks may be
>> attached in some order that depends on how fast they are to respond), but
>> in a much more controlled fashion, and only for drivers that explicitly
>> say that they can do it.
>
> How is that better? You still must rely on udev for persistent device
> names. And controlling device names in the driver can still be done in
> the device driver with multithreaded probing (on s390 ccw, devices
> already pop up in a random order, and the dasd driver manages to
> conjure up consistent names for those devices that the user specified.)

I have used userspace tools for managing multiple machines that spawn
tasks in parallel on multiple machines and then buffers them per-machine
to display the output of the commands in order, no matter which one
finished first.

for example 'dsh m1,m2 ls /' would always result in

m1 .
m1 ..
.
.
m2 .
m2 ..
.
.

even if m2 finished before m1

why can't the detection be done in parallel, but the regestration of
detected devices be done in order?

David Lang
Stefan Richter
2007-05-08 21:45:15 UTC
Permalink
***@lang.hm wrote:
> I have used userspace tools for managing multiple machines that spawn
> tasks in parallel on multiple machines and then buffers them per-machine
> to display the output of the commands in order, no matter which one
> finished first.
>
> for example 'dsh m1,m2 ls /' would always result in
>
> m1 .
> m1 ..
> .
> .
> m2 .
> m2 ..
> .
> .
>
> even if m2 finished before m1
>
> why can't the detection be done in parallel, but the regestration of
> detected devices be done in order?

Userspace would have to tell the kernel which set of devices is
available. (In your example: The set consists of m1 and m2; but does
not include m0 nor m3 nor m6391.)

So instead of telling the kernel beforehand what to enqueue for
registration and in which order to enqueue it, we simply let userspace
give persistent alias names to devices after the fact.
--
Stefan Richter
-=====-=-=== -=-= -=---
http://arcgraph.de/sr/
Linus Torvalds
2007-05-08 18:30:03 UTC
Permalink
On Tue, 8 May 2007, Cornelia Huck wrote:
>
> > Threading at a driver level still does that (ie individual disks may be
> > attached in some order that depends on how fast they are to respond), but
> > in a much more controlled fashion, and only for drivers that explicitly
> > say that they can do it.
>
> How is that better? You still must rely on udev for persistent device
> names.

It's better because then the *driver* can decide whether it supports
threaded probing or not.

The threaded PCI bus probing was totally and utterly broken exactly
because many drivers were simply not willing or able to handle it.

Also, quite frankly, when this came up I already posted a much better
approach for allowing arbitrary threading. It was ignored, but maybe you
want to take a look.

See

http://lkml.org/lkml/2006/10/27/185

for last time this came up. Any driver can decide to do

execute_in_parallel(myprobe, myarguments);

and that really *simple* thing allows synchronization points
("wait_for_parallel()") and arbitrary setting of the allowed parallelism.

I really think that is a *much* better approach. It allows, for example,
the driver to do certain things (like name allocations and any
particularly *fast* parts of probing) synchronously, and then do slow
things (like spinning up disks and actually reading the partition tables)
asynchronously.

Anyway, I will refuse to merge anything that does generic bus-level
(unless the bus is something off-the-board like SCSI) parallelism, unless
you can really show me how superior and stable it is. I seriously doubt
you can. We tried it once, and it was a total disaster.

Linus
Cornelia Huck
2007-05-08 19:21:17 UTC
Permalink
On Tue, 8 May 2007 11:30:03 -0700 (PDT),
Linus Torvalds <***@linux-foundation.org> wrote:

> It's better because then the *driver* can decide whether it supports
> threaded probing or not.

Is there any reason why a driver shouldn't be able to handle probing
several devices at once? What if two devices become hotplugged at the
same time? Does this imply that the bus _always_ needs to do some
serializing there?

> Also, quite frankly, when this came up I already posted a much better
> approach for allowing arbitrary threading. It was ignored, but maybe you
> want to take a look.
>
> See
>
> http://lkml.org/lkml/2006/10/27/185
>
> for last time this came up. Any driver can decide to do
>
> execute_in_parallel(myprobe, myarguments);
>
> and that really *simple* thing allows synchronization points
> ("wait_for_parallel()") and arbitrary setting of the allowed parallelism.

Yeah, I saw this back then. But this is much too inflexible. How do you
determine the settings? On a large system, you could throttle too much,
obliterating most of the benefits of the parallel approach; on a small
system, you could overwhelm the system.

> I really think that is a *much* better approach. It allows, for example,
> the driver to do certain things (like name allocations and any
> particularly *fast* parts of probing) synchronously, and then do slow
> things (like spinning up disks and actually reading the partition tables)
> asynchronously.

Hm, I think "asynchronous" is the key word here, not "multithreaded".
Thinking some more of it, we really want to do the following:
- look at the device:
- do the fast stuff
- kick off the slow stuff
- look at the next device
- ...
- wait until all devices said they are finished
- each device needs to signal completion when the slow stuff has been
done (or hit a timeout)

(By some curious accident, that is what s390 cio does :))

This way, we are much more flexible than with the execute_in_parallel()
approach. This async approach especially looks suited to stuff where
some kind of operation is submitted to the hardware and we need to wait
for an interrupt/completion. And for that kind of operations, it scales
equally well on small or big machines.

> Anyway, I will refuse to merge anything that does generic bus-level
> (unless the bus is something off-the-board like SCSI) parallelism, unless
> you can really show me how superior and stable it is. I seriously doubt
> you can. We tried it once, and it was a total disaster.

Maybe the criterion should be more "a resonably unified set of
devices". s390 channel attached devices are really all the same (at a
certain not driver-specific level), and the same may be true for
something like scsi. pci may be just too diverse with too many
dissimilar drivers.
Linus Torvalds
2007-05-08 19:31:22 UTC
Permalink
On Tue, 8 May 2007, Cornelia Huck wrote:
>
> Is there any reason why a driver shouldn't be able to handle probing
> several devices at once?

Internal driver reasons?

Look, Cornelia, there's about a million drivers in the kernel. They are
not all safe. End of discussion.

If you are willing to go through them all, fix everything, test it, talk
with all maintainers, all the more power to you.

As it is, that's not going to happen. So you should *assume* that drivers
are not expecting to be called multiple times in parallel. And if they
are, you may then assume that people don't want them to be, because they
want stable naming.

> What if two devices become hotplugged at the same time? Does this imply
> that the bus _always_ needs to do some serializing there?

You want to test it? Be my guest.

But the final nail in the coffin is that doing it at the bus level is
INFERIOR. As you yourself admitted, it's better to do some things
synchronously.

Linus
Linus Torvalds
2007-05-08 20:01:21 UTC
Permalink
On Tue, 8 May 2007, Linus Torvalds wrote:
>
> But the final nail in the coffin is that doing it at the bus level is
> INFERIOR. As you yourself admitted, it's better to do some things
> synchronously.

Side note: there may well be clever combinations of "bus side" support
*together* with per-device rules.

For example, right now we probe devices by calling their "probe()"
routines synchronously. Changing that to be asynchronous simply isn't an
option, because we've seen drivers that get unhappy (and the hotplug
argument isn't an argument: *most* drivers aren't even hotplug-capable
anyway).

BUT.

Instead of changing existign probe functionality to be asynchronous, we
could *add* a new and asynchronous part to it. For example, we could make
the rule for PCI - or other bus - devices be:

- the bus will *first* call the "probe()" function synchronously.

- after that one has completed, we will call "probe_async()"
asynchronously at some point (it ie might be scheduled immediately
after the "probe()" call, but delayed by some arbitrary issues like
just already having too many asynchronous probes on-going or similar)

(A variation of the above might be that *everybody*s synchronous probe
function will be called first, and then the asynchronous probe functions
will be called only when they are all done. That might help with drivers
that have dependencies between different PCI functions - Cardbus comes to
mind, where the different slots look like independent PCI devices, but
slot zero is literally the master and controls some of the functions on
slot 1 too - similar issues may well happen in other multi-function
devices, and it might simplify things if you knew that the serial probe
had completed fully before the asynchronous parallel part even starts).

So an unmodified driver would basically work exactly like it does now, but
if a driver is happy with being called asynchronously, it could just
change it's

.probe = mydriver_probe

thing into a

.probe_async = mydriver_probe

and we can do that ona per-driver basis with that kind of really simple
one-liner change.

In fact, there is nothing wrong with having *both* a synchronous part, and
an async part:

.probe = mydriver_setup,
.probe_async = mydriver_spin_up_and_probe_devices,

and it would do basic setup (including, for example, the fast enumeration
of whatever devices are connected) synchronously, but then do anything
more in the async part - and the async part would still be guaranteed that
the setup has been run by the time it is scheduled (but not really have
any other guarantees).

Hmm? Would something like this work? I dunno, but it seems a hell of a lot
safer and more capable than the aborted PCI multithreaded probing that was
an "all or nothing" approach.

Linus
d***@lang.hm
2007-05-08 20:26:00 UTC
Permalink
On Tue, 8 May 2007, Linus Torvalds wrote:

> Instead of changing existign probe functionality to be asynchronous, we
> could *add* a new and asynchronous part to it. For example, we could make
> the rule for PCI - or other bus - devices be:
>
> - the bus will *first* call the "probe()" function synchronously.
>
> - after that one has completed, we will call "probe_async()"
> asynchronously at some point (it ie might be scheduled immediately
> after the "probe()" call, but delayed by some arbitrary issues like
> just already having too many asynchronous probes on-going or similar)
>
> (A variation of the above might be that *everybody*s synchronous probe
> function will be called first, and then the asynchronous probe functions
> will be called only when they are all done. That might help with drivers
> that have dependencies between different PCI functions - Cardbus comes to
> mind, where the different slots look like independent PCI devices, but
> slot zero is literally the master and controls some of the functions on
> slot 1 too - similar issues may well happen in other multi-function
> devices, and it might simplify things if you knew that the serial probe
> had completed fully before the asynchronous parallel part even starts).

is it really enough to do the sync followed by the async piece?

it seems to me that there may be a need for three steps

1. prep (sync, what we do today)
which will all complete before
2. parallel (async)
which will all complete before
3. cleanup (sync)

for some busses you can do the device enumeration in the prep phase, but I
would expect that for others you can't enumerate the devices until you
actually go looking for them. so by allowing the cleanup phase you have a
spot that can order everything as if the second phase was sequential.

I would expect that almost no drivers would use all three, but I could see
some useing #1 and #2 while others use #2 and #3

in any case there needs to be a way to wait until the async portion
finishes before going on to next steps (you don't want to try to probe
drives for md partitions until the drives have finished spinning up for
example)

David Lang
Cornelia Huck
2007-05-09 07:58:12 UTC
Permalink
On Tue, 8 May 2007 13:26:00 -0700 (PDT),
***@lang.hm wrote:

> is it really enough to do the sync followed by the async piece?
>
> it seems to me that there may be a need for three steps
>
> 1. prep (sync, what we do today)
> which will all complete before
> 2. parallel (async)
> which will all complete before
> 3. cleanup (sync)
>
> for some busses you can do the device enumeration in the prep phase, but I
> would expect that for others you can't enumerate the devices until you
> actually go looking for them. so by allowing the cleanup phase you have a
> spot that can order everything as if the second phase was sequential.
>
> I would expect that almost no drivers would use all three, but I could see
> some useing #1 and #2 while others use #2 and #3

Hm, how about:
1. ->probe() called (wait until it is finished)
2. ->probe_async() called (only kick off, don't wait, go to next device)
3. bus waits until all async probes have returned

This would imply a wait mechanism on the bus, like upping a counter for
each successfully called ->probe_async(), the driver downing the
counter when all actions triggered by ->probe_async() have finshed, and
the bus waiting until the counter has reached 0.

(Individual drivers can have 1., 2., or 1. & 2. A bus may support 1.
(like now), 2. & 3. or 1., 2. and 3.)

> in any case there needs to be a way to wait until the async portion
> finishes before going on to next steps (you don't want to try to probe
> drives for md partitions until the drives have finished spinning up for
> example)

Agreed. Would it be enough if we only have one bus in flight
simuntaneously?
d***@lang.hm
2007-05-09 08:33:14 UTC
Permalink
On Wed, 9 May 2007, Cornelia Huck wrote:

> On Tue, 8 May 2007 13:26:00 -0700 (PDT),
> ***@lang.hm wrote:
>
>> is it really enough to do the sync followed by the async piece?
>>
>> it seems to me that there may be a need for three steps
>>
>> 1. prep (sync, what we do today)
>> which will all complete before
>> 2. parallel (async)
>> which will all complete before
>> 3. cleanup (sync)
>>
>> for some busses you can do the device enumeration in the prep phase, but I
>> would expect that for others you can't enumerate the devices until you
>> actually go looking for them. so by allowing the cleanup phase you have a
>> spot that can order everything as if the second phase was sequential.
>>
>> I would expect that almost no drivers would use all three, but I could see
>> some useing #1 and #2 while others use #2 and #3
>
> Hm, how about:
> 1. ->probe() called (wait until it is finished)
> 2. ->probe_async() called (only kick off, don't wait, go to next device)
> 3. bus waits until all async probes have returned
>
> This would imply a wait mechanism on the bus, like upping a counter for
> each successfully called ->probe_async(), the driver downing the
> counter when all actions triggered by ->probe_async() have finshed, and
> the bus waiting until the counter has reached 0.
>
> (Individual drivers can have 1., 2., or 1. & 2. A bus may support 1.
> (like now), 2. & 3. or 1., 2. and 3.)
>
>> in any case there needs to be a way to wait until the async portion
>> finishes before going on to next steps (you don't want to try to probe
>> drives for md partitions until the drives have finished spinning up for
>> example)
>
> Agreed. Would it be enough if we only have one bus in flight
> simuntaneously?

two things I see

1. why should different, unrelated busses need to wait for each other?
picking two, why can't you have SCSI and USB going through their timeouts
at the same time?

2. I'm not sure that you can always do the device enumeration before you
do the slow probing, there's another message in this thread that talks
about a USB device where you need to load firmware to it before you can
find out what is really there. in a case like this devices would either
show up in a random order during step #2, or they should not be added to
the system until step #3, which makes step #3 more then just waiting for
the async portions to finish.

David Lang
Cornelia Huck
2007-05-09 09:15:15 UTC
Permalink
On Wed, 9 May 2007 01:33:14 -0700 (PDT),
***@lang.hm wrote:

> 1. why should different, unrelated busses need to wait for each other?
> picking two, why can't you have SCSI and USB going through their timeouts
> at the same time?

If they don't have dependencies on each other, yes. Some busses should
be finished before probing for others start (e.g. low-level busses).

> 2. I'm not sure that you can always do the device enumeration before you
> do the slow probing, there's another message in this thread that talks
> about a USB device where you need to load firmware to it before you can
> find out what is really there. in a case like this devices would either
> show up in a random order during step #2, or they should not be added to
> the system until step #3, which makes step #3 more then just waiting for
> the async portions to finish.

I'm not sure whether that is not really a question of "one depends upon
the other". If the low level bus knows where to point its probe at, the
higher level should be able to look at sane devices after the firmware
has been loaded (or am I misunderstanding the situation here?)
d***@lang.hm
2007-05-09 09:25:10 UTC
Permalink
On Wed, 9 May 2007, Cornelia Huck wrote:

> On Wed, 9 May 2007 01:33:14 -0700 (PDT),
> ***@lang.hm wrote:
>
>> 1. why should different, unrelated busses need to wait for each other?
>> picking two, why can't you have SCSI and USB going through their timeouts
>> at the same time?
>
> If they don't have dependencies on each other, yes. Some busses should
> be finished before probing for others start (e.g. low-level busses).

I think we are agreeing on this, I was responding to your question of if
we could get away to limiting things to one bus at a time.

>> 2. I'm not sure that you can always do the device enumeration before you
>> do the slow probing, there's another message in this thread that talks
>> about a USB device where you need to load firmware to it before you can
>> find out what is really there. in a case like this devices would either
>> show up in a random order during step #2, or they should not be added to
>> the system until step #3, which makes step #3 more then just waiting for
>> the async portions to finish.
>
> I'm not sure whether that is not really a question of "one depends upon
> the other". If the low level bus knows where to point its probe at, the
> higher level should be able to look at sane devices after the firmware
> has been loaded (or am I misunderstanding the situation here?)

makeing up an example

today the process would be
call driver X to initialize it's devices.

driver X follows these steps

1. see if a compatible chipset/device appears to be available

2. initialize the chipset/device (loading firmware)

3. query the now partially initialized chipset/device to see what specific
options are enabled in the hardware

4. for each 'thing' that's enabled in hardware, initialize and register
it.

step 2 can take a long time and so you want it to be async so that it's
timeouts can be running in parallel (for that matter step 1 may have some
timeouts as well)

however, you can't tell what devices to register until after step 3

so in at least some cases, it seems as if the registratin of devices needs
to be after the async section.

Linus made a good case fo an example (hard drives) where the registration
can happen first and fast, but the async section (spinning up) takes a
while and could come second.

to acommodate both of these device models it seems to me that you can't
defice either
async followed by sync
or
sync followed by async
but instead need to support the combined
sync followed by async followed by sync.

now, it's very possible that I'm mistaken and one of the two-part models
can be used for everything, if so then it's definantly simpler to do so.

David Lang
Cornelia Huck
2007-05-09 13:20:24 UTC
Permalink
On Wed, 9 May 2007 02:25:10 -0700 (PDT),
***@lang.hm wrote:

> makeing up an example
>
> today the process would be
> call driver X to initialize it's devices.
>
> driver X follows these steps
>
> 1. see if a compatible chipset/device appears to be available
>
> 2. initialize the chipset/device (loading firmware)
>
> 3. query the now partially initialized chipset/device to see what specific
> options are enabled in the hardware
>
> 4. for each 'thing' that's enabled in hardware, initialize and register
> it.

The sync ->probe() could return after step 1 (when it is clear the
driver wants to bind to the device). The remainder should be done in
the async function, so that the bus could go on with other devices. The
driver should then signal when it is done with its registrations.

> step 2 can take a long time and so you want it to be async so that it's
> timeouts can be running in parallel (for that matter step 1 may have some
> timeouts as well)
>
> however, you can't tell what devices to register until after step 3
>
> so in at least some cases, it seems as if the registratin of devices needs
> to be after the async section.

If you're talking about devices registered by the driver, these should
be in the async section.

> Linus made a good case fo an example (hard drives) where the registration
> can happen first and fast, but the async section (spinning up) takes a
> while and could come second.

I don't really see the difference?

> to acommodate both of these device models it seems to me that you can't
> defice either
> async followed by sync
> or
> sync followed by async
> but instead need to support the combined
> sync followed by async followed by sync.
>
> now, it's very possible that I'm mistaken and one of the two-part models
> can be used for everything, if so then it's definantly simpler to do so.

Hm, maybe I'm dense, but I really don't understand the problem here.
Why should (optional) sync followed by (optional) async followed by bus
synchonization not cover all cases? The async portion can also do some
"sync" stuff and then signal completion, can't it?

If you're worried about notifications to userspace, you could first
suppress the add uevent via uevent_suppress and then generate it when
you signal completion to the bus.
d***@lang.hm
2007-05-09 16:18:10 UTC
Permalink
On Wed, 9 May 2007, Cornelia Huck wrote:

> On Wed, 9 May 2007 02:25:10 -0700 (PDT),
> ***@lang.hm wrote:
>
>> makeing up an example
>>
>> today the process would be
>> call driver X to initialize it's devices.
>>
>> driver X follows these steps
>>
>> 1. see if a compatible chipset/device appears to be available
>>
>> 2. initialize the chipset/device (loading firmware)
>>
>> 3. query the now partially initialized chipset/device to see what specific
>> options are enabled in the hardware
>>
>> 4. for each 'thing' that's enabled in hardware, initialize and register
>> it.
>
> The sync ->probe() could return after step 1 (when it is clear the
> driver wants to bind to the device). The remainder should be done in
> the async function, so that the bus could go on with other devices. The
> driver should then signal when it is done with its registrations.

but then won't the devices get registered in a random order? (i.e.
whenever the async portion finishes the probing and finds the details of
what there is to register)

>> to acommodate both of these device models it seems to me that you can't
>> defice either
>> async followed by sync
>> or
>> sync followed by async
>> but instead need to support the combined
>> sync followed by async followed by sync.
>>
>> now, it's very possible that I'm mistaken and one of the two-part models
>> can be used for everything, if so then it's definantly simpler to do so.
>
> Hm, maybe I'm dense, but I really don't understand the problem here.
> Why should (optional) sync followed by (optional) async followed by bus
> synchonization not cover all cases? The async portion can also do some
> "sync" stuff and then signal completion, can't it?
>
> If you're worried about notifications to userspace, you could first
> suppress the add uevent via uevent_suppress and then generate it when
> you signal completion to the bus.

I'm not worried about notification to userspace, I'm worried about devices
getting registered during the async stage appearing in different orders on
different boots due to different timeings.

if you can identify the device well enough to register it quickly then the
approach that Linus proposed works well, which is

sync probe, identify devices, register devices
async initialize devices
wait for all async to complete

however I'm talking about cases where you can't fully identify the devices
(at least not well enough to register them with the kernel) and am saying
that doing

sync probe
async initialize device, register device
wait for all async to complete

results in unpredictable device ordering, but if instead you did

sync probe
async initialize device
sync wait for all async to complete, register device

you can get back to predictable ordering

David Lang
Cornelia Huck
2007-05-09 17:07:36 UTC
Permalink
On Wed, 9 May 2007 09:18:10 -0700 (PDT),
***@lang.hm wrote:

> I'm not worried about notification to userspace, I'm worried about devices
> getting registered during the async stage appearing in different orders on
> different boots due to different timeings.

Ah, now I think I understand what you mean. You're talking about stuff
like block devices or net devices, right?

>
> if you can identify the device well enough to register it quickly then the
> approach that Linus proposed works well, which is
>
> sync probe, identify devices, register devices
> async initialize devices
> wait for all async to complete
>
> however I'm talking about cases where you can't fully identify the devices
> (at least not well enough to register them with the kernel) and am saying
> that doing
>
> sync probe
> async initialize device, register device
> wait for all async to complete
>
> results in unpredictable device ordering, but if instead you did
>
> sync probe
> async initialize device
> sync wait for all async to complete, register device
>
> you can get back to predictable ordering

Hm, so that sound like a case for a distinct setup() routine:

1. bus calls ->probe(), which return synchronously
2. bus calls ->probe_async() for all devices (optional)
3. bus waits for all probes to finish
4. bus calls ->setup() for all devices (which does the registering)

(->setup() can but need not be sync, although it should be for your
case)

Note that ordering is not guaranteed on hotpluggable busses anyway, and
if you use ->setup() as a function that may or may not be called later
on, there's no ordering guarantee either. (Unless the bus/device driver
implements something to that effect, or you have udev rules in place.)
d***@lang.hm
2007-05-09 17:09:33 UTC
Permalink
On Wed, 9 May 2007, Cornelia Huck wrote:

> On Wed, 9 May 2007 09:18:10 -0700 (PDT),
> ***@lang.hm wrote:
>
>> I'm not worried about notification to userspace, I'm worried about devices
>> getting registered during the async stage appearing in different orders on
>> different boots due to different timeings.
>
> Ah, now I think I understand what you mean. You're talking about stuff
> like block devices or net devices, right?

they will work as examples, but I am not sure that that is the limit of
this sort of problem

>>
>> if you can identify the device well enough to register it quickly then the
>> approach that Linus proposed works well, which is
>>
>> sync probe, identify devices, register devices
>> async initialize devices
>> wait for all async to complete
>>
>> however I'm talking about cases where you can't fully identify the devices
>> (at least not well enough to register them with the kernel) and am saying
>> that doing
>>
>> sync probe
>> async initialize device, register device
>> wait for all async to complete
>>
>> results in unpredictable device ordering, but if instead you did
>>
>> sync probe
>> async initialize device
>> sync wait for all async to complete, register device
>>
>> you can get back to predictable ordering
>
> Hm, so that sound like a case for a distinct setup() routine:
>
> 1. bus calls ->probe(), which return synchronously
> 2. bus calls ->probe_async() for all devices (optional)
> 3. bus waits for all probes to finish
> 4. bus calls ->setup() for all devices (which does the registering)

this is exactly what I've been trying to describe.

> (->setup() can but need not be sync, although it should be for your
> case)

if it's not sync then you have race conditions again

> Note that ordering is not guaranteed on hotpluggable busses anyway, and
> if you use ->setup() as a function that may or may not be called later
> on, there's no ordering guarantee either. (Unless the bus/device driver
> implements something to that effect, or you have udev rules in place.)

correct.

David Lang
Cornelia Huck
2007-05-09 17:48:45 UTC
Permalink
On Wed, 9 May 2007 10:09:33 -0700 (PDT),
***@lang.hm wrote:

> > Hm, so that sound like a case for a distinct setup() routine:
> >
> > 1. bus calls ->probe(), which return synchronously
> > 2. bus calls ->probe_async() for all devices (optional)
> > 3. bus waits for all probes to finish
> > 4. bus calls ->setup() for all devices (which does the registering)
>
> this is exactly what I've been trying to describe.

Nearly, but with a slightly different spin...

>
> > (->setup() can but need not be sync, although it should be for your
> > case)
>
> if it's not sync then you have race conditions again

...but not all busses will care. If your bus wants to enforce ordering,
it must enforce it to be sync. If your bus allows hotplug and calling
setup() later on, it may also allow async. (setup() is a bit
dual-purpose in this idea.)
d***@lang.hm
2007-05-09 17:53:52 UTC
Permalink
On Wed, 9 May 2007, Cornelia Huck wrote:

> On Wed, 9 May 2007 10:09:33 -0700 (PDT),
> ***@lang.hm wrote:
>
>>> Hm, so that sound like a case for a distinct setup() routine:
>>>
>>> 1. bus calls ->probe(), which return synchronously
>>> 2. bus calls ->probe_async() for all devices (optional)
>>> 3. bus waits for all probes to finish
>>> 4. bus calls ->setup() for all devices (which does the registering)
>>
>> this is exactly what I've been trying to describe.
>
> Nearly, but with a slightly different spin...
>
>>
>>> (->setup() can but need not be sync, although it should be for your
>>> case)
>>
>> if it's not sync then you have race conditions again
>
> ...but not all busses will care. If your bus wants to enforce ordering,
> it must enforce it to be sync. If your bus allows hotplug and calling
> setup() later on, it may also allow async. (setup() is a bit
> dual-purpose in this idea.)

if you want the registration to use hotplug and be async then do do the
registration during step 2 and make step 4 be a noop, in fact some drivers
may do all their work in step 2, while others (everything currently) will
do all their work during step 1

David Lang
Cornelia Huck
2007-05-09 18:36:19 UTC
Permalink
On Wed, 9 May 2007 10:53:52 -0700 (PDT),
***@lang.hm wrote:

> if you want the registration to use hotplug and be async then do do the
> registration during step 2 and make step 4 be a noop, in fact some drivers
> may do all their work in step 2, while others (everything currently) will
> do all their work during step 1

The added benefit of setup() I was thinking of was being able to
enable/disable devices later on (where enable entails the "heavy
lifting").

Say you have a range of devices you only use seldomly for some special
purpose. They may take a long time to get up and use a lot of resources
when they're on. You don't want to slow the boot process down for them,
so they start disabled with just very basic setup (the fast stuff)
done. (It is also useful if you want some manual configuration between
the fast stuff and the heavy lifting.) When you need them, you spin
them up (may take some time) and then use them. Afterwards, you could
spin them down again. (This is what the ccw/ccwgroup/ap bus online
attribute does today.)
d***@lang.hm
2007-05-09 18:52:59 UTC
Permalink
On Wed, 9 May 2007, Cornelia Huck wrote:

> On Wed, 9 May 2007 10:53:52 -0700 (PDT),
> ***@lang.hm wrote:
>
>> if you want the registration to use hotplug and be async then do do the
>> registration during step 2 and make step 4 be a noop, in fact some drivers
>> may do all their work in step 2, while others (everything currently) will
>> do all their work during step 1
>
> The added benefit of setup() I was thinking of was being able to
> enable/disable devices later on (where enable entails the "heavy
> lifting").
>
> Say you have a range of devices you only use seldomly for some special
> purpose. They may take a long time to get up and use a lot of resources
> when they're on. You don't want to slow the boot process down for them,
> so they start disabled with just very basic setup (the fast stuff)
> done. (It is also useful if you want some manual configuration between
> the fast stuff and the heavy lifting.) When you need them, you spin
> them up (may take some time) and then use them. Afterwards, you could
> spin them down again. (This is what the ccw/ccwgroup/ap bus online
> attribute does today.)

this sounds like something that you want to load the driver module for
when you want to use it and unload the driver module when you are done.

if you want to make a seperate enable/disable function I think it would
probably be more appropriate to use the power management functions.

David Lang
Cornelia Huck
2007-05-10 07:38:24 UTC
Permalink
On Wed, 9 May 2007 11:52:59 -0700 (PDT),
***@lang.hm wrote:

> this sounds like something that you want to load the driver module for
> when you want to use it and unload the driver module when you are done.

A common case is that you only want to use some of all your devices of
a certain model. (My view may be strongly influenced by the s390
perspective, where we only have a quite limited amount of different
models and device drivers. In the highly diverse PC world, module
load/unload may indeed be the better way, since it doesn't require
additional work.)

> if you want to make a seperate enable/disable function I think it would
> probably be more appropriate to use the power management functions.

It might. (s390 doesn't currently have power management.)
Greg KH
2007-05-09 17:07:42 UTC
Permalink
On Wed, May 09, 2007 at 09:18:10AM -0700, ***@lang.hm wrote:
> On Wed, 9 May 2007, Cornelia Huck wrote:
>
> > On Wed, 9 May 2007 02:25:10 -0700 (PDT),
> > ***@lang.hm wrote:
> >
> >> makeing up an example
> >>
> >> today the process would be
> >> call driver X to initialize it's devices.
> >>
> >> driver X follows these steps
> >>
> >> 1. see if a compatible chipset/device appears to be available
> >>
> >> 2. initialize the chipset/device (loading firmware)
> >>
> >> 3. query the now partially initialized chipset/device to see what specific
> >> options are enabled in the hardware
> >>
> >> 4. for each 'thing' that's enabled in hardware, initialize and register
> >> it.
> >
> > The sync ->probe() could return after step 1 (when it is clear the
> > driver wants to bind to the device). The remainder should be done in
> > the async function, so that the bus could go on with other devices. The
> > driver should then signal when it is done with its registrations.
>
> but then won't the devices get registered in a random order? (i.e. whenever
> the async portion finishes the probing and finds the details of what there
> is to register)

So? We have busses today that have devices get registered in random
order, our userspace tools can handle it just fine now.

> >> to acommodate both of these device models it seems to me that you can't
> >> defice either
> >> async followed by sync
> >> or
> >> sync followed by async
> >> but instead need to support the combined
> >> sync followed by async followed by sync.
> >>
> >> now, it's very possible that I'm mistaken and one of the two-part models
> >> can be used for everything, if so then it's definantly simpler to do so.
> >
> > Hm, maybe I'm dense, but I really don't understand the problem here.
> > Why should (optional) sync followed by (optional) async followed by bus
> > synchonization not cover all cases? The async portion can also do some
> > "sync" stuff and then signal completion, can't it?
> >
> > If you're worried about notifications to userspace, you could first
> > suppress the add uevent via uevent_suppress and then generate it when
> > you signal completion to the bus.
>
> I'm not worried about notification to userspace, I'm worried about devices
> getting registered during the async stage appearing in different orders on
> different boots due to different timeings.
>
> if you can identify the device well enough to register it quickly then the
> approach that Linus proposed works well, which is
>
> sync probe, identify devices, register devices
> async initialize devices
> wait for all async to complete
>
> however I'm talking about cases where you can't fully identify the devices
> (at least not well enough to register them with the kernel) and am saying
> that doing
>
> sync probe
> async initialize device, register device
> wait for all async to complete
>
> results in unpredictable device ordering, but if instead you did
>
> sync probe
> async initialize device
> sync wait for all async to complete, register device
>
> you can get back to predictable ordering

No, don't worry about that, the ordering can be made different due to
any number of different things (bios upgrade, new device added, phase of
the moon, etc.) So stop worrying about being predictable, we have never
really had that...

thanks,

greg k-h
Linus Torvalds
2007-05-09 17:25:57 UTC
Permalink
On Wed, 9 May 2007, Greg KH wrote:
> >
> > but then won't the devices get registered in a random order? (i.e. whenever
> > the async portion finishes the probing and finds the details of what there
> > is to register)
>
> So? We have busses today that have devices get registered in random
> order, our userspace tools can handle it just fine now.

However, that does *not* translate to: "..so we can do it for all buses".

People *do* depend on simple things like internal harddisks showing up in
a particular order. The fact that you *can* handle it with UUID's etc does
not mean that people do, or that they should be forced to do so.

There is no inherent goodness in making the bootup more random. In fact,
there's a lot of badness to it, ranging from just not being very nice to
"really a bitch to debug". And no amount of "user level _could_ handle it"
changes that.

Linus
Stefan Richter
2007-05-09 09:30:30 UTC
Permalink
Cornelia Huck wrote:
> On Wed, 9 May 2007 01:33:14 -0700 (PDT),
> ***@lang.hm wrote:
>
>> 1. why should different, unrelated busses need to wait for each other?
>> picking two, why can't you have SCSI and USB going through their timeouts
>> at the same time?
>
> If they don't have dependencies on each other, yes. Some busses should
> be finished before probing for others start (e.g. low-level busses).

If there are dependencies, then a driver on the lower-level bus simply
hot-inserts a bus to the higher-level bus's driver when the lower-level
probe is done.

E.g. PCI or CardBus detects insertion of a IEEE 1394 controller,
ohci1394 adds a new IEEE 1394 bus to ieee1394 core, ieee1394core
discovers e.g. storage devices on this bus and kicks off sbp2's probe,
sbp2 adds them to SCSI core, SCSI core does the inquiry and kicks off a
SCSI command set driver's probe. Meanwhile, userspace just waits for a
certain SCSI device to appear. (And userspace inserts drivers according
to uevents a.k.a. hotplug events if the drivers weren't statically linked.)
--
Stefan Richter
-=====-=-=== -=-= -=--=
http://arcgraph.de/sr/
Cornelia Huck
2007-05-08 20:51:59 UTC
Permalink
On Tue, 8 May 2007 13:01:21 -0700 (PDT),
Linus Torvalds <***@linux-foundation.org> wrote:

> Instead of changing existign probe functionality to be asynchronous, we
> could *add* a new and asynchronous part to it. For example, we could make
> the rule for PCI - or other bus - devices be:
>
> - the bus will *first* call the "probe()" function synchronously.
>
> - after that one has completed, we will call "probe_async()"
> asynchronously at some point (it ie might be scheduled immediately
> after the "probe()" call, but delayed by some arbitrary issues like
> just already having too many asynchronous probes on-going or similar)

Hm. This would mean that probe() would be the decision "I want to use
this device", while probe_async() would be "setup my device, may take
some time"? Could work.

> So an unmodified driver would basically work exactly like it does now, but
> if a driver is happy with being called asynchronously, it could just
> change it's
>
> .probe = mydriver_probe
>
> thing into a
>
> .probe_async = mydriver_probe
>
> and we can do that ona per-driver basis with that kind of really simple
> one-liner change.

But probe would return int, while probe_async would return void? Two
liner :)

>
> In fact, there is nothing wrong with having *both* a synchronous part, and
> an async part:
>
> .probe = mydriver_setup,
> .probe_async = mydriver_spin_up_and_probe_devices,
>
> and it would do basic setup (including, for example, the fast enumeration
> of whatever devices are connected) synchronously, but then do anything
> more in the async part - and the async part would still be guaranteed that
> the setup has been run by the time it is scheduled (but not really have
> any other guarantees).

Looks like an idea.

I'll sleep on it, just too tired to make sense.
David Miller
2007-05-08 21:41:54 UTC
Permalink
From: Linus Torvalds <***@linux-foundation.org>
Date: Tue, 8 May 2007 13:01:21 -0700 (PDT)

> In fact, there is nothing wrong with having *both* a synchronous part, and
> an async part:
>
> .probe = mydriver_setup,
> .probe_async = mydriver_spin_up_and_probe_devices,
...
> Hmm? Would something like this work? I dunno, but it seems a hell of a lot
> safer and more capable than the aborted PCI multithreaded probing that was
> an "all or nothing" approach.

I definitely agree that we need a transitonary approach to this.

Although I kind of preferred the idea you mentioned where the
device could launch the asynchronous probe and just return from
the normal ->probe() immediately.

This might get tricky if the callers do some kind of reference
counting or other resource management based upon the ->probe()
return value since it wouldn't know what happened to the
launched asynchronous probe when it returns from ->probe().
Greg KH
2007-05-09 09:55:32 UTC
Permalink
On Tue, May 08, 2007 at 02:41:54PM -0700, David Miller wrote:
> From: Linus Torvalds <***@linux-foundation.org>
> Date: Tue, 8 May 2007 13:01:21 -0700 (PDT)
>
> > In fact, there is nothing wrong with having *both* a synchronous part, and
> > an async part:
> >
> > .probe = mydriver_setup,
> > .probe_async = mydriver_spin_up_and_probe_devices,
> ...
> > Hmm? Would something like this work? I dunno, but it seems a hell of a lot
> > safer and more capable than the aborted PCI multithreaded probing that was
> > an "all or nothing" approach.
>
> I definitely agree that we need a transitonary approach to this.
>
> Although I kind of preferred the idea you mentioned where the
> device could launch the asynchronous probe and just return from
> the normal ->probe() immediately.

Yes, let this be a decision the individual PCI driver does, I don't want
to put this two-stage thing in the driver core, but any individual bus
can implement it if they really want to.

> This might get tricky if the callers do some kind of reference
> counting or other resource management based upon the ->probe()
> return value since it wouldn't know what happened to the
> launched asynchronous probe when it returns from ->probe().

As long as the ->probe() call returns that the driver has clamed the
device, and the ->remove() call can be handled properly while the driver
is off doing whatever it wants to in the initialization, the driver core
should work just fine, no changes needed.

thanks,

greg k-h
Duncan Sands
2007-05-09 08:14:16 UTC
Permalink
Hi,

> Instead of changing existign probe functionality to be asynchronous, we
> could *add* a new and asynchronous part to it. For example, we could make
> the rule for PCI - or other bus - devices be:
>
> - the bus will *first* call the "probe()" function synchronously.
>
> - after that one has completed, we will call "probe_async()"
> asynchronously at some point (it ie might be scheduled immediately
> after the "probe()" call, but delayed by some arbitrary issues like
> just already having too many asynchronous probes on-going or similar)

the usbatm USB ADSL modem drivers have this functionality. These drivers
need to load firmware before they become useful. The natural place to do
this is in the probe() method, but because firmware loading can take quite
some time (10 seconds, or even an infinite amount of time if the firmware
is not available and the timeout has been turned off) and would block the
USB hub thread if done from probe(), it's done in a separate kernel thread.
Clients of usbatm, like the speedtch driver, register themselves with usbatm,
providing a "bind" and a "heavy_init" method. "bind" is like probe(), while
"heavy_init" is like probe_async(). First bind is called, and if successful
and heavy_init has been defined, then heavy_init is run in its own thread.
If the device is unplugged, usbatm takes care of making sure that the heavy_init
thread has stopped before calling unbind and destroying device related structures.

A bunch of other USB drivers could do with similar functionality for the
same reason (slow probe), and there was some discussion about generalizing
this functionality to the USB layer but I didn't find time to do anything
about it yet. See http://marc.info/?l=linux-usb-devel&m=116551653026075&w=2

Best wishes,

Duncan.
Cornelia Huck
2007-05-09 08:45:41 UTC
Permalink
On Wed, 9 May 2007 10:14:16 +0200,
Duncan Sands <***@math.u-psud.fr> wrote:

> the usbatm USB ADSL modem drivers have this functionality. These drivers
> need to load firmware before they become useful. The natural place to do
> this is in the probe() method, but because firmware loading can take quite
> some time (10 seconds, or even an infinite amount of time if the firmware
> is not available and the timeout has been turned off) and would block the
> USB hub thread if done from probe(), it's done in a separate kernel thread.
> Clients of usbatm, like the speedtch driver, register themselves with usbatm,
> providing a "bind" and a "heavy_init" method. "bind" is like probe(), while
> "heavy_init" is like probe_async(). First bind is called, and if successful
> and heavy_init has been defined, then heavy_init is run in its own thread.
> If the device is unplugged, usbatm takes care of making sure that the heavy_init
> thread has stopped before calling unbind and destroying device related structures.
>
> A bunch of other USB drivers could do with similar functionality for the
> same reason (slow probe), and there was some discussion about generalizing
> this functionality to the USB layer but I didn't find time to do anything
> about it yet. See http://marc.info/?l=linux-usb-devel&m=116551653026075&w=2

Would a general split between probe() (check if we can handle the
device, do very basic stuff) and setup() (get the device up and
running) make sense? Drivers could stay with today's probe() function
if they want to (and still be working). setup() could be, but need not
be async. As an added benefit for huge systems, setup() might only be
called if explicitly requested (like "only do this heavy lifting
if/when we really want to use the device"). probe() could call bind()
and setup() (doing the firmware load etc.) heavy_init(). (This may be
orthogonal to the probe()/probe_async() idea.)
Duncan Sands
2007-05-09 09:16:06 UTC
Permalink
Hi Cornelia,

> > the usbatm USB ADSL modem drivers have this functionality. These drivers
> > need to load firmware before they become useful. The natural place to do
> > this is in the probe() method, but because firmware loading can take quite
> > some time (10 seconds, or even an infinite amount of time if the firmware
> > is not available and the timeout has been turned off) and would block the
> > USB hub thread if done from probe(), it's done in a separate kernel thread.
> > Clients of usbatm, like the speedtch driver, register themselves with usbatm,
> > providing a "bind" and a "heavy_init" method. "bind" is like probe(), while
> > "heavy_init" is like probe_async(). First bind is called, and if successful
> > and heavy_init has been defined, then heavy_init is run in its own thread.
> > If the device is unplugged, usbatm takes care of making sure that the heavy_init
> > thread has stopped before calling unbind and destroying device related structures.
> >
> > A bunch of other USB drivers could do with similar functionality for the
> > same reason (slow probe), and there was some discussion about generalizing
> > this functionality to the USB layer but I didn't find time to do anything
> > about it yet. See http://marc.info/?l=linux-usb-devel&m=116551653026075&w=2
>
> Would a general split between probe() (check if we can handle the
> device, do very basic stuff) and setup() (get the device up and
> running) make sense? Drivers could stay with today's probe() function
> if they want to (and still be working). setup() could be, but need not
> be async. As an added benefit for huge systems, setup() might only be
> called if explicitly requested (like "only do this heavy lifting
> if/when we really want to use the device"). probe() could call bind()
> and setup() (doing the firmware load etc.) heavy_init(). (This may be
> orthogonal to the probe()/probe_async() idea.)

yes that would be fine. Like with your setup() suggestion, it is possible
to specify whether heavy_init should be called or not. This is useful if
the firmware has already been uploaded - the probe() can detect this and
signal that heavy_init is not needed.

The main difficulties encountered with usbatm were: (1) avoiding race
conditions on device disconnect; (2) providing a way to cancel heavy_init.
As mentioned in my original email, heavy_init could run forever depending
on how the system is configured. In the current implementation a signal
is sent on disconnect(), to indicate that the game is up, then disconnect()
blocks until heavy_init terminates, if it hasn't already. A better method
would be to allow registration of a cancel() routine, which disconnect
would call before blocking on heavy_init termination. This could send a
signal, or complete a completion, or whatever. Sadly, the generic firmware
loading infrastructure doesn't have a way of asynchronously cancelling a
firmware load, so right now cancellation only does something if the firmware
has already been loaded from disk, in which case it cancels the upload to the
modem (which is the most time consuming part).

Best wishes,

Duncan.
Cornelia Huck
2007-05-09 12:37:35 UTC
Permalink
On Wed, 9 May 2007 11:16:06 +0200,
Duncan Sands <***@math.u-psud.fr> wrote:

> The main difficulties encountered with usbatm were: (1) avoiding race
> conditions on device disconnect; (2) providing a way to cancel heavy_init.
> As mentioned in my original email, heavy_init could run forever depending
> on how the system is configured. In the current implementation a signal
> is sent on disconnect(), to indicate that the game is up, then disconnect()
> blocks until heavy_init terminates, if it hasn't already. A better method
> would be to allow registration of a cancel() routine, which disconnect
> would call before blocking on heavy_init termination. This could send a
> signal, or complete a completion, or whatever.

Yes, a cancel routine would make sense also in the generic
probe()/setup() case.

> Sadly, the generic firmware
> loading infrastructure doesn't have a way of asynchronously cancelling a
> firmware load, so right now cancellation only does something if the firmware
> has already been loaded from disk, in which case it cancels the upload to the
> modem (which is the most time consuming part).

Hm, a firmware load cancellation interface looks like a worthwile
addition to the firmware code (especially since fw_load_abort() already
exists and can even be triggered by userspace).
Cornelia Huck
2007-05-08 20:36:53 UTC
Permalink
On Tue, 8 May 2007 12:31:22 -0700 (PDT),
Linus Torvalds <***@linux-foundation.org> wrote:

> > What if two devices become hotplugged at the same time? Does this imply
> > that the bus _always_ needs to do some serializing there?
>
> You want to test it? Be my guest.

It was more a rhetorical question. If not all drivers can handle it,
serialize at the bus level. But if all drivers can, there's no reason
for the bus to do it. On a bus with a small number of drivers, I
prefer to have the drivers in good shape. (This kind of implies that e.
g. pci needs to serialize, but css/ccw doesn't, and doesn't have to.)

> But the final nail in the coffin is that doing it at the bus level is
> INFERIOR. As you yourself admitted, it's better to do some things
> synchronously.

How so? It is possible to do some stuff synchronously. But why
shouldn't other things be done asynchronously?
(And there is a place both for bus-level parallelism and driver-level
parallelism. Just depends on your bus/driver/device.)
Greg KH
2007-05-09 09:53:02 UTC
Permalink
On Tue, May 08, 2007 at 06:38:46PM +0200, Cornelia Huck wrote:
> On Tue, 8 May 2007 08:27:34 -0700 (PDT),
> Linus Torvalds <***@linux-foundation.org> wrote:
>
> > And no, we should not do it at the device core level. In fact, I don't
> > think we should do it at that level at all.
> >
> > I'm pretty sure that the performance problems are at individual device
> > drivers, and that the right solution is to thread at *that* level. Not
> > higher up.
>
> These are two different problems:
>
> 1. Probing taking long for individual device drivers. I agree, this
> should be solved at the driver level.

For your bus perhaps, but not for PCI.

> 2. Sheer volume of devices on a bus. Even if the indivdual probing
> doesn't take long, having all devices probed one after the other may
> take a lot of time. Putting the actual probe on a thread makes it
> possible to run several probes in parallel, thereby cutting probing
> time.

Again, not for PCI, right?

If you want to implement this for your bus type, fine, I have no
objection to that at all, but not for PCI, it's just not worth it.

thanks,

greg k-h
Cornelia Huck
2007-05-09 13:38:13 UTC
Permalink
On Wed, 9 May 2007 02:53:02 -0700,
Greg KH <***@kroah.com> wrote:

> > 2. Sheer volume of devices on a bus. Even if the indivdual probing
> > doesn't take long, having all devices probed one after the other may
> > take a lot of time. Putting the actual probe on a thread makes it
> > possible to run several probes in parallel, thereby cutting probing
> > time.
>
> Again, not for PCI, right?

It seems that everyone agrees now that moving PCI over to a new probing
model without individual driver support was a bad idea. So generic
multithreaded probing is dead.

> If you want to implement this for your bus type, fine, I have no
> objection to that at all, but not for PCI, it's just not worth it.

Infrastructure for async probing might not be such a bad idea, though.
(Aren't there some huge PCI-based machines?) However, I don't really
care whether PCI or SCSI or $HUGE_BUS use it, but serial synchronous
probing on a bus looks like a killer on most large systems.
Greg KH
2007-05-09 16:42:07 UTC
Permalink
On Wed, May 09, 2007 at 03:38:13PM +0200, Cornelia Huck wrote:
> On Wed, 9 May 2007 02:53:02 -0700,
> Greg KH <***@kroah.com> wrote:
>
> > > 2. Sheer volume of devices on a bus. Even if the indivdual probing
> > > doesn't take long, having all devices probed one after the other may
> > > take a lot of time. Putting the actual probe on a thread makes it
> > > possible to run several probes in parallel, thereby cutting probing
> > > time.
> >
> > Again, not for PCI, right?
>
> It seems that everyone agrees now that moving PCI over to a new probing
> model without individual driver support was a bad idea. So generic
> multithreaded probing is dead.

Why is it dead? Since when is PCI the only bus in the system?

> > If you want to implement this for your bus type, fine, I have no
> > objection to that at all, but not for PCI, it's just not worth it.
>
> Infrastructure for async probing might not be such a bad idea, though.
> (Aren't there some huge PCI-based machines?) However, I don't really
> care whether PCI or SCSI or $HUGE_BUS use it, but serial synchronous
> probing on a bus looks like a killer on most large systems.

PCI probing is just reading stuff from memory. The slowness happens
when storage devices need to spin up and/or discover attached drives.

So, the async stuff can be done in those buses if wanted/needed, PCI is
not the problem here.

thanks,

greg k-h
d***@lang.hm
2007-05-09 16:50:09 UTC
Permalink
On Wed, 9 May 2007, Greg KH wrote:

>>>> doesn't take long, having all devices probed one after the other may
>>>> take a lot of time. Putting the actual probe on a thread makes it
>>>> possible to run several probes in parallel, thereby cutting probing
>>>> time.
>>>
>>> Again, not for PCI, right?
>>
>> It seems that everyone agrees now that moving PCI over to a new probing
>> model without individual driver support was a bad idea. So generic
>> multithreaded probing is dead.
>
> Why is it dead? Since when is PCI the only bus in the system?

correct, while this discussion started over the PCI_MULTITHREAD_PROBE it's
focusing on the generic problem that would apply to any bus

>>> If you want to implement this for your bus type, fine, I have no
>>> objection to that at all, but not for PCI, it's just not worth it.
>>
>> Infrastructure for async probing might not be such a bad idea, though.
>> (Aren't there some huge PCI-based machines?) However, I don't really
>> care whether PCI or SCSI or $HUGE_BUS use it, but serial synchronous
>> probing on a bus looks like a killer on most large systems.
>
> PCI probing is just reading stuff from memory. The slowness happens
> when storage devices need to spin up and/or discover attached drives.
>
> So, the async stuff can be done in those buses if wanted/needed, PCI is
> not the problem here.

I don't think anyone is saying that PCI is the problem here. We're trying
to identify what needs to be done for any arbatrary bus, if a particular
bus doesn't need to split it's initialization up becouse there are no
delays (and therefor no gains for multithreading) then drivers for that
bus don't bother to implement any async portion.

David Lang
Cornelia Huck
2007-05-09 17:14:35 UTC
Permalink
On Wed, 9 May 2007 09:50:09 -0700 (PDT),
***@lang.hm wrote:

> On Wed, 9 May 2007, Greg KH wrote:

> > Why is it dead? Since when is PCI the only bus in the system?
>
> correct, while this discussion started over the PCI_MULTITHREAD_PROBE it's
> focusing on the generic problem that would apply to any bus

s/multithreaded/async/ (more generic and better describing what we
want imo) That's why I said the multithreaded approach was dead (sorry
if that was confusing).

> I don't think anyone is saying that PCI is the problem here. We're trying
> to identify what needs to be done for any arbatrary bus, if a particular
> bus doesn't need to split it's initialization up becouse there are no
> delays (and therefor no gains for multithreading) then drivers for that
> bus don't bother to implement any async portion.

Exactly. A general idea/framework on how to do async probing would be
good.
Linus Torvalds
2007-05-09 17:09:14 UTC
Permalink
On Wed, 9 May 2007, Greg KH wrote:
>
> Why is it dead? Since when is PCI the only bus in the system?

Quite frankly, any probing strategy that isn't relevant to PCI simply
isn't relevant - full stop!

No, it's not the only bus, but if something isn't relevant to PCI it
shouldn't be in any general bus layer abstraction. PCI is _that_ dominant
(and all the modern variations are just extensions on PCI, PCI didn't go
away just because it's called PCI-X or whatever).

I think buses like SCSI, USB etc are valid things to worry about as being
very different from PCI, but they are not something that the generic bus
probing code (as exemplified by the commit in question) should know/care
about. Whatever probing semantics that a SCSI adapter ends up using for
probing its bus should be up to the SCSI layer, and there is no point in
thinking that it should be a "generic bus" abstraction.

In fact, different SCSI adapters would likely have different rules. iSCSI
probably won't have anything in common with "normal" SCSI when it comes to
probing, for example.

Linus
Phillip Susi
2007-05-09 22:21:23 UTC
Permalink
***@lang.hm wrote:
> is it really enough to do the sync followed by the async piece?
>
> it seems to me that there may be a need for three steps
>
> 1. prep (sync, what we do today)
> which will all complete before
> 2. parallel (async)
> which will all complete before
> 3. cleanup (sync)

I agree. Considering the particular case of scsi disks, you want to do
the long INQUIRY command to probe each target in phase 2, so that this
time consuming step can be done in parallel between multiple scsi
busses. After the INQUIRY though, you have to register with the scsi
layer and assign it a minor number and so on, and this you want to do
with some predictability, so you need to do this in phase 3.

Take a hypothetical system containing a PCI bus with a video card and
two scsi cards in it, and each scsi bus has a hard disk on it. The
sequence would go:

1) PCI Bus invokes phase 1 on video card, scsi a, scsi b, in that order
2) PCI Bus invokes phase 2 on video card, scsi a, scsi b, each in their
own thread
2.1) video card uploads firmware
2.2) scsi bus A issues INQUERY commands to each target to detect devices
2.3) scsi bus B issues INQUERY commands to each target to detect devices
3) PCI Bus invokes phase 3 on video card, scsi a, scsi b, in that order
3.1) video card registers with the frame buffer layer
3.2) scsi bus A registers its detected disk with the scsi layer
3.3) scsi bus B registers its detected disk with the scsi layer


Because 3.2 and 3.3 always happen in order, /dev/sda will stay /dev/sda
even if 2.3 finishes before 2.2. Likewise if you add a new video card
to PCI slot 4 which does not require a firmware upload, it will not
displace the existing video card from /dev/fb0 because phase 3 will take
place for the first card first, even if phase 2 completed first on the
new card.
Stefan Richter
2007-05-09 22:37:56 UTC
Permalink
Phillip Susi wrote:
> Considering the particular case of scsi disks, you want to do
> the long INQUIRY command to probe each target in phase 2, so that this
> time consuming step can be done in parallel between multiple scsi
> busses.

The SCSI stack already has infrastructure for multi-threaded discovery
and probing.
--
Stefan Richter
-=====-=-=== -=-= -=-=-
http://arcgraph.de/sr/
Phillip Susi
2007-05-10 14:23:52 UTC
Permalink
Stefan Richter wrote:
> The SCSI stack already has infrastructure for multi-threaded discovery
> and probing.

So? It would still benefit from using a generic framework that other
buses can use as well. Not to mention that the current scsi specific
framework tends to cause unstable names doesn't it?
Stefan Richter
2007-05-10 14:55:54 UTC
Permalink
Phillip Susi wrote:
> Stefan Richter wrote:
>> The SCSI stack already has infrastructure for multi-threaded discovery
>> and probing.
>
> So? It would still benefit from using a generic framework that other
> buses can use as well.

Perhaps, perhaps not. Many details of the If and How of asynchronous,
parallelized probing rest with the low-level drivers.

[BTW, which ever team attempts to design this generic framework please
brings in detailed knowledge of a variety of bus architectures. I for
one would like to contribute with what I know about IEEE 1394, but
before that I still have to experiment on my own before I have a good
understanding of how to parallelize IEEE 1394 scanning and probing, and
the IEEE 1394 core is being radically reworked at the moment anyway.]

> Not to mention that the current scsi specific
> framework tends to cause unstable names doesn't it?

No. Reality causes "unstable" names. Or more precisely, we ultimately
can't get persistent names from dumb enumerations according to probing
order. We get persistent names by reading persistent device properties
in userspace and by letting userspace rename or give aliases to devices.

Names based on probing order are fine for really simplistic setups like
the famous Aunt Tilly's home computer --- but not for the general case
that you are trying to cover.

(Sorry for repeating what has been said before in this thread.)
--
Stefan Richter
-=====-=-=== -=-= -=-=-
http://arcgraph.de/sr/
Cornelia Huck
2007-05-11 07:22:20 UTC
Permalink
On Thu, 10 May 2007 16:55:54 +0200,
Stefan Richter <***@s5r6.in-berlin.de> wrote:

> Phillip Susi wrote:
> > Stefan Richter wrote:
> >> The SCSI stack already has infrastructure for multi-threaded discovery
> >> and probing.
> >
> > So? It would still benefit from using a generic framework that other
> > buses can use as well.
>
> Perhaps, perhaps not. Many details of the If and How of asynchronous,
> parallelized probing rest with the low-level drivers.

A mix of bus/driver parallelism would probably be the most flexible
approach.

> [BTW, which ever team attempts to design this generic framework please
> brings in detailed knowledge of a variety of bus architectures. I for
> one would like to contribute with what I know about IEEE 1394, but
> before that I still have to experiment on my own before I have a good
> understanding of how to parallelize IEEE 1394 scanning and probing, and
> the IEEE 1394 core is being radically reworked at the moment anyway.]

I guess I'm too tainted by s390 :/ (which in comparison provides a
quite unified way at probing), but I'd be happy to contibute my
experiences as well.
David Miller
2007-05-08 21:15:54 UTC
Permalink
From: Linus Torvalds <***@linux-foundation.org>
Date: Tue, 8 May 2007 08:27:34 -0700 (PDT)

> Threading at the bus level just inevitably means things like random
> numbers for devices depending on some timing/scheduling issue. That's
> nasty.

I hadn't considered this issue, so ignore the other reply I made to
this thread. Although as an aside I'm starting to become of an
opinion that device numbering doesn't matter. Every device should
have a unique ID of sorts, or a unique physical location, and that
should factor into the thing users use to refer to the object.

Whether that's an ethernet MAC address, a PCI domain:bus:devfn tuple,
or some unique ID written in the disk label or partition header, it
doesn't matter.

If that gets translated into "user_happy_sunny_day_disk0" to make
references simpler for the user, that's fine. But probe ordering
should never break these mappings even if "user_happy_sunny_day_disk1"
is spotted by probing before "user_happy_sunny_day_disk0". If the
system is designed properly, that should never matter, but for us
it still does.

Those mappings should exist statically somewhere, not depend upon
something as frivolous as probe ordering.

Anyways, it would be nice, however, to really deal with the case like
when the IDE layer is waiting for a probe to port X to timeout,
meanwhile we could be initializing the networking card.

Another bad case is, as you mentioned, SCSI bus resets and SAS/FC
fabric scans. Those take several seconds if not longer and it's
really stupid to not be able to do other things during that time.
Stefan Richter
2007-05-08 22:19:22 UTC
Permalink
David Miller wrote:
[persistent unique device property -> persistent unique device name]
> Those mappings should exist statically somewhere, not depend upon
> something as frivolous as probe ordering.

We do this all the time for hotpluggable hardware. And bus types
without hotplugging capability become increasingly rare nowadays. :-)
--
Stefan Richter
-=====-=-=== -=-= -=---
http://arcgraph.de/sr/
Greg KH
2007-05-09 09:46:47 UTC
Permalink
On Tue, May 08, 2007 at 02:15:54PM -0700, David Miller wrote:
> From: Linus Torvalds <***@linux-foundation.org>
> Date: Tue, 8 May 2007 08:27:34 -0700 (PDT)
>
> > Threading at the bus level just inevitably means things like random
> > numbers for devices depending on some timing/scheduling issue. That's
> > nasty.
>
> I hadn't considered this issue, so ignore the other reply I made to
> this thread. Although as an aside I'm starting to become of an
> opinion that device numbering doesn't matter. Every device should
> have a unique ID of sorts, or a unique physical location, and that
> should factor into the thing users use to refer to the object.

We pretty much already do this today.

For block devices, as an example, look at /dev/disk/ which udev creates
so that you can handle block devices being discovered in any order
possible.

> Anyways, it would be nice, however, to really deal with the case like
> when the IDE layer is waiting for a probe to port X to timeout,
> meanwhile we could be initializing the networking card.
>
> Another bad case is, as you mentioned, SCSI bus resets and SAS/FC
> fabric scans. Those take several seconds if not longer and it's
> really stupid to not be able to do other things during that time.

Yes, because of that, I think this kind of multi-probe stuff should be
done in the IDE/SATA/SCSI bus code, not in the PCI code, as PCI
"normally" does not have any speed issues.

thanks,

greg k-h
Linus Torvalds
2007-05-08 15:15:22 UTC
Permalink
On Tue, 8 May 2007, Cornelia Huck wrote:
>
> As I wanted to add multithreaded probing for some s390 busses, I
> discovered that the commit above removed the multithreaded probing
> infrastructure again, while just some days before some of my patches
> reworking it had been merged...

No.

I am *not* reverting that patch. The whole PCI bus-level multi-threaded
stuff was an incredibly broken idea.

If you have problems with probing, it's not going to be about the host
bus. It might be things like SCSI buses etc, and if somebody makes *those*
be more threaded, I'm ok with it. But I'm not going to accept another
broken PCI bus probe threading patch.

> (OK, PCI_MULTITHREAD_PROBE should not depend on BROKEN, but on
> EXPERIMENTAL with the reworked probing infrastructure. This got mixed
> up, I can send a patch that changes it.)

No. No. No.

Probing the core bus was broken. Not "experimental". Broken. It not only
threaded the wrogn area entirely, it also totally screwed up several
driver subsystems (like IDE).

Linus
Continue reading on narkive:
Loading...