Regression from 2.6.26: Hibernation (possibly suspend) broken on Toshiba R500 (bisected)

Discussion:

Regression from 2.6.26: Hibernation (possibly suspend) broken on Toshiba R500 (bisected)

Rafael J. Wysocki

2008-12-02 02:20:31 UTC

Hi Linus,

For some time I've been having problems with resume from hibernation and
suspend on Toshiba Portege R500 I'm currently testing. Initially I thought
that was a regression from 2.6.27, because some 2.6.27-based kernels appeared
to work correctly on this box, but today I realized that in fact 2.6.27-rc6
failed too and then I confirmed that the problem was also present in 2.6.27
and in all of the -stable 2.6.27.y kernels. Still, I was unable to reproduce
the problem with the 2.6.27-rc3 kernel and that made me carry out bisection
between 2.6.27-rc3 and 2.6.27-rc6 that turned up the following commit of yours:

commit 5f17cfce5776c566d64430f543a289e5cfa4538b
Author: Linus Torvalds <***@linux-foundation.org>
Date: Thu Sep 4 01:33:59 2008 -0700

PCI: fix pbus_size_mem() resource alignment for CardBus controllers

Following this, I applied the appended patch on top of the current mainline
and it appears to have fixed my hibernation/resume problems on this box
(at least, with the patch applied the box have survived ~20 hibernation/resume
and suspend/resume cycles in a row, which was not achievable with the mainline
without the patch).

The symptoms of the breakage are that sometimes the box hangs solid during
resume, sometimes it hangs but can be rebooted by pressing Alt-SysRq-b, and
sometimes it just powers off while resuming. Still, it resumes correctly in
about 75% of cases and that made the issue very hard to debug. [Interestingly
enough, it was not reproducible with snd_hda_intel unloaded, which made me
think it was related to the driver, but evidently it wasn't.] Also, I'm sure
hibernation is affected, but recently there have been some other sources of
breakage of resume from suspend to RAM, so I'm not so sure to what extent it
is affected too.

Please let me know if you need debug information from the affected box.

Thanks,
Rafael

Signed-off-by: Rafael J. Wysocki <***@sisk.pl>
---
drivers/pci/setup-bus.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/drivers/pci/setup-bus.c
===================================================================
--- linux-2.6.orig/drivers/pci/setup-bus.c
+++ linux-2.6/drivers/pci/setup-bus.c
@@ -352,7 +352,7 @@ static int pbus_size_mem(struct pci_bus
continue;
r_size = resource_size(r);
/* For bridges size != alignment */
- align = resource_alignment(r);
+ align = (i < PCI_BRIDGE_RESOURCES) ? r_size : r->start;
order = __ffs(align) - 20;
if (order > 11) {
dev_warn(&dev->dev, "BAR %d bad alignment %llx: "

Linus Torvalds

2008-12-02 03:32:02 UTC

Post by Rafael J. Wysocki
r_size = resource_size(r);
/* For bridges size != alignment */
- align = resource_alignment(r);
+ align = (i < PCI_BRIDGE_RESOURCES) ? r_size : r->start;

Hmm. This means that something set the alignment flags incorrectly. The
resource _should_ have IORESOURCE_SIZEALIGN set for a resource with size
alignment, and IORESOURCE_STARTALIGN for one that has start alignment.

Your patch doesn't fix anything, it just hides the bug.

It would be good to hear what resource this is, and where it got set. So
instead of that broken patch that just hides the problem, please try to
debug it with something like

resource_size_t expected_align;

expected_align = (i < PCI_BRIDGE_RESOURCES) ? r_size : r->start;
align = resource_alignment(r);
if (align != expected_align) {
dev_warn(&dev->dev,
"BAR %d %llx-%llx wrong alignment flags %lx %llx (%llx)\n",
i,
(unsigned long long) r->start,
(unsigned long long) r->end,
r->flags,
(unsigned long long) align,
(unsigned long long) expected_align);
/* Hacky and wrong, but trying to keep things
align = expected_align;
}

or something like that. And then we just need to figure out which setup
routine sets the wrong alignment flag,.

Linus

Linus Torvalds

2008-12-02 03:42:48 UTC

Post by Linus Torvalds
or something like that. And then we just need to figure out which setup
routine sets the wrong alignment flag,.

Oh, btw, one more thing: since it apparently sometimes _does_ resume
from hibernation without all this, I'd also like to see the actual
differences in /proc/ioports and /proc/iomem that happen as a result of
the different alignment.

I also really suspect we should add a whole "alignment" field to "struct
resource", instead of the size-vs-start flags. The fact is, some PCI
devices have alignment that is neither tied to size or anything else: I
think some PCI bus resources are really always 4kB-aligned, for example
(and aligning them by size will give a bigger alignment than actually
required).

Linus

Frans Pop

2008-12-02 04:31:50 UTC

Post by Linus Torvalds
Oh, btw, one more thing: since it apparently sometimes _does_ resume
from hibernation without all this, I'd also like to see the actual
differences in /proc/ioports and /proc/iomem that happen as a result of
the different alignment.

You're in luck. I still had /proc/io* contents from .28-rc3 lying around
from working on some other issue.

Here's the relevant diff for iomem; there's no diff for ioports.

--- iomem_2.6.28-rc3 2008-11-03 10:59:37.000000000 +0100
+++ iomem_2.6.28-rc6_linus 2008-12-02 05:20:31.000000000 +0100
@@ -10,10 +10,9 @@
7e7b0000-7e7c53ff : reserved
7e7c5400-7e7e7fb7 : ACPI Non-volatile Storage
7e7e7fb8-7effffff : reserved
-80000000-83ffffff : PCI Bus 0000:02
- 80000000-83ffffff : PCI CardBus 0000:03
-84000000-87ffffff : PCI CardBus 0000:03
-88000000-88000fff : Intel Flush Page
+80000000-83ffffff : PCI CardBus 0000:03
+84000000-84000fff : Intel Flush Page
+84400000-847fffff : PCI CardBus 0000:03
d0000000-dfffffff : 0000:00:02.0
d0000000-d076ffff : vesafb
e0000000-e00fffff : PCI Bus 0000:10

I've tried a few quick suspend/resume cycles and no failures so far, but
that's not really conclusive yet.

Besides snd_hda_intel I've also been unloading e1000e before suspend
because I thought it contributed to resume failures. I'm now keeping that
loaded as well. Will report results.

Cheers,
FJP

Linus Torvalds

2008-12-02 04:46:20 UTC

Post by Frans Pop
You're in luck. I still had /proc/io* contents from .28-rc3 lying around
from working on some other issue.
Here's the relevant diff for iomem; there's no diff for ioports.
--- iomem_2.6.28-rc3 2008-11-03 10:59:37.000000000 +0100
+++ iomem_2.6.28-rc6_linus 2008-12-02 05:20:31.000000000 +0100
@@ -10,10 +10,9 @@
7e7b0000-7e7c53ff : reserved
7e7c5400-7e7e7fb7 : ACPI Non-volatile Storage
7e7e7fb8-7effffff : reserved
-80000000-83ffffff : PCI Bus 0000:02
- 80000000-83ffffff : PCI CardBus 0000:03
-84000000-87ffffff : PCI CardBus 0000:03
-88000000-88000fff : Intel Flush Page
+80000000-83ffffff : PCI CardBus 0000:03
+84000000-84000fff : Intel Flush Page
+84400000-847fffff : PCI CardBus 0000:03
d0000000-dfffffff : 0000:00:02.0
d0000000-d076ffff : vesafb
e0000000-e00fffff : PCI Bus 0000:10

I'm not seeing how this could matter. In the latter one, we apparently
don't set up any PCI bus memory window, but I bet it's a transparent
bridge, and it shouldn't matter. IOW, your dmesg probably has a line like
this somewhere:

pci 0000:00:1e.0: transparent bridge

and whether there is an explicit bus window or not is simply immaterial.

That said, if you can show the differences in dmesg from the two cases, it
would probably be interesting to see why it happens that way. Why did we
bother setting up that PCI bus window in -rc3 at all? Was it there from
the beginning?

Post by Frans Pop
I've tried a few quick suspend/resume cycles and no failures so far, but
that's not really conclusive yet.
Besides snd_hda_intel I've also been unloading e1000e before suspend
because I thought it contributed to resume failures. I'm now keeping that
loaded as well. Will report results.

There were some HDA fixes recently, although I don't think they should
matter for suspend/resume (well, at least a couple of them should fix the
case of sound being _silent_ on resume, but shouldn't have caused any
other issues).

Linus

Frans Pop

2008-12-02 05:29:24 UTC

Post by Linus Torvalds
That said, if you can show the differences in dmesg from the two cases,
it would probably be interesting to see why it happens that way. Why
did we bother setting up that PCI bus window in -rc3 at all? Was it
there from the beginning?

Attached is a full diff between dmesg from -rc3 and -rc6 with your debug
patch.

I've cleaned up the diff a bit to make it more readable (mostly removal of
changes that I always get due to random USB load order changes - UHCI
still frequently loads before EHCI).

The most interesting points are probably at lines 298-346 and 639-649.

At the bottom there's a fairly long addition from the few suspend/resume
cycles I did (again, running with the debug patch).

Frans Pop

2008-12-02 05:56:59 UTC

+yenta_cardbus 0000:02:06.0: CardBus bridge, secondary bus 0000:03
+yenta_cardbus 0000:02:06.0: IO window: 0x003000-0x0030ff
+yenta_cardbus 0000:02:06.0: IO window: 0x003400-0x0034ff
+yenta_cardbus 0000:02:06.0: PREFETCH window: 0x84400000-0x847fffff
+yenta_cardbus 0000:02:06.0: MEM window: 0x80000000-0x83ffffff

Wild speculation, but could all this possibly also be related to occasional
"irq 19: nobody cared" errors I'm seeing on resume for ohci1394?

My FireWire controller is sitting behind this bridge after all.

Here's an example:
Dec 2 04:57:32 aragorn kernel: irq 19: nobody cared (try booting with the "irqpoll" option)
Dec 2 04:57:32 aragorn kernel: Pid: 0, comm: swapper Not tainted 2.6.28-rc6 #57
Dec 2 04:57:32 aragorn kernel: Call Trace:
Dec 2 04:57:32 aragorn kernel: <IRQ> [<ffffffffa01009e1>] ? ohci_irq_handler+0x60/0x7e9 [ohci1394]
Dec 2 04:57:32 aragorn kernel: [<ffffffff8026aa29>] __report_bad_irq+0x38/0x87
Dec 2 04:57:32 aragorn kernel: [<ffffffff8026ab86>] note_interrupt+0x10e/0x174
Dec 2 04:57:32 aragorn kernel: [<ffffffff8026b23e>] handle_fasteoi_irq+0xa7/0xd1
Dec 2 04:57:32 aragorn kernel: [<ffffffff8020eb87>] do_IRQ+0x73/0xe4
Dec 2 04:57:32 aragorn kernel: [<ffffffff8020c626>] ret_from_intr+0x0/0xa
Dec 2 04:57:32 aragorn kernel: <EOI> [<ffffffffa0012606>] ? acpi_idle_enter_bm+0x26b/0x2b2 [processor]
Dec 2 04:57:32 aragorn kernel: [<ffffffffa00125fc>] ? acpi_idle_enter_bm+0x261/0x2b2 [processor]
Dec 2 04:57:32 aragorn kernel: [<ffffffff8024f33b>] ? notifier_call_chain+0x33/0x5b
Dec 2 04:57:32 aragorn kernel: [<ffffffff803b9ca4>] ? cpuidle_idle_call+0x8c/0xc4
Dec 2 04:57:32 aragorn kernel: [<ffffffff8020b312>] ? cpu_idle+0x4a/0x9a
Dec 2 04:57:32 aragorn kernel: [<ffffffff8042c528>] ? rest_init+0x5c/0x5e
Dec 2 04:57:32 aragorn kernel: handlers:
Dec 2 04:57:32 aragorn kernel: [<ffffffffa0100981>] (ohci_irq_handler+0x0/0x7e9 [ohci1394])
Dec 2 04:57:32 aragorn kernel: Disabling IRQ #19

I can send full kernel log from the start of that boot if desired.

Cheers,
FJP

Linus Torvalds

2008-12-02 15:46:38 UTC

Post by Frans Pop
Attached is a full diff between dmesg from -rc3 and -rc6 with your debug
patch.
I've cleaned up the diff a bit to make it more readable (mostly removal of
changes that I always get due to random USB load order changes - UHCI
still frequently loads before EHCI).
The most interesting points are probably at lines 298-346 and 639-649.

So, it looks like you have MSI enabled in -rc6, and not in -rc3. And yes,
for some reason -rc3 will create the prefetchable memory range windows,
but -rc6 won't.

I have to admit that I'm not seeing _why_ to that latter one. I don't
think we've done any resource allocation changes since -rc3 (the "clean up
late e820 resource allocation" thing happened just _before_ -rc3), so I'm
really not seeing why -rc3 would act differently from -rc6..

Post by Frans Pop
At the bottom there's a fairly long addition from the few suspend/resume
cycles I did (again, running with the debug patch).

Sure. Quite frankly, from these messages, I'm not seeing anything really
even remotely wrong. And apparently it does actually work for you.

It would perhaps be more interesting to see if there is some dmesg
difference in a boot that then ends up _not_ able to resume from
hibernation? But apparently that hasn't happened to you lately?

I don't like not knowing why you have prefetchable windows in one, and not
in the other, but it is indeed a transparent bridge and so that difference
really shouldn't even matter.

Do you perhaps dual-boot that laptop? What can sometimes happen is that
PCI resources do not get totally reset over a warm-boot.

We've (very occasionally) had situations where PCI resource bugs only
happen when you warm-boot from another OS (generally Windows), or when you
warm-boot from an earlier version of Linux. Exactly because some firmware
didn't fully re-initialize the state of the PCI bus, and because Linux
will try to honor everything that the firmware set up..

Linus

Frans Pop

2008-12-02 17:46:48 UTC

Post by Linus Torvalds
So, it looks like you have MSI enabled in -rc6, and not in -rc3. And
yes, for some reason -rc3 will create the prefetchable memory range
windows, but -rc6 won't.

I have no changes in my config that would explain that (checked the diff).
The only changes are simple 'make oldconfig' updates.

Post by Linus Torvalds
I have to admit that I'm not seeing _why_ to that latter one. I don't
think we've done any resource allocation changes since -rc3 (the "clean
up late e820 resource allocation" thing happened just _before_ -rc3),
so I'm really not seeing why -rc3 would act differently from -rc6..

I could reinstall some intermediate versions and check when that change
got introduced if that would help, or revert to the kernel I was using
before I did the pull today and applied the debug patch (which was plain
rc6 + a few selected bug fix patches that had not yet been merged).

Post by Linus Torvalds

Post by Frans Pop
At the bottom there's a fairly long addition from the few
suspend/resume cycles I did (again, running with the debug patch).

Sure. Quite frankly, from these messages, I'm not seeing anything
really even remotely wrong. And apparently it does actually work for
you.

Right. All these resumes were perfect.

Post by Linus Torvalds
It would perhaps be more interesting to see if there is some dmesg
difference in a boot that then ends up _not_ able to resume from
hibernation? But apparently that hasn't happened to you lately?

It did happen once a few days ago, but with the workarounds I had resume
failures were extremely rare. I'll see if I can work out which boot it
was, but that could well be very tricky as a failed resume leaves no
trace.

The way I "see" the failure is that the wireless led does not come on

Post by Linus Torvalds
I don't like not knowing why you have prefetchable windows in one, and
not in the other, but it is indeed a transparent bridge and so that
difference really shouldn't even matter.

/me always likes the use of "should" in such situations ;-)

Post by Linus Torvalds
Do you perhaps dual-boot that laptop? What can sometimes happen is that
PCI resources do not get totally reset over a warm-boot.

No. I only run Debian testing on that machine. As Debian Lenny is getting
close to release there are very few changes in userland ATM. For kernels
I essentially follow git starting with -rc2/3 or so of each new minor.

Post by Linus Torvalds
We've (very occasionally) had situations where PCI resource bugs only
happen when you warm-boot from another OS (generally Windows), or when
you warm-boot from an earlier version of Linux. Exactly because some
firmware didn't fully re-initialize the state of the PCI bus, and
because Linux will try to honor everything that the firmware set up..

Given the above that does not seem relevant.

Linus Torvalds

2008-12-02 18:17:32 UTC

Post by Frans Pop
I could reinstall some intermediate versions and check when that change
got introduced if that would help, or revert to the kernel I was using
before I did the pull today and applied the debug patch (which was plain
rc6 + a few selected bug fix patches that had not yet been merged).

It might be interesting, but probably not very relevant. I don't really
think that the resume problem is related to this.

Post by Frans Pop
It did happen once a few days ago, but with the workarounds I had resume
failures were extremely rare. I'll see if I can work out which boot it
was, but that could well be very tricky as a failed resume leaves no
trace.

If you just save the dmesg before each resume try, you'd have at least the
dmesg of the bootup that leads to failure. It's _likely_ exactly the same
as the ones that don't lead to failures, but..

Post by Frans Pop
The way I "see" the failure is that the wireless led does not come on

Quite frankly, from everything I hear, I personally strongly suspect that
it's something timing-related to some driver, and most likely totally
unrelated to any PCI resource allocation in any other way. IOW, _if_ the
resource allocation makes a difference, it's more likely through just
mattering from a timing standpoint.

For example, the bad alignment you get from picking the alignment from the
wrong place (with Rafael's patch or with my debugging patch) results in
this:

- PCI init time:

+pci 0000:02:06.0: BAR 9 0-3ffffff wrong alignment flags 21200 4000000 (0)
+pci 0000:02:06.0: BAR 9 bad alignment 0: [0x000000-0x3ffffff]

- CardBus init time:

+yenta_cardbus 0000:02:06.0: CardBus bridge, secondary bus 0000:03
+yenta_cardbus 0000:02:06.0: IO window: 0x003000-0x0030ff
+yenta_cardbus 0000:02:06.0: IO window: 0x003400-0x0034ff
+yenta_cardbus 0000:02:06.0: PREFETCH window: 0x84400000-0x847fffff
+yenta_cardbus 0000:02:06.0: MEM window: 0x80000000-0x83ffffff

ie what happened is that when using the wrong alignment, the PCI bus setup
code would fail to set up the memory windows (bar 9), but then the cardbus
init which happens later will fix that, since it re-does the setup (see
"yenta_allocate_resources()": it will re-do "pci_setup_cardbus()" if the
resources didn't get set up earlier)

End result? There should be no real difference, but timing does change.

Now, there are _some_ subtle issues that change depending on whether we do
the "failure case" in yenta_allocate_resources() or not: for example, the
code in yenta_allocate_res() actually knows about the fact that CardBus
memory resources have a 4kB granularity, while the generic PCI code uses
the resource size as the alignment.

The yenta code also tries to potentially find a smaller resource than the
generic PCI resource code did (the generic PCI code just tries to use
2*pci_cardbus_io_size and 2*pci_cardbus_mem_size for sizing, while the
Yenta code tries to aim for BRIDGE_IO/MEM_MAX but knows to try to make
them smaller if it cannot fit.

So the resources can end up being laid out at different points as a
result, and again, that can lead to totally independent bugs showing up
(overlap with random unknown motherboard resources set up by firmware).
But more relevantly, it can simply just cause some timing differences.

Anyway, I'm not at all sure that you and Rafael even necessarily see
exactly the same thing. It doesn't look like my debug patch (which tries
to emulate Rafael's patch in behavior, in addition to adding the debug
printouts) really even matters for you. Because you seemed to be
hibernating ok even without that "break alignment in PCI layer" thing.

Linus

Frans Pop

2008-12-05 08:53:10 UTC

Post by Linus Torvalds
So, it looks like you have MSI enabled in -rc6, and not in -rc3.

I've just compared dmesg for my *desktop* and there I see the following
between 2.6.28-rc6 and current 2.6.28-rc7:

pcieport-driver 0000:00:1c.0: setting latency timer to 64
pcieport-driver 0000:00:1c.0: found MSI capability
-pcieport-driver 0000:00:1c.0: irq 47 for MSI/MSI-X
+pcieport-driver 0000:00:1c.0: irq 511 for MSI/MSI-X
pci_express 0000:00:1c.0:pcie00: allocate port service
pci_express 0000:00:1c.0:pcie02: allocate port service
pcieport-driver 0000:00:1c.2: setting latency timer to 64
pcieport-driver 0000:00:1c.2: found MSI capability
-pcieport-driver 0000:00:1c.2: irq 46 for MSI/MSI-X
+pcieport-driver 0000:00:1c.2: irq 510 for MSI/MSI-X
pci_express 0000:00:1c.2:pcie00: allocate port service
pci_express 0000:00:1c.2:pcie02: allocate port service
pcieport-driver 0000:00:1c.3: setting latency timer to 64
pcieport-driver 0000:00:1c.3: found MSI capability
-pcieport-driver 0000:00:1c.3: irq 45 for MSI/MSI-X
+pcieport-driver 0000:00:1c.3: irq 509 for MSI/MSI-X
pci_express 0000:00:1c.3:pcie00: allocate port service
pci_express 0000:00:1c.3:pcie02: allocate port service
pcieport-driver 0000:00:1c.4: setting latency timer to 64
pcieport-driver 0000:00:1c.4: found MSI capability
-pcieport-driver 0000:00:1c.4: irq 44 for MSI/MSI-X
+pcieport-driver 0000:00:1c.4: irq 508 for MSI/MSI-X
pci_express 0000:00:1c.4:pcie00: allocate port service
pci_express 0000:00:1c.4:pcie02: allocate port service
pcieport-driver 0000:00:1c.5: setting latency timer to 64
pcieport-driver 0000:00:1c.5: found MSI capability
-pcieport-driver 0000:00:1c.5: irq 43 for MSI/MSI-X
+pcieport-driver 0000:00:1c.5: irq 507 for MSI/MSI-X
[...]
ahci 0000:00:1f.2: PCI INT B -> GSI 19 (level, low) -> IRQ 19
-ahci 0000:00:1f.2: irq 42 for MSI/MSI-X
+ahci 0000:00:1f.2: irq 506 for MSI/MSI-X
ahci 0000:00:1f.2: AHCI 0001.0100 32 slots 4 ports 3 Gbps 0xf impl SATA mode
[...]
e1000e 0000:01:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
e1000e 0000:01:00.0: setting latency timer to 64
-e1000e 0000:01:00.0: irq 41 for MSI/MSI-X
+e1000e 0000:01:00.0: irq 505 for MSI/MSI-X

I.e. very similar to the change between rc-3 and rc-7 you commented on
for my notebook.

Is this really MSI being (not) enabled, or just it using much higher
IRQ numbers?

Yinghai Lu

2008-12-05 09:09:03 UTC

Post by Frans Pop

Post by Linus Torvalds
So, it looks like you have MSI enabled in -rc6, and not in -rc3.

I've just compared dmesg for my *desktop* and there I see the following
pcieport-driver 0000:00:1c.0: setting latency timer to 64
pcieport-driver 0000:00:1c.0: found MSI capability
-pcieport-driver 0000:00:1c.0: irq 47 for MSI/MSI-X
+pcieport-driver 0000:00:1c.0: irq 511 for MSI/MSI-X
pci_express 0000:00:1c.0:pcie00: allocate port service
pci_express 0000:00:1c.0:pcie02: allocate port service
pcieport-driver 0000:00:1c.2: setting latency timer to 64
pcieport-driver 0000:00:1c.2: found MSI capability
-pcieport-driver 0000:00:1c.2: irq 46 for MSI/MSI-X
+pcieport-driver 0000:00:1c.2: irq 510 for MSI/MSI-X
pci_express 0000:00:1c.2:pcie00: allocate port service
pci_express 0000:00:1c.2:pcie02: allocate port service
pcieport-driver 0000:00:1c.3: setting latency timer to 64
pcieport-driver 0000:00:1c.3: found MSI capability
-pcieport-driver 0000:00:1c.3: irq 45 for MSI/MSI-X
+pcieport-driver 0000:00:1c.3: irq 509 for MSI/MSI-X
pci_express 0000:00:1c.3:pcie00: allocate port service
pci_express 0000:00:1c.3:pcie02: allocate port service
pcieport-driver 0000:00:1c.4: setting latency timer to 64
pcieport-driver 0000:00:1c.4: found MSI capability
-pcieport-driver 0000:00:1c.4: irq 44 for MSI/MSI-X
+pcieport-driver 0000:00:1c.4: irq 508 for MSI/MSI-X
pci_express 0000:00:1c.4:pcie00: allocate port service
pci_express 0000:00:1c.4:pcie02: allocate port service
pcieport-driver 0000:00:1c.5: setting latency timer to 64
pcieport-driver 0000:00:1c.5: found MSI capability
-pcieport-driver 0000:00:1c.5: irq 43 for MSI/MSI-X
+pcieport-driver 0000:00:1c.5: irq 507 for MSI/MSI-X
[...]
ahci 0000:00:1f.2: PCI INT B -> GSI 19 (level, low) -> IRQ 19
-ahci 0000:00:1f.2: irq 42 for MSI/MSI-X
+ahci 0000:00:1f.2: irq 506 for MSI/MSI-X
ahci 0000:00:1f.2: AHCI 0001.0100 32 slots 4 ports 3 Gbps 0xf impl SATA mode
[...]
e1000e 0000:01:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
e1000e 0000:01:00.0: setting latency timer to 64
-e1000e 0000:01:00.0: irq 41 for MSI/MSI-X
+e1000e 0000:01:00.0: irq 505 for MSI/MSI-X
I.e. very similar to the change between rc-3 and rc-7 you commented on
for my notebook.
Is this really MSI being (not) enabled, or just it using much higher
IRQ numbers?

because NR_IRQS is increased. and MSI irq is back counting from NR_IRQS - 1.

we have patch in sparse_irq, will start dyn irq for MSI from
nr_irqs_gsi ( the max irq ioapic will use) .. ( according to Thomas)

before that reach mainline, the effects for you is wasting some space
from some big irq_desc[].
but only 450... that is not bad.

YH

Ingo Molnar

2008-12-05 12:20:06 UTC

Post by Frans Pop

Post by Linus Torvalds
So, it looks like you have MSI enabled in -rc6, and not in -rc3.

I've just compared dmesg for my *desktop* and there I see the following
pcieport-driver 0000:00:1c.0: setting latency timer to 64
pcieport-driver 0000:00:1c.0: found MSI capability
-pcieport-driver 0000:00:1c.0: irq 47 for MSI/MSI-X
+pcieport-driver 0000:00:1c.0: irq 511 for MSI/MSI-X
pci_express 0000:00:1c.0:pcie00: allocate port service
pci_express 0000:00:1c.0:pcie02: allocate port service
pcieport-driver 0000:00:1c.2: setting latency timer to 64
pcieport-driver 0000:00:1c.2: found MSI capability
-pcieport-driver 0000:00:1c.2: irq 46 for MSI/MSI-X
+pcieport-driver 0000:00:1c.2: irq 510 for MSI/MSI-X
pci_express 0000:00:1c.2:pcie00: allocate port service
pci_express 0000:00:1c.2:pcie02: allocate port service
pcieport-driver 0000:00:1c.3: setting latency timer to 64
pcieport-driver 0000:00:1c.3: found MSI capability
-pcieport-driver 0000:00:1c.3: irq 45 for MSI/MSI-X
+pcieport-driver 0000:00:1c.3: irq 509 for MSI/MSI-X
pci_express 0000:00:1c.3:pcie00: allocate port service
pci_express 0000:00:1c.3:pcie02: allocate port service
pcieport-driver 0000:00:1c.4: setting latency timer to 64
pcieport-driver 0000:00:1c.4: found MSI capability
-pcieport-driver 0000:00:1c.4: irq 44 for MSI/MSI-X
+pcieport-driver 0000:00:1c.4: irq 508 for MSI/MSI-X
pci_express 0000:00:1c.4:pcie00: allocate port service
pci_express 0000:00:1c.4:pcie02: allocate port service
pcieport-driver 0000:00:1c.5: setting latency timer to 64
pcieport-driver 0000:00:1c.5: found MSI capability
-pcieport-driver 0000:00:1c.5: irq 43 for MSI/MSI-X
+pcieport-driver 0000:00:1c.5: irq 507 for MSI/MSI-X
[...]
ahci 0000:00:1f.2: PCI INT B -> GSI 19 (level, low) -> IRQ 19
-ahci 0000:00:1f.2: irq 42 for MSI/MSI-X
+ahci 0000:00:1f.2: irq 506 for MSI/MSI-X
ahci 0000:00:1f.2: AHCI 0001.0100 32 slots 4 ports 3 Gbps 0xf impl SATA mode
[...]
e1000e 0000:01:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
e1000e 0000:01:00.0: setting latency timer to 64
-e1000e 0000:01:00.0: irq 41 for MSI/MSI-X
+e1000e 0000:01:00.0: irq 505 for MSI/MSI-X
I.e. very similar to the change between rc-3 and rc-7 you commented on
for my notebook.
Is this really MSI being (not) enabled, or just it using much higher
IRQ numbers?

The MSI IRQ number is a pure software detail that was already non-stable
due to device detection ordering, etc. It's counted back from NR_IRQS,
and the sizing of NR_IRQS changed (upwards) in 2.6.28 - that's what you
see.

The fact that MSI numbers goes back from NR_IRQs i consider a (minor)
misbehavior: it should not count down but should count up - and it should
not go to unreasonably high numbers if possible - that is confusing to
users when the sizing of NR_IRQS changes to to a higher NR_CPUS for
example.

A better (because more human-compatible) numbering scheme is to start
counting upwards from the high end of physical interrupt lines. We've
done that in the for-.29 sparseirq tree - there we'll start counting from
256 upwards in essence. (the first 256 IRQs are GSI interrupts)

Maybe it should be counted starting at 1000? That might be an even more
human-friendly numbering scheme.

Ingo

Eric Dumazet

2008-12-05 13:04:21 UTC

The MSI IRQ number is a pure software detail that was already non-sta=

ble=20

due to device detection ordering, etc. It's counted back from NR_IRQS=

,=20

and the sizing of NR_IRQS changed (upwards) in 2.6.28 - that's what y=

ou=20

see.
=20
The fact that MSI numbers goes back from NR_IRQs i consider a (minor)=

=20

misbehavior: it should not count down but should count up - and it sh=

ould=20

not go to unreasonably high numbers if possible - that is confusing t=

o=20

users when the sizing of NR_IRQS changes to to a higher NR_CPUS for=20
example.
=20
A better (because more human-compatible) numbering scheme is to start=

=20

counting upwards from the high end of physical interrupt lines. We've=

=20

done that in the for-.29 sparseirq tree - there we'll start counting =

from=20

256 upwards in essence. (the first 256 IRQs are GSI interrupts)
=20
Maybe it should be counted starting at 1000? That might be an even mo=

re=20

human-friendly numbering scheme.

I agree a better scheme would be good, given some distros use NR_CPUS=3D=
256

I had a (legacy & buggy) program that segfaulted when parsing /proc/sta=
t on new kernel (NR_CPUS=3D64)

$ cat /proc/stat
cpu 386 0 534 38332 1193 0 7 0 0
cpu0 26 0 101 4757 195 0 0 0 0
cpu1 32 0 29 4884 142 0 1 0 0
cpu2 32 0 93 4727 218 0 1 0 0
cpu3 35 0 55 4829 141 0 1 0 0
cpu4 58 0 53 4810 135 0 0 0 0
cpu5 112 0 47 4755 137 0 0 0 0
cpu6 47 0 95 4754 100 0 0 0 0
cpu7 40 0 57 4813 123 0 0 0 0
intr 38370 123 4 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 93 0 0 0 0 0 0=
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 =
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0=
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 =
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0=
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 =
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0=
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 =
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0=
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 =
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0=
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 =
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0=
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 =
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0=
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 =
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0=
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 =
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0=
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 =
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0=
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 =
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0=
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 =
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0=
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 =
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 =
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0=
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 =
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0=
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 =
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0=
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 =
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0=
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 =
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0=
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 =
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0=
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 =
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 =
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0=
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 =
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0=
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 =
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0=
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 =
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0=
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 =
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0=
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 =
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0=
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 =
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 =
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0=
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 =
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0=
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 =
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0=
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 =
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0=
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 =
771 39 4602 0 0 0 0 0 0 0 0 0
ctxt 110659
btime 1228482025
processes 5163
procs_running 1
procs_blocked 1
$ cat /proc/stat|wc
15 2406 4980
$ cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 =
CPU6 CPU7
0: 122 0 1 0 0 0 =
0 0 IO-APIC-edge timer
1: 1 1 0 0 0 0 =
1 1 IO-APIC-edge i8042
9: 0 0 0 0 0 0 =
0 0 IO-APIC-fasteoi acpi
12: 0 0 1 1 1 2 =
0 0 IO-APIC-edge i8042
14: 0 0 0 0 0 0 =
0 0 IO-APIC-edge ide0
15: 0 0 0 0 0 0 =
0 0 IO-APIC-edge ide1
16: 0 0 0 0 0 0 =
0 0 IO-APIC-fasteoi uhci_hcd:usb1
17: 0 0 0 0 0 0 =
0 0 IO-APIC-fasteoi uhci_hcd:usb2
18: 0 0 0 0 0 0 =
0 0 IO-APIC-fasteoi uhci_hcd:usb3
19: 0 0 0 0 0 0 =
0 0 IO-APIC-fasteoi uhci_hcd:usb4
22: 11 13 11 10 13 13 =
9 13 IO-APIC-fasteoi uhci_hcd:usb5
2292: 49 44 119 121 111 110=
140 142 PCI-MSI-edge eth1
2293: 6 6 4 9 7 10=
7 5 PCI-MSI-edge eth0
2294: 637 638 564 557 570 563=
549 539 PCI-MSI-edge cciss0
NMI: 0 0 0 0 0 0 =
0 0 Non-maskable interrupts
LOC: 5053 4982 3564 3552 3069 3256 =
3541 2363 Local timer interrupts
RES: 62 83 33 78 150 204 =
75 179 Rescheduling interrupts
CAL: 55 69 66 64 60 60 =
63 60 Function call interrupts
TLB: 267 192 253 194 440 354 =
435 387 TLB shootdowns
TRM: 0 0 0 0 0 0 =
0 0 Thermal event interrupts
SPU: 0 0 0 0 0 0 =
0 0 Spurious interrupts
ERR: 0
MIS: 0

H. Peter Anvin

2008-12-05 17:49:41 UTC

Post by Ingo Molnar
The MSI IRQ number is a pure software detail that was already non-stable
due to device detection ordering, etc. It's counted back from NR_IRQS,
and the sizing of NR_IRQS changed (upwards) in 2.6.28 - that's what you
see.
The fact that MSI numbers goes back from NR_IRQs i consider a (minor)
misbehavior: it should not count down but should count up - and it should
not go to unreasonably high numbers if possible - that is confusing to
users when the sizing of NR_IRQS changes to to a higher NR_CPUS for
example.
A better (because more human-compatible) numbering scheme is to start
counting upwards from the high end of physical interrupt lines. We've
done that in the for-.29 sparseirq tree - there we'll start counting from
256 upwards in essence. (the first 256 IRQs are GSI interrupts)

Although it's kind of broken to limit the number of GSI interrupts to
256, at least for HyperTransport systems where you can have MSIs that
look like IOAPICs (although discovered using a slightly different
mechanism.)

This also means, at least in theory, that IOAPICs could be discovered at
runtime!

Post by Ingo Molnar
Maybe it should be counted starting at 1000? That might be an even more
human-friendly numbering scheme.

Personally I think we should just fix the legacy PIC and southbridge
IOAPIC at zero, and let everything grow up from there. We'll assign
IOAPIC numbers first just by virtue of them being discovered first, but
I don't really see a huge need to partition the space.

Assigning them numbers starting with 1000 does stand out, of course;
however, if we do that then we really need to make sure we don't run
into ugly surprises on HT systems. On the other hand, perhaps none will
ever be built, since newer HT silicon is likely to use MSI-X rather than
MSI-HT just for compatiblity.

-hpa

Frans Pop

2008-12-02 04:13:39 UTC

Hi Linus,

Post by Linus Torvalds
Your patch doesn't fix anything, it just hides the bug.
It would be good to hear what resource this is, and where it got set. So
instead of that broken patch that just hides the problem, please try to
debug it with something like

Your debug patch gives me on boot (hp 2510p; x86_64; current git head):

system 00:00: iomem range 0x0-0x9ffff could not be reserved
system 00:00: iomem range 0xe0000-0xfffff could not be reserved
system 00:00: iomem range 0x100000-0x7e7fffff could not be reserved
system 00:0a: ioport range 0x500-0x55f has been reserved
system 00:0a: ioport range 0x800-0x80f has been reserved
system 00:0a: iomem range 0xffb00000-0xffbfffff has been reserved
system 00:0a: iomem range 0xfff00000-0xffffffff has been reserved
system 00:0c: ioport range 0x4d0-0x4d1 has been reserved
system 00:0c: ioport range 0x2f8-0x2ff has been reserved
system 00:0c: ioport range 0x3f8-0x3ff has been reserved
system 00:0c: ioport range 0x1000-0x107f has been reserved
system 00:0c: ioport range 0x1100-0x113f has been reserved
system 00:0c: ioport range 0x1200-0x121f has been reserved
system 00:0c: iomem range 0xf8000000-0xfbffffff has been reserved
system 00:0c: iomem range 0xfec00000-0xfec000ff has been reserved
system 00:0c: iomem range 0xfed20000-0xfed3ffff has been reserved
system 00:0c: iomem range 0xfed45000-0xfed8ffff has been reserved
system 00:0c: iomem range 0xfed90000-0xfed99fff has been reserved
system 00:0d: iomem range 0xcee00-0xcffff has been reserved
system 00:0d: iomem range 0xd2000-0xd3fff has been reserved
system 00:0d: iomem range 0xfeda0000-0xfedbffff has been reserved
system 00:0d: iomem range 0xfee00000-0xfee00fff has been reserved
! pci 0000:02:06.0: BAR 9 0-3ffffff wrong alignment flags 21200 4000000 (0)
! pci 0000:02:06.0: BAR 9 bad alignment 0: [0x000000-0x3ffffff]
pci 0000:00:1c.0: PCI bridge, secondary bus 0000:08
pci 0000:00:1c.0: IO window: disabled
pci 0000:00:1c.0: MEM window: disabled
pci 0000:00:1c.0: PREFETCH window: disabled
pci 0000:00:1c.1: PCI bridge, secondary bus 0000:10
pci 0000:00:1c.1: IO window: disabled
pci 0000:00:1c.1: MEM window: 0xe0000000-0xe00fffff
pci 0000:00:1c.1: PREFETCH window: disabled
pci 0000:02:06.0: CardBus bridge, secondary bus 0000:03
pci 0000:02:06.0: IO window: 0x003000-0x0030ff
pci 0000:02:06.0: IO window: 0x003400-0x0034ff
pci 0000:02:06.0: MEM window: 0x80000000-0x83ffffff
pci 0000:00:1e.0: PCI bridge, secondary bus 0000:02
pci 0000:00:1e.0: IO window: 0x3000-0x3fff
pci 0000:00:1e.0: MEM window: 0xe0100000-0xe03fffff
pci 0000:00:1e.0: PREFETCH window: disabled
pci 0000:00:1c.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
pci 0000:00:1c.0: setting latency timer to 64
pci 0000:00:1c.1: PCI INT B -> GSI 17 (level, low) -> IRQ 17
pci 0000:00:1c.1: setting latency timer to 64
pci 0000:00:1e.0: setting latency timer to 64
pci 0000:02:06.0: PCI INT A -> GSI 18 (level, low) -> IRQ 18
02:06.0 CardBus bridge: Ricoh Co Ltd RL5c476 II (rev ba)
Subsystem: Hewlett-Packard Company Device 30c9
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 168
Interrupt: pin A routed to IRQ 18
Region 0: Memory at e0100000 (32-bit, non-prefetchable) [size=4K]
Bus: primary=02, secondary=03, subordinate=06, sec-latency=176
Memory window 0: 84400000-847ff000 (prefetchable)
Memory window 1: 80000000-83fff000
I/O window 0: 00003000-000030ff
I/O window 1: 00003400-000034ff
BridgeCtl: Parity- SERR- ISA- VGA- MAbort- >Reset- 16bInt+ PostWrite+
16-bit legacy interface ports at 0001
Kernel driver in use: yenta_cardbus
Kernel modules: yenta_socket

02:06.1 FireWire (IEEE 1394): Ricoh Co Ltd R5C832 IEEE 1394 Controller (rev 04) (prog-if 10 [OHCI])
Subsystem: Hewlett-Packard Company Device 30c9
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 64 (500ns min, 1000ns max), Cache Line Size: 64 bytes
Interrupt: pin B routed to IRQ 19
Region 0: Memory at e0101000 (32-bit, non-prefetchable) [size=2K]
Capabilities: [dc] Power Management version 2
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
Status: D0 PME-Enable- DSel=0 DScale=2 PME+
Kernel driver in use: ohci1394
Kernel modules: ohci1394

02:06.2 SD Host controller: Ricoh Co Ltd R5C822 SD/SDIO/MMC/MS/MSPro Host Adapter (rev 21)
Subsystem: Hewlett-Packard Company Device 30c9
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 64, Cache Line Size: 64 bytes
Interrupt: pin C routed to IRQ 20
Region 0: Memory at e0102000 (32-bit, non-prefetchable) [size=256]
Capabilities: [80] Power Management version 2
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
Status: D0 PME-Enable- DSel=0 DScale=2 PME-
Kernel driver in use: sdhci-pci
Kernel modules: sdhci-pci

02:06.3 System peripheral: Ricoh Co Ltd R5C843 MMC Host Controller (rev ff) (prog-if ff)
!!! Unknown header type 7f
Kernel driver in use: ricoh-mmc
Kernel modules: ricoh_mmc

Linus Torvalds

2008-12-02 04:36:40 UTC

Post by Frans Pop
! pci 0000:02:06.0: BAR 9 0-3ffffff wrong alignment flags 21200 4000000 (0)
! pci 0000:02:06.0: BAR 9 bad alignment 0: [0x000000-0x3ffffff]

Hmm. flags 21200 means that IORESOURCE_SIZEALIGN is set, and 'align' is
_correct_ (0x4000000==size), while 'expected_align' is total crap (0).

So at least on your machine, using the expected_align value (which is
effectively what Rafael's patch does) would definitely be the wrong thing.

Of course, it might then happen to work (because the thing doesn't
actually need that big alignment at all).

Post by Frans Pop
02:06.0 CardBus bridge: Ricoh Co Ltd RL5c476 II (rev ba)
Subsystem: Hewlett-Packard Company Device 30c9
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 168
Interrupt: pin A routed to IRQ 18
Region 0: Memory at e0100000 (32-bit, non-prefetchable) [size=4K]
Bus: primary=02, secondary=03, subordinate=06, sec-latency=176
Memory window 0: 84400000-847ff000 (prefetchable)
Memory window 1: 80000000-83fff000
I/O window 0: 00003000-000030ff
I/O window 1: 00003400-000034ff

the above all looks fine, and it apparently works for you.

I'd like to see what Rafael's machine does.

Of course, the thing to keep in mind here is that resource alignment is
one of those things that changes how PCI resources get laid out, and two
different layouts may well _both_ be correct - but then one of them may
not work, because there is some hidden SMI resource that we don't know
about, or some other stupid BIOS issue where the BIOS is unhappy about how
we laid things out.

We've had many of those before. And they can easily result in "innocent"
changes (including real fixes) just then exposing problems that were
hidden before due to just a subtly different layout.

That's why I'd like to see what the layout differences are for Rafael with
and without his patch (and also both before and after hibernate/resume).
Maybe both layouts are "correct", but the non-working one can give us a
clue about what may be triggering the problem.

Linus

Rafael J. Wysocki

2008-12-02 22:38:50 UTC

Post by Linus Torvalds
That's why I'd like to see what the layout differences are for Rafael with
and without his patch (and also both before and after hibernate/resume).
Maybe both layouts are "correct", but the non-working one can give us a
clue about what may be triggering the problem.

OK

Sorry for the delay, it took me more time than expected to generate all the
data.

* dmesg output including one hibernation-resume cycle from 2.6.28-rc7 with the
debug patch (appended for completness):

http://www.sisk.pl/kernel/debug/mainline/2.6.28-rc7/dmesg-rc7-patched-prep.log

* dmesg output including one hibernation-resume cycle from 2.6.28-rc7 without
the debug patch:

http://www.sisk.pl/kernel/debug/mainline/2.6.28-rc7/dmesg-rc7-nopatch-prep.log

* diff between the two:

http://www.sisk.pl/kernel/debug/mainline/2.6.28-rc7/dmesg-rc7_nopatch-rc7_patched.diff

This part of the diff (+ is the patched one) seems to be particularly
interesting to me, especially the overlapping MEM windows for 0000:00:1e.0 and
0000:03:0b.0 (may that be the reason for the observed failures?):

@@ -276,0 +277,4 @@
+bad alignment flags 21200 4000000 (0)
+pci 0000:03:0b.0: BAR 9 bad alignment 0: [0x000000-0x3ffffff]
+bad alignment flags 20200 4000000 (0)
+pci 0000:03:0b.0: BAR 10 bad alignment 0: [0x000000-0x3ffffff]
@@ -288,2 +291,0 @@
-pci 0000:03:0b.0: PREFETCH window: 0x88000000-0x8bffffff
-pci 0000:03:0b.0: MEM window: 0x8c000000-0x8fffffff
@@ -292,2 +294,2 @@
-pci 0000:00:1e.0: MEM window: 0x8c000000-0x91ffffff
-pci 0000:00:1e.0: PREFETCH window: 0x00000088000000-0x0000008bffffff
+pci 0000:00:1e.0: MEM window: 0x88000000-0x880fffff
+pci 0000:00:1e.0: PREFETCH window: disabled
@@ -312,2 +314,2 @@
-bus: 03 index 1 mmio: [0x8c000000-0x91ffffff]
-bus: 03 index 2 mmio: [0x88000000-0x8bffffff]
+bus: 03 index 1 mmio: [0x88000000-0x880fffff]
+bus: 03 index 2 mmio: [0x0-0x0]
@@ -318,2 +320,2 @@
-bus: 04 index 2 mmio: [0x88000000-0x8bffffff]
-bus: 04 index 3 mmio: [0x8c000000-0x8fffffff]
+bus: 04 index 2 mmio: [0x0-0x3ffffff]
+bus: 04 index 3 mmio: [0x0-0x3ffffff]

* the output of 'lspci -vv'

http://www.sisk.pl/kernel/debug/mainline/2.6.28-rc7/lspci-vv.txt

* diff between /proc/iomem between 2.6.28-rc7 without and with the patch
(+ is from the patched kernel):

diff -U 0 -r rc7-nopatch//iomem rc7-patched//iomem
--- rc7-nopatch//iomem 2008-12-02 22:57:34.000000000 +0100
+++ rc7-patched//iomem 2008-12-02 23:07:02.000000000 +0100
@@ -7,2 +7,2 @@
- 00200000-00450496 : Kernel code
- 00450497-00603c37 : Kernel data
+ 00200000-00450526 : Kernel code
+ 00450527-00603c37 : Kernel data
@@ -14,15 +14,14 @@
-88000000-8bffffff : PCI Bus 0000:03
- 88000000-8bffffff : PCI CardBus 0000:04
-8c000000-91ffffff : PCI Bus 0000:03
- 8c000000-8fffffff : PCI CardBus 0000:04
- 90000000-90003fff : 0000:03:0b.1
- 90004000-90004fff : 0000:03:0b.0
- 90004000-90004fff : yenta_socket
- 90005000-900057ff : 0000:03:0b.1
- 90005000-900057ff : firewire_ohci
- 90005800-900058ff : 0000:03:0b.3
- 90005800-900058ff : mmc0
-92000000-9207ffff : 0000:00:02.1
-92080000-92083fff : 0000:00:1b.0
- 92080000-92083fff : ICH HD audio
-92084000-92084fff : Intel Flush Page
+88000000-880fffff : PCI Bus 0000:03
+ 88000000-88003fff : 0000:03:0b.1
+ 88004000-88004fff : 0000:03:0b.0
+ 88004000-88004fff : yenta_socket
+ 88005000-880057ff : 0000:03:0b.1
+ 88005000-880057ff : firewire_ohci
+ 88005800-880058ff : 0000:03:0b.3
+ 88005800-880058ff : mmc0
+88100000-8817ffff : 0000:00:02.1
+88180000-88183fff : 0000:00:1b.0
+ 88180000-88183fff : ICH HD audio
+88184000-88184fff : Intel Flush Page
+88400000-887fffff : PCI CardBus 0000:04
+88800000-88bfffff : PCI CardBus 0000:04

All of the files above plus some more data are available from
http://www.sisk.pl/kernel/debug/mainline/2.6.28-rc7/

HTH

Thanks,
Rafael

Linus Torvalds

2008-12-02 23:37:23 UTC

Post by Rafael J. Wysocki
* dmesg output including one hibernation-resume cycle from 2.6.28-rc7 with the
http://www.sisk.pl/kernel/debug/mainline/2.6.28-rc7/dmesg-rc7-patched-prep.log
* dmesg output including one hibernation-resume cycle from 2.6.28-rc7 without
http://www.sisk.pl/kernel/debug/mainline/2.6.28-rc7/dmesg-rc7-nopatch-prep.log

As with Frans, the debug patch seems to make no difference what-so-ever.
Yes, the cardbus regions get allocated differently, but they're fine in
either case, and arguably (exactly as with Frans) the debug patch actually
makes things uglier by actively getting the alignment wrong, and skipping
cardbus setup until later.

That's what your patch (without debugging) should have resulted in too,
except you'd not have seen the "bad alignment flags" printout, of course
(but you probably would have seen the "bad alignment 0: [...]" one).

In fact, I'm starting to think I know why we set up the prefetch window
without the patch, and why we don't with it - because with the patch, the
PCI code ends up never seeing any valid prefetchable region for the
cardbus controller at all, so it never even bothers to try to set up a
prefetchable window.

So in many ways, the debug patch that gets the alignment wrong (on
purpose) is really the inferior one. Plain -rc7 seems to do everything
right.

Post by Rafael J. Wysocki
http://www.sisk.pl/kernel/debug/mainline/2.6.28-rc7/dmesg-rc7_nopatch-rc7_patched.diff

Gaah. Using "-U 0" is likely the least readable form of diffs there
exists, even if it makes the diff smaller.

Post by Rafael J. Wysocki
This part of the diff (+ is the patched one) seems to be particularly
interesting to me, especially the overlapping MEM windows for 0000:00:1e.0 and

No, those are very much on purpose.

Device 0000:00:1e.0 is the PCI bridge that bridges to PCI bus#3, so the
MEM window is very much intentional - exactly because MMIO goes through
that PCI bridge bus to get to bus#3, which is where the cardbus controller
is.

IOW, the topology is as follows:
- CPU is on the root bus (bus #0)
- device 00:1e.0 is the PCI bridge to bus #3
- device 03:0b.0 is the CardBus bridge (to bus #4)
and any actual cardbus cards (if you had any) would be on that bus #4, so
they'd be named "04:xx.y".

Now, that PCI bridge 00:1e.0 is a transparent bridge (aka "[Subtractive
decode]" in your lspci output - as compared to the other bridges that say
"[Normal decode]"), which means that you don't actually _have_ to set up
any MMIO window on them, since the bridge will forward _any_ PCI cycles
that don't get responded to by any other PCI device.

But having an explicit window is still generally a good idea, since it
should allow the PCI bridge to pick up the PCI cycles earlier (no need to
wait to see if others respond to it), and possibly allows for better
prefetching behavior. So again, the dmesg and the PCI layout actually
looks _better_ without the hacky patch.

So are you saying that the unpatched kernel still reliably doesn't
hibernate for you, while the (arguably _incorrect_) patched kernel
reliably does hibernate?

Linus

Rafael J. Wysocki

2008-12-03 00:00:17 UTC

Post by Linus Torvalds

Post by Rafael J. Wysocki
* dmesg output including one hibernation-resume cycle from 2.6.28-rc7 with the
http://www.sisk.pl/kernel/debug/mainline/2.6.28-rc7/dmesg-rc7-patched-prep.log
* dmesg output including one hibernation-resume cycle from 2.6.28-rc7 without
http://www.sisk.pl/kernel/debug/mainline/2.6.28-rc7/dmesg-rc7-nopatch-prep.log

As with Frans, the debug patch seems to make no difference what-so-ever.
Yes, the cardbus regions get allocated differently, but they're fine in
either case, and arguably (exactly as with Frans) the debug patch actually
makes things uglier by actively getting the alignment wrong, and skipping
cardbus setup until later.

Hm, what about (from the copy of /proc/iomem without the patch at
http://www.sisk.pl/kernel/debug/mainline/2.6.28-rc7/rc7-nopatch/iomem):

88000000-8bffffff : PCI Bus 0000:03
88000000-8bffffff : PCI CardBus 0000:04
8c000000-91ffffff : PCI Bus 0000:03
8c000000-8fffffff : PCI CardBus 0000:04

(1) Why two ranges are allocated for 0000:03 without the patch while there is
only one range with the patch:

88000000-880fffff : PCI Bus 0000:03

(copy of the file at
http://www.sisk.pl/kernel/debug/mainline/2.6.28-rc7/rc7-patched/iomem)?
That seems to look like a difference to me.

(2) Why are they so large without the patch while with the patch they are much
smaller (O(2^28) vs O(2^21) if I'm not mistaken)?

(3) Why are they overlapping with the ranges for CardBus 0000:04, although
without the patch they aren't? Is that actually correct at all?

Post by Linus Torvalds
That's what your patch (without debugging) should have resulted in too,
except you'd not have seen the "bad alignment flags" printout, of course
(but you probably would have seen the "bad alignment 0: [...]" one).

Yes, I saw that:

bad alignment flags 21200 4000000 (0)
pci 0000:03:0b.0: BAR 9 bad alignment 0: [0x000000-0x3ffffff]
bad alignment flags 20200 4000000 (0)
pci 0000:03:0b.0: BAR 10 bad alignment 0: [0x000000-0x3ffffff]

Post by Linus Torvalds
In fact, I'm starting to think I know why we set up the prefetch window
without the patch, and why we don't with it - because with the patch, the
PCI code ends up never seeing any valid prefetchable region for the
cardbus controller at all, so it never even bothers to try to set up a
prefetchable window.
So in many ways, the debug patch that gets the alignment wrong (on
purpose) is really the inferior one. Plain -rc7 seems to do everything
right.

Well, I'm not sure ...

Post by Linus Torvalds

Post by Rafael J. Wysocki
http://www.sisk.pl/kernel/debug/mainline/2.6.28-rc7/dmesg-rc7_nopatch-rc7_patched.diff

Gaah. Using "-U 0" is likely the least readable form of diffs there
exists, even if it makes the diff smaller.

Sorry.

To me it's more readable this way, but well.

Post by Linus Torvalds

Post by Rafael J. Wysocki
This part of the diff (+ is the patched one) seems to be particularly
interesting to me, especially the overlapping MEM windows for 0000:00:1e.0 and

No, those are very much on purpose.
Device 0000:00:1e.0 is the PCI bridge that bridges to PCI bus#3, so the
MEM window is very much intentional - exactly because MMIO goes through
that PCI bridge bus to get to bus#3, which is where the cardbus controller
is.
- CPU is on the root bus (bus #0)
- device 00:1e.0 is the PCI bridge to bus #3
- device 03:0b.0 is the CardBus bridge (to bus #4)
and any actual cardbus cards (if you had any) would be on that bus #4, so
they'd be named "04:xx.y".
Now, that PCI bridge 00:1e.0 is a transparent bridge (aka "[Subtractive
decode]" in your lspci output - as compared to the other bridges that say
"[Normal decode]"), which means that you don't actually _have_ to set up
any MMIO window on them, since the bridge will forward _any_ PCI cycles
that don't get responded to by any other PCI device.
But having an explicit window is still generally a good idea, since it
should allow the PCI bridge to pick up the PCI cycles earlier (no need to
wait to see if others respond to it), and possibly allows for better
prefetching behavior. So again, the dmesg and the PCI layout actually
looks _better_ without the hacky patch.
So are you saying that the unpatched kernel still reliably doesn't
hibernate for you, while the (arguably _incorrect_) patched kernel
reliably does hibernate?

Yes, I am.

Thanks,
Rafael

Rafael J. Wysocki

2008-12-03 00:05:53 UTC

Post by Rafael J. Wysocki

Post by Linus Torvalds
So are you saying that the unpatched kernel still reliably doesn't
hibernate for you, while the (arguably _incorrect_) patched kernel
reliably does hibernate?

Yes, I am.

To be more precise, the patched kernel resumes reliably (100% of the time)
while the plain -rc7 doesn't. Both hibernate (ie. create the image and save
it) reliably.

Thanks,
Rafael

Rafael J. Wysocki

2008-12-03 00:31:41 UTC

Post by Rafael J. Wysocki

Post by Linus Torvalds

Post by Rafael J. Wysocki
* dmesg output including one hibernation-resume cycle from 2.6.28-rc7 with the
http://www.sisk.pl/kernel/debug/mainline/2.6.28-rc7/dmesg-rc7-patched-prep.log
* dmesg output including one hibernation-resume cycle from 2.6.28-rc7 without
http://www.sisk.pl/kernel/debug/mainline/2.6.28-rc7/dmesg-rc7-nopatch-prep.log

As with Frans, the debug patch seems to make no difference what-so-ever.
Yes, the cardbus regions get allocated differently, but they're fine in
either case, and arguably (exactly as with Frans) the debug patch actually
makes things uglier by actively getting the alignment wrong, and skipping
cardbus setup until later.

Hm, what about (from the copy of /proc/iomem without the patch at
88000000-8bffffff : PCI Bus 0000:03
88000000-8bffffff : PCI CardBus 0000:04
8c000000-91ffffff : PCI Bus 0000:03
8c000000-8fffffff : PCI CardBus 0000:04
(1) Why two ranges are allocated for 0000:03 without the patch while there is
88000000-880fffff : PCI Bus 0000:03
(copy of the file at
http://www.sisk.pl/kernel/debug/mainline/2.6.28-rc7/rc7-patched/iomem)?
That seems to look like a difference to me.

OK, I see why this happens.

Post by Rafael J. Wysocki
(2) Why are they so large without the patch while with the patch they are much
smaller (O(2^28) vs O(2^21) if I'm not mistaken)?

I don't see why this should happen, though. Even if the prefetch window is
discarded, the MEM window seems to be much wider without the patch.

Post by Rafael J. Wysocki
(3) Why are they overlapping with the ranges for CardBus 0000:04, although
without the patch they aren't? Is that actually correct at all?

OK, I see why this happens too.

Sorry for the noise,
Rafael

Linus Torvalds

2008-12-03 00:41:33 UTC

Post by Rafael J. Wysocki
Hm, what about (from the copy of /proc/iomem without the patch at
88000000-8bffffff : PCI Bus 0000:03
88000000-8bffffff : PCI CardBus 0000:04
8c000000-91ffffff : PCI Bus 0000:03
8c000000-8fffffff : PCI CardBus 0000:04
(1) Why two ranges are allocated for 0000:03 without the patch while there is
88000000-880fffff : PCI Bus 0000:03

So look at the lspci information that has more details for info on this.

Basically, a Cardbus controller (as well as a PCI bridge) has _two_ very
different MMIO windows - one for prefetchable MMIO, and one for
non-prefetchable.

And the windows should match up, although it's always ok to map a
prefetchable region in a non-prefetchable window (although you may lose
some performance, since the upstream will now not be able to prefetch).

And it's not really different, exactly because 00:1e.0 is a transparent
bridge. That "transparency" means that there is an implied decoding window
DESPITE a lack of an explicit one - it may just be a few PCI cycles slower
(or not, depending on how the bridge is implemented. It's probably
technically quite ok for a transparent bridge to ignore its IO/MEM window
contents entirely).

So those "PCI Bus 0000:03" windows are technically irrelevant, except as a
possible performance improvement. The bridge will forward PCI cycles
whether they are there or not. That is what "transparent" means.

Post by Rafael J. Wysocki
(2) Why are they so large without the patch while with the patch they are much
smaller (O(2^28) vs O(2^21) if I'm not mistaken)?

Because without the patch, we'll size the PCI bridge windows according to
the default setup of the CardBus controller, namely:

- two IO regions of size 'pci_cardbus_io_size' (256 bytes)

- one prefetchable and one non-prefetchable IO region of size
'pci_cardbus_mem_size' (64MB by default, you can change it with
the 'pci=cbmemsize=xyz' kernel command line parameter but nobody ever
does)

See pci_bus_size_cardbus() in drivers/pci/setup-bus.c for details.

HOWEVER - when the alignment is wrong, we skip all that, and don't set up
the CardBus regions there at all, because the whole "pbus_size_mem()"
thing will just give up.

And then what happens is that we won't set up the CardBus bridge during
PCI bus setup AT ALL, but later, by the Yenta code. And that one will try
to do something else entirely.

[ And yes, if you think that it might be a good idea to try to share the
code and not have two different cardbus sizing logics, I'm not going to
argue against that AT ALL! ]

In the yenta code, we don't try to to just have a fixed maximum size,
because that code is literally designed to be a fallback case for when the
PCI sizing fails (ie literally "oops, we had so little space that we
couldn't use the default max size!" case).

For the yenta code, see drivers/pcmcia/yenta_socket.c ("Yenta" is just the
name for the CardBus standard programming model), and notice all the logic
there, see:

- BRIDGE_MEM_MAX/ACC/MIN and BRIDGE_IO_MAX/ACC/MIN for "max",
"acceptable" and "minimum" values respectively.

Realize thatt he "MAX" value is just 4MB, which is obviously smaller
than the 64MB mentioned above, and that's exactly because this is
assumed to be a "uhhuh, we ran out of space" case.

- See yenta_allocate_res() and the helper functions above it that in
addition try to shrink the window further if it doesn't fit.

- Notice how the Yenta code - unlike the generic PCI code - does _not_
try to allocate parent PCI bridge memory window resources, so if we
fall into this case, we're going to depend on previously set up windows
and/or the fact that a transparent bridge doesn't need them.

So this explains why in one case you'd see a 64M resource, and in another
just a 4M resource.

_Most_ cardbus cards by far only need a few kB of IOMEM resources, but
there are some crazy people who do CardBus graphics cards and/or video
capture cards, and they really do want 32MB+ of MMIO, which is why we try
to get such a big CardBus MMIO window by default.

As mentioned, you could try to just use

pci=cbmemsize=4M

on the kernel command line, and see if that also hides the bug. It's
entirely possible that your suspend/resume problem is not so much about
the cardbus allocation itself, as about some other memory area that wants
to use it (eg hidden RAM used by the bios SMM code, whatever).

Post by Rafael J. Wysocki
(3) Why are they overlapping with the ranges for CardBus 0000:04, although
without the patch they aren't? Is that actually correct at all?

See the earlier explanation of the topology, and realize the
overlappingness is actually _required_ in order for a PCI bridge to work,
and forward the PCI cycles to the right lower bus.

But then, with a transparent bridge, if nobody else overlaps, it will
do what's called "negative decode", which just means that "if nobody else
said they wanted this PCI cycle, I'll decode it and forward it to the
lower bus", so a transparent bridge doesn't _need_ the overlap.

Do

/sbin/lspci -t

to understand the relationship, in particular see how device 0:1e.0 is the
one that bridges _to_ that CardBus controller.

See the comments about "topology" in my previous email.

Post by Rafael J. Wysocki

Post by Linus Torvalds
So in many ways, the debug patch that gets the alignment wrong (on
purpose) is really the inferior one. Plain -rc7 seems to do everything
right.

Well, I'm not sure ...

I'm pretty sure.

The fact that you then have hibernation issues is almost certainly due to
something else. Most likely something else that we don't know about from a
resource angle that now got "hidden" by the fact that we programmed the
0:1e.0 bridge to forward to the cardbus controller rather than to some
insane power management chip.

It's why we have all those quirks in drivers/pci/quirks.c. It's very
possible that your chipset is missing some quirk.

Post by Rafael J. Wysocki

Post by Linus Torvalds
So are you saying that the unpatched kernel still reliably doesn't
hibernate for you, while the (arguably _incorrect_) patched kernel
reliably does hibernate?

Yes, I am.

So see what happens if you add

pci=cbmemsize=4M

to make the cardbus allocations smaller. Perhaps that will just change the
allocations enough that now it doesn't cover something else.

Oh, and please do a "lspci -vvxxx" (as root) too so that I can see the
actual values in your PCI config space. We have two quirks already
triggering for you:

pci 0000:00:1f.0: quirk: region d800-d87f claimed by ICH6 ACPI/GPIO/TCO
pci 0000:00:1f.0: quirk: region eec0-eeff claimed by ICH6 GPIO

but I'm maybe there's another one missing.

Linus

Rafael J. Wysocki

2008-12-03 01:22:28 UTC

Post by Linus Torvalds

Post by Rafael J. Wysocki

Post by Linus Torvalds
So are you saying that the unpatched kernel still reliably doesn't
hibernate for you, while the (arguably _incorrect_) patched kernel
reliably does hibernate?

Yes, I am.

So see what happens if you add
pci=cbmemsize=4M
to make the cardbus allocations smaller. Perhaps that will just change the
allocations enough that now it doesn't cover something else.
Oh, and please do a "lspci -vvxxx" (as root) too so that I can see the
actual values in your PCI config space. We have two quirks already
pci 0000:00:1f.0: quirk: region d800-d87f claimed by ICH6 ACPI/GPIO/TCO
pci 0000:00:1f.0: quirk: region eec0-eeff claimed by ICH6 GPIO
but I'm maybe there's another one missing.

Here's the output of 'lspci -vvxxx':

00:00.0 Host bridge: Intel Corporation Mobile 945GM/PM/GMS, 943/940GML and 945GT Express Memory Controller Hub (rev 03)
Subsystem: Toshiba America Info Systems Device 0001
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ >SERR- <PERR- INTx-
Latency: 0
Capabilities: [e0] Vendor Specific Information <?>
Kernel driver in use: agpgart-intel
Kernel modules: intel-agp
00: 86 80 a0 27 06 00 90 20 03 00 00 06 00 00 00 00
10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 79 11 01 00
30: 00 00 00 00 e0 00 00 00 00 00 00 00 00 00 00 00
40: 01 90 d1 fe 01 40 d1 fe 05 00 00 f0 01 80 d1 fe
50: 00 00 30 00 19 00 00 00 00 00 00 00 00 00 00 00
60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
90: 10 11 11 00 00 11 33 00 ff 03 00 00 80 4a b8 00
a0: 10 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 09 00 09 71 23 25 0a a1 0e 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 86 0f 05 00 10 00 00 00

00:02.0 VGA compatible controller: Intel Corporation Mobile 945GM/GMS, 943/940GML Express Integrated Graphics Controller (rev 03) (prog-if 00 [VGA controller])
Subsystem: Toshiba America Info Systems Device 0022
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 16
Region 0: Memory at ffc80000 (32-bit, non-prefetchable) [size=512K]
Region 1: I/O ports at cff8 [size=8]
Region 2: Memory at e0000000 (32-bit, prefetchable) [size=256M]
Region 3: Memory at ffc40000 (32-bit, non-prefetchable) [size=256K]
Capabilities: [90] Message Signalled Interrupts: Mask- 64bit- Queue=0/0 Enable-
Address: 00000000 Data: 0000
Capabilities: [d0] Power Management version 2
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
Kernel modules: intelfb
00: 86 80 a2 27 07 00 90 00 03 00 00 03 00 00 80 00
10: 00 00 c8 ff f9 cf 00 00 08 00 00 e0 00 00 c4 ff
20: 00 00 00 00 00 00 00 00 00 00 00 00 79 11 22 00
30: 00 00 00 00 90 00 00 00 00 00 00 00 0a 01 00 00
40: 00 00 00 00 48 00 00 00 09 00 09 71 23 25 0a a1
50: 0e 00 30 00 19 00 00 00 00 00 00 00 00 00 80 7f
60: 00 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00
70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
90: 05 d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00
a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 04 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 01 00 22 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 64 34 00 00 00 00 86 0f 05 00 00 00 00 00

00:02.1 Display controller: Intel Corporation Mobile 945GM/GMS/GME, 943/940GML Express Integrated Graphics Controller (rev 03)
Subsystem: Toshiba America Info Systems Device 0022
Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Region 0: Memory at 88100000 (32-bit, non-prefetchable) [disabled] [size=512K]
Capabilities: [d0] Power Management version 2
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
00: 86 80 a6 27 00 00 90 00 03 00 80 03 00 00 80 00
10: 00 00 10 88 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 79 11 22 00
30: 00 00 00 00 d0 00 00 00 00 00 00 00 00 00 00 00
40: 00 00 00 00 48 00 00 00 09 00 09 71 23 25 0a a1
50: 0e 00 30 00 19 00 00 00 00 00 00 00 00 00 80 7f
60: 00 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00
70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 04 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 01 00 22 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 64 34 00 00 00 00 86 0f 05 00 00 00 00 00

00:1b.0 Audio device: Intel Corporation 82801G (ICH7 Family) High Definition Audio Controller (rev 02)
Subsystem: Toshiba America Info Systems Device 0001
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 32 bytes
Interrupt: pin A routed to IRQ 22
Region 0: Memory at 88180000 (64-bit, non-prefetchable) [size=16K]
Capabilities: [50] Power Management version 2
Flags: PMEClk- DSI- D1- D2- AuxCurrent=55mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [60] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0 Enable-
Address: 0000000000000000 Data: 0000
Capabilities: [70] Express (v1) Root Complex Integrated Endpoint, MSI 00
DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
ExtTag- RBE- FLReset-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 128 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
LnkCap: Port #0, Speed unknown, Width x0, ASPM unknown, Latency L0 <64ns, L1 <1us
ClockPM- Suprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; Disabled- Retrain- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed unknown, Width x0, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
Capabilities: [100] Virtual Channel <?>
Capabilities: [130] Root Complex Link <?>
Kernel driver in use: HDA Intel
Kernel modules: snd-hda-intel
00: 86 80 d8 27 06 00 10 00 02 00 03 04 08 00 00 00
10: 04 00 18 88 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 79 11 01 00
30: 00 00 00 00 50 00 00 00 00 00 00 00 ff 01 00 00
40: 01 00 00 03 00 00 00 00 00 00 00 00 00 00 00 00
50: 01 60 42 c8 00 00 00 00 00 00 00 00 00 00 00 00
60: 05 70 80 00 00 00 00 00 00 00 00 00 00 00 00 00
70: 10 00 91 00 00 00 00 00 00 08 10 00 00 00 00 00
80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 86 0f 02 00 00 00 00 00

00:1c.0 PCI bridge: Intel Corporation 82801G (ICH7 Family) PCI Express Port 1 (rev 02) (prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 32 bytes
Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
I/O behind bridge: 0000b000-0000bfff
Memory behind bridge: ff900000-ff9fffff
Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: [40] Express (v1) Root Port (Slot-), MSI 00
DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
ExtTag- RBE- FLReset-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
MaxPayload 128 bytes, MaxReadReq 128 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
LnkCap: Port #1, Speed 2.5GT/s, Width x1, ASPM L0s L1, Latency L0 <256ns, L1 <4us
ClockPM- Suprise- LLActRep+ BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt-
RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna- CRSVisible-
RootCap: CRSVisible-
RootSta: PME ReqID 0000, PMEStatus- PMEPending-
Capabilities: [80] Message Signalled Interrupts: Mask- 64bit- Queue=0/0 Enable+
Address: fee0300c Data: 4161
Capabilities: [90] Subsystem: Toshiba America Info Systems Device 0001
Capabilities: [a0] Power Management version 2
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [100] Virtual Channel <?>
Capabilities: [180] Root Complex Link <?>
Kernel driver in use: pcieport-driver
Kernel modules: shpchp
00: 86 80 d0 27 07 04 10 00 02 00 04 06 08 00 81 00
10: 00 00 00 00 00 00 00 00 00 01 01 00 b0 b0 00 00
20: 90 ff 90 ff f1 ff 01 00 00 00 00 00 00 00 00 00
30: 00 00 00 00 40 00 00 00 00 00 00 00 0b 01 00 00
40: 10 80 41 00 c0 0f 00 00 00 00 10 00 11 2c 11 01
50: 40 00 11 30 60 05 00 00 00 00 40 01 00 00 00 00
60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
80: 05 90 01 00 0c 30 e0 fe 61 41 00 00 00 00 00 00
90: 0d a0 00 00 79 11 01 00 00 00 00 00 00 00 00 00
a0: 01 00 02 c8 00 00 00 00 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 80 00 11 80 00 00 00 00
e0: 00 0f c7 00 06 07 08 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 86 0f 02 00 00 00 00 00

00:1c.2 PCI bridge: Intel Corporation 82801G (ICH7 Family) PCI Express Port 3 (rev 02) (prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 32 bytes
Bus: primary=00, secondary=02, subordinate=02, sec-latency=0
Memory behind bridge: ff800000-ff8fffff
Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ <SERR- <PERR-
BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: [40] Express (v1) Root Port (Slot-), MSI 00
DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
ExtTag- RBE- FLReset-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
MaxPayload 128 bytes, MaxReadReq 128 bytes
DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
LnkCap: Port #3, Speed 2.5GT/s, Width x1, ASPM L0s L1, Latency L0 <256ns, L1 <4us
ClockPM- Suprise- LLActRep+ BwNot-
LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes Disabled- Retrain- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt-
RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna- CRSVisible-
RootCap: CRSVisible-
RootSta: PME ReqID 0000, PMEStatus- PMEPending-
Capabilities: [80] Message Signalled Interrupts: Mask- 64bit- Queue=0/0 Enable+
Address: fee0300c Data: 4169
Capabilities: [90] Subsystem: Toshiba America Info Systems Device 0001
Capabilities: [a0] Power Management version 2
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [100] Virtual Channel <?>
Capabilities: [180] Root Complex Link <?>
Kernel driver in use: pcieport-driver
Kernel modules: shpchp
00: 86 80 d4 27 07 04 10 00 02 00 04 06 08 00 81 00
10: 00 00 00 00 00 00 00 00 00 02 02 00 f0 00 00 20
20: 80 ff 80 ff f1 ff 01 00 00 00 00 00 00 00 00 00
30: 00 00 00 00 40 00 00 00 00 00 00 00 0b 03 00 00
40: 10 80 41 00 c0 0f 00 00 00 00 11 00 11 2c 11 03
50: 43 00 11 30 60 05 00 00 00 00 40 01 00 00 00 00
60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
80: 05 90 01 00 0c 30 e0 fe 69 41 00 00 00 00 00 00
90: 0d a0 00 00 79 11 01 00 00 00 00 00 00 00 00 00
a0: 01 00 02 c8 00 00 00 00 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 80 00 11 80 00 00 00 00
e0: 00 0f c7 00 06 07 08 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 86 0f 02 00 00 00 00 00

00:1d.0 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI Controller #1 (rev 02) (prog-if 00 [UHCI])
Subsystem: Toshiba America Info Systems Device 0001
Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 23
Region 4: I/O ports at afe0 [size=32]
Kernel driver in use: uhci_hcd
Kernel modules: uhci-hcd
00: 86 80 c8 27 05 00 80 02 02 00 03 0c 00 00 80 00
10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20: e1 af 00 00 00 00 00 00 00 00 00 00 79 11 01 00
30: 00 00 00 00 00 00 00 00 00 00 00 00 0b 01 00 00
40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
60: 10 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 2f 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 86 0f 02 00 00 00 00 00

00:1d.1 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI Controller #2 (rev 02) (prog-if 00 [UHCI])
Subsystem: Toshiba America Info Systems Device 0001
Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin B routed to IRQ 19
Region 4: I/O ports at af80 [size=32]
Kernel driver in use: uhci_hcd
Kernel modules: uhci-hcd
00: 86 80 c9 27 05 00 80 02 02 00 03 0c 00 00 00 00
10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20: 81 af 00 00 00 00 00 00 00 00 00 00 79 11 01 00
30: 00 00 00 00 00 00 00 00 00 00 00 00 0b 02 00 00
40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
60: 10 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 2f 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 86 0f 02 00 00 00 00 00

00:1d.2 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI Controller #3 (rev 02) (prog-if 00 [UHCI])
Subsystem: Toshiba America Info Systems Device 0001
Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin C routed to IRQ 18
Region 4: I/O ports at af60 [size=32]
Kernel driver in use: uhci_hcd
Kernel modules: uhci-hcd
00: 86 80 ca 27 05 00 80 02 02 00 03 0c 00 00 00 00
10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20: 61 af 00 00 00 00 00 00 00 00 00 00 79 11 01 00
30: 00 00 00 00 00 00 00 00 00 00 00 00 0b 03 00 00
40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
60: 10 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 2f 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 86 0f 02 00 00 00 00 00

00:1d.3 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI Controller #4 (rev 02) (prog-if 00 [UHCI])
Subsystem: Toshiba America Info Systems Device 0001
Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin D routed to IRQ 16
Region 4: I/O ports at af40 [size=32]
Kernel driver in use: uhci_hcd
Kernel modules: uhci-hcd
00: 86 80 cb 27 05 00 80 02 02 00 03 0c 00 00 00 00
10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20: 41 af 00 00 00 00 00 00 00 00 00 00 79 11 01 00
30: 00 00 00 00 00 00 00 00 00 00 00 00 0a 04 00 00
40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
60: 10 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 2f 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 86 0f 02 00 00 00 00 00

00:1d.7 USB Controller: Intel Corporation 82801G (ICH7 Family) USB2 EHCI Controller (rev 02) (prog-if 20 [EHCI])
Subsystem: Toshiba America Info Systems Device 0001
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 23
Region 0: Memory at ffc3fc00 (32-bit, non-prefetchable) [size=1K]
Capabilities: [50] Power Management version 2
Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [58] Debug port: BAR=1 offset=00a0
Kernel driver in use: ehci_hcd
Kernel modules: ehci-hcd
00: 86 80 cc 27 06 00 90 02 02 20 03 0c 00 00 00 00
10: 00 fc c3 ff 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 79 11 01 00
30: 00 00 00 00 50 00 00 00 00 00 00 00 0b 01 00 00
40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
50: 01 58 c2 c9 00 00 00 00 0a 00 a0 20 00 00 00 00
60: 20 20 ff 01 00 00 00 00 01 00 00 00 00 00 08 c0
70: 00 00 d7 3f 00 00 00 00 00 00 00 00 00 00 00 00
80: 00 00 00 00 11 00 00 00 00 00 00 00 00 00 00 00
90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 aa ff 00 ff 00 ff 00 20 00 00 88
e0: 00 00 00 00 db b6 6d 00 00 00 00 00 00 00 00 00
f0: 00 80 00 09 88 85 40 00 86 0f 02 00 06 17 02 20

00:1e.0 PCI bridge: Intel Corporation 82801 Mobile PCI Bridge (rev e2) (prog-if 01 [Subtractive decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Bus: primary=00, secondary=03, subordinate=07, sec-latency=32
I/O behind bridge: 00001000-00001fff
Memory behind bridge: 88000000-880fffff
Secondary status: 66MHz- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort+ <SERR- <PERR-
BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: [50] Subsystem: Toshiba America Info Systems Device 0001
00: 86 80 48 24 07 00 10 00 e2 01 04 06 00 00 01 00
10: 00 00 00 00 00 00 00 00 00 03 07 20 10 10 80 22
20: 00 88 00 88 f1 ff 01 00 00 00 00 00 00 00 00 00
30: 00 00 00 00 50 00 00 00 00 00 00 00 ff 00 00 00
40: 00 00 00 00 00 00 00 00 00 00 00 00 00 12 00 00
50: 0d 00 00 00 79 11 01 00 00 00 00 00 00 00 00 00
60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 86 0f 02 00 00 00 00 00

00:1f.0 ISA bridge: Intel Corporation 82801GBM (ICH7-M) LPC Interface Bridge (rev 02)
Subsystem: Toshiba America Info Systems Device 0001
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Capabilities: [e0] Vendor Specific Information <?>
Kernel modules: iTCO_wdt
00: 86 80 b9 27 07 00 10 02 02 00 01 06 00 00 80 00
10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 79 11 01 00
30: 00 00 00 00 e0 00 00 00 00 00 00 00 00 00 00 00
40: 01 d8 00 00 80 00 00 00 c1 ee 00 00 10 00 00 00
50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
60: 80 80 80 80 91 00 00 00 80 80 80 80 00 00 00 00
70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
80: 00 00 00 2c 81 06 7c 00 00 00 00 00 00 00 00 00
90: e1 01 0c 00 00 00 00 00 00 00 00 00 00 00 00 00
a0: 80 06 00 00 00 00 00 00 13 1c 0a 00 00 03 00 00
b0: 00 00 f0 00 00 00 00 00 00 00 01 04 00 00 00 00
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 33 22 11 00 67 45 00 00 c0 f0 00 00 00 00 00 00
e0: 09 00 0c 10 b4 02 24 17 00 00 00 00 00 00 00 00
f0: 01 c0 d1 fe 00 00 00 00 86 0f 02 00 00 00 00 00

00:1f.2 SATA controller: Intel Corporation 82801GBM/GHM (ICH7 Family) SATA AHCI Controller (rev 02) (prog-if 01 [AHCI 1.0])
Subsystem: Toshiba America Info Systems Device 0001
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin B routed to IRQ 317
Region 0: I/O ports at af38 [size=8]
Region 1: I/O ports at af34 [size=4]
Region 2: I/O ports at af28 [size=8]
Region 3: I/O ports at af24 [size=4]
Region 4: I/O ports at af10 [size=16]
Region 5: Memory at ffc3f800 (32-bit, non-prefetchable) [size=1K]
Capabilities: [80] Message Signalled Interrupts: Mask- 64bit- Queue=0/0 Enable+
Address: fee0300c Data: 4179
Capabilities: [70] Power Management version 2
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
Kernel driver in use: ahci
Kernel modules: ahci
00: 86 80 c5 27 07 04 b0 02 02 01 06 01 00 00 00 00
10: 39 af 00 00 35 af 00 00 29 af 00 00 25 af 00 00
20: 11 af 00 00 00 f8 c3 ff 00 00 00 00 79 11 01 00
30: 00 00 00 00 80 00 00 00 00 00 00 00 0b 02 00 00
40: 07 a3 00 00 00 00 00 00 01 00 01 00 00 00 00 00
50: 00 00 00 00 10 10 04 00 00 00 00 00 00 00 00 00
60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
70: 01 00 02 40 00 00 00 00 00 00 00 00 00 00 00 00
80: 05 70 01 00 0c 30 e0 fe 79 41 00 00 00 00 00 00
90: 40 00 11 10 80 01 00 5a 00 00 00 00 00 00 00 00
a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 86 0f 02 00 00 00 00 00

01:00.0 Ethernet controller: Intel Corporation 82573L Gigabit Ethernet Controller
Subsystem: Toshiba America Info Systems Device 0001
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 316
Region 0: Memory at ff9e0000 (32-bit, non-prefetchable) [size=128K]
Region 2: I/O ports at bfe0 [size=32]
Capabilities: [c8] Power Management version 2
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [d0] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0 Enable+
Address: 00000000fee0100c Data: 41d1
Capabilities: [e0] Express (v1) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- UncorrErr+ FatalErr- UnsuppReq+ AuxPwr+ TransPend-
LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Latency L0 <128ns, L1 <64us
ClockPM+ Suprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
Capabilities: [100] Advanced Error Reporting <?>
Capabilities: [140] Device Serial Number ac-dd-af-ff-ff-7e-1c-00
Kernel driver in use: e1000e
Kernel modules: e1000e
00: 86 80 9a 10 07 04 10 00 00 00 00 02 10 00 00 00
10: 00 00 9e ff 00 00 00 00 e1 bf 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 79 11 01 00
30: 00 00 00 00 c8 00 00 00 00 00 00 00 0a 01 00 00
40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 01 d0 22 c8 00 20 00 0f
d0: 05 e0 81 00 0c 10 e0 fe 00 00 00 00 d1 41 00 00
e0: 10 00 01 00 c1 0c 00 00 10 28 1a 00 11 1c 07 00
f0: 40 01 11 10 00 00 00 00 00 00 00 00 00 00 00 00

02:00.0 Network controller: Intel Corporation PRO/Wireless 4965 AG or AGN Network Connection (rev 61)
Subsystem: Intel Corporation Device 1101
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 32 bytes
Interrupt: pin A routed to IRQ 315
Region 0: Memory at ff8fe000 (64-bit, non-prefetchable) [size=8K]
Capabilities: [c8] Power Management version 3
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [d0] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0 Enable+
Address: 00000000fee0100c Data: 41d9
Capabilities: [e0] Express (v1) Endpoint, MSI 00
DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <512ns, L1 unlimited
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 128 bytes
DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend-
LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Latency L0 <128ns, L1 <64us
ClockPM+ Suprise- LLActRep- BwNot-
LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes Disabled- Retrain- CommClk+
ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
Capabilities: [100] Advanced Error Reporting <?>
Capabilities: [140] Device Serial Number e7-38-c3-ff-ff-3b-1f-00
Kernel driver in use: iwlagn
Kernel modules: iwlagn
00: 86 80 29 42 06 00 10 00 61 00 80 02 08 00 00 00
10: 04 e0 8f ff 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 86 80 01 11
30: 00 00 00 00 c8 00 00 00 00 00 00 00 0b 01 00 00
40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 01 d0 23 c8 00 00 00 0d
d0: 05 e0 81 00 0c 10 e0 fe 00 00 00 00 d9 41 00 00
e0: 10 00 01 00 c0 8e 00 00 10 08 19 00 11 1c 07 00
f0: 43 01 11 10 00 00 00 00 00 00 00 00 00 00 00 00

03:0b.0 CardBus bridge: Texas Instruments PCIxx12 Cardbus Controller
Subsystem: Toshiba America Info Systems Device 0001
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 168, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 21
Region 0: Memory at 88004000 (32-bit, non-prefetchable) [size=4K]
Bus: primary=03, secondary=04, subordinate=07, sec-latency=176
Memory window 0: 88400000-887ff000 (prefetchable)
Memory window 1: 88800000-88bff000
I/O window 0: 00001000-000010ff
I/O window 1: 00001400-000014ff
BridgeCtl: Parity- SERR- ISA- VGA- MAbort- >Reset+ 16bInt+ PostWrite+
16-bit legacy interface ports at 0001
Kernel driver in use: yenta_cardbus
Kernel modules: yenta_socket
00: 4c 10 39 80 07 00 10 02 00 00 07 06 10 a8 82 00
10: 00 40 00 88 a0 00 00 02 03 04 07 b0 00 00 40 88
20: 00 f0 7f 88 00 00 80 88 00 f0 bf 88 00 10 00 00
30: fc 10 00 00 00 14 00 00 fc 14 00 00 ff 01 c0 05
40: 79 11 01 00 01 00 00 00 00 00 00 00 00 00 00 00
50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
80: 60 d0 00 08 19 00 b3 02 00 00 0f 00 22 10 aa 01
90: c0 04 64 60 00 00 00 00 00 00 00 00 00 00 00 00
a0: 01 00 02 7e 00 00 c0 00 00 00 00 00 00 00 00 00
b0: 00 00 00 08 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 08 3e c5 ef a8 9e 01 c2 00 00 00 00 00 00 00 00

03:0b.1 FireWire (IEEE 1394): Texas Instruments PCIxx12 OHCI Compliant IEEE 1394 Host Controller (prog-if 10 [OHCI])
Subsystem: Toshiba America Info Systems Device 0001
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 64 (500ns min, 1000ns max), Cache Line Size: 32 bytes
Interrupt: pin B routed to IRQ 20
Region 0: Memory at 88005000 (32-bit, non-prefetchable) [size=2K]
Region 1: Memory at 88000000 (32-bit, non-prefetchable) [size=16K]
Capabilities: [44] Power Management version 2
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=0 PME+
Kernel driver in use: firewire_ohci
Kernel modules: firewire-ohci
00: 4c 10 3a 80 06 00 10 02 00 10 00 0c 08 40 80 00
10: 00 50 00 88 00 00 00 88 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 79 11 01 00
30: 00 00 00 00 44 00 00 00 00 00 00 00 ff 02 02 04
40: 00 00 00 00 01 00 02 7e 00 80 00 00 00 00 00 00
50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 08 00 00 00
f0: 10 08 00 00 82 00 00 00 79 11 01 00 00 00 00 00

03:0b.3 SD Host controller: Texas Instruments PCIxx12 SDA Standard Compliant SD Host Controller (prog-if 01)
Subsystem: Toshiba America Info Systems Device 0001
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 64 (1750ns min, 1000ns max), Cache Line Size: 32 bytes
Interrupt: pin D routed to IRQ 23
Region 0: Memory at 88005800 (32-bit, non-prefetchable) [size=256]
Capabilities: [80] Power Management version 2
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
Kernel driver in use: sdhci-pci
Kernel modules: sdhci-pci
00: 4c 10 3c 80 06 00 10 02 00 01 05 08 08 40 80 00
10: 00 58 00 88 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 79 11 01 00
30: 00 00 00 00 80 00 00 00 00 00 00 00 ff 04 07 04
40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
80: 01 00 02 7e 00 00 00 00 63 00 00 00 79 11 01 00
90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

I'll run the 'pci=cbmemsize=4M' test tomorrow (need to have some sleep).

Thanks,
Rafael

Linus Torvalds

2008-12-03 02:02:24 UTC

Ok, I'm not finding any documented quirks that would be memory regions,
and in fact it doesn't look like there is even any remotely likely 32-bit
valies that might be pointers in your PCI config space that look remotely
like they might be conflicting in the area of MMIO space that we allocate
PCI resources from (ie 0x88000000-0x92000000).

Of course, any odd MMIO regions might be descibed by some insane model
that doesn't look like an aligned 32-bit value, but that's unlikely.

So I'm still not seeing anything wrong in there.

Post by Rafael J. Wysocki
I'll run the 'pci=cbmemsize=4M' test tomorrow (need to have some sleep).

Sure. It will be interesting to see if it makes any difference.

Linus

Rafael J. Wysocki

2008-12-02 15:49:26 UTC

Post by Linus Torvalds

Post by Rafael J. Wysocki
r_size = resource_size(r);
/* For bridges size != alignment */
- align = resource_alignment(r);
+ align = (i < PCI_BRIDGE_RESOURCES) ? r_size : r->start;

Hmm. This means that something set the alignment flags incorrectly. The
resource _should_ have IORESOURCE_SIZEALIGN set for a resource with size
alignment, and IORESOURCE_STARTALIGN for one that has start alignment.
Your patch doesn't fix anything, it just hides the bug.

Well, it's just a partial revert of commit
5f17cfce5776c566d64430f543a289e5cfa4538b ("PCI: fix pbus_size_mem() resource
alignment for CardBus controllers").

Post by Linus Torvalds
It would be good to hear what resource this is, and where it got set. So
instead of that broken patch that just hides the problem, please try to
debug it with something like
resource_size_t expected_align;
expected_align = (i < PCI_BRIDGE_RESOURCES) ? r_size : r->start;
align = resource_alignment(r);
if (align != expected_align) {
dev_warn(&dev->dev,
"BAR %d %llx-%llx wrong alignment flags %lx %llx (%llx)\n",
i,
(unsigned long long) r->start,
(unsigned long long) r->end,
r->flags,
(unsigned long long) align,
(unsigned long long) expected_align);
/* Hacky and wrong, but trying to keep things
align = expected_align;
}
or something like that. And then we just need to figure out which setup
routine sets the wrong alignment flag,.

Yeah, I'll give it a try later today, when I get back from the Uni.

Thanks,
Rafael

Frans Pop

2008-12-02 07:53:03 UTC

(resending to list only as original attempt did not make it)

Post by Rafael J. Wysocki
The symptoms of the breakage are that sometimes the box hangs solid
during resume, sometimes it hangs but can be rebooted by pressing
Alt-SysRq-b, and sometimes it just powers off while resuming. Still,
it resumes correctly in about 75% of cases and that made the issue very
hard to debug.
[Interestingly enough, it was not reproducible with snd_hda_intel
unloaded, which made me think it was related to the driver, but
evidently it wasn't.]

This almost exactly describes problems I've been seeing on my HP 2510p. I
never tried Alt-SysRq-b and have not seen the spontaneous power-off, but
the percentage correct boots and relationship with the sound driver are
spot on.

I worked around the sound driver relationship by using a very low power
save setting so I can virtually count on it being disabled when I
suspend.

I'll give your patch a go and report the results. Nice work!

Cheers,
FJP

Rafael J. Wysocki

2008-12-04 01:23:53 UTC

Then, voila!, I'm not able to reproduce the hibernation-resume failure.
(1) The sizes of the allocations and the locations of devices in the memory
address space don't matter here.
(2) The presence and size of the prefetchable memory window don't matter here.
(3) What matters is the presence of non-prefetchable memory window on the
supposedly transparent bridge. Namely, if the window is there, resume from
hibernation occasionally fails (again, the size of the window and the
location of it in the memory address space doesn't seem to matter).

That is indeed rather odd, and very interesting.

So, apparently, on this box (and I guess on Frans' too) we could avoid the
problem if we didn't allocate the non-prefetchable memory window in
pci_bus_size_cardbus(), but I guess that wouldn't be generally correct.

Well, I think that what _would_ be generally correct, and actually pretty
simple, is a rather different approach: just not sizing things behind a
transparent bridge AT ALL, since it really shouldn't matter.
So if the appended patch fixes things for you, I think this might
potentially be the right approach.
NOTE NOTE NOTE! This patch is entirely untested, as usual. I didn't check
that this is necessarily the correct place to test for this, but it
does make sense. IOW, it _feels_ like the rigth thing.

Yes, it _looks_ sane.

However, even if it fixes things for you, I think we're too late in the
2.6.28 cycle to really apply this.

Absolutely.

So it really does make sense to consider a root bridge and a transparent
one to be equivalent here, since in both cases any bridge windows should
be irrelevant. Which is why I think this patch is interesting.

Also, I would be happy to actually understand _why_ this happens.

100% agreed. I do _not_ see why it should ever matter how we set up a PCI
bridging window - whether prefetchable or not - on a bridge that should be
transparent. It sounds really odd. I'm wondering if there is something
we're missing here.
But apart from the existence of the bridging window, the only thing that
it seems to affect is really just a minor layour issue. So it does seem
like it matters. Odd.

Well, in principle it may be related to the way we handle bridges during
resume, but I really need to read some docs and compare them with the code
before I can say anything more about that. Surely, nothing like this issue has
ever been reported before.

Anyway, thanks for the patch, I'm going to try it tomorrow.

Best,
Rafael

Linus Torvalds

2008-12-04 04:40:58 UTC

Post by Rafael J. Wysocki
Well, in principle it may be related to the way we handle bridges during
resume

Ahh. Yes, that's possible. It's quite possible that the problem isn't
resource allocation per se, but just the bigger complexity at resume time.

This is a hibernate-only issue for you, right? Or is it about regular
suspend-to-ram too?

Post by Rafael J. Wysocki
but I really need to read some docs and compare them with the code
before I can say anything more about that. Surely, nothing like this
issue has ever been reported before.

Well, how stable has hibernate been on that particular machine
historically?

Because the half-revert alignment patch (ie reverting part of 5f17cf) that
made it work for you would actually have been a non-issue in the original
code that was pre-PCI-resource-alignment cleanup (commit 88452565).

So the patch you partially reverted was literally the one that made the
Cardbus allocation work the _same_ way as it did historically, before
88452565. So if the new code breaks for you, then so should the "old" code
(ie 2.6.25 and earlier).

So the "hasn't been reported before" case may well be just another way of
saying "hibernate has never been very reliable".

Linus

Frans Pop

2008-12-04 08:21:03 UTC

Post by Linus Torvalds

Post by Rafael J. Wysocki
Well, in principle it may be related to the way we handle bridges
during resume

This is a hibernate-only issue for you, right? Or is it about regular
suspend-to-ram too?

It is regular suspend to ram.

I feel now confident in saying that with the debug patch resume from STR
is 100% reliable. And the two workarounds I was using to improve resume
reliability are no longer needed:
- unloading e1000e before suspend
- using aggressive powersave setting on snd_hda_intel to ensure that
sound controller was already sleeping before entering suspend

I don't think we have any theory yet on how those workarounds were helping
to improve things, right?

Post by Linus Torvalds
Well, how stable has hibernate been on that particular machine
historically?

I cannot comment on this as I have not owned this laptop long enough.

One other thing: the ohci1394 "irq 19: nobody cared" issue is definitely
unrelated as I just got one during a resume with the debug patch.

Cheers,
FJP

Rafael J. Wysocki

2008-12-04 22:01:26 UTC

Post by Linus Torvalds

Post by Rafael J. Wysocki
Well, in principle it may be related to the way we handle bridges during
resume

Ahh. Yes, that's possible. It's quite possible that the problem isn't
resource allocation per se, but just the bigger complexity at resume time.
This is a hibernate-only issue for you, right? Or is it about regular
suspend-to-ram too?

It is suspend to RAM too, from what I can tell.

Post by Linus Torvalds

Post by Rafael J. Wysocki
but I really need to read some docs and compare them with the code
before I can say anything more about that. Surely, nothing like this
issue has ever been reported before.

Well, how stable has hibernate been on that particular machine
historically?
Because the half-revert alignment patch (ie reverting part of 5f17cf) that
made it work for you would actually have been a non-issue in the original
code that was pre-PCI-resource-alignment cleanup (commit 88452565).
So the patch you partially reverted was literally the one that made the
Cardbus allocation work the _same_ way as it did historically, before
88452565. So if the new code breaks for you, then so should the "old" code
(ie 2.6.25 and earlier).
So the "hasn't been reported before" case may well be just another way of
saying "hibernate has never been very reliable".

This is a new box and the kernels earlier than 2.6.27-rc3 have not been tested
on it. So, in fact, it's quite possible that hibernation would fail on it with
earlier kernels as well.

Thanks,
Rafael

Frans Pop

2008-12-04 11:29:43 UTC

Well, I think that what _would_ be generally correct, and actually
pretty simple, is a rather different approach: just not sizing things
behind a transparent bridge AT ALL, since it really shouldn't matter.

I've given your patch a try and the few resumes from STR I've done were
all successful. That's not 100% conclusive yet, but a nice start.
Some info from logs etc. below.

Also, I would be happy to actually understand _why_ this happens.

100% agreed. I do _not_ see why it should ever matter how we set up a
PCI bridging window - whether prefetchable or not - on a bridge that
should be transparent. It sounds really odd. I'm wondering if there is
something we're missing here.

The theory that it is really a resume issue and not a device layout issue
sounds logical. Especially as everything always works correctly after a
normal boot.

Cheers,
FJP

Below info from 3 kernels, all based on 2.6.28-rc7-91:
A) unpatched
B) with the revert/debug patch
C) with the oneliner "ignore transparent bridges" patch

AFAICT all results are probably as expected.

From lspci -vvxxx:
00:1e.0 PCI bridge: Intel Corporation 82801 Mobile PCI Bridge
- for A)
I/O behind bridge: 00003000-00003fff
Memory behind bridge: e0100000-e03fffff
Prefetchable memory behind bridge: 0000000080000000-0000000083ffffff
- for B)
I/O behind bridge: 00003000-00003fff
Memory behind bridge: e0100000-e03fffff
- for C)
Memory behind bridge: e0100000-e03fffff
02:06.0 CardBus bridge: Ricoh Co Ltd RL5c476 II
- for A)
Memory window 0: 80000000-83fff000 (prefetchable)
Memory window 1: 84000000-87fff000
I/O window 0: 00003000-000030ff
I/O window 1: 00003400-000034ff
- for B)
Memory window 0: 84400000-847ff000 (prefetchable)
Memory window 1: 80000000-83fff000
I/O window 0: 00003000-000030ff
I/O window 1: 00003400-000034ff
- for C)
Memory window 0: 80000000-83fff000 (prefetchable)
Memory window 1: 84000000-87fff000
I/O window 0: 00001400-000014ff
I/O window 1: 00001800-000018ff

From /proc/iomem:
- for A)
80000000-83ffffff : PCI Bus 0000:02
80000000-83ffffff : PCI CardBus 0000:03
84000000-87ffffff : PCI CardBus 0000:03
88000000-88000fff : Intel Flush Page
- for B)
80000000-83ffffff : PCI CardBus 0000:03
84000000-84000fff : Intel Flush Page
84400000-847fffff : PCI CardBus 0000:03
- for C)
80000000-83ffffff : PCI CardBus 0000:03
84000000-87ffffff : PCI CardBus 0000:03
88000000-88000fff : Intel Flush Page

Attached a tarball with dmesg for all 3 kernel, including a successful
STR/resume cycle for each (not cleaned up this time).
A) 2.6.28-rc7_nofix
B) 2.6.28-rc7_revert
C) 2.6.28-rc7_resumefix

From the last one:
pci 0000:02:06.0: CardBus bridge, secondary bus 0000:03
pci 0000:02:06.0: IO window: 0x001400-0x0014ff
pci 0000:02:06.0: IO window: 0x001800-0x0018ff
pci 0000:02:06.0: PREFETCH window: 0x80000000-0x83ffffff
pci 0000:02:06.0: MEM window: 0x84000000-0x87ffffff
pci 0000:00:1e.0: PCI bridge, secondary bus 0000:02
pci 0000:00:1e.0: IO window: disabled
pci 0000:00:1e.0: MEM window: 0xe0100000-0xe03fffff
pci 0000:00:1e.0: PREFETCH window: disabled
[...]
bus: 02 index 0 mmio: [0x0-0x0]
bus: 02 index 1 mmio: [0xe0100000-0xe03fffff]
bus: 02 index 2 mmio: [0x0-0x0]
bus: 02 index 3 io port: [0x00-0xffff]
bus: 02 index 4 mmio: [0x000000-0xffffffffffffffff]
bus: 03 index 0 io port: [0x1400-0x14ff]
bus: 03 index 1 io port: [0x1800-0x18ff]
bus: 03 index 2 mmio: [0x80000000-0x83ffffff]
bus: 03 index 3 mmio: [0x84000000-0x87ffffff]

Linus Torvalds

2008-12-04 16:17:26 UTC

Post by Frans Pop

Well, I think that what _would_ be generally correct, and actually
pretty simple, is a rather different approach: just not sizing things
behind a transparent bridge AT ALL, since it really shouldn't matter.

I've given your patch a try and the few resumes from STR I've done were
all successful. That's not 100% conclusive yet, but a nice start.
Some info from logs etc. below.

Ok, but I thought you had a hard time reproducing this _anyway_, even with
just plain -rc7. No?

That said, of the various patches posted, the "don't bother allocating
bridging windows for transparent bridges" one is not just the simplest,
but the only one that actually makes sense so far.

So I'm happy it's apparently working for you, I'm just wondering about
whather your success means a lot. It seems that Rafael is the one who had
more failures?

Post by Frans Pop

Also, I would be happy to actually understand _why_ this happens.

100% agreed. I do _not_ see why it should ever matter how we set up a
PCI bridging window - whether prefetchable or not - on a bridge that
should be transparent. It sounds really odd. I'm wondering if there is
something we're missing here.

The theory that it is really a resume issue and not a device layout issue
sounds logical. Especially as everything always works correctly after a
normal boot.

Yes, that does sound like a convincing argument. Usually real PCI resource
clashes result in some kind of run-time problems, and wouldn't necessarily
be suspend-specific per se.

That said, suspend/resume does a lot of unusual things, so it could still
be some odd PCI resource clash that only triggers problems in the
suspend/resume case. But since the exact layouts and the sizing of the
resources doesn't really seem to matter, a simple PCI resource clash seems
rather unlikely.

So some kind of resume-time ordering or timing issue does seem like the
most likely thing. But that still leaves us not knowing what the real
_root_ cause of this all is - very irritating. Even if not allocating the
unnecessary bridging windows "fixes" things, it would be really really
good to know exactly what it is that causes problems.

Post by Frans Pop
A) unpatched
B) with the revert/debug patch
C) with the oneliner "ignore transparent bridges" patch
AFAICT all results are probably as expected.
00:1e.0 PCI bridge: Intel Corporation 82801 Mobile PCI Bridge
- for A)
I/O behind bridge: 00003000-00003fff
Memory behind bridge: e0100000-e03fffff
Prefetchable memory behind bridge: 0000000080000000-0000000083ffffff
- for B)
I/O behind bridge: 00003000-00003fff
Memory behind bridge: e0100000-e03fffff
- for C)
Memory behind bridge: e0100000-e03fffff

And this all makes total sense. The e0100000-e03fffff MMIO bridge range is
apparently set up by the firmware, which is why it shows up in all cases.
And the (A) case has that prefetchable memory range, because that's the
only case that finds - and cares about - the prefetch window for the
CardBus controller.

And both (A) and (B) have the IO bridging window, because regardless of
whether we see a valid CardBus prefetchable memory window with good
alignment, we'll always see the IO ports, so we'll try to allocate that
bridging window, except in (C) when we decide that due to the transparent
nature, we simply don't care.

So the PCI resources make sense in all three cases, and we understand
those. The differences in the actual Cardbus ranges also all make sense.
So it all still boils down to the PCI layer doing everything right in
_all_ cases, just making slightly different - but all valid - choices
depending on essentially random details (eg the revert/debug patch case
the "random detail" is just enabling a small incorrect alignment).

IOW, it really doesn't look like a PCI resource allocator bug. Quite the
reverse, I'd say that in the end this whole thread points out just how
robust the whole PCI and cardbus resource allocation is, with the code
really very gracefully just adjusting in a sane manner to all these
different cases.

Of course, none of that helps us with any kind of idea of what the real
problem is. Device ordering bug in setting up PCI resources at resume?
Perhaps just a plain bug in PCI bridge resume code (even when you resume
things in the right order)?

And I still worry that perhaps it's just a timing bug, where having a PCI
bridging window changes timing of various PCI accesses, and the _real_ bug
is actually in the sound card or ethernet driver resume, which happens to
work with one timing and not with another.

Since it's apparently STR, has anybody gotten _anything_ sane out of
trying to enable PM_TRACE_RTC, and then doing that

echo 1 > /sys/power/pm_trace

because even with the (very limited) set of standard trace-points, it
should still be able to tell which device we were trying to resume last in
the failure case Maybe that gives some hint?

Linus

Frans Pop

2008-12-04 18:00:25 UTC

Post by Linus Torvalds

Post by Frans Pop
I've given your patch a try and the few resumes from STR I've done
were all successful. That's not 100% conclusive yet, but a nice
start. Some info from logs etc. below.

Ok, but I thought you had a hard time reproducing this _anyway_, even
with just plain -rc7. No?

Well, I had a failure rate of about 1 in 5-10 resumes originally.
See: http://bugzilla.kernel.org/show_bug.cgi?id=11545

Then I found the 2 workarounds and *with those in place* I got almost 100%
reliable resumes. Now I've removed those workarounds and with either the
revert or your oneliner I still get 100% success.
From my PoV that is a very definite improvement: the machine now "feels" a
hell of a lot more reliable for critical use.

So I _could_ reproduce it reliably given enough suspend/resume cycles.
But I guess this does support your suspicion that it may be a timing
issue: if the timing happens to be right, the resume succeeds; if it's
wrong I get a dead box.

Post by Linus Torvalds
Since it's apparently STR, has anybody gotten _anything_ sane out of
trying to enable PM_TRACE_RTC, and then doing that
echo 1 > /sys/power/pm_trace

I did try that at the beginning. That's how I ended up removing e1000e
before suspend. See http://bugzilla.kernel.org/show_bug.cgi?id=11545.

My next hint was that Matthew Garret, who has the same notebook, was
surprised at my resume problems as he did not see them. So I did a
comparison of our kernel configs and made some changes to mine. From
that I found that a very low value for SND_HDA_POWER_SAVE_DEFAULT (5)
reduced the failure rate to practically zero.

At some point I tried keeping e1000e loaded for a bit, but that quickly
gave me a failure again, so I starting removing it again during suspend.

So I did have some data, but as I got no response on my BR I had no idea
where to go from there. I was really very happy to see Rafael's mail as
his description almost exactly matched what I had been seeing.

I'd be happy to run with unpatched kernels for a while and do some more
pm_traces, but only if someone is going to follow up and interpret the
results for me or provide suggestions for targeted additional debugging.

Cheers,
FJP

Linus Torvalds

2008-12-04 20:03:47 UTC

Ingo, Len, can you check the end of the email about the apparent
very-early interrupt issue? Can we get into acpi_ec_gpe_handler() without
interrupts being enabled some way?

Greg, Jesse, can you think about and look at the USB PCI resume ordering?

Post by Frans Pop
Well, I had a failure rate of about 1 in 5-10 resumes originally.
See: http://bugzilla.kernel.org/show_bug.cgi?id=11545

Ok, very interesting. Thanks for the pointer.

Post by Frans Pop
Then I found the 2 workarounds and *with those in place* I got almost 100%
reliable resumes. Now I've removed those workarounds and with either the
revert or your oneliner I still get 100% success.
From my PoV that is a very definite improvement: the machine now "feels" a
hell of a lot more reliable for critical use.

Sure. I'd love to apply the "transparency fix" (the last patch), but my
main worry is that while it feels really right, and it fixes things for
you and Rafael, these kinds of changes historically _always_ end up biting
us. Because even if it's 100% the correct thing to do, it will show up
some problem for somebody else just because we're really unlucky, and it
just ends up exposing some totally unrelated bug.

Exactly the same way that this whole PCI resource setting thing was 100%
correct in the first place, but exposed some other bug.

Post by Frans Pop
So I _could_ reproduce it reliably given enough suspend/resume cycles.
But I guess this does support your suspicion that it may be a timing
issue: if the timing happens to be right, the resume succeeds; if it's
wrong I get a dead box.

Yes.

Post by Frans Pop
I did try that at the beginning. That's how I ended up removing e1000e
before suspend. See http://bugzilla.kernel.org/show_bug.cgi?id=11545.

What is interesting is that it's apparently not reliably that e1000e thing
that is being resumed when it fails. You have another report there that
says that it's a match on PNP0C0A.

Of course, the way that hash works, we only have a few bits to create it,
and sometimes you just get false positives (there's not a whole lot you
can reliably do with about 24 bits of information ;(

So it would be interesting to get a few more debug traces of that lockup.

HOWEVER. Having now looked through your fuller dmesg output even for the
_successful_ case, I actually find a few things that are a bit worrying.

Looking at the unpatched dmesg, since that's the most interesting one
(since it's the one that should hopefully show behaviour that is
potentially triggering the problem), I see two worrying things:

pci 0000:00:1e.0: restoring config space at offset 0x9 (was 0x10001, writing 0x83f18001)
pci 0000:00:1e.0: restoring config space at offset 0x8 (was 0x0, writing 0xe030e010)
pci 0000:00:1e.0: restoring config space at offset 0x7 (was 0x228000f0, writing 0x22803030)
pci 0000:00:1e.0: restoring config space at offset 0x1 (was 0x100007, writing 0x100107)
pci 0000:00:1e.0: setting latency timer to 64

That "offset 0x9/0x8/0x7" are the PCI bridge window prefetchable memory,
non-prefetchable memory, and IO bases respectively (when it says '0x9',
it's counting in quad-words, so it's really config space offset 0x24L
PCI_PREF_MEMORY_BASE).

Now, that really means:
- 0x9 prefetchable window: was disabled, is now 0x80000000-0x83ffffff
- 0x8 nonprefetch window: was disabled, is now 0xe0100000-0xe03fffff
- 0x7 IO window: was disabled, is noe 0x3000-0x3fff

That all looks correct, BUT the IO base reprogramming actually worries me.
It's correct only because it's a 16-bit range. For a 32-bit range (which
is not supported on an x86 platform, since IO ports are always just 16
bits), the ordering would be very different, and we'd have to make sure
that we write the upper bits in a special order to avoid problems.

With the "revert fix", the sequence is essentially the same, just
different values:

pci 0000:00:1e.0: restoring config space at offset 0x9 (was 0x10001, writing 0x1fff1)
pci 0000:00:1e.0: restoring config space at offset 0x8 (was 0x0, writing 0xe030e010)
pci 0000:00:1e.0: restoring config space at offset 0x7 (was 0x228000f0, writing 0x22803030)
pci 0000:00:1e.0: restoring config space at offset 0x1 (was 0x100007, writing 0x100107)
pci 0000:00:1e.0: setting latency timer to 64

the difference is that we now disable the prefetchable window (since we
never allocated it), we just disable it with a different value than the
one the BIOS used (0x10001 _could_ be imagined to mean "bridge the range
0x00000000-0x0000ffff, while the kernel disables the IO region by setting
the lower range higher than the high range, which is why you see those
fff's there).

But the "revert fix" still has the IO range restore. It's still correct in
this case (no 32-bit IO bits set), but still has the 32-bit range worry
for non-x86.

With the "fix transparent bridges", the sequence is different:

pci 0000:00:1e.0: restoring config space at offset 0x9 (was 0x10001, writing 0x1fff1)
pci 0000:00:1e.0: restoring config space at offset 0x8 (was 0x0, writing 0xe030e010)
pci 0000:00:1e.0: restoring config space at offset 0x1 (was 0x100007, writing 0x100107)
pci 0000:00:1e.0: setting latency timer to 64

ie now we don't even touch the IO window, since we agree with the BIOS on
how to disable it (ie the kernel also disables it by writing 0x00f0 to the
low 16 bits).

Anyway, the bridge reprogramming itself all looks correct, and the only
worry really is that I'm not sure our PCI resume code really stricly
speaking does the right thing for 32-bit IO resources for other non-x86
architectures.

The "transparent bridge" fix results in the simplest resume sequence for
that bridge, but the "revert" fix really makes almost no difference at
all, and again should not matter in the _least_ from a resume standpoint!

So there is a _small_ worry there, but it's not relevant for PC platforms,
and in no case does it look like the programming of the transparent bridge
should matter in any way what-so-ever for the resume code.

In many ways the bigger worry is actually in the totally unrelated USB
UHCI and EHCI drivers that resume _before_ the bridge does:

uhci_hcd 0000:00:1d.2: enabling device (0000 -> 0001)
uhci_hcd 0000:00:1d.2: PCI INT C -> GSI 18 (level, low) -> IRQ 18
uhci_hcd 0000:00:1d.2: setting latency timer to 64
uhci_hcd 0000:00:1d.2: restoring config space at offset 0xf (was 0x300, writing 0x30b)
uhci_hcd 0000:00:1d.2: restoring config space at offset 0x8 (was 0x1, writing 0x2101)
usb usb7: root hub lost power or was reset
ehci_hcd 0000:00:1d.7: enabling device (0000 -> 0002)
ehci_hcd 0000:00:1d.7: PCI INT A -> GSI 20 (level, low) -> IRQ 20
ehci_hcd 0000:00:1d.7: setting latency timer to 64
ehci_hcd 0000:00:1d.7: restoring config space at offset 0xf (was 0x100, writing 0x10a)
ehci_hcd 0000:00:1d.7: restoring config space at offset 0x4 (was 0x0, writing 0xe0648000)

and the worry I have here is that we actually enable the device _before_
we've restored the BAR information. That sounds very iffy. It sounds
doubly iffy in the 'resume from hibernate' case, where we are going to
have an already-set-up PCI bus and the config space values are going to
all be live as we reprogram them.

That "restoring config space at offset 0x8" thing is where we restore
the BAR (dword 0x8 = offset 0x20 = PCI_BASE_ADDRESS_4), and we're changing
it from 0x1 to 0x2101, with the IO BAR enabled. In this case, the old
value meant that the BAR started out disabled, but hibernate would have
been different.

So I'd _much_ rather have seen the sequence have the BAR restore sequence
be something like

uhci_hcd 0000:00:1d.2: restoring config space at offset 0xf (was 0x300, writing 0x30b)
uhci_hcd 0000:00:1d.2: restoring config space at offset 0x8 (was 0x1, writing 0x2101)
uhci_hcd 0000:00:1d.2: enabling device (0000 -> 0001)
uhci_hcd 0000:00:1d.2: PCI INT C -> GSI 18 (level, low) -> IRQ 18
uhci_hcd 0000:00:1d.2: setting latency timer to 64

instead. Possibly even with an explicit disable of the memory/IO/busmaster
bits before the whole sequence.

That said, I don't think this is the cause of the problem either. For one
thing, the USB resume happens after the e1000e resume, so since you've
apparently seen it hang in the e1000e driver, the real problem must have
occurred earlier. And e1000e is resumed not just before USB on your
machine, but even before the PCI bridge is (since it's on the root bus).

For another, in your case, the BAR really was disabled, so there was
nothing "live" going on here anyway.

The third thing that worries me is the _very_ early occurrence of

ACPI: Waking up from system sleep state S3
APIC error on CPU1: 00(40)
ACPI: EC: non-query interrupt received, switching to interrupt mode

Now, that "APIC error" thing is worrisome. It's worrisome for multiple
reasons:

- errors are never good (0x40 means "received illegal vector", whatever
caused _that_)

- more importantly, it seems to imply that interrupts are enabled on
CPU1, and they sure as hell shouldn't be enabled at this stage!

Do we perhaps have a SMP resume bug where we resume the other CPU's
with interrupts enabled?

- the "ACPI: EC: non-query interrupt received, switching to interrupt
mode" thing is from ACPI, and _also_ implies that interrupts are on.

Why are interrupts enabled that early? I really don't like seeing
interrupts enabled before we've even done the basic PCI resume.

I'd really like to resume the other CPU's much later (last in the whole
sequnce, long after we've set up devices), but the f'ing ACPI rules seem
to be against that. And maybe some setup actually needs the CPU's alive to
act as a bridge for IO (eg with HT or CSI).

And interrupts happening at random times could certainly cause
"interesting" and timing-dependent resume problems. Hmm...

The problem with the whole interrupt issue is that it seems to have
nothing what-so-ever to do with the programming of that bridge in any way,
shape or form. The timing issues/problems it could introduce should be
totally irrelevant to anything else.

Post by Frans Pop
I'd be happy to run with unpatched kernels for a while and do some more
pm_traces, but only if someone is going to follow up and interpret the
results for me or provide suggestions for targeted additional debugging.

I don't really have any better patches to try right now. But as usual,
from everything I can see, the actual bridge setup itself should be
totally irrelevant to the problem you see. Which is really irritating,
since the only patches we _do_ have that seem to matter are purely about
that bridge resource that shouldn't matter at all!

Linus

Linus Torvalds

2008-12-05 21:26:53 UTC

Post by Linus Torvalds
The third thing that worries me is the _very_ early occurrence of
ACPI: Waking up from system sleep state S3
APIC error on CPU1: 00(40)
ACPI: EC: non-query interrupt received, switching to interrupt mode
Now, that "APIC error" thing is worrisome. It's worrisome for multiple
- errors are never good (0x40 means "received illegal vector", whatever
caused _that_)
- more importantly, it seems to imply that interrupts are enabled on
CPU1, and they sure as hell shouldn't be enabled at this stage!
Do we perhaps have a SMP resume bug where we resume the other CPU's
with interrupts enabled?
- the "ACPI: EC: non-query interrupt received, switching to interrupt
mode" thing is from ACPI, and _also_ implies that interrupts are on.
Why are interrupts enabled that early? I really don't like seeing
interrupts enabled before we've even done the basic PCI resume.

Oh, I finally started looking more at this.

It's because the PCI layer uses the late resume for resuming. Which is
horrid. It really shouldn't. Resuming your device after interrupts were
enabled really sounds like a disaster. *Especially* if the device was
active before, either because of hibernation or simply because firmware
pre-initialized it to some "live" state (which could easily happen with
ethernet in particular).

I do wonder why the PCI layer wants to resume things so late. It sounds
totally insane to do things like resume the PCI bridge setup long *after*
you have resumed other devices early. So by doing the default resume late,
it just means that nobody can possibly use the early resume, and now
everybody needs to resume everything with interrupts already going full
blast.

IOW, the _sane_ thing would be to do something like the following patch
does, namely:

- if the driver has a suspend or suspend_early function, _only_ call that
(whether legacy or not)

- otherwise, do the default suspend/resume late/early with interrupts
disabled.

which means that by default, we'll do all save-restore of the PCI state
close to the actual CPU suspend event as possible.

Of course, hibernate probably depends on ->suspend() saving state, which
it won't. Again, if thats' the case, then that's just hibernate (again)
being totally fundamentally broken, and messing with STR functions.

Rafael?

I have neither tested the patch nor even tried to compile it - it's meant
to be an example and get people thinking about this, rather than anything
else.

Linus
---
drivers/pci/pci-driver.c | 20 ++++++++++----------
1 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index b4cdd69..6395983 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -346,8 +346,6 @@ static int pci_legacy_suspend(struct device *dev, pm_message_t state)
if (drv && drv->suspend) {
i = drv->suspend(pci_dev, state);
suspend_report_result(drv->suspend, i);
- } else {
- pci_default_pm_suspend(pci_dev);
}
return i;
}
@@ -361,7 +359,8 @@ static int pci_legacy_suspend_late(struct device *dev, pm_message_t state)
if (drv && drv->suspend_late) {
i = drv->suspend_late(pci_dev, state);
suspend_report_result(drv->suspend_late, i);
- }
+ } else if (!drv || !drv->suspend)
+ pci_default_pm_suspend(pci_dev);
return i;
}

@@ -373,8 +372,6 @@ static int pci_legacy_resume(struct device *dev)

if (drv && drv->resume)
error = drv->resume(pci_dev);
- else
- error = pci_default_pm_resume(pci_dev);
return error;
}

@@ -386,6 +383,8 @@ static int pci_legacy_resume_early(struct device *dev)

if (drv && drv->resume_early)
error = drv->resume_early(pci_dev);
+ else if (!drv || !drv->resume)
+ error = pci_default_pm_resume(pci_dev);
return error;
}

@@ -420,8 +419,6 @@ static int pci_pm_suspend(struct device *dev)
if (drv->pm->suspend) {
error = drv->pm->suspend(dev);
suspend_report_result(drv->pm->suspend, error);
- } else {
- pci_default_pm_suspend(pci_dev);
}
} else {
error = pci_legacy_suspend(dev, PMSG_SUSPEND);
@@ -441,7 +438,8 @@ static int pci_pm_suspend_noirq(struct device *dev)
if (drv->pm->suspend_noirq) {
error = drv->pm->suspend_noirq(dev);
suspend_report_result(drv->pm->suspend_noirq, error);
- }
+ } else if (!drv->pm->suspend)
+ pci_default_pm_suspend(pci_dev);
} else {
error = pci_legacy_suspend_late(dev, PMSG_SUSPEND);
}
@@ -458,8 +456,7 @@ static int pci_pm_resume(struct device *dev)
pci_fixup_device(pci_fixup_resume, pci_dev);

if (drv && drv->pm) {
- error = drv->pm->resume ? drv->pm->resume(dev) :
- pci_default_pm_resume(pci_dev);
+ error = drv->pm->resume ? drv->pm->resume(dev) : 0;
} else {
error = pci_legacy_resume(dev);
}
@@ -467,6 +464,7 @@ static int pci_pm_resume(struct device *dev)
return error;
}

+
static int pci_pm_resume_noirq(struct device *dev)
{
struct pci_dev *pci_dev = to_pci_dev(dev);
@@ -478,6 +476,8 @@ static int pci_pm_resume_noirq(struct device *dev)
if (drv && drv->pm) {
if (drv->pm->resume_noirq)
error = drv->pm->resume_noirq(dev);
+ else if (!drv->pm->resume)
+ error = pci_default_pm_resume(pci_dev);
} else {
error = pci_legacy_resume_early(dev);
}

Rafael J. Wysocki

2008-12-05 22:01:44 UTC

Post by Linus Torvalds

Post by Linus Torvalds
The third thing that worries me is the _very_ early occurrence of
ACPI: Waking up from system sleep state S3
APIC error on CPU1: 00(40)
ACPI: EC: non-query interrupt received, switching to interrupt mode
Now, that "APIC error" thing is worrisome. It's worrisome for multiple
- errors are never good (0x40 means "received illegal vector", whatever
caused _that_)
- more importantly, it seems to imply that interrupts are enabled on
CPU1, and they sure as hell shouldn't be enabled at this stage!
Do we perhaps have a SMP resume bug where we resume the other CPU's
with interrupts enabled?
- the "ACPI: EC: non-query interrupt received, switching to interrupt
mode" thing is from ACPI, and _also_ implies that interrupts are on.
Why are interrupts enabled that early? I really don't like seeing
interrupts enabled before we've even done the basic PCI resume.

Oh, I finally started looking more at this.
It's because the PCI layer uses the late resume for resuming. Which is
horrid. It really shouldn't. Resuming your device after interrupts were
enabled really sounds like a disaster. *Especially* if the device was
active before, either because of hibernation or simply because firmware
pre-initialized it to some "live" state (which could easily happen with
ethernet in particular).
I do wonder why the PCI layer wants to resume things so late. It sounds
totally insane to do things like resume the PCI bridge setup long *after*
you have resumed other devices early. So by doing the default resume late,
it just means that nobody can possibly use the early resume, and now
everybody needs to resume everything with interrupts already going full
blast.
IOW, the _sane_ thing would be to do something like the following patch
- if the driver has a suspend or suspend_early function, _only_ call that
(whether legacy or not)
- otherwise, do the default suspend/resume late/early with interrupts
disabled.
which means that by default, we'll do all save-restore of the PCI state
close to the actual CPU suspend event as possible.
Of course, hibernate probably depends on ->suspend() saving state, which
it won't. Again, if thats' the case, then that's just hibernate (again)
being totally fundamentally broken, and messing with STR functions.

No, hibernate doesn't care whether ->suspend() or ->suspend_late() saves
the state, if that's what you mean. Also, we have the hibernation-specific
callbacks in the new framework anyway.

Post by Linus Torvalds
Rafael?

Well, actually I think we should go further and save the standard config
registers of _all_ PCI devices in the _late() callbacks (ie. with interrupts
disabled) and restore them in the _early() callbacks.

I don't really understand why pci_restore_state() is not called by the core
and every single driver calls it by itself. Moreover, many of them
call pci_set_power_state(dev, PCI_D0) before calling pci_restore_state(),
although this is not really necessary, because they subsequently call
pci_enable_device() which calls pci_set_power_state(dev, PCI_D0) again.

IOW, I would split the resume of PCI devices into two parts, the first of
which will call pci_restore_state() with interrupts disabled and the second
will do the remaining stuff.

Post by Linus Torvalds
I have neither tested the patch nor even tried to compile it - it's meant
to be an example and get people thinking about this, rather than anything
else.

I'm going to try it, though, and see what happens. ;-)

Thanks,
Rafael

Post by Linus Torvalds
---
drivers/pci/pci-driver.c | 20 ++++++++++----------
1 files changed, 10 insertions(+), 10 deletions(-)
diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index b4cdd69..6395983 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -346,8 +346,6 @@ static int pci_legacy_suspend(struct device *dev, pm_message_t state)
if (drv && drv->suspend) {
i = drv->suspend(pci_dev, state);
suspend_report_result(drv->suspend, i);
- } else {
- pci_default_pm_suspend(pci_dev);
}
return i;
}
@@ -361,7 +359,8 @@ static int pci_legacy_suspend_late(struct device *dev, pm_message_t state)
if (drv && drv->suspend_late) {
i = drv->suspend_late(pci_dev, state);
suspend_report_result(drv->suspend_late, i);
- }
+ } else if (!drv || !drv->suspend)
+ pci_default_pm_suspend(pci_dev);
return i;
}
@@ -373,8 +372,6 @@ static int pci_legacy_resume(struct device *dev)
if (drv && drv->resume)
error = drv->resume(pci_dev);
- else
- error = pci_default_pm_resume(pci_dev);
return error;
}
@@ -386,6 +383,8 @@ static int pci_legacy_resume_early(struct device *dev)
if (drv && drv->resume_early)
error = drv->resume_early(pci_dev);
+ else if (!drv || !drv->resume)
+ error = pci_default_pm_resume(pci_dev);
return error;
}
@@ -420,8 +419,6 @@ static int pci_pm_suspend(struct device *dev)
if (drv->pm->suspend) {
error = drv->pm->suspend(dev);
suspend_report_result(drv->pm->suspend, error);
- } else {
- pci_default_pm_suspend(pci_dev);
}
} else {
error = pci_legacy_suspend(dev, PMSG_SUSPEND);
@@ -441,7 +438,8 @@ static int pci_pm_suspend_noirq(struct device *dev)
if (drv->pm->suspend_noirq) {
error = drv->pm->suspend_noirq(dev);
suspend_report_result(drv->pm->suspend_noirq, error);
- }
+ } else if (!drv->pm->suspend)
+ pci_default_pm_suspend(pci_dev);
} else {
error = pci_legacy_suspend_late(dev, PMSG_SUSPEND);
}
@@ -458,8 +456,7 @@ static int pci_pm_resume(struct device *dev)
pci_fixup_device(pci_fixup_resume, pci_dev);
if (drv && drv->pm) {
- pci_default_pm_resume(pci_dev);
+ error = drv->pm->resume ? drv->pm->resume(dev) : 0;
} else {
error = pci_legacy_resume(dev);
}
@@ -467,6 +464,7 @@ static int pci_pm_resume(struct device *dev)
return error;
}
+
static int pci_pm_resume_noirq(struct device *dev)
{
struct pci_dev *pci_dev = to_pci_dev(dev);
@@ -478,6 +476,8 @@ static int pci_pm_resume_noirq(struct device *dev)
if (drv && drv->pm) {
if (drv->pm->resume_noirq)
error = drv->pm->resume_noirq(dev);
+ else if (!drv->pm->resume)
+ error = pci_default_pm_resume(pci_dev);
} else {
error = pci_legacy_resume_early(dev);
}

--
Everyone knows that debugging is twice as hard as writing a program
in the first place. So if you're as clever as you can be when you write it,
how will you ever debug it? --- Brian Kernighan

Linus Torvalds

2008-12-05 22:14:23 UTC

Post by Rafael J. Wysocki
Well, actually I think we should go further and save the standard config
registers of _all_ PCI devices in the _late() callbacks (ie. with interrupts
disabled) and restore them in the _early() callbacks.

That would certainly simplify the code.

Post by Rafael J. Wysocki
I don't really understand why pci_restore_state() is not called by the core
and every single driver calls it by itself.

The idea was to allow PCI drivers to override it if they wanted to.

That said, the ones that do their own state restore generally do it wrong
(eg the USB host controllers doing things in the wrong order and enabling
the device before having actually written back the BAR values), so it's
arguably broken to let drivers override it.

Post by Rafael J. Wysocki
IOW, I would split the resume of PCI devices into two parts, the first of
which will call pci_restore_state() with interrupts disabled and the second
will do the remaining stuff.

I would definitely not disagree with that - leave the regular
"suspend/resume" callbacks for purely higher-level actions. It would
interesting to hear what it does for you.

Linus

Rafael J. Wysocki

2008-12-06 00:04:49 UTC

Post by Linus Torvalds

Post by Rafael J. Wysocki
Well, actually I think we should go further and save the standard config
registers of _all_ PCI devices in the _late() callbacks (ie. with interrupts
disabled) and restore them in the _early() callbacks.

That would certainly simplify the code.

Post by Rafael J. Wysocki
I don't really understand why pci_restore_state() is not called by the core
and every single driver calls it by itself.

The idea was to allow PCI drivers to override it if they wanted to.
That said, the ones that do their own state restore generally do it wrong
(eg the USB host controllers doing things in the wrong order and enabling
the device before having actually written back the BAR values), so it's
arguably broken to let drivers override it.

Post by Rafael J. Wysocki
IOW, I would split the resume of PCI devices into two parts, the first of
which will call pci_restore_state() with interrupts disabled and the second
will do the remaining stuff.

I would definitely not disagree with that - leave the regular
"suspend/resume" callbacks for purely higher-level actions. It would
interesting to hear what it does for you.

I tested the appended patch with suspend-to-RAM and it just hangs during
resume.

Next, I'll try to do that only for devices the drivers of which don't define their own
suspend-resume callbacks at all.

Thanks,
Rafael

---
drivers/pci/pci-driver.c | 50 ++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 47 insertions(+), 3 deletions(-)

Index: linux-2.6/drivers/pci/pci-driver.c
===================================================================
--- linux-2.6.orig/drivers/pci/pci-driver.c
+++ linux-2.6/drivers/pci/pci-driver.c
@@ -300,6 +300,46 @@ static void pci_device_shutdown(struct d

#ifdef CONFIG_PM_SLEEP

+static void pci_default_suspend_noirq(struct pci_dev *pci_dev)
+{
+ dev_info(&pci_dev->dev, "saving standard PCI config registers\n");
+
+ /* save the PCI config space */
+ pci_save_state(pci_dev);
+ /*
+ * mark its power state as "unknown", since we don't know if
+ * e.g. the BIOS will change its device state when we suspend.
+ */
+ if (pci_dev->current_state == PCI_D0)
+ pci_dev->current_state = PCI_UNKNOWN;
+}
+
+static void pci_default_resume_noirq(struct pci_dev *pci_dev)
+{
+ dev_info(&pci_dev->dev, "restoring standard PCI config registers\n");
+
+ /* restore the PCI config space */
+ pci_restore_state(pci_dev);
+}
+
+static int pci_default_resume(struct pci_dev *pci_dev)
+{
+ int retval = 0;
+
+ dev_info(&pci_dev->dev, "trying to reenable device\n");
+
+ /* if the device was enabled before suspend, reenable */
+ retval = pci_reenable_device(pci_dev);
+ /*
+ * if the device was busmaster before the suspend, make it busmaster
+ * again
+ */
+ if (pci_dev->is_busmaster)
+ pci_set_master(pci_dev);
+
+ return retval;
+}
+
/*
* Default "suspend" method for devices that have no driver provided suspend,
* or not even a driver at all.
@@ -346,8 +386,6 @@ static int pci_legacy_suspend(struct dev
if (drv && drv->suspend) {
i = drv->suspend(pci_dev, state);
suspend_report_result(drv->suspend, i);
- } else {
- pci_default_pm_suspend(pci_dev);
}
return i;
}
@@ -362,6 +400,9 @@ static int pci_legacy_suspend_late(struc
i = drv->suspend_late(pci_dev, state);
suspend_report_result(drv->suspend_late, i);
}
+
+ pci_default_suspend_noirq(pci_dev);
+
return i;
}

@@ -374,7 +415,7 @@ static int pci_legacy_resume(struct devi
if (drv && drv->resume)
error = drv->resume(pci_dev);
else
- error = pci_default_pm_resume(pci_dev);
+ error = pci_default_resume(pci_dev);
return error;
}

@@ -384,8 +425,11 @@ static int pci_legacy_resume_early(struc
struct pci_dev * pci_dev = to_pci_dev(dev);
struct pci_driver * drv = pci_dev->driver;

+ pci_default_resume_noirq(pci_dev);
+
if (drv && drv->resume_early)
error = drv->resume_early(pci_dev);
+
return error;
}

Linus Torvalds

2008-12-06 00:50:34 UTC

Post by Rafael J. Wysocki
I tested the appended patch with suspend-to-RAM and it just hangs during
resume.

That patch looks bogus. It only changes the "legacy" cases as far as I can
tell, so anybogy who has drv->pm set will now not do any state save at
all.

Or am I misreading it?

Linus

Rafael J. Wysocki

2008-12-06 01:18:12 UTC

Post by Linus Torvalds

Post by Rafael J. Wysocki
I tested the appended patch with suspend-to-RAM and it just hangs during
resume.

That patch looks bogus. It only changes the "legacy" cases as far as I can
tell, so anybogy who has drv->pm set will now not do any state save at
all.
Or am I misreading it?

It only affects the legacy handling, but the non-legacy handling was left
untouched. IOW, the old "default" functions are still there and are being
called by the "non-legacy" code (it's only used by USB at the moment, AFAICS).

Anyway, I did the test doing it only to the devices which don't have any
non-default suspend-resume handling at all and _that_ apparently fixed the
problem on my box. :-)

Appended is a very crude version of the patch (it duplicates some code),
tomorrow I'll post a cleaned-up version.

I'm still thinkig it will be reasonable to save standard config registers for
all devices with interrupts disabled, but that appears to be more complicated
that I thought it would be.

Thanks,
Rafael

---
drivers/pci/pci-driver.c | 66 +++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 66 insertions(+)

Index: linux-2.6/drivers/pci/pci-driver.c
===================================================================
--- linux-2.6.orig/drivers/pci/pci-driver.c
+++ linux-2.6/drivers/pci/pci-driver.c
@@ -300,6 +300,54 @@ static void pci_device_shutdown(struct d

#ifdef CONFIG_PM_SLEEP

+static void pci_default_suspend_noirq(struct pci_dev *pci_dev)
+{
+ dev_info(&pci_dev->dev, "saving standard PCI config registers\n");
+
+ /* save the PCI config space */
+ pci_save_state(pci_dev);
+ /*
+ * mark its power state as "unknown", since we don't know if
+ * e.g. the BIOS will change its device state when we suspend.
+ */
+ if (pci_dev->current_state == PCI_D0)
+ pci_dev->current_state = PCI_UNKNOWN;
+}
+
+static void pci_default_resume_noirq(struct pci_dev *pci_dev)
+{
+ dev_info(&pci_dev->dev, "restoring standard PCI config registers\n");
+
+ /* restore the PCI config space */
+ pci_restore_state(pci_dev);
+}
+
+static int pci_default_resume(struct pci_dev *pci_dev)
+{
+ int retval = 0;
+
+ dev_info(&pci_dev->dev, "trying to reenable device\n");
+
+ /* if the device was enabled before suspend, reenable */
+ retval = pci_reenable_device(pci_dev);
+ /*
+ * if the device was busmaster before the suspend, make it busmaster
+ * again
+ */
+ if (pci_dev->is_busmaster)
+ pci_set_master(pci_dev);
+
+ return retval;
+}
+
+static bool pci_has_legacy_pm_handling(struct pci_dev *pci_dev)
+{
+ struct pci_driver *drv = pci_dev->driver;
+
+ return drv && (drv->suspend || drv->suspend_late || drv->resume
+ || drv->resume_early);
+}
+
/*
* Default "suspend" method for devices that have no driver provided suspend,
* or not even a driver at all.
@@ -343,6 +391,9 @@ static int pci_legacy_suspend(struct dev
struct pci_driver * drv = pci_dev->driver;
int i = 0;

+ if (!pci_has_legacy_pm_handling(pci_dev))
+ return 0;
+
if (drv && drv->suspend) {
i = drv->suspend(pci_dev, state);
suspend_report_result(drv->suspend, i);
@@ -358,10 +409,16 @@ static int pci_legacy_suspend_late(struc
struct pci_driver * drv = pci_dev->driver;
int i = 0;

+ if (!pci_has_legacy_pm_handling(pci_dev)) {
+ pci_default_suspend_noirq(pci_dev);
+ return 0;
+ }
+
if (drv && drv->suspend_late) {
i = drv->suspend_late(pci_dev, state);
suspend_report_result(drv->suspend_late, i);
}
+
return i;
}

@@ -371,6 +428,9 @@ static int pci_legacy_resume(struct devi
struct pci_dev * pci_dev = to_pci_dev(dev);
struct pci_driver * drv = pci_dev->driver;

+ if (!pci_has_legacy_pm_handling(pci_dev))
+ return pci_default_resume(pci_dev);
+
if (drv && drv->resume)
error = drv->resume(pci_dev);
else
@@ -384,8 +444,14 @@ static int pci_legacy_resume_early(struc
struct pci_dev * pci_dev = to_pci_dev(dev);
struct pci_driver * drv = pci_dev->driver;

+ if (!pci_has_legacy_pm_handling(pci_dev)) {
+ pci_default_resume_noirq(pci_dev);
+ return 0;
+ }
+
if (drv && drv->resume_early)
error = drv->resume_early(pci_dev);
+
return error;
}

Linus Torvalds

2008-12-06 01:55:16 UTC

Post by Rafael J. Wysocki
It only affects the legacy handling, but the non-legacy handling was left
untouched. IOW, the old "default" functions are still there and are being
called by the "non-legacy" code (it's only used by USB at the moment, AFAICS).

Ok.

Post by Rafael J. Wysocki
Anyway, I did the test doing it only to the devices which don't have any
non-default suspend-resume handling at all and _that_ apparently fixed the
problem on my box. :-)

Which makes sense, btw. Because if you do the pci_save_state() on a device
that _does_ have a suspend function, you'll be saving the post-suspend
state - ie the device turned off.

So yeah, we really can only do the default suspend if the device has no
pre-existing suspend function - or we'd need to make sure that all PCI
drivers that do have suspend functions would only do the higher-level
functionality.

Anyway, what I'm most interested in hearing is whether this actually
improves your situation. I can _easily_ see that your resume problem could
be due to interrupt timing. That's especially true if there are shared
interrupts, but even in the absense of that, I'm not at all sure that the
e1000e resume code is interrupt-safe, for example.

Linus

Rafael J. Wysocki

2008-12-06 02:18:07 UTC

Post by Linus Torvalds

Post by Rafael J. Wysocki
It only affects the legacy handling, but the non-legacy handling was left
untouched. IOW, the old "default" functions are still there and are being
called by the "non-legacy" code (it's only used by USB at the moment, AFAICS).

Ok.

Post by Rafael J. Wysocki
Anyway, I did the test doing it only to the devices which don't have any
non-default suspend-resume handling at all and _that_ apparently fixed the
problem on my box. :-)

Which makes sense, btw. Because if you do the pci_save_state() on a device
that _does_ have a suspend function, you'll be saving the post-suspend
state - ie the device turned off.
So yeah, we really can only do the default suspend if the device has no
pre-existing suspend function - or we'd need to make sure that all PCI
drivers that do have suspend functions would only do the higher-level
functionality.
Anyway, what I'm most interested in hearing is whether this actually
improves your situation.

Yes, it does, from what I can tell at the moment. :-)

Tomorrow I'll do more testing to (hopefully) confirm that.

Post by Linus Torvalds
I can _easily_ see that your resume problem could be due to interrupt
timing. That's especially true if there are shared interrupts, but even in
the absense of that, I'm not at all sure that the e1000e resume code is
interrupt-safe, for example.

Agreed.

Thanks,
Rafael

Rafael J. Wysocki

2008-12-06 13:53:23 UTC

Post by Rafael J. Wysocki

Post by Linus Torvalds

Post by Rafael J. Wysocki
It only affects the legacy handling, but the non-legacy handling was left
untouched. IOW, the old "default" functions are still there and are being
called by the "non-legacy" code (it's only used by USB at the moment, AFAICS).

Ok.

Post by Rafael J. Wysocki
Anyway, I did the test doing it only to the devices which don't have any
non-default suspend-resume handling at all and _that_ apparently fixed the
problem on my box. :-)

Which makes sense, btw. Because if you do the pci_save_state() on a device
that _does_ have a suspend function, you'll be saving the post-suspend
state - ie the device turned off.
So yeah, we really can only do the default suspend if the device has no
pre-existing suspend function - or we'd need to make sure that all PCI
drivers that do have suspend functions would only do the higher-level
functionality.
Anyway, what I'm most interested in hearing is whether this actually
improves your situation.

Yes, it does, from what I can tell at the moment. :-)

OK, this patch alone doesn't fix the problem, ie. I was able to reproduce it
with this patch applied, but it decreases the probability of a failure.

_However_, when I added two more patches to the mix:
- a patch that moved the PCI Express port suspend and resume to functions
executed with interrupts disabled
- a patch that moves the restoration of the PCI config space in snd_hda_intel
to a ->resume_early() callback
I'm not able to reproduce the problem any more (I did over 20
hibernation-resume cycles with this combination of patches applied with
occasional suspend-to-RAM-resume cycles in between and there were no problems
resuming).

I'm going to post the three patches in a separate thread for discussion.

Thanks,
Rafael

Greg KH

2008-12-06 02:45:08 UTC

Post by Linus Torvalds

Post by Rafael J. Wysocki
It only affects the legacy handling, but the non-legacy handling was left
untouched. IOW, the old "default" functions are still there and are being
called by the "non-legacy" code (it's only used by USB at the moment, AFAICS).

Ok.

Post by Rafael J. Wysocki
Anyway, I did the test doing it only to the devices which don't have any
non-default suspend-resume handling at all and _that_ apparently fixed the
problem on my box. :-)

Which makes sense, btw. Because if you do the pci_save_state() on a device
that _does_ have a suspend function, you'll be saving the post-suspend
state - ie the device turned off.

I think that is why we did not do it for every device, we didn't want to
touch drivers that already had working suspend calls.

Post by Linus Torvalds
So yeah, we really can only do the default suspend if the device has no
pre-existing suspend function - or we'd need to make sure that all PCI
drivers that do have suspend functions would only do the higher-level
functionality.

Agreed.

thanks,

greg k-h

Frans Pop

2008-12-06 09:20:17 UTC

Post by Linus Torvalds
Greg, Jesse, can you think about and look at the USB PCI resume ordering?

[...]

Post by Linus Torvalds
In many ways the bigger worry is actually in the totally unrelated USB
uhci_hcd 0000:00:1d.2: enabling device (0000 -> 0001)
uhci_hcd 0000:00:1d.2: PCI INT C -> GSI 18 (level, low) -> IRQ 18
uhci_hcd 0000:00:1d.2: setting latency timer to 64
uhci_hcd 0000:00:1d.2: restoring config space at offset 0xf (was 0x300, writing 0x30b)
uhci_hcd 0000:00:1d.2: restoring config space at offset 0x8 (was 0x1, writing 0x2101)
usb usb7: root hub lost power or was reset
ehci_hcd 0000:00:1d.7: enabling device (0000 -> 0002)
ehci_hcd 0000:00:1d.7: PCI INT A -> GSI 20 (level, low) -> IRQ 20
ehci_hcd 0000:00:1d.7: setting latency timer to 64
ehci_hcd 0000:00:1d.7: restoring config space at offset 0xf (was 0x100, writing 0x10a)
ehci_hcd 0000:00:1d.7: restoring config space at offset 0x4 (was 0x0, writing 0xe0648000)
and the worry I have here is that we actually enable the device
_before_ we've restored the BAR information. That sounds very iffy. It
sounds doubly iffy in the 'resume from hibernate' case, where we are
going to have an already-set-up PCI bus and the config space values are
going to all be live as we reprogram them.
That "restoring config space at offset 0x8" thing is where we restore
the BAR (dword 0x8 = offset 0x20 = PCI_BASE_ADDRESS_4), and we're
changing it from 0x1 to 0x2101, with the IO BAR enabled. In this case,
the old value meant that the BAR started out disabled, but hibernate
would have been different.
So I'd _much_ rather have seen the sequence have the BAR restore
sequence be something like
uhci_hcd 0000:00:1d.2: restoring config space at offset 0xf (was 0x300, writing 0x30b)
uhci_hcd 0000:00:1d.2: restoring config space at offset 0x8 (was 0x1, writing 0x2101)
uhci_hcd 0000:00:1d.2: enabling device (0000 -> 0001)
uhci_hcd 0000:00:1d.2: PCI INT C -> GSI 18 (level, low) -> IRQ 18
uhci_hcd 0000:00:1d.2: setting latency timer to 64
instead. Possibly even with an explicit disable of the
memory/IO/busmaster bits before the whole sequence.

I've taken a very naive look at this, basically by comparing what usb
(usb/core/hcd-pci.c) is doing compared to other drivers.

I used the following command to get an overview:
$ git grep -E -n -C5 "pci_(enable_device|set_master|restore_state|power_state.*D0)"
(The line numbers give some indication whether work is split over functions.)

Most drivers seem to do some variation of the following, which looks
logical and is in line with Documentation/power/pci.txt:
pci_set_power_state(dev, PCI_D0);
pci_restore_state(dev);
pci_enable_device(dev);
pci_set_master(dev);

But quite a lot of drivers (including usb and e.g. ide/setup-pci.c) do
something like:
pci_enable_device(dev);
pci_set_master(dev);
pci_restore_state(dev);

Maybe the whole tree should get a review for this?

Anyway, I gave the patch below a try on both my notebook and desktop.
My desktop has USB keyboard and mouse and the notebook has wireless and
a fingerprint scanner on USB. Everything still worked after resume.

Diff of the resume dmesg of my motebook attached. Looks better I think?

Cheers,
FJP

Rafael J. Wysocki

2008-12-06 13:48:30 UTC

Post by Frans Pop

Post by Linus Torvalds
Greg, Jesse, can you think about and look at the USB PCI resume ordering?

[...]

Post by Linus Torvalds
In many ways the bigger worry is actually in the totally unrelated USB
uhci_hcd 0000:00:1d.2: enabling device (0000 -> 0001)
uhci_hcd 0000:00:1d.2: PCI INT C -> GSI 18 (level, low) -> IRQ 18
uhci_hcd 0000:00:1d.2: setting latency timer to 64
uhci_hcd 0000:00:1d.2: restoring config space at offset 0xf (was 0x300, writing 0x30b)
uhci_hcd 0000:00:1d.2: restoring config space at offset 0x8 (was 0x1, writing 0x2101)
usb usb7: root hub lost power or was reset
ehci_hcd 0000:00:1d.7: enabling device (0000 -> 0002)
ehci_hcd 0000:00:1d.7: PCI INT A -> GSI 20 (level, low) -> IRQ 20
ehci_hcd 0000:00:1d.7: setting latency timer to 64
ehci_hcd 0000:00:1d.7: restoring config space at offset 0xf (was 0x100, writing 0x10a)
ehci_hcd 0000:00:1d.7: restoring config space at offset 0x4 (was 0x0, writing 0xe0648000)
and the worry I have here is that we actually enable the device
_before_ we've restored the BAR information. That sounds very iffy. It
sounds doubly iffy in the 'resume from hibernate' case, where we are
going to have an already-set-up PCI bus and the config space values are
going to all be live as we reprogram them.
That "restoring config space at offset 0x8" thing is where we restore
the BAR (dword 0x8 = offset 0x20 = PCI_BASE_ADDRESS_4), and we're
changing it from 0x1 to 0x2101, with the IO BAR enabled. In this case,
the old value meant that the BAR started out disabled, but hibernate
would have been different.
So I'd _much_ rather have seen the sequence have the BAR restore
sequence be something like
uhci_hcd 0000:00:1d.2: restoring config space at offset 0xf (was 0x300, writing 0x30b)
uhci_hcd 0000:00:1d.2: restoring config space at offset 0x8 (was 0x1, writing 0x2101)
uhci_hcd 0000:00:1d.2: enabling device (0000 -> 0001)
uhci_hcd 0000:00:1d.2: PCI INT C -> GSI 18 (level, low) -> IRQ 18
uhci_hcd 0000:00:1d.2: setting latency timer to 64
instead. Possibly even with an explicit disable of the
memory/IO/busmaster bits before the whole sequence.

I've taken a very naive look at this, basically by comparing what usb
(usb/core/hcd-pci.c) is doing compared to other drivers.
$ git grep -E -n -C5 "pci_(enable_device|set_master|restore_state|power_state.*D0)"
(The line numbers give some indication whether work is split over functions.)
Most drivers seem to do some variation of the following, which looks
pci_set_power_state(dev, PCI_D0);
pci_restore_state(dev);
pci_enable_device(dev);
pci_set_master(dev);
But quite a lot of drivers (including usb and e.g. ide/setup-pci.c) do
pci_enable_device(dev);
pci_set_master(dev);
pci_restore_state(dev);
Maybe the whole tree should get a review for this?

I think so.

Post by Frans Pop
Anyway, I gave the patch below a try on both my notebook and desktop.
My desktop has USB keyboard and mouse and the notebook has wireless and
a fingerprint scanner on USB. Everything still worked after resume.
Diff of the resume dmesg of my motebook attached. Looks better I think?

Yes, to me it does.

In fact, I think you could even move the pci_restore_state(dev) into
usb_hcd_pci_resume_early() that would be executed with interrupts off and
drop the pci_set_power_state(dev, PCI_D0); entirely (the
pci_enable_device(dev); would invoke it anyway).

Thanks,
Rafael

PS
Please append patches instead of attaching them, they are a lot easier to
discuss this way.
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Frans Pop

2008-12-06 15:02:23 UTC

Post by Rafael J. Wysocki
In fact, I think you could even move the pci_restore_state(dev) into
usb_hcd_pci_resume_early() that would be executed with interrupts off
and drop the pci_set_power_state(dev, PCI_D0); entirely (the
pci_enable_device(dev); would invoke it anyway).

There is quite a lot going on in the resume function that happens before
pci_restore_state is called and I'm absolutely not confident with messing
with that, so I'll leave anything more invasive to the USB maintainers.

Cheers,
FJP
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Rafael J. Wysocki

2008-12-04 22:46:21 UTC

Post by Frans Pop

Post by Linus Torvalds

Post by Frans Pop
I've given your patch a try and the few resumes from STR I've done
were all successful. That's not 100% conclusive yet, but a nice
start. Some info from logs etc. below.

Ok, but I thought you had a hard time reproducing this _anyway_, even
with just plain -rc7. No?

Well, I had a failure rate of about 1 in 5-10 resumes originally.
See: http://bugzilla.kernel.org/show_bug.cgi?id=11545
Then I found the 2 workarounds and *with those in place* I got almost 100%
reliable resumes. Now I've removed those workarounds and with either the
revert or your oneliner I still get 100% success.
From my PoV that is a very definite improvement: the machine now "feels" a
hell of a lot more reliable for critical use.
So I _could_ reproduce it reliably given enough suspend/resume cycles.
But I guess this does support your suspicion that it may be a timing
issue: if the timing happens to be right, the resume succeeds; if it's
wrong I get a dead box.

Post by Linus Torvalds
Since it's apparently STR, has anybody gotten _anything_ sane out of
trying to enable PM_TRACE_RTC, and then doing that
echo 1 > /sys/power/pm_trace

I did try that at the beginning. That's how I ended up removing e1000e
before suspend. See http://bugzilla.kernel.org/show_bug.cgi?id=11545.
My next hint was that Matthew Garret, who has the same notebook, was
surprised at my resume problems as he did not see them. So I did a
comparison of our kernel configs and made some changes to mine. From
that I found that a very low value for SND_HDA_POWER_SAVE_DEFAULT (5)
reduced the failure rate to practically zero.
At some point I tried keeping e1000e loaded for a bit, but that quickly
gave me a failure again, so I starting removing it again during suspend.
So I did have some data, but as I got no response on my BR I had no idea
where to go from there. I was really very happy to see Rafael's mail as
his description almost exactly matched what I had been seeing.
I'd be happy to run with unpatched kernels for a while and do some more
pm_traces, but only if someone is going to follow up and interpret the
results for me or provide suggestions for targeted additional debugging.

Please go for it, I'm very interested in understanding the underlying problem.

Thanks,
Rafael

Rafael J. Wysocki

2008-12-04 22:40:58 UTC

Post by Linus Torvalds

Post by Frans Pop

Well, I think that what _would_ be generally correct, and actually
pretty simple, is a rather different approach: just not sizing things
behind a transparent bridge AT ALL, since it really shouldn't matter.

I've given your patch a try and the few resumes from STR I've done were
all successful. That's not 100% conclusive yet, but a nice start.
Some info from logs etc. below.

Ok, but I thought you had a hard time reproducing this _anyway_, even with
just plain -rc7. No?
That said, of the various patches posted, the "don't bother allocating
bridging windows for transparent bridges" one is not just the simplest,
but the only one that actually makes sense so far.
So I'm happy it's apparently working for you, I'm just wondering about
whather your success means a lot. It seems that Rafael is the one who had
more failures?

This most probably is correct and I got a resume failure with that patch
applied, so it evidently doesn't fix the problem. :-(

Post by Linus Torvalds

Post by Frans Pop

Also, I would be happy to actually understand _why_ this happens.

100% agreed. I do _not_ see why it should ever matter how we set up a
PCI bridging window - whether prefetchable or not - on a bridge that
should be transparent. It sounds really odd. I'm wondering if there is
something we're missing here.

The theory that it is really a resume issue and not a device layout issue
sounds logical. Especially as everything always works correctly after a
normal boot.

Yes, that does sound like a convincing argument. Usually real PCI resource
clashes result in some kind of run-time problems, and wouldn't necessarily
be suspend-specific per se.
That said, suspend/resume does a lot of unusual things, so it could still
be some odd PCI resource clash that only triggers problems in the
suspend/resume case. But since the exact layouts and the sizing of the
resources doesn't really seem to matter, a simple PCI resource clash seems
rather unlikely.

I agree.

That said the "don't bother allocating bridging windows for transparent
bridges" patch resulted in the following layout on my box (from /proc/iomem):

88000000-8807ffff : 0000:00:02.1
88080000-88083fff : 0000:00:1b.0
88080000-88083fff : ICH HD audio
88084000-88087fff : 0000:03:0b.1
88088000-88088fff : 0000:03:0b.0
88088000-88088fff : yenta_socket
88089000-880897ff : 0000:03:0b.1
88089000-880897ff : firewire_ohci
88089800-880898ff : 0000:03:0b.3
88089800-880898ff : mmc0
8808a000-8808afff : Intel Flush Page
8c000000-8fffffff : PCI CardBus 0000:04
90000000-93ffffff : PCI CardBus 0000:04

while my "don't allocate bridging windows for cardbus bridges behind
transparent bridges" patch I've just sent (appended for easier reference)
results in the layout:

88000000-880fffff : PCI Bus 0000:03
88000000-88003fff : 0000:03:0b.1
88004000-88004fff : 0000:03:0b.0
88004000-88004fff : yenta_socket
88005000-880057ff : 0000:03:0b.1
88005000-880057ff : firewire_ohci
88005800-880058ff : 0000:03:0b.3
88005800-880058ff : mmc0
88100000-8817ffff : 0000:00:02.1
88180000-88183fff : 0000:00:1b.0
88180000-88183fff : ICH HD audio
88184000-88184fff : Intel Flush Page
8c000000-8fffffff : PCI CardBus 0000:04
90000000-93ffffff : PCI CardBus 0000:04

where devices behind the transparent bridge (PCI Bus 0000:03) are located
_before_ ICH HD audio in the memory address space, and this one appears to
work. So there _may_ be an effect of the layout too.

Post by Linus Torvalds
So some kind of resume-time ordering or timing issue does seem like the
most likely thing. But that still leaves us not knowing what the real
_root_ cause of this all is - very irritating. Even if not allocating the
unnecessary bridging windows "fixes" things, it would be really really
good to know exactly what it is that causes problems.

Well, given that both affected boxes have the same chipset (945GM), I seriously
suspect a nastiness in that chipset we're not aware of. Especially that
the problem is not reproducible without snd_hda_intel (at least on my box).

Post by Linus Torvalds

Post by Frans Pop
A) unpatched
B) with the revert/debug patch
C) with the oneliner "ignore transparent bridges" patch
AFAICT all results are probably as expected.
00:1e.0 PCI bridge: Intel Corporation 82801 Mobile PCI Bridge
- for A)
I/O behind bridge: 00003000-00003fff
Memory behind bridge: e0100000-e03fffff
Prefetchable memory behind bridge: 0000000080000000-0000000083ffffff
- for B)
I/O behind bridge: 00003000-00003fff
Memory behind bridge: e0100000-e03fffff
- for C)
Memory behind bridge: e0100000-e03fffff

And this all makes total sense. The e0100000-e03fffff MMIO bridge range is
apparently set up by the firmware, which is why it shows up in all cases.
And the (A) case has that prefetchable memory range, because that's the
only case that finds - and cares about - the prefetch window for the
CardBus controller.
And both (A) and (B) have the IO bridging window, because regardless of
whether we see a valid CardBus prefetchable memory window with good
alignment, we'll always see the IO ports, so we'll try to allocate that
bridging window, except in (C) when we decide that due to the transparent
nature, we simply don't care.
So the PCI resources make sense in all three cases, and we understand
those. The differences in the actual Cardbus ranges also all make sense.
So it all still boils down to the PCI layer doing everything right in
_all_ cases, just making slightly different - but all valid - choices
depending on essentially random details (eg the revert/debug patch case
the "random detail" is just enabling a small incorrect alignment).
IOW, it really doesn't look like a PCI resource allocator bug. Quite the
reverse, I'd say that in the end this whole thread points out just how
robust the whole PCI and cardbus resource allocation is, with the code
really very gracefully just adjusting in a sane manner to all these
different cases.
Of course, none of that helps us with any kind of idea of what the real
problem is. Device ordering bug in setting up PCI resources at resume?
Perhaps just a plain bug in PCI bridge resume code (even when you resume
things in the right order)?
And I still worry that perhaps it's just a timing bug, where having a PCI
bridging window changes timing of various PCI accesses, and the _real_ bug
is actually in the sound card or ethernet driver resume, which happens to
work with one timing and not with another.
Since it's apparently STR, has anybody gotten _anything_ sane out of
trying to enable PM_TRACE_RTC, and then doing that
echo 1 > /sys/power/pm_trace
because even with the (very limited) set of standard trace-points, it
should still be able to tell which device we were trying to resume last in
the failure case Maybe that gives some hint?

Well, I think more fine-grained debugging will be necessary.

I've already checked the resume ordering of PCI devices on my box and it
is the following:

pci:0000:00:00.0
pci:0000:00:02.0 <- graphics
pci:0000:00:02.1 <- graphics
pci:0000:00:1b.0 <- snd_hda_intel
pci:0000:00:1c.0 <- PCI Express port 1
pci:0000:00:1c.2 <- PCI Express port 3
pci:0000:00:1d.0 <- USB UHCI
pci:0000:00:1d.1 <- USB UHCI
pci:0000:00:1d.2 <- USB UHCI
pci:0000:00:1d.3 <- USB UHCI
pci:0000:00:1d.7 <- USB EHCI
pci:0000:00:1e.0 <- transparent bridge (Intel Corporation 82801 Mobile PCI Bridge)
pci:0000:00:1f.0 <- ISA bridge
pci:0000:00:1f.2 <- SATA (ahci)
pci:0000:01:00.0 <- e1000e
No Bus:0000:01
pci:0000:02:00.0 <- wireless (iwlagn)
No Bus:0000:02
pci:0000:03:0b.0 <- cardbus bridge
pci:0000:03:0b.1 <- FireWire
pci:0000:03:0b.3 <- SD Host controller (Texas Instruments)
No Bus:0000:04
No Bus:0000:03

So, snd_hda_intel resumes before all of the bridges and the layout of devices
_behind_ the transparent bridge shouldn't affect it at all.

Thanks,
Rafael

Linus Torvalds

2008-12-04 23:22:30 UTC

Post by Rafael J. Wysocki
That said the "don't bother allocating bridging windows for transparent
88000000-8807ffff : 0000:00:02.1
88080000-88083fff : 0000:00:1b.0
88080000-88083fff : ICH HD audio
88084000-88087fff : 0000:03:0b.1
88088000-88088fff : 0000:03:0b.0
88088000-88088fff : yenta_socket
88089000-880897ff : 0000:03:0b.1
88089000-880897ff : firewire_ohci
88089800-880898ff : 0000:03:0b.3
88089800-880898ff : mmc0
8808a000-8808afff : Intel Flush Page
8c000000-8fffffff : PCI CardBus 0000:04
90000000-93ffffff : PCI CardBus 0000:04
while my "don't allocate bridging windows for cardbus bridges behind
transparent bridges" patch I've just sent (appended for easier reference)
88000000-880fffff : PCI Bus 0000:03
88000000-88003fff : 0000:03:0b.1
88004000-88004fff : 0000:03:0b.0
88004000-88004fff : yenta_socket
88005000-880057ff : 0000:03:0b.1
88005000-880057ff : firewire_ohci
88005800-880058ff : 0000:03:0b.3
88005800-880058ff : mmc0
88100000-8817ffff : 0000:00:02.1
88180000-88183fff : 0000:00:1b.0
88180000-88183fff : ICH HD audio
88184000-88184fff : Intel Flush Page
8c000000-8fffffff : PCI CardBus 0000:04
90000000-93ffffff : PCI CardBus 0000:04

Well, this happens because in the second case, you still will allocate a
non-prefetchable window due to the _other_ devices behind the bridge (ie
the firewire and mmc device.

So with your "ignore cardbus device resources", you still do end up with a
bridge window allocated, but now it will depend entirely on what other
devices exist on that PCI bus. Which is why I really dislike that patch,
because it really makes no sense at all.

And once you've allocated the PCI bridge window, then the alignment
requirements are that you end up having at least 1MB of window, and now
because a window exists, and will fit, it will then happily put the yenta
control mappings into that window even though it didn't size it for them.

See?

Also, notice how it doesn't put the actual cardbus bridge windows
themselves into that PCI bridge window, because they won't fit. So these
resources:

8c000000-8fffffff : PCI CardBus 0000:04
90000000-93ffffff : PCI CardBus 0000:04

are outside the window, even though that bus is topologically inside the
same bus, and they both end up depending on the fact that the PCI bus
controller is transparent.

So the above results are fairly easy to explain.

The _ordering_ difference comes simply from the allocation order. We'll do
bus allocations first (if we do them), which is why _if_ we allocate a
window for that PCI bus 03 (ie the one that is bridged to by 0:1e.0, that
transparent bridge), then we'll end up allocating it first. So that's why
the PCI bridge window shows up at 0x88000000, if it shows up at all
(because that's the starting address for PCI allocations).

Then, we'll do the regular devices in the order we found them, so then
we'll allocate the resources for device 0:02.1 and then 0:1b.0. Now, _if_
we did a bus window allocation, they'll end up being after the bus window
we allocated. Otherwise they'll end up being the first allocations.

So then, by the time we actually get to bus#3, and devices 3:0b.*, where
they end up is going to depend on whether we allocated that bus window (in
which case they'll preferentially get allocated inside the window - ie
starting at 0x88000000), or they'll just get allocated inside the parent
(root) PCI bus. In the latter case they'll be allocated after the 0:02.1
etc devices.

So the differences in ordering really do make sense, and are a direct
consequence of whether we decided to need to allocate a bridging window
for that PCI-PCI bridge at 0:1e.0.

Also note that _if_ PCI bus #3 had had other devices with prefetchable
memory resources, then we'd have allocated a prefetchable window for
those even with your patch, and then we'd have been back to the original
allocation again (although sizing might cause changes, since your patch
will make us ignore the cardbus controlle for sizing).

Post by Rafael J. Wysocki
where devices behind the transparent bridge (PCI Bus 0000:03) are located
_before_ ICH HD audio in the memory address space, and this one appears to
work. So there _may_ be an effect of the layout too.

I do agree that your patch will affect layout. I just don't think it makes
any real amount of sense because of how it essentially does so "randomly"
depending on what other devices you'd have behind the transparent bridge.

Which is why I think we should either not size transparent bridges at all,
or we should size _all_ the devices behind them.

Oh, btw, I thought your and Frans' laptops were identical? They don't seem
to be: in Frans' case, the firmware does seem to set up one memory window,
so he gets that one even with my "don't size anything" patch, just because
we'll still honor allocations done by firmware.

Post by Rafael J. Wysocki
Well, given that both affected boxes have the same chipset (945GM), I seriously
suspect a nastiness in that chipset we're not aware of. Especially that
the problem is not reproducible without snd_hda_intel (at least on my box).

Well, the counter-argument to that is that the 945GM should be a _very_
common chipset. So I'd not expect a chipset nastiness.

I'd be more wary about the BIOS doing something odd. Or an ACPI thing.

Post by Rafael J. Wysocki
I've already checked the resume ordering of PCI devices on my box and it
[snip]
So, snd_hda_intel resumes before all of the bridges and the layout of devices
_behind_ the transparent bridge shouldn't affect it at all.

Yes. And in the case of Frans' machine, the e1000e controller was before
all the bridges too.

Linus

Rafael J. Wysocki

2008-12-04 23:45:40 UTC

Post by Linus Torvalds

Post by Rafael J. Wysocki
That said the "don't bother allocating bridging windows for transparent
88000000-8807ffff : 0000:00:02.1
88080000-88083fff : 0000:00:1b.0
88080000-88083fff : ICH HD audio
88084000-88087fff : 0000:03:0b.1
88088000-88088fff : 0000:03:0b.0
88088000-88088fff : yenta_socket
88089000-880897ff : 0000:03:0b.1
88089000-880897ff : firewire_ohci
88089800-880898ff : 0000:03:0b.3
88089800-880898ff : mmc0
8808a000-8808afff : Intel Flush Page
8c000000-8fffffff : PCI CardBus 0000:04
90000000-93ffffff : PCI CardBus 0000:04
while my "don't allocate bridging windows for cardbus bridges behind
transparent bridges" patch I've just sent (appended for easier reference)
88000000-880fffff : PCI Bus 0000:03
88000000-88003fff : 0000:03:0b.1
88004000-88004fff : 0000:03:0b.0
88004000-88004fff : yenta_socket
88005000-880057ff : 0000:03:0b.1
88005000-880057ff : firewire_ohci
88005800-880058ff : 0000:03:0b.3
88005800-880058ff : mmc0
88100000-8817ffff : 0000:00:02.1
88180000-88183fff : 0000:00:1b.0
88180000-88183fff : ICH HD audio
88184000-88184fff : Intel Flush Page
8c000000-8fffffff : PCI CardBus 0000:04
90000000-93ffffff : PCI CardBus 0000:04

Well, this happens because in the second case, you still will allocate a
non-prefetchable window due to the _other_ devices behind the bridge (ie
the firewire and mmc device.
So with your "ignore cardbus device resources", you still do end up with a
bridge window allocated, but now it will depend entirely on what other
devices exist on that PCI bus. Which is why I really dislike that patch,
because it really makes no sense at all.
And once you've allocated the PCI bridge window, then the alignment
requirements are that you end up having at least 1MB of window, and now
because a window exists, and will fit, it will then happily put the yenta
control mappings into that window even though it didn't size it for them.
See?
Also, notice how it doesn't put the actual cardbus bridge windows
themselves into that PCI bridge window, because they won't fit. So these
8c000000-8fffffff : PCI CardBus 0000:04
90000000-93ffffff : PCI CardBus 0000:04
are outside the window, even though that bus is topologically inside the
same bus, and they both end up depending on the fact that the PCI bus
controller is transparent.
So the above results are fairly easy to explain.
The _ordering_ difference comes simply from the allocation order. We'll do
bus allocations first (if we do them), which is why _if_ we allocate a
window for that PCI bus 03 (ie the one that is bridged to by 0:1e.0, that
transparent bridge), then we'll end up allocating it first. So that's why
the PCI bridge window shows up at 0x88000000, if it shows up at all
(because that's the starting address for PCI allocations).
Then, we'll do the regular devices in the order we found them, so then
we'll allocate the resources for device 0:02.1 and then 0:1b.0. Now, _if_
we did a bus window allocation, they'll end up being after the bus window
we allocated. Otherwise they'll end up being the first allocations.
So then, by the time we actually get to bus#3, and devices 3:0b.*, where
they end up is going to depend on whether we allocated that bus window (in
which case they'll preferentially get allocated inside the window - ie
starting at 0x88000000), or they'll just get allocated inside the parent
(root) PCI bus. In the latter case they'll be allocated after the 0:02.1
etc devices.
So the differences in ordering really do make sense, and are a direct
consequence of whether we decided to need to allocate a bridging window
for that PCI-PCI bridge at 0:1e.0.
Also note that _if_ PCI bus #3 had had other devices with prefetchable
memory resources, then we'd have allocated a prefetchable window for
those even with your patch, and then we'd have been back to the original
allocation again (although sizing might cause changes, since your patch
will make us ignore the cardbus controlle for sizing).

Yes, I do realize all of what you said above. I only wanted to note that the
layouts of device's memory ranges are different with both patches.

Post by Linus Torvalds

Post by Rafael J. Wysocki
where devices behind the transparent bridge (PCI Bus 0000:03) are located
_before_ ICH HD audio in the memory address space, and this one appears to
work. So there _may_ be an effect of the layout too.

I do agree that your patch will affect layout. I just don't think it makes
any real amount of sense because of how it essentially does so "randomly"
depending on what other devices you'd have behind the transparent bridge.
Which is why I think we should either not size transparent bridges at all,
or we should size _all_ the devices behind them.

I'm not saying it's unreasonable to do this in general. Still, on this
particular box it appears to break things in a very subtle way.

Also, I do realize that most probably the root cause of that is something
else, but we can make it show up itself or hide by changing the layout of
the memory space. What is this, though, I don't know.

Post by Linus Torvalds
Oh, btw, I thought your and Frans' laptops were identical?

No, they are from different vendors. :-) Mine is a Toshiba one and the Frans'
one is from HP.

Post by Linus Torvalds
They don't seem to be: in Frans' case, the firmware does seem to set up one
memory window, so he gets that one even with my "don't size anything" patch,
just because we'll still honor allocations done by firmware.

Post by Rafael J. Wysocki
Well, given that both affected boxes have the same chipset (945GM), I seriously
suspect a nastiness in that chipset we're not aware of. Especially that
the problem is not reproducible without snd_hda_intel (at least on my box).

Well, the counter-argument to that is that the 945GM should be a _very_
common chipset. So I'd not expect a chipset nastiness.

Unless people don't report problems with it, because they are so rare that
are regarded as "random" and "unreproducible".

Also, since the layout of devices in the PCI config space seems to matter,
the problem need not appear on other systems with that chipset at all.

Post by Linus Torvalds
I'd be more wary about the BIOS doing something odd. Or an ACPI thing.

Well, unlikely, _unless_ Toshiba and HP both use the same Intel reference
BIOS or something.

Post by Linus Torvalds

Post by Rafael J. Wysocki
I've already checked the resume ordering of PCI devices on my box and it
[snip]
So, snd_hda_intel resumes before all of the bridges and the layout of devices
_behind_ the transparent bridge shouldn't affect it at all.

Yes. And in the case of Frans' machine, the e1000e controller was before
all the bridges too.

Hm. And unloading it before suspend made things work? Interesting.

Rafael

Linus Torvalds

2008-12-05 00:07:47 UTC

Post by Rafael J. Wysocki

Post by Linus Torvalds
Yes. And in the case of Frans' machine, the e1000e controller was before
all the bridges too.

Hm. And unloading it before suspend made things work? Interesting.

Yeah. Frans' workaround was

- unloading e1000e before suspend
- using aggressive powersave setting on snd_hda_intel to ensure that
sound controller was already sleeping before entering suspend

and both of those devices are on the root PCI bus and are enumerated (and
thus resumed) before the transparent bridge.

So yeah, the whole "resource allocation for that bridge" saga should
_really_ not matter. But it clearly does seem to.

Linus

Rafael J. Wysocki

2008-12-05 00:20:01 UTC

Post by Linus Torvalds

Post by Rafael J. Wysocki

Post by Linus Torvalds
Yes. And in the case of Frans' machine, the e1000e controller was before
all the bridges too.

Hm. And unloading it before suspend made things work? Interesting.

Yeah. Frans' workaround was
- unloading e1000e before suspend
- using aggressive powersave setting on snd_hda_intel to ensure that
sound controller was already sleeping before entering suspend
and both of those devices are on the root PCI bus and are enumerated (and
thus resumed) before the transparent bridge.
So yeah, the whole "resource allocation for that bridge" saga should
_really_ not matter. But it clearly does seem to.

Well, I'm going to have a closer look at what we're doing to PCI bridges in the
resume code path, as this _feels_ relevant in this case.

Perhaps we're not doing something we're supposed to do (that already happened
for regular devices in the past) or we're doing something we're not supposed
to do. Unfortunately, I'd have to dig into the PCI bridge spec for this
purpose and that will take time. Still, I suspect that's worth doing, as
potentially the problem may affect a wide range of systems.

The fact that I have a box on which I can reproduce the problem should help
here. ;-)

Thanks,
Rafael

Frans Pop

2008-12-05 06:55:20 UTC

Post by Rafael J. Wysocki

Post by Linus Torvalds
Oh, btw, I thought your and Frans' laptops were identical?

No, they are from different vendors. :-) Mine is a Toshiba one and the
Frans' one is from HP.

Also, the CardBus bridge and what's behind them are quite different
(and we use different firewire stacks).

Mine is:
02:06.0 CardBus bridge [0607]: Ricoh Co Ltd RL5c476 II [1180:0476] (rev ba)
Kernel driver in use: yenta_cardbus
Kernel modules: yenta_socket
02:06.1 FireWire (IEEE 1394) [0c00]: Ricoh Co Ltd R5C832 IEEE 1394 Controller [1180:0832] (rev 04)
Kernel driver in use: ohci1394
Kernel modules: ohci1394
02:06.2 SD Host controller [0805]: Ricoh Co Ltd R5C822 SD/SDIO/MMC/MS/MSPro Host Adapter [1180:0822] (rev 21)
Kernel driver in use: sdhci-pci
Kernel modules: sdhci-pci
02:06.3 System peripheral [0880]: Ricoh Co Ltd R5C843 MMC Host Controller [1180:0843] (rev ff)
Kernel driver in use: ricoh-mmc
Kernel modules: ricoh_mmc
(The ricoh_mmc module disables the last one.)

While Rafael has:
03:0b.0 CardBus bridge: Texas Instruments PCIxx12 Cardbus Controller
Kernel driver in use: yenta_cardbus
Kernel modules: yenta_socket
03:0b.1 FireWire (IEEE 1394): Texas Instruments PCIxx12 OHCI Compliant IEEE 1394 Host Controller (prog-if 10 [OHCI])
Kernel driver in use: firewire_ohci
Kernel modules: firewire-ohci
03:0b.3 SD Host controller: Texas Instruments PCIxx12 SDA Standard Compliant SD Host Controller (prog-if 01)
Kernel driver in use: sdhci-pci
Kernel modules: sdhci-pci

Rafael J. Wysocki

2008-12-04 22:09:11 UTC

Post by Frans Pop

Well, I think that what _would_ be generally correct, and actually
pretty simple, is a rather different approach: just not sizing things
behind a transparent bridge AT ALL, since it really shouldn't matter.

I've given your patch a try and the few resumes from STR I've done were
all successful. That's not 100% conclusive yet, but a nice start.
Some info from logs etc. below.

It doesn't help on my box, though. I've got a failure to resume from
hibernation on the first attempt.

However, this one appears to work reliably for me (on top of vanilla current
mainline):

--- linux-2.6.orig/drivers/pci/setup-bus.c
+++ linux-2.6/drivers/pci/setup-bus.c
@@ -350,6 +350,11 @@ static int pbus_size_mem(struct pci_bus

if (r->parent || (r->flags & mask) != type)
continue;
+
+ if ((dev->class >> 8) == PCI_CLASS_BRIDGE_CARDBUS
+ && bus->self->transparent)
+ continue;
+
r_size = resource_size(r);
/* For bridges size != alignment */
align = resource_alignment(r);

Post by Frans Pop

Also, I would be happy to actually understand _why_ this happens.

100% agreed. I do _not_ see why it should ever matter how we set up a
PCI bridging window - whether prefetchable or not - on a bridge that
should be transparent. It sounds really odd. I'm wondering if there is
something we're missing here.

The theory that it is really a resume issue and not a device layout issue
sounds logical. Especially as everything always works correctly after a
normal boot.

Well, in fact I'm pretty sure this is the case. By changing memory address
space layout we effectively change conditions during suspend-resume and
apparently we can choose one for which the failure condition doesn't trigger
(or, IOW, the probability of it is _so_ small that we just can't see it).

There seems to be a race of some kind or a missing delay or something similar.

Thanks,
Rafael

Linus Torvalds

2008-12-04 22:20:21 UTC

Post by Rafael J. Wysocki
However, this one appears to work reliably for me (on top of vanilla current

Not very interesting. It just does the same thing your previous patches
have done - ignores the cardbus slot for sizing. It just does it
differently and more explicitly.

Your original patch did it by simply giving the resources invalid
alignments (in a very non-obvious way). This one does it by being explicit
and saying "we won't care about cardbus resources behind transparent
bridges". But it's still a very hacky thing, and thus not really
interesting at all as a patch.

IOW, it's not a patch that makes sense - it's just a patch that ON YOUR
PARTICULAR MACHINE causes us to get the layout you want in order to hide
the bug. And it doesn't really even do anything new - it's just doing the
same thing in an old way.

But it's interesting that the "don't size _anything_ behind a transparent
bridge" apparently made no difference for you.

Can you send "lspci -vv" and "dmesg" output for that kernel? Even if it
failed the suspend/resume, it's interesting, because I would actually have
expected that one to have the same layout as the successful ones.

Linus

Rafael J. Wysocki

2008-12-04 23:31:26 UTC

Post by Linus Torvalds

Post by Rafael J. Wysocki
However, this one appears to work reliably for me (on top of vanilla current

Not very interesting. It just does the same thing your previous patches
have done - ignores the cardbus slot for sizing. It just does it
differently and more explicitly.

There's a difference, though. It doesn't cause the resources flags to be
cleared for the cardbus bridge and the cardbus bridge gets the correct sizes
of both prefetchable and non-prefetchable windows (64 MB).

Post by Linus Torvalds
Your original patch did it by simply giving the resources invalid
alignments (in a very non-obvious way). This one does it by being explicit
and saying "we won't care about cardbus resources behind transparent
bridges". But it's still a very hacky thing, and thus not really
interesting at all as a patch.
IOW, it's not a patch that makes sense - it's just a patch that ON YOUR
PARTICULAR MACHINE causes us to get the layout you want in order to hide
the bug.

I know that. :-) Still, I find it important to notice that the memory windows
of the cardbus bridge can be 64 MB-wide and things work in that case too.
Also, I like it more than the previous patch. ;-)

Moreover, I _think_ it would work for Frans too, because I _suspect_ the
problem is related to a cardbus bridge being located behind that "transparent"
thing somehow.

Post by Linus Torvalds
And it doesn't really even do anything new - it's just doing the
same thing in an old way.
But it's interesting that the "don't size _anything_ behind a transparent
bridge" apparently made no difference for you.
Can you send "lspci -vv" and "dmesg" output for that kernel?

No prob, both attached along with the contents of /proc/iomem .

Post by Linus Torvalds
Even if it failed the suspend/resume, it's interesting, because I would
actually have expected that one to have the same layout as the successful
ones.

Well, not exactly. Actually, with this patch the graphics and "ICH HD audio"
get their memory ranges before the ranges of _all_ devices behind the
transparent bridge, while in the "working" case their memory ranges are located
_after_ the memory ranges of devices behind the transparent bridge _except_
for the cardbus bridge's memory windows.

Thanks,
Rafael

Linus Torvalds

2008-12-05 00:03:03 UTC

Post by Rafael J. Wysocki

Post by Linus Torvalds
Not very interesting. It just does the same thing your previous patches
have done - ignores the cardbus slot for sizing. It just does it
differently and more explicitly.

There's a difference, though. It doesn't cause the resources flags to be
cleared for the cardbus bridge and the cardbus bridge gets the correct sizes
of both prefetchable and non-prefetchable windows (64 MB).

Yes, true. In that sense, it minimizes the differences between the
"working" and "nonworking" case.

Which is interesting in the sense that it makes it even less likely that
it's an actual resource clash.

Post by Rafael J. Wysocki
I know that. :-) Still, I find it important to notice that the memory windows
of the cardbus bridge can be 64 MB-wide and things work in that case too.
Also, I like it more than the previous patch. ;-)
Moreover, I _think_ it would work for Frans too, because I _suspect_ the
problem is related to a cardbus bridge being located behind that "transparent"
thing somehow.

Well, I suspect that on Frans' machine, there will be no difference at all
between your patch and my previous patch, since he already had a bridge
window allocated by the BIOS. And he also had just that firewire and mmc
thing that only needed that memory window, so he'd end up with the exact
same resource allocation, methinks.

Post by Rafael J. Wysocki

Post by Linus Torvalds
Can you send "lspci -vv" and "dmesg" output for that kernel?

No prob, both attached along with the contents of /proc/iomem .

It all looks very sane. Too bad it apparently doesn't work.

Post by Rafael J. Wysocki

Post by Linus Torvalds
Even if it failed the suspend/resume, it's interesting, because I would
actually have expected that one to have the same layout as the successful
ones.

Well, not exactly. Actually, with this patch the graphics and "ICH HD audio"
get their memory ranges before the ranges of _all_ devices behind the
transparent bridge, while in the "working" case their memory ranges are located
_after_ the memory ranges of devices behind the transparent bridge _except_
for the cardbus bridge's memory windows.

Yes, however, it shouldn't matter.

Except in cae the audio driver (for example) were to access past its MMIO
window, and we'd have a situation where we care what was just before it or
after it. That doesn't seem very likely, though.

Linus

Linus Torvalds

2008-12-05 00:45:47 UTC

Post by Linus Torvalds

Post by Rafael J. Wysocki
There's a difference, though. It doesn't cause the resources flags to be
cleared for the cardbus bridge and the cardbus bridge gets the correct sizes
of both prefetchable and non-prefetchable windows (64 MB).

Yes, true. In that sense, it minimizes the differences between the
"working" and "nonworking" case.

Hmm.

One other issue: we've been looking mostly at MMIO, but another thing that
differs here is the PIO part.

Your patch only changes pbus_size_mem(), so what happens is that it avoids
allocating the prefetch window. But it still allocates the PIO window,
because pbus_size_io() is still run.

Maybe the PIO window matters? Any magic suspend registers are usually in
PIO space, not in MMIO space. Did /proc/ioports change, and if so, how?

Linus

Rafael J. Wysocki

2008-12-05 01:08:33 UTC

Post by Linus Torvalds

Post by Linus Torvalds

Post by Rafael J. Wysocki
There's a difference, though. It doesn't cause the resources flags to be
cleared for the cardbus bridge and the cardbus bridge gets the correct sizes
of both prefetchable and non-prefetchable windows (64 MB).

Yes, true. In that sense, it minimizes the differences between the
"working" and "nonworking" case.

Hmm.
One other issue: we've been looking mostly at MMIO, but another thing that
differs here is the PIO part.
Your patch only changes pbus_size_mem(), so what happens is that it avoids
allocating the prefetch window. But it still allocates the PIO window,
because pbus_size_io() is still run.
Maybe the PIO window matters? Any magic suspend registers are usually in
PIO space, not in MMIO space. Did /proc/ioports change, and if so, how?

|18,20c18,19
|< 1000-1fff : PCI Bus 0000:03
|< 1000-10ff : PCI CardBus 0000:04
|< 1400-14ff : PCI CardBus 0000:04
|---
|> 1000-10ff : PCI CardBus 0000:04
|> 1400-14ff : PCI CardBus 0000:04

where the first one is with my patch and the second one is with the "no sizing
for transparent bridges" patch. No difference to my eyes, if the "transparent"
bridge is really transparent. :-)

Thanks,
Rafael

Linus Torvalds

2008-12-05 01:45:12 UTC

Post by Rafael J. Wysocki

Post by Linus Torvalds
Maybe the PIO window matters? Any magic suspend registers are usually in
PIO space, not in MMIO space. Did /proc/ioports change, and if so, how?

|18,20c18,19
|< 1000-1fff : PCI Bus 0000:03
|< 1000-10ff : PCI CardBus 0000:04
|< 1400-14ff : PCI CardBus 0000:04
|---
|> 1000-10ff : PCI CardBus 0000:04
|> 1400-14ff : PCI CardBus 0000:04
where the first one is with my patch and the second one is with the "no sizing
for transparent bridges" patch. No difference to my eyes, if the "transparent"
bridge is really transparent. :-)

Well, there _is_ a difference, although a subtle one. Not for the Cardbus
card itself, but for decode of _non-cardbus_ IO ranges.

IOW, if there is a secret magic IO port that we don't know about at (say)
address 0x1100, then the difference is that now there would be a fight
over who would take it.

And I think I actually have found some secret IO decoding in ICH7. Damn. I
_hate_ it when Intel does that. I asked (long ago) that Intel double-check
that we have quirks for all their idiotic magic IO addresses, but they
clearly never did that.

Or maybe I'm reading the ICH7 docs wrong. But I don't think I am.

NOTE NOTE NOTE! I didn't check which all LPC bridges there are out there
that have these magic registers. But it shows up in the ICH7 docs. It
migth exist in ICH[5-9] for all I know. But at least for ICH7, the only
LPC bridge ID's I find in the spec update are 27b8, 27b9 and 27bd, which
are those three devices that I list in the quirks.

Somebody should double-check this, and also check whether ICH6-9 have the
same LPC decode logic..

Anyway, does this cause anything to be printed out for you? I have _not_
checked whether this is the only such programmable hidden range register
set.

That said, the "no IO window" case should be the safe one (if there are
hidden IO ports that clash), and it's the one that does _not_ work for
you, so this may not be it either. But these kinds of quirks were very
common reasons for ACPI power states not working (because we would just
allocate something else on top of the damn hidden things that we didn't
know about).

Linus

---
drivers/pci/quirks.c | 36 ++++++++++++++++++++++++++++++++++++
1 files changed, 36 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 5f4f85f..355ccf2 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -474,6 +474,42 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH9_4, quirk_
DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH9_7, quirk_ich6_lpc_acpi);
DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH9_8, quirk_ich6_lpc_acpi);

+static void __devinit ich7_lpc_generic_decode(struct pci_dev *dev, unsigned reg, const char *name)
+{
+ u32 val;
+ u32 mask, base;
+
+ pci_read_config_dword(dev, reg, &val);
+
+ /* Enabled? */
+ if (!(val & 1))
+ return;
+
+ /* Base in bits 15:2 */
+ base = val & 0xfffc;
+
+ /* Decode mask in bits 23:18 */
+ mask = (val >> 16) & 0xfc;
+
+ /* The mask is only on a dword address, the word/byte is always matched */
+ mask |= 3;
+
+ /* Just print it out for now. We should reserve it after debugging */
+ dev_info(&dev->dev, "%s PIO at %04x (mask %04x)\n", name, base, mask);
+}
+
+static void __devinit quirk_ich7_lpc_decode(struct pci_dev *dev)
+{
+ ich7_lpc_generic_decode(dev, 0x84, "ICH7 LPC Generic IO decode 1");
+ ich7_lpc_generic_decode(dev, 0x88, "ICH7 LPC Generic IO decode 2");
+ ich7_lpc_generic_decode(dev, 0x8c, "ICH7 LPC Generic IO decode 3");
+ ich7_lpc_generic_decode(dev, 0x90, "ICH7 LPC Generic IO decode 4");
+}
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH7_0, quirk_ich7_lpc_decode);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH7_1, quirk_ich7_lpc_decode);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH7_31, quirk_ich7_lpc_decode);
+
+
/*
* VIA ACPI: One IO region pointed to by longword at
* 0x48 or 0x20 (256 bytes of ACPI registers)

Linus Torvalds

2008-12-05 02:55:18 UTC

Post by Linus Torvalds
NOTE NOTE NOTE! I didn't check which all LPC bridges there are out there
that have these magic registers. But it shows up in the ICH7 docs. It
migth exist in ICH[5-9] for all I know. But at least for ICH7, the only
LPC bridge ID's I find in the spec update are 27b8, 27b9 and 27bd, which
are those three devices that I list in the quirks.

Ok, the ICH6 LPC side has something similar, but not the same. Just two
ranges, and slightly less flexible wrt sizing.

And ICH8/9/10 seems to have the same thing as ICH7.

And looking at my own machine (ICH10) it actually appears like my ICH10
setup has two of the magic IO ranges that it decodes, and only one of them
is covered by the BIOS PnP tables.

They're both below 0x1000, though, so Linux shouldn't ever allocate
anything on top of them. And that's the norm - when firmware sets up
hidden magic system IO ranges, they do tend to be low IO ports.

But if some incompetent firmware person screws that up...

Linus

Linus Torvalds

2008-12-05 03:25:56 UTC

Post by Linus Torvalds
Ok, the ICH6 LPC side has something similar, but not the same. Just two
ranges, and slightly less flexible wrt sizing.
And ICH8/9/10 seems to have the same thing as ICH7.

Here's a patch that implements what I think is the correct quirks (apart
from the commented ICH6 lazy detail I didn't do).

It would be very interesting to see if people affected get any printouts
about IO decodes that don't show up in /proc/ioports...

And I know I've looked for these kinds of things before in the Intel ICH
docs, and apparently always missed these things (or been too lazy to
react), so can somebody else see if they can find any other ranges like
this? Maybe in non-LPC controllers?

Jesse, are there any Intel chipset people who could once and for all say
"these are the things we decode in our chipset" for _all_ chipsets and
_all_ dynamic ranges? I've asked for that before. There must be people who
know this, without having to wade through many thousands of pages of
boring datasheets?

The ICH datasheets tend to be 850 pages each, and there is more than one
of them. And they _do_ differ in the details, even if there is a lot of
sharing going on. So reading the docs is a huge effort, when there's bound
to be somebody who just knows the answer.

NOTE! This patch will just add a _printout_ of the IO regions it finds. It
won't actually register them as known resources. So it won't make the
kernel know to avoid them if they were to clash!

Also, see the "This is not correct" for the ICH6 dynamically sized case.

Linus

---
drivers/pci/quirks.c | 105 ++++++++++++++++++++++++++++++++++++++++++-------
1 files changed, 90 insertions(+), 15 deletions(-)

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 5f4f85f..1b64b28 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -449,7 +449,7 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82801DB_12,
DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82801EB_0, quirk_ich4_lpc_acpi);
DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ESB_1, quirk_ich4_lpc_acpi);

-static void __devinit quirk_ich6_lpc_acpi(struct pci_dev *dev)
+static void __devinit ich6_lpc_acpi_gpio(struct pci_dev *dev)
{
u32 region;

@@ -459,20 +459,95 @@ static void __devinit quirk_ich6_lpc_acpi(struct pci_dev *dev)
pci_read_config_dword(dev, 0x48, &region);
quirk_io_region(dev, region, 64, PCI_BRIDGE_RESOURCES+1, "ICH6 GPIO");
}
-DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH6_0, quirk_ich6_lpc_acpi);
-DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH6_1, quirk_ich6_lpc_acpi);
-DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH7_0, quirk_ich6_lpc_acpi);
-DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH7_1, quirk_ich6_lpc_acpi);
-DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH7_31, quirk_ich6_lpc_acpi);
-DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH8_0, quirk_ich6_lpc_acpi);
-DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH8_2, quirk_ich6_lpc_acpi);
-DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH8_3, quirk_ich6_lpc_acpi);
-DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH8_1, quirk_ich6_lpc_acpi);
-DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH8_4, quirk_ich6_lpc_acpi);
-DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH9_2, quirk_ich6_lpc_acpi);
-DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH9_4, quirk_ich6_lpc_acpi);
-DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH9_7, quirk_ich6_lpc_acpi);
-DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH9_8, quirk_ich6_lpc_acpi);
+
+static void __devinit ich6_lpc_generic_decode(struct pci_dev *dev, unsigned reg, const char *name, int dynsize)
+{
+ u32 val;
+ u32 size, base;
+
+ pci_read_config_dword(dev, reg, &val);
+
+ /* Enabled? */
+ if (!(val & 1))
+ return;
+ base = val & 0xfffc;
+ if (dynsize) {
+ /*
+ * This is not correct. It is 16, 32 or 64 bytes depending on
+ * register D31:F0:ADh bits 5:4.
+ *
+ * But this gets us at least _part_ of it.
+ */
+ size = 16;
+ } else {
+ size = 128;
+ }
+ base &= ~(size-1);
+
+ /* Just print it out for now. We should reserve it after more debugging */
+ dev_info(&dev->dev, "%s PIO at %04x-%04x\n", name, base, base+size-1);
+}
+
+static void __devinit quirk_ich6_lpc(struct pci_dev *dev)
+{
+ /* Shared ACPI/GPIO decode with all ICH6+ */
+ ich6_lpc_acpi_gpio(dev);
+
+ /* ICH6-specific generic IO decode */
+ ich6_lpc_generic_decode(dev, 0x84, "LPC Generic IO decode 1", 0);
+ ich6_lpc_generic_decode(dev, 0x88, "LPC Generic IO decode 2", 1);
+}
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH6_0, quirk_ich6_lpc);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH6_1, quirk_ich6_lpc);
+
+static void __devinit ich7_lpc_generic_decode(struct pci_dev *dev, unsigned reg, const char *name)
+{
+ u32 val;
+ u32 mask, base;
+
+ pci_read_config_dword(dev, reg, &val);
+
+ /* Enabled? */
+ if (!(val & 1))
+ return;
+
+ /*
+ * IO base in bits 15:2, mask in bits 23:18, both
+ * are dword-based
+ */
+ base = val & 0xfffc;
+ mask = (val >> 16) & 0xfc;
+ mask |= 3;
+
+ /* Just print it out for now. We should reserve it after more debugging */
+ dev_info(&dev->dev, "%s PIO at %04x (mask %04x)\n", name, base, mask);
+}
+
+/* ICH7-10 has the same common LPC generic IO decode registers */
+static void __devinit quirk_ich7_lpc(struct pci_dev *dev)
+{
+ /* We share the common ACPI/DPIO decode with ICH6 */
+ ich6_lpc_acpi_gpio(dev);
+
+ /* And have 4 ICH7+ generic decodes */
+ ich7_lpc_generic_decode(dev, 0x84, "ICH7 LPC Generic IO decode 1");
+ ich7_lpc_generic_decode(dev, 0x88, "ICH7 LPC Generic IO decode 2");
+ ich7_lpc_generic_decode(dev, 0x8c, "ICH7 LPC Generic IO decode 3");
+ ich7_lpc_generic_decode(dev, 0x90, "ICH7 LPC Generic IO decode 4");
+}
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH7_0, quirk_ich7_lpc);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH7_1, quirk_ich7_lpc);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH7_31, quirk_ich7_lpc);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH8_0, quirk_ich7_lpc);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH8_2, quirk_ich7_lpc);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH8_3, quirk_ich7_lpc);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH8_1, quirk_ich7_lpc);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH8_4, quirk_ich7_lpc);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH9_2, quirk_ich7_lpc);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH9_4, quirk_ich7_lpc);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH9_7, quirk_ich7_lpc);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH9_8, quirk_ich7_lpc);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH10_1, quirk_ich7_lpc);

/*
* VIA ACPI: One IO region pointed to by longword at

Frans Pop

2008-12-05 06:44:54 UTC

Post by Linus Torvalds
Here's a patch that implements what I think is the correct quirks
(apart from the commented ICH6 lazy detail I didn't do).

I get:
pci 0000:00:1f.0: quirk: region 1000-107f claimed by ICH6 ACPI/GPIO/TCO
pci 0000:00:1f.0: quirk: region 1100-113f claimed by ICH6 GPIO
pci 0000:00:1f.0: ICH7 LPC Generic IO decode 1 PIO at 0500 (mask 007f)
pci 0000:00:1f.0: ICH7 LPC Generic IO decode 4 PIO at 02e8 (mask 0007)

The ICH6 and ICH7 in those messages is a bit weird given that my system is
ICH8...

Hardware info (JFYI):
00:1f.0 ISA bridge [0601]: Intel Corporation 82801HBM (ICH8M-E) LPC
Interface Controller [8086:2811] (rev 03)

System Information
Manufacturer: Hewlett-Packard
Product Name: HP Compaq 2510p Notebook PC
Version: F.0C
Base Board Information
Manufacturer: Hewlett-Packard
Product Name: 30C9
Version: KBC Version 75.28
BIOS Information
Vendor: Hewlett-Packard
Version: 68MSP Ver. F.0C
Release Date: 06/18/2008

Frans Pop

2008-12-05 08:27:34 UTC

Post by Frans Pop

Post by Linus Torvalds
Here's a patch that implements what I think is the correct quirks
(apart from the commented ICH6 lazy detail I didn't do).

pci 0000:00:1f.0: quirk: region 1000-107f claimed by ICH6 ACPI/GPIO/TCO
pci 0000:00:1f.0: quirk: region 1100-113f claimed by ICH6 GPIO
pci 0000:00:1f.0: ICH7 LPC Generic IO decode 1 PIO at 0500 (mask 007f)
pci 0000:00:1f.0: ICH7 LPC Generic IO decode 4 PIO at 02e8 (mask 0007)
It would be very interesting to see if people affected get any
printouts about IO decodes that don't show up in /proc/ioports...

Looks like 02e8 is missing (see /proc/ioports below).

I also tried the patch on my ICH7 desktop which gives:
pci 0000:00:1f.0: quirk: region 0400-047f claimed by ICH6 ACPI/GPIO/TCO
pci 0000:00:1f.0: quirk: region 0500-053f claimed by ICH6 GPIO
pci 0000:00:1f.0: ICH7 LPC Generic IO decode 1 PIO at 0680 (mask 007f)
But 0680 is accounted for in /proc/ioports:
0680-06ff : pnp 00:06

Cheers,
FJP

/proc/ioports (for notebook):
0000-001f : dma1
0020-0021 : pic1
0040-0043 : timer0
0050-0053 : timer1
0060-0060 : keyboard
0064-0064 : keyboard
0070-0071 : rtc0
0080-008f : dma page reg
00a0-00a1 : pic2
00c0-00df : dma2
00f0-00ff : fpu
0170-0177 : 0000:00:1f.1
0170-0177 : ata_piix
01f0-01f7 : 0000:00:1f.1
01f0-01f7 : ata_piix
02f8-02ff : pnp 00:0c
0376-0376 : 0000:00:1f.1
0376-0376 : ata_piix
03c0-03df : vesafb
03f6-03f6 : 0000:00:1f.1
03f6-03f6 : ata_piix
03f8-03ff : pnp 00:0c
04d0-04d1 : pnp 00:0c
0500-055f : pnp 00:0a
0800-080f : pnp 00:0a
0cf8-0cff : PCI conf1
1000-107f : 0000:00:1f.0
1000-107f : pnp 00:0c
1000-1003 : ACPI PM1a_EVT_BLK
1004-1005 : ACPI PM1a_CNT_BLK
1008-100b : ACPI PM_TMR
1010-1015 : ACPI CPU throttle
1020-1020 : ACPI PM2_CNT_BLK
1028-102f : ACPI GPE0_BLK
1060-107f : iTCO_wdt
1100-113f : 0000:00:1f.0
1100-113f : pnp 00:0c
1200-121f : pnp 00:0c
2000-2007 : 0000:00:02.0
2008-200f : 0000:00:03.2
2010-2013 : 0000:00:03.2
2018-201f : 0000:00:03.2
2020-2023 : 0000:00:03.2
2030-203f : 0000:00:03.2
2040-2047 : 0000:00:03.3
2040-2047 : serial
2060-207f : 0000:00:19.0
2080-209f : 0000:00:1a.0
2080-209f : uhci_hcd
20a0-20bf : 0000:00:1a.1
20a0-20bf : uhci_hcd
20c0-20df : 0000:00:1d.0
20c0-20df : uhci_hcd
20e0-20ff : 0000:00:1d.1
20e0-20ff : uhci_hcd
2100-211f : 0000:00:1d.2
2100-211f : uhci_hcd
2120-212f : 0000:00:1f.1
2120-212f : ata_piix
3000-3fff : PCI Bus 0000:02
3000-30ff : PCI CardBus 0000:03
3400-34ff : PCI CardBus 0000:03

Rafael J. Wysocki

2008-12-05 12:00:16 UTC

Post by Linus Torvalds

Post by Linus Torvalds
Ok, the ICH6 LPC side has something similar, but not the same. Just two
ranges, and slightly less flexible wrt sizing.
And ICH8/9/10 seems to have the same thing as ICH7.

Here's a patch that implements what I think is the correct quirks (apart
from the commented ICH6 lazy detail I didn't do).
It would be very interesting to see if people affected get any printouts
about IO decodes that don't show up in /proc/ioports...

pci 0000:00:1f.0: quirk: region d800-d87f claimed by ICH6 ACPI/GP IO/TCO
pci 0000:00:1f.0: quirk: region eec0-eeff claimed by ICH6 GPIO
pci 0000:00:1f.0: ICH7 LPC Generic IO decode 1 PIO at 0680 (mask 007f)
pci 0000:00:1f.0: ICH7 LPC Generic IO decode 4 PIO at 01e0 (mask 000f)

The second one shows up in /proc/ioports as "01e0-01ef : pnp 00:09", but the
first one (at 680) doesn't.

Thanks,
Rafael

Linus Torvalds

2008-12-05 15:57:59 UTC

Post by Rafael J. Wysocki

Post by Linus Torvalds
It would be very interesting to see if people affected get any printouts
about IO decodes that don't show up in /proc/ioports...

pci 0000:00:1f.0: quirk: region d800-d87f claimed by ICH6 ACPI/GP IO/TCO
pci 0000:00:1f.0: quirk: region eec0-eeff claimed by ICH6 GPIO
pci 0000:00:1f.0: ICH7 LPC Generic IO decode 1 PIO at 0680 (mask 007f)
pci 0000:00:1f.0: ICH7 LPC Generic IO decode 4 PIO at 01e0 (mask 000f)
The second one shows up in /proc/ioports as "01e0-01ef : pnp 00:09", but the
first one (at 680) doesn't.

Ok, so the patch is interesting and probably worth expanding on (to
actually allocate the regions), but at the same time it too doesn't
actually explain your problems.

While the kernel doesn't know about that magic 0x680 allocation, it also
won't be allocating anything over it, since we define PCIBIOS_MIN_IO to
0x1000 on x86, and will never allocate new resources under that.

Linus

Rafael J. Wysocki

2008-12-05 21:32:12 UTC

Post by Linus Torvalds

Post by Rafael J. Wysocki

Post by Linus Torvalds
It would be very interesting to see if people affected get any printouts
about IO decodes that don't show up in /proc/ioports...

pci 0000:00:1f.0: quirk: region d800-d87f claimed by ICH6 ACPI/GP IO/TCO
pci 0000:00:1f.0: quirk: region eec0-eeff claimed by ICH6 GPIO
pci 0000:00:1f.0: ICH7 LPC Generic IO decode 1 PIO at 0680 (mask 007f)
pci 0000:00:1f.0: ICH7 LPC Generic IO decode 4 PIO at 01e0 (mask 000f)
The second one shows up in /proc/ioports as "01e0-01ef : pnp 00:09", but the
first one (at 680) doesn't.

Ok, so the patch is interesting and probably worth expanding on (to
actually allocate the regions), but at the same time it too doesn't
actually explain your problems.
While the kernel doesn't know about that magic 0x680 allocation, it also
won't be allocating anything over it, since we define PCIBIOS_MIN_IO to
0x1000 on x86, and will never allocate new resources under that.

In the meantime I did some more debugging with unpatched mainline and found
that if resume from hibernation fails, it usually fails immediately after
resuming the SATA controller (once it apparently failed right after resuming
EHCI, but then it just might be a problem with printing more messages), where
the resume sequence is (again, for easier reference):

pci:0000:00:00.0
pci:0000:00:02.0 <- graphics
pci:0000:00:02.1 <- graphics
pci:0000:00:1b.0 <- snd_hda_intel
pci:0000:00:1c.0 <- PCI Express port 1
pci:0000:00:1c.2 <- PCI Express port 3
pci:0000:00:1d.0 <- USB UHCI
pci:0000:00:1d.1 <- USB UHCI
pci:0000:00:1d.2 <- USB UHCI
pci:0000:00:1d.3 <- USB UHCI
pci:0000:00:1d.7 <- USB EHCI
pci:0000:00:1e.0 <- transparent bridge (Intel Corporation 82801 Mobile PCI Bridge)
pci:0000:00:1f.0 <- ISA bridge
pci:0000:00:1f.2 <- SATA (ahci)

--> so it usually hangs here or during the e1000e resume (I don't
get any messages from e1000e in the failing cycles, though).

pci:0000:01:00.0 <- e1000e
No Bus:0000:01
pci:0000:02:00.0 <- wireless (iwlagn)
No Bus:0000:02
pci:0000:03:0b.0 <- cardbus bridge
pci:0000:03:0b.1 <- FireWire
pci:0000:03:0b.3 <- SD Host controller (Texas Instruments)
No Bus:0000:04
No Bus:0000:03

Interestingly enough, usually after a failure some messages still get printed
into the screen (eg. messages from the ACPI battery driver) and the keyboard
sort of works, although the keys are not decoded correctly.

Next, as I was unable to get anything with the help of magic sysrq, so I tried
to boot the kernel with nmi_watchdog=1 and in this configuration I could not
reproduce the problem. This clearly indicates that this really is a timing
issue.

I also noticed two things that may or may not be relevant.

First, the snd_hda_intel device is a PCI Express endpoind integrated into the
root complex which is the host bridge in this case. This may be relevant since
unloading the snd_hda_intel driver makes things work 100% of the time.

Second, the transparent bridge 0000:00:1e.0 does supports subtractive
decoding, so if there is a device doing subtractive decode behind it (the
cardbus bridge may do that, for example) it will claim any transaction not
claimed by any other device on bus 0.

Next, I'm going to hack the magic sysrq so that it will allow me to get a stack
dump after a resume failure and I will add some debug printks to the PCI
resume code path.

Thanks,
Rafael

Jesse Barnes

2008-12-05 17:25:30 UTC

Post by Linus Torvalds

Post by Linus Torvalds
Ok, the ICH6 LPC side has something similar, but not the same. Just two
ranges, and slightly less flexible wrt sizing.
And ICH8/9/10 seems to have the same thing as ICH7.

Here's a patch that implements what I think is the correct quirks (apart
from the commented ICH6 lazy detail I didn't do).
It would be very interesting to see if people affected get any printouts
about IO decodes that don't show up in /proc/ioports...
And I know I've looked for these kinds of things before in the Intel ICH
docs, and apparently always missed these things (or been too lazy to
react), so can somebody else see if they can find any other ranges like
this? Maybe in non-LPC controllers?
Jesse, are there any Intel chipset people who could once and for all say
"these are the things we decode in our chipset" for _all_ chipsets and
_all_ dynamic ranges? I've asked for that before. There must be people who
know this, without having to wade through many thousands of pages of
boring datasheets?

Yeah, I can get that info. Sorry I haven't spent more time on this bug so
far, I've been on vacation this week and very selective about which mails I
reply to. :)

Post by Linus Torvalds
The ICH datasheets tend to be 850 pages each, and there is more than one
of them. And they _do_ differ in the details, even if there is a lot of
sharing going on. So reading the docs is a huge effort, when there's bound
to be somebody who just knows the answer.
NOTE! This patch will just add a _printout_ of the IO regions it finds. It
won't actually register them as known resources. So it won't make the
kernel know to avoid them if they were to clash!
Also, see the "This is not correct" for the ICH6 dynamically sized case.

Some of these may be listed as ACPI PNP ranges too...

Jesse

Rafael J. Wysocki

2008-12-06 14:05:33 UTC

Hi,

The following three patches address the hibernation/suspend issue described in
http://bugzilla.kernel.org/show_bug.cgi?id=12121 and in the very long thread at
http://lkml.org/lkml/2008/12/1/382.

In short, the problem is that resume (from hibernation and/or suspend-to-RAM)
occasionally fails (approximately 20-25% of attempts) in the middle of resuming
PCI devices. We were able to find a specific layout of devices within the
memory address space in which the failure appeared to be extremely unlikely,
but this layout was no really valid for other reasons. We also found out that
using the NMI watchdog decreased the probablitily of failure which indicated
that the problem could be timing-related.

Next, we started to look at the PCI resume code and we generally agreed that
it would be a good idea to restore the standard PCI configuration registers
with interrupts disabled. Also, we thought we could move the saving of those
registers for some devices into functions executed with interrupts disabled.

I have followed these observations and created the three following patches.
With all of these patches applied, I'm not able to reproduce the problem.

Thanks,
Rafael

Rafael J. Wysocki

2008-12-06 14:07:59 UTC

From: Rafael J. Wysocki <***@sisk.pl>
Subject: PCI: Suspend and resume PCI Express ports with interrupts disabled

I don't see why the suspend and resume of PCI Express ports should be
handled with interrupts enabled and it may even lead to problems in
some situations. For this reason, move the suspending and resuming
of PCI Express ports into ->suspend_late() and ->resume_early()
callbacks executed with interrupts disabled.

This patch addresses the regression from 2.6.26 tracked as
http://bugzilla.kernel.org/show_bug.cgi?id=12121 .

Signed-off-by: Rafael J. Wysocki <***@sisk.pl>
---
drivers/pci/pcie/portdrv_pci.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)

Index: linux-2.6/drivers/pci/pcie/portdrv_pci.c
===================================================================
--- linux-2.6.orig/drivers/pci/pcie/portdrv_pci.c
+++ linux-2.6/drivers/pci/pcie/portdrv_pci.c
@@ -50,7 +50,7 @@ static int pcie_portdrv_restore_config(s
}

#ifdef CONFIG_PM
-static int pcie_portdrv_suspend(struct pci_dev *dev, pm_message_t state)
+static int pcie_portdrv_suspend_late(struct pci_dev *dev, pm_message_t state)
{
int ret = pcie_port_device_suspend(dev, state);

@@ -59,14 +59,14 @@ static int pcie_portdrv_suspend(struct p
return ret;
}

-static int pcie_portdrv_resume(struct pci_dev *dev)
+static int pcie_portdrv_resume_early(struct pci_dev *dev)
{
pcie_portdrv_restore_config(dev);
return pcie_port_device_resume(dev);
}
#else
-#define pcie_portdrv_suspend NULL
-#define pcie_portdrv_resume NULL
+#define pcie_portdrv_suspend_late NULL
+#define pcie_portdrv_resume_early NULL
#endif

/*
@@ -282,8 +282,8 @@ static struct pci_driver pcie_portdriver
.probe = pcie_portdrv_probe,
.remove = pcie_portdrv_remove,

- .suspend = pcie_portdrv_suspend,
- .resume = pcie_portdrv_resume,
+ .suspend_late = pcie_portdrv_suspend_late,
+ .resume_early = pcie_portdrv_resume_early,

.err_handler = &pcie_portdrv_err_handler,
};

Linus Torvalds

2008-12-06 17:15:22 UTC

Post by Rafael J. Wysocki
I don't see why the suspend and resume of PCI Express ports should be
handled with interrupts enabled and it may even lead to problems in
some situations.

Absolutely. A PCI Express port is really just a PCI bridge, with some odd
rules. We need to enable them early, exacly like regular PCI bridges, or
we cannot walk the PCI bus hierarchy correctly.

Anyway, ack, ack, ack for the whole series.

Linus

Rafael J. Wysocki

2008-12-06 17:25:59 UTC

Post by Linus Torvalds

Post by Rafael J. Wysocki
I don't see why the suspend and resume of PCI Express ports should be
handled with interrupts enabled and it may even lead to problems in
some situations.

Absolutely. A PCI Express port is really just a PCI bridge, with some odd
rules. We need to enable them early, exacly like regular PCI bridges, or
we cannot walk the PCI bus hierarchy correctly.
Anyway, ack, ack, ack for the whole series.

Thanks! :-)

I think it should go through Jesse?

Rafael

Linus Torvalds

2008-12-06 17:38:36 UTC

Post by Rafael J. Wysocki
I think it should go through Jesse?

Probably correct. And we want it in -next, so that it can get some testing
even before I open the merge window. Because I hope everybody realizes
that there's no way we're doing this in 2.6.28, and we'll leave the broken
and unreliable suspend.

Because afaik this is not a new bug (I tried to push a patch to do
suspend_late/resume_early for the PCI code a _loong_ time ago, but it
never got merged), and the only reason it showed up as a regression was
almost certainly simply that we've always had this.

IOW, suspend/resume has always been dodgy wrt interrupts, and there's some
luck involved. And your machine just happened to get unlucky.

I'd love to fix this in 2.6.28, but it's just not reasonable - it needs
widespread testing with an early -rc merge. And if it turns out to fix a
lot of machines, and there are no regressions, we can always back-port it
later.

Jesse?

Linus

Rafael J. Wysocki

2008-12-06 17:46:11 UTC

Post by Linus Torvalds

Post by Rafael J. Wysocki
I think it should go through Jesse?

Probably correct. And we want it in -next, so that it can get some testing
even before I open the merge window. Because I hope everybody realizes
that there's no way we're doing this in 2.6.28, and we'll leave the broken
and unreliable suspend.
Because afaik this is not a new bug (I tried to push a patch to do
suspend_late/resume_early for the PCI code a _loong_ time ago, but it
never got merged), and the only reason it showed up as a regression was
almost certainly simply that we've always had this.
IOW, suspend/resume has always been dodgy wrt interrupts, and there's some
luck involved. And your machine just happened to get unlucky.
I'd love to fix this in 2.6.28, but it's just not reasonable - it needs
widespread testing with an early -rc merge. And if it turns out to fix a
lot of machines, and there are no regressions, we can always back-port it
later.

I agree.

Thanks,
Rafael

Jesse Barnes

2008-12-07 02:18:54 UTC

Post by Rafael J. Wysocki

Post by Linus Torvalds

Post by Rafael J. Wysocki
I think it should go through Jesse?

Probably correct. And we want it in -next, so that it can get some
testing even before I open the merge window. Because I hope everybody
realizes that there's no way we're doing this in 2.6.28, and we'll leave
the broken and unreliable suspend.
Because afaik this is not a new bug (I tried to push a patch to do
suspend_late/resume_early for the PCI code a _loong_ time ago, but it
never got merged), and the only reason it showed up as a regression was
almost certainly simply that we've always had this.
IOW, suspend/resume has always been dodgy wrt interrupts, and there's
some luck involved. And your machine just happened to get unlucky.
I'd love to fix this in 2.6.28, but it's just not reasonable - it needs
widespread testing with an early -rc merge. And if it turns out to fix a
lot of machines, and there are no regressions, we can always back-port it
later.

I agree.

I'll stuff it into my -next branch tonight.

--
Jesse Barnes, Intel Open Source Technology Center

Rafael J. Wysocki

2008-12-06 14:09:08 UTC

From: Rafael J. Wysocki <***@sisk.pl>
Subject: Sound (HDA Intel): Restore PCI configuration space with interrupts off

Move the restoration of the standard PCI configuration registers
in the snd_hda_intel driver to a ->resume_early() callback executed
with interrupts disabled, since doing that with interrupts enabled
may lead to problems in some cases.

This patch addresses the regression from 2.6.26 tracked as
http://bugzilla.kernel.org/show_bug.cgi?id=12121 .

Signed-off-by: Rafael J. Wysocki <***@sisk.pl>
---
sound/pci/hda/hda_intel.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)

Index: linux-2.6/sound/pci/hda/hda_intel.c
===================================================================
--- linux-2.6.orig/sound/pci/hda/hda_intel.c
+++ linux-2.6/sound/pci/hda/hda_intel.c
@@ -1951,13 +1951,16 @@ static int azx_suspend(struct pci_dev *p
return 0;
}

+static int azx_resume_early(struct pci_dev *pci)
+{
+ return pci_restore_state(pci);
+}
+
static int azx_resume(struct pci_dev *pci)
{
struct snd_card *card = pci_get_drvdata(pci);
struct azx *chip = card->private_data;

- pci_set_power_state(pci, PCI_D0);
- pci_restore_state(pci);
if (pci_enable_device(pci) < 0) {
printk(KERN_ERR "hda-intel: pci_enable_device failed, "
"disabling device\n");
@@ -2465,6 +2468,7 @@ static struct pci_driver driver = {
.remove = __devexit_p(azx_remove),
#ifdef CONFIG_PM
.suspend = azx_suspend,
+ .resume_early = azx_resume_early,
.resume = azx_resume,
#endif
};

Jesse Barnes

2008-12-07 04:45:35 UTC

Post by Rafael J. Wysocki
Subject: Sound (HDA Intel): Restore PCI configuration space with interrupts off
Move the restoration of the standard PCI configuration registers
in the snd_hda_intel driver to a ->resume_early() callback executed
with interrupts disabled, since doing that with interrupts enabled
may lead to problems in some cases.
This patch addresses the regression from 2.6.26 tracked as
http://bugzilla.kernel.org/show_bug.cgi?id=12121 .

Since I only applied 1 and 2 you'll need to send this one through Takashi.

Thanks,

--
Jesse Barnes, Intel Open Source Technology Center

Rafael J. Wysocki

2008-12-06 14:07:05 UTC

From: Rafael J. Wysocki <***@sisk.pl>
Subject: PCI: Rework default handling of suspend and resume

Rework the handling of suspend and resume of PCI devices which have
no drivers or the drivers of which do not provide any suspend-resume
callbacks in such a way that their standard PCI configuration
registers will be saved and restored with interrupts disabled. This
should prevent such devices, including PCI bridges, from being
resumed too late to be able to function correctly during the resume
of the other PCI devices that may depend on them.

Also, to remove one possible source of future confusion, drop the
default handling of suspend and resume for PCI devices with drivers
providing the 'pm' object introduced by the new suspend-resume
framework (there are no such PCI drivers at the moment).

This patch addresses the regression from 2.6.26 tracked as
http://bugzilla.kernel.org/show_bug.cgi?id=12121 .

Signed-off-by: Rafael J. Wysocki <***@sisk.pl>
---
drivers/pci/pci-driver.c | 90 ++++++++++++++++++++++++++++++-----------------
1 file changed, 59 insertions(+), 31 deletions(-)

Index: linux-2.6/drivers/pci/pci-driver.c
===================================================================
--- linux-2.6.orig/drivers/pci/pci-driver.c
+++ linux-2.6/drivers/pci/pci-driver.c
@@ -300,6 +300,14 @@ static void pci_device_shutdown(struct d

#ifdef CONFIG_PM_SLEEP

+static bool pci_has_legacy_pm_support(struct pci_dev *pci_dev)
+{
+ struct pci_driver *drv = pci_dev->driver;
+
+ return drv && (drv->suspend || drv->suspend_late || drv->resume
+ || drv->resume_early);
+}
+
/*
* Default "suspend" method for devices that have no driver provided suspend,
* or not even a driver at all.
@@ -317,14 +325,22 @@ static void pci_default_pm_suspend(struc

/*
* Default "resume" method for devices that have no driver provided resume,
- * or not even a driver at all.
+ * or not even a driver at all (first part).
*/
-static int pci_default_pm_resume(struct pci_dev *pci_dev)
+static void pci_default_pm_resume_early(struct pci_dev *pci_dev)
{
- int retval = 0;
-
/* restore the PCI config space */
pci_restore_state(pci_dev);
+}
+
+/*
+ * Default "resume" method for devices that have no driver provided resume,
+ * or not even a driver at all (second part).
+ */
+static int pci_default_pm_resume_late(struct pci_dev *pci_dev)
+{
+ int retval;
+
/* if the device was enabled before suspend, reenable */
retval = pci_reenable_device(pci_dev);
/*
@@ -371,10 +387,12 @@ static int pci_legacy_resume(struct devi
struct pci_dev * pci_dev = to_pci_dev(dev);
struct pci_driver * drv = pci_dev->driver;

- if (drv && drv->resume)
+ if (drv && drv->resume) {
error = drv->resume(pci_dev);
- else
- error = pci_default_pm_resume(pci_dev);
+ } else {
+ pci_default_pm_resume_early(pci_dev);
+ error = pci_default_pm_resume_late(pci_dev);
+ }
return error;
}

@@ -420,10 +438,8 @@ static int pci_pm_suspend(struct device
if (drv->pm->suspend) {
error = drv->pm->suspend(dev);
suspend_report_result(drv->pm->suspend, error);
- } else {
- pci_default_pm_suspend(pci_dev);
}
- } else {
+ } else if (pci_has_legacy_pm_support(pci_dev)) {
error = pci_legacy_suspend(dev, PMSG_SUSPEND);
}
pci_fixup_device(pci_fixup_suspend, pci_dev);
@@ -442,8 +458,10 @@ static int pci_pm_suspend_noirq(struct d
error = drv->pm->suspend_noirq(dev);
suspend_report_result(drv->pm->suspend_noirq, error);
}
- } else {
+ } else if (pci_has_legacy_pm_support(pci_dev)) {
error = pci_legacy_suspend_late(dev, PMSG_SUSPEND);
+ } else {
+ pci_default_pm_suspend(pci_dev);
}

return error;
@@ -453,15 +471,17 @@ static int pci_pm_resume(struct device *
{
struct pci_dev *pci_dev = to_pci_dev(dev);
struct device_driver *drv = dev->driver;
- int error;
+ int error = 0;

pci_fixup_device(pci_fixup_resume, pci_dev);

if (drv && drv->pm) {
- error = drv->pm->resume ? drv->pm->resume(dev) :
- pci_default_pm_resume(pci_dev);
- } else {
+ if (drv->pm->resume)
+ error = drv->pm->resume(dev);
+ } else if (pci_has_legacy_pm_support(pci_dev)) {
error = pci_legacy_resume(dev);
+ } else {
+ error = pci_default_pm_resume_late(pci_dev);
}

return error;
@@ -478,8 +498,10 @@ static int pci_pm_resume_noirq(struct de
if (drv && drv->pm) {
if (drv->pm->resume_noirq)
error = drv->pm->resume_noirq(dev);
- } else {
+ } else if (pci_has_legacy_pm_support(pci_dev)) {
error = pci_legacy_resume_early(dev);
+ } else {
+ pci_default_pm_resume_early(pci_dev);
}

return error;
@@ -506,10 +528,8 @@ static int pci_pm_freeze(struct device *
if (drv->pm->freeze) {
error = drv->pm->freeze(dev);
suspend_report_result(drv->pm->freeze, error);
- } else {
- pci_default_pm_suspend(pci_dev);
}
- } else {
+ } else if (pci_has_legacy_pm_support(pci_dev)) {
error = pci_legacy_suspend(dev, PMSG_FREEZE);
pci_fixup_device(pci_fixup_suspend, pci_dev);
}
@@ -528,8 +548,10 @@ static int pci_pm_freeze_noirq(struct de
error = drv->pm->freeze_noirq(dev);
suspend_report_result(drv->pm->freeze_noirq, error);
}
- } else {
+ } else if (pci_has_legacy_pm_support(pci_dev)) {
error = pci_legacy_suspend_late(dev, PMSG_FREEZE);
+ } else {
+ pci_default_pm_suspend(pci_dev);
}

return error;
@@ -537,14 +559,15 @@ static int pci_pm_freeze_noirq(struct de

static int pci_pm_thaw(struct device *dev)
{
+ struct pci_dev *pci_dev = to_pci_dev(dev);
struct device_driver *drv = dev->driver;
int error = 0;

if (drv && drv->pm) {
if (drv->pm->thaw)
error = drv->pm->thaw(dev);
- } else {
- pci_fixup_device(pci_fixup_resume, to_pci_dev(dev));
+ } else if (pci_has_legacy_pm_support(pci_dev)) {
+ pci_fixup_device(pci_fixup_resume, pci_dev);
error = pci_legacy_resume(dev);
}

@@ -560,7 +583,7 @@ static int pci_pm_thaw_noirq(struct devi
if (drv && drv->pm) {
if (drv->pm->thaw_noirq)
error = drv->pm->thaw_noirq(dev);
- } else {
+ } else if (pci_has_legacy_pm_support(pci_dev)) {
pci_fixup_device(pci_fixup_resume_early, pci_dev);
error = pci_legacy_resume_early(dev);
}
@@ -570,17 +593,18 @@ static int pci_pm_thaw_noirq(struct devi

static int pci_pm_poweroff(struct device *dev)
{
+ struct pci_dev *pci_dev = to_pci_dev(dev);
struct device_driver *drv = dev->driver;
int error = 0;

- pci_fixup_device(pci_fixup_suspend, to_pci_dev(dev));
+ pci_fixup_device(pci_fixup_suspend, pci_dev);

if (drv && drv->pm) {
if (drv->pm->poweroff) {
error = drv->pm->poweroff(dev);
suspend_report_result(drv->pm->poweroff, error);
}
- } else {
+ } else if (pci_has_legacy_pm_support(pci_dev)) {
error = pci_legacy_suspend(dev, PMSG_HIBERNATE);
}

@@ -598,7 +622,7 @@ static int pci_pm_poweroff_noirq(struct
error = drv->pm->poweroff_noirq(dev);
suspend_report_result(drv->pm->poweroff_noirq, error);
}
- } else {
+ } else if (pci_has_legacy_pm_support(pci_dev)) {
error = pci_legacy_suspend_late(dev, PMSG_HIBERNATE);
}

@@ -609,13 +633,15 @@ static int pci_pm_restore(struct device
{
struct pci_dev *pci_dev = to_pci_dev(dev);
struct device_driver *drv = dev->driver;
- int error;
+ int error = 0;

if (drv && drv->pm) {
- error = drv->pm->restore ? drv->pm->restore(dev) :
- pci_default_pm_resume(pci_dev);
- } else {
+ if (drv->pm->restore)
+ error = drv->pm->restore(dev);
+ } else if (pci_has_legacy_pm_support(pci_dev)) {
error = pci_legacy_resume(dev);
+ } else {
+ error = pci_default_pm_resume_late(pci_dev);
}
pci_fixup_device(pci_fixup_resume, pci_dev);

@@ -633,8 +659,10 @@ static int pci_pm_restore_noirq(struct d
if (drv && drv->pm) {
if (drv->pm->restore_noirq)
error = drv->pm->restore_noirq(dev);
- } else {
+ } else if (pci_has_legacy_pm_support(pci_dev)) {
error = pci_legacy_resume_early(dev);
+ } else {
+ pci_default_pm_resume_early(pci_dev);
}
pci_fixup_device(pci_fixup_resume_early, pci_dev);

Linus Torvalds

2008-12-06 17:07:26 UTC

Post by Rafael J. Wysocki
Rework the handling of suspend and resume of PCI devices which have
no drivers or the drivers of which do not provide any suspend-resume
callbacks in such a way that their standard PCI configuration
registers will be saved and restored with interrupts disabled.

Ok, I think this is good, but I _also_ think that we should do one more
fix:

- if a device uses the new-format suspend/resume structure, we should do
the low-level save-restore _unconditionally_ in the PCI layer.

Because apparently there is only a single user of the new format, and that
single user got it wrong. So wouldn't it be much nicer to just _remove_
the code from the USB host controllers that does the save/restore thing.

Quite frankly, the USB code really does look wrong. Not just in that it
enables the BAR's before restoring them, but on the suspend side it
actually puts the device into D3_hot _before_ it then does the whole
"pci_enable_wake()", which I'm not at all sure will necessarily work. I'm
pretty sure that you should enable wakeup events _before_ going to sleep.

If the generic PCI layer unconditionally did the suspend as the last thing
it does (and the resume as the first thing), then drivers couldn't do
insane things like that, even by mistake.

Hmm?

Linus

Rafael J. Wysocki

2008-12-06 17:22:35 UTC

Post by Linus Torvalds

Post by Rafael J. Wysocki
Rework the handling of suspend and resume of PCI devices which have
no drivers or the drivers of which do not provide any suspend-resume
callbacks in such a way that their standard PCI configuration
registers will be saved and restored with interrupts disabled.

Ok, I think this is good, but I _also_ think that we should do one more
- if a device uses the new-format suspend/resume structure, we should do
the low-level save-restore _unconditionally_ in the PCI layer.
Because apparently there is only a single user of the new format, and that
single user got it wrong. So wouldn't it be much nicer to just _remove_
the code from the USB host controllers that does the save/restore thing.

USB doesn't use that for PCI suspend-resume, it uses it for suspend-resume of
USB devices behind the controller.

Post by Linus Torvalds
Quite frankly, the USB code really does look wrong. Not just in that it
enables the BAR's before restoring them, but on the suspend side it
actually puts the device into D3_hot _before_ it then does the whole
"pci_enable_wake()", which I'm not at all sure will necessarily work. I'm
pretty sure that you should enable wakeup events _before_ going to sleep.

Yeah. Or simply use pci_prepare_to_sleep() and be done with it.

Post by Linus Torvalds
If the generic PCI layer unconditionally did the suspend as the last thing
it does (and the resume as the first thing), then drivers couldn't do
insane things like that, even by mistake.
Hmm?

OK

But then we will save the device's registers in the "sleeping" state. Is this
going to be entirely correct in all possible cases? [pci_save_state() doesn't
save the PM registers, so that _should_ be correct, but I don't have _that_
much experience with these things.]

Rafael

Linus Torvalds

2008-12-06 17:33:37 UTC

Post by Rafael J. Wysocki
USB doesn't use that for PCI suspend-resume, it uses it for suspend-resume of
USB devices behind the controller.

Oh, in that case there are no PCI users of this at all, and what the PCI
driver does is immaterial ;)

Post by Rafael J. Wysocki
But then we will save the device's registers in the "sleeping" state.

No no. The rule would be that a PCI driver - if it uses the new
infrastructure, which apparently nobody does _as_ a PCI driver - simply
would never do the whole "pci_set_power_state(PCI_D3hot)" etc crud AT ALL.

So a PCI driver would only do higher-level stuff in its suspend/resume
code. For example, a USB host controller would initiate the USB bus level
stuff, and likely just stop the controller (not suspend it - just stop
it).

Linus

Rafael J. Wysocki

2008-12-06 17:43:58 UTC

Post by Linus Torvalds

Post by Rafael J. Wysocki
USB doesn't use that for PCI suspend-resume, it uses it for suspend-resume of
USB devices behind the controller.

Oh, in that case there are no PCI users of this at all, and what the PCI
driver does is immaterial ;)

Post by Rafael J. Wysocki
But then we will save the device's registers in the "sleeping" state.

No no. The rule would be that a PCI driver - if it uses the new
infrastructure, which apparently nobody does _as_ a PCI driver - simply
would never do the whole "pci_set_power_state(PCI_D3hot)" etc crud AT ALL.

Now _that_ sounds good. :-)

Post by Linus Torvalds
So a PCI driver would only do higher-level stuff in its suspend/resume
code. For example, a USB host controller would initiate the USB bus level
stuff, and likely just stop the controller (not suspend it - just stop
it).

I like this idea very much.

So, to fix the issue at hand, I'd like the $subject patch to go first. Then,
there is a major update of the new framework waiting for .29 in the Greg's
tree (that's the main reason why nobody uses it so far, BTW) and I'd really
prefer it to go next. After it's been merged, I'm going to add the mandatory
suspend-resume things (save state and go to a low power state on suspend,
restore state on resume) to the new framework in a separete patch.

Is this plan acceptable?

Rafael

Linus Torvalds

2008-12-06 18:00:35 UTC

Post by Rafael J. Wysocki
So, to fix the issue at hand, I'd like the $subject patch to go first. Then,
there is a major update of the new framework waiting for .29 in the Greg's
tree (that's the main reason why nobody uses it so far, BTW) and I'd really
prefer it to go next. After it's been merged, I'm going to add the mandatory
suspend-resume things (save state and go to a low power state on suspend,
restore state on resume) to the new framework in a separete patch.
Is this plan acceptable?

Sounds good to me. And assuming Jesse/Greg are all aboard, I'll just wait
for the pull requests from Jesse and Greg.

The only thing I'll do right now is to send off my "print out ICH6+
LPC resources" patch again to Jesse, with a changelog etc. It can probably
go in as-is (it really just adds printk's), but since it didn't matter
anyway we migth as well just do it as a PCI thing for 2.6.29 too.

On a similar note, I wonder what we should do about the whole "transparent
bridge resource allocation" thing. It also didn't end up really mattering,
even if it apparently made a difference for Frans. The question is just
whether we would be better off with IO windows for transparent buses (the
way we try to set things up now), or with a simpler PCI resource tree that
just takes advantage of the transparency.

The bridge windows _may_ result in better PCI throughput behind such a
bridge, so there is some argument for keeping them. On the other hand,
transparent bridges aren't generally for high-performance stuff anyway,
and one advantage of the transparency is the flexibility it allows (ie we
don't _need_ to set up the static bridging windows).

I dunno. I wonder what Windows does. Following Windows in areas like this
tends to have the advantage that it's what the firmware and the hardware
has generally been tested with most. At the same time, I'm not sure this
is necessarily a very bug-prone area for either firmware or hardware. If
there's actual bridge bugs wrt the windows, I suspect such a bridge would
be broken enough to be unusable regardless.

Linus

Rafael J. Wysocki

2008-12-06 21:24:15 UTC

Post by Linus Torvalds

Post by Rafael J. Wysocki
So, to fix the issue at hand, I'd like the $subject patch to go first. Then,
there is a major update of the new framework waiting for .29 in the Greg's
tree (that's the main reason why nobody uses it so far, BTW) and I'd really
prefer it to go next. After it's been merged, I'm going to add the mandatory
suspend-resume things (save state and go to a low power state on suspend,
restore state on resume) to the new framework in a separete patch.
Is this plan acceptable?

Sounds good to me. And assuming Jesse/Greg are all aboard, I'll just wait
for the pull requests from Jesse and Greg.
The only thing I'll do right now is to send off my "print out ICH6+
LPC resources" patch again to Jesse, with a changelog etc. It can probably
go in as-is (it really just adds printk's), but since it didn't matter
anyway we migth as well just do it as a PCI thing for 2.6.29 too.
On a similar note, I wonder what we should do about the whole "transparent
bridge resource allocation" thing. It also didn't end up really mattering,
even if it apparently made a difference for Frans. The question is just
whether we would be better off with IO windows for transparent buses (the
way we try to set things up now), or with a simpler PCI resource tree that
just takes advantage of the transparency.
The bridge windows _may_ result in better PCI throughput behind such a
bridge, so there is some argument for keeping them. On the other hand,
transparent bridges aren't generally for high-performance stuff anyway,
and one advantage of the transparency is the flexibility it allows (ie we
don't _need_ to set up the static bridging windows).

The static bridging windows help understand the system topology a bit IMO,
because you can just look at /proc/iomem and see what resources are
behind the bridge.

Post by Linus Torvalds
I dunno. I wonder what Windows does. Following Windows in areas like this
tends to have the advantage that it's what the firmware and the hardware
has generally been tested with most. At the same time, I'm not sure this
is necessarily a very bug-prone area for either firmware or hardware. If
there's actual bridge bugs wrt the windows, I suspect such a bridge would
be broken enough to be unusable regardless.

I think Intel people should be able to find out what Windows does in this
area.

Thanks,
Rafael

Jesse Barnes

2008-12-07 04:44:57 UTC

Post by Linus Torvalds

Post by Rafael J. Wysocki
So, to fix the issue at hand, I'd like the $subject patch to go first.
Then, there is a major update of the new framework waiting for .29 in the
Greg's tree (that's the main reason why nobody uses it so far, BTW) and
I'd really prefer it to go next. After it's been merged, I'm going to
add the mandatory suspend-resume things (save state and go to a low power
state on suspend, restore state on resume) to the new framework in a
separete patch.
Is this plan acceptable?

Sounds good to me. And assuming Jesse/Greg are all aboard, I'll just wait
for the pull requests from Jesse and Greg.
The only thing I'll do right now is to send off my "print out ICH6+
LPC resources" patch again to Jesse, with a changelog etc. It can probably
go in as-is (it really just adds printk's), but since it didn't matter
anyway we migth as well just do it as a PCI thing for 2.6.29 too.

Ok, I applied the set (Rafael's 1-2 and your ICH patch) to my linux-next
branch. We should get a little build coverage this week at least, hopefully
nothing breaks too badly.

Post by Linus Torvalds
On a similar note, I wonder what we should do about the whole "transparent
bridge resource allocation" thing. It also didn't end up really mattering,
even if it apparently made a difference for Frans. The question is just
whether we would be better off with IO windows for transparent buses (the
way we try to set things up now), or with a simpler PCI resource tree that
just takes advantage of the transparency.
The bridge windows _may_ result in better PCI throughput behind such a
bridge, so there is some argument for keeping them. On the other hand,
transparent bridges aren't generally for high-performance stuff anyway,
and one advantage of the transparency is the flexibility it allows (ie we
don't _need_ to set up the static bridging windows).
I dunno. I wonder what Windows does. Following Windows in areas like this
tends to have the advantage that it's what the firmware and the hardware
has generally been tested with most. At the same time, I'm not sure this
is necessarily a very bug-prone area for either firmware or hardware. If
there's actual bridge bugs wrt the windows, I suspect such a bridge would
be broken enough to be unusable regardless.

Just so happens that I'm working with some people internally on transparent
bridge related issues atm, I'll see what I can dig up.

--
Jesse Barnes, Intel Open Source Technology Center

Greg KH

2008-12-07 05:41:49 UTC

Post by Linus Torvalds

Post by Rafael J. Wysocki
So, to fix the issue at hand, I'd like the $subject patch to go first. Then,
there is a major update of the new framework waiting for .29 in the Greg's
tree (that's the main reason why nobody uses it so far, BTW) and I'd really
prefer it to go next. After it's been merged, I'm going to add the mandatory
suspend-resume things (save state and go to a low power state on suspend,
restore state on resume) to the new framework in a separete patch.
Is this plan acceptable?

Sounds good to me. And assuming Jesse/Greg are all aboard, I'll just wait
for the pull requests from Jesse and Greg.

No objection from me, I'll wait for Jesse to "go first" in the .29 merge
window.

thanks,

greg k-h

Alan Stern

2008-12-06 18:30:37 UTC

Post by Rafael J. Wysocki

Post by Linus Torvalds

Post by Rafael J. Wysocki
USB doesn't use that for PCI suspend-resume, it uses it for suspend-resume of
USB devices behind the controller.

Oh, in that case there are no PCI users of this at all, and what the PCI
driver does is immaterial ;)

Post by Rafael J. Wysocki
But then we will save the device's registers in the "sleeping" state.

No no. The rule would be that a PCI driver - if it uses the new
infrastructure, which apparently nobody does _as_ a PCI driver - simply
would never do the whole "pci_set_power_state(PCI_D3hot)" etc crud AT ALL.

Now _that_ sounds good. :-)

Post by Linus Torvalds
So a PCI driver would only do higher-level stuff in its suspend/resume
code. For example, a USB host controller would initiate the USB bus level
stuff, and likely just stop the controller (not suspend it - just stop
it).

I like this idea very much.

Rafael, I'd be happy to help with fixing up the USB PCI PM code. At
this point I'm not sure exactly what's needed, though. For instance,
is there any compelling reason to switch over to the new dev_pm_ops
approach? And what should the correct sequence of calls be?

Alan Stern

Rafael J. Wysocki

2008-12-06 21:36:43 UTC

Hi Alan,

Post by Alan Stern

Post by Rafael J. Wysocki

Post by Linus Torvalds

Post by Rafael J. Wysocki
USB doesn't use that for PCI suspend-resume, it uses it for suspend-resume of
USB devices behind the controller.

Oh, in that case there are no PCI users of this at all, and what the PCI
driver does is immaterial ;)

Post by Rafael J. Wysocki
But then we will save the device's registers in the "sleeping" state.

No no. The rule would be that a PCI driver - if it uses the new
infrastructure, which apparently nobody does _as_ a PCI driver - simply
would never do the whole "pci_set_power_state(PCI_D3hot)" etc crud AT ALL.

Now _that_ sounds good. :-)

Post by Linus Torvalds
So a PCI driver would only do higher-level stuff in its suspend/resume
code. For example, a USB host controller would initiate the USB bus level
stuff, and likely just stop the controller (not suspend it - just stop
it).

I like this idea very much.

Rafael, I'd be happy to help with fixing up the USB PCI PM code. At
this point I'm not sure exactly what's needed, though. For instance,
is there any compelling reason to switch over to the new dev_pm_ops
approach?

Certainly not at the moment. There will be a reason some time after .29.

That said, it apparently is possible to clean up the resume callbacks of PCI
USB controllers, as mentioned here: http://lkml.org/lkml/2008/12/6/38

Post by Alan Stern
And what should the correct sequence of calls be?

Well, that's something I'm not exactly sure about myself. Surely it seems
reasonable to call pci_restore_state() with interrupts disabled and do the rest
of resume after that. Also, I think that the core could execute things like
pci_enable_device() during resume and pci_set_power_state()/pci_enable_wake()
on suspend so that the drivers didn't have to. This way we could reduce code
duplication quite a bit.

However, I'm not quite sure about the freeing and requesting IRQs during
suspend and resume. Many drivers do that, many others don't. Still,
apparently some drivers don't work correctly after resume if this is not done.
So, if that should generally be done, I also think that moving it to the core
might be a good idea.

Thanks,
Rafael

Linus Torvalds

2008-12-06 22:24:55 UTC

Post by Rafael J. Wysocki
However, I'm not quite sure about the freeing and requesting IRQs during
suspend and resume. Many drivers do that, many others don't. Still,
apparently some drivers don't work correctly after resume if this is not done.
So, if that should generally be done, I also think that moving it to the core
might be a good idea.

I'd suggest against it.

A lot of drivers that want to disable (or unregister) interrupts almost
certainly want to do it simply because they are not ready and willing to
handle any interrupts after having run their "suspend()" function.

So if the generic layer does it _after_ calling ->suspend() (or at
suspend_late()) time, it's too late.

And the generic layer certainly must not disable/unregister interrupts
_before_ calling ->suspend(), since the driver may well need to handle
interrupts for suspending.

So there is no right time for the generic layer to do this. Not to mention
that the generic layer doesn't even know what kind of interrupt (if any -
or if perhaps even _multiple_) that the driver has registered.

I also suspect that a lot of drivers simply do not want or need to
unregister the interrupt handler. I'm personally pretty sure that the only
reason that drivers do this in the first place is exactly because they do
their suspend() thing with interrupts enabled in the first place, and
moving the core suspend routines to inside the irq-off region just means
that they don't even want/need to do anything about interrupts.

Linus

Arjan van de Ven

2008-12-06 23:25:45 UTC

On Sat, 6 Dec 2008 14:24:55 -0800 (PST)

Post by Linus Torvalds

Post by Rafael J. Wysocki
However, I'm not quite sure about the freeing and requesting IRQs
during suspend and resume. Many drivers do that, many others
don't. Still, apparently some drivers don't work correctly after
resume if this is not done. So, if that should generally be done, I
also think that moving it to the core might be a good idea.

I'd suggest against it.
A lot of drivers that want to disable (or unregister) interrupts
almost certainly want to do it simply because they are not ready and
willing to handle any interrupts after having run their "suspend()"
function.

the problem is that the system bios can have reassigned interrupts
after resume, and afaik we need to re-evaluate the ACPI methods to
get the new mapping.
So we need to unregister + re-register to make that happen

Alan Cox

2008-12-06 23:35:07 UTC

Post by Arjan van de Ven
So we need to unregister + re-register to make that happen

Agreed - and to cope with coming back up with some masked IRQs for those
lovely hardware vendors whose idea of amusement is handing the resumed
system a pending IRQ. To be fair its often hardware flagging things like
'device has become ready' from power up events...

Linus Torvalds

2008-12-07 06:00:59 UTC

Post by Arjan van de Ven
the problem is that the system bios can have reassigned interrupts
after resume, and afaik we need to re-evaluate the ACPI methods to
get the new mapping.
So we need to unregister + re-register to make that happen

Can you give actual examples of real life situations?

Because quite frankly, it sounds less and less likely for any relevant
hardware. It's a non-issue for MSI, for example. And it's a non-issue for
any sane interrupt source I can think of.

In other words, I've heard that claim before - and I just don't believe
it. I've never heard a realistic explanation of why it would happen for a
normal PCI driver. And I still claim that it's a very odd and special case
if it does.

And btw, I'm talking suspend, not hibernate.

Linus

Linus Torvalds

2008-12-07 06:03:43 UTC

Post by Linus Torvalds
And btw, I'm talking suspend, not hibernate.

And, btw, even if anybody actually does this, it should be up to the
interrupt controller logic to re-initialize the interrupts so that they
are back where they belong. IOW, we should never show such _idiotic_
brokenness to any actual driver, it should all be remapped and handled
below them.

And I still have never heard any valid reason to do it in the first place,
so until somebody actually gives a real example and an explanation, I
would suggest ignoring the whole issue as some insane rumblings from crazy
hw/firmare people doing idiotic things.

Linus

Takashi Iwai

2008-12-07 09:44:55 UTC

At Sat, 6 Dec 2008 22:00:59 -0800 (PST),

Post by Linus Torvalds

Post by Arjan van de Ven
the problem is that the system bios can have reassigned interrupts
after resume, and afaik we need to re-evaluate the ACPI methods to
get the new mapping.
So we need to unregister + re-register to make that happen

Can you give actual examples of real life situations?

There were such cases on intel8x0 and maestro3 on-board sound devices,
but all they were about hibernate, IIRC. Just though a quick git
log search, I found the following:
http://bugzilla.kernel.org/show_bug.cgi?id=4416

Takashi

Alan Stern

2008-12-07 00:02:20 UTC

Post by Rafael J. Wysocki

Post by Alan Stern
Rafael, I'd be happy to help with fixing up the USB PCI PM code. At
this point I'm not sure exactly what's needed, though. For instance,
is there any compelling reason to switch over to the new dev_pm_ops
approach?

Certainly not at the moment. There will be a reason some time after .29.
That said, it apparently is possible to clean up the resume callbacks of PCI
USB controllers, as mentioned here: http://lkml.org/lkml/2008/12/6/38

Post by Alan Stern
And what should the correct sequence of calls be?

Well, that's something I'm not exactly sure about myself. Surely it seems
reasonable to call pci_restore_state() with interrupts disabled and do the rest
of resume after that. Also, I think that the core could execute things like
pci_enable_device() during resume and pci_set_power_state()/pci_enable_wake()
on suspend so that the drivers didn't have to. This way we could reduce code
duplication quite a bit.

Do you plan to change the PCI core to do these things any time soon?
Wouldn't that require changing a whole bunch of PCI drivers too? I
tend to agree that having the core take care of these choreographed
activities would be good -- it would leave less room for drivers to
make mistakes.

So for now maybe it would be best just to rearrange the existing calls
in USB, and wait for the core changes before doing anything more
ambitious.

Post by Rafael J. Wysocki
However, I'm not quite sure about the freeing and requesting IRQs during
suspend and resume. Many drivers do that, many others don't. Still,
apparently some drivers don't work correctly after resume if this is not done.
So, if that should generally be done, I also think that moving it to the core
might be a good idea.

For USB this doesn't matter; we don't free the IRQs during suspend.

Alan Stern

Alan Cox

2008-12-06 21:09:27 UTC

Post by Rafael J. Wysocki
prefer it to go next. After it's been merged, I'm going to add the mandatory
suspend-resume things (save state and go to a low power state on suspend,
restore state on resume) to the new framework in a separete patch.
Is this plan acceptable?

I have at least two drivers I look after where if you put the device into
D3 you lost. We survive because on a successful suspend/resume sequence
the BIOS puts it back coming out of suspend but that means we must not
put those devices into D3 ourselves ever - including during a suspend
before we are 100% comitted to the suspend completing or reboot.

Rafael J. Wysocki

2008-12-06 21:50:42 UTC

Post by Alan Cox

Post by Rafael J. Wysocki
prefer it to go next. After it's been merged, I'm going to add the mandatory
suspend-resume things (save state and go to a low power state on suspend,
restore state on resume) to the new framework in a separete patch.
Is this plan acceptable?

I have at least two drivers I look after where if you put the device into
D3 you lost. We survive because on a successful suspend/resume sequence
the BIOS puts it back coming out of suspend but that means we must not
put those devices into D3 ourselves ever - including during a suspend
before we are 100% comitted to the suspend completing or reboot.

We can mark them as devices not to put into D3. There already is a
mechanism for that in place.

Thanks,
Rafael

Frans Pop

2008-12-06 19:30:55 UTC

Post by Rafael J. Wysocki
The following three patches address the hibernation/suspend issue
described in http://bugzilla.kernel.org/show_bug.cgi?id=12121 and in
the very long thread at http://lkml.org/lkml/2008/12/1/382.

I've just built a kernel with your three patches, and without the earlier
"revert/debug" and "ignore transparent bridge" patches.

It also includes my patch that somewhat improves USB resume and my fix for
the ohci1394: "irq 19: nobody cared" issue. (Although I understand that
it's likely we'll get a much more structural improvement for USB than my
naive patch.)

I'll run my notebook with this kernel for the next few days and let you
know the results. First suspend/resume cycle was fine and showed, as
expected, a lot of config restores moved up, including HDA intel and
pcieport-driver.

It's nice to see something of significance happening before the ricoh-mmc
controller gets disabled :-P

Cheers,
FJP

133 Replies
1 View
Permalink to this page
Disable enhanced parsing

Thread Navigation

Rafael J. Wysocki 2008-12-02 02:20:31 UTC

Linus Torvalds 2008-12-02 03:32:02 UTC

Linus Torvalds 2008-12-02 03:42:48 UTC

Frans Pop 2008-12-02 04:31:50 UTC

Linus Torvalds 2008-12-02 04:46:20 UTC

Frans Pop 2008-12-02 05:29:24 UTC

Frans Pop 2008-12-02 05:56:59 UTC

Linus Torvalds 2008-12-02 15:46:38 UTC

Frans Pop 2008-12-02 17:46:48 UTC

Linus Torvalds 2008-12-02 18:17:32 UTC

Frans Pop 2008-12-05 08:53:10 UTC

Yinghai Lu 2008-12-05 09:09:03 UTC

Ingo Molnar 2008-12-05 12:20:06 UTC

Eric Dumazet 2008-12-05 13:04:21 UTC

H. Peter Anvin 2008-12-05 17:49:41 UTC

Frans Pop 2008-12-02 04:13:39 UTC

Linus Torvalds 2008-12-02 04:36:40 UTC

Rafael J. Wysocki 2008-12-02 22:38:50 UTC

Linus Torvalds 2008-12-02 23:37:23 UTC

Rafael J. Wysocki 2008-12-03 00:00:17 UTC

Rafael J. Wysocki 2008-12-03 00:05:53 UTC

Rafael J. Wysocki 2008-12-03 00:31:41 UTC

Linus Torvalds 2008-12-03 00:41:33 UTC

Rafael J. Wysocki 2008-12-03 01:22:28 UTC

Linus Torvalds 2008-12-03 02:02:24 UTC

Rafael J. Wysocki 2008-12-02 15:49:26 UTC

Frans Pop 2008-12-02 07:53:03 UTC

Rafael J. Wysocki 2008-12-04 01:23:53 UTC

Linus Torvalds 2008-12-04 04:40:58 UTC

Frans Pop 2008-12-04 08:21:03 UTC

Rafael J. Wysocki 2008-12-04 22:01:26 UTC

Frans Pop 2008-12-04 11:29:43 UTC

Linus Torvalds 2008-12-04 16:17:26 UTC

Frans Pop 2008-12-04 18:00:25 UTC

Linus Torvalds 2008-12-04 20:03:47 UTC

Linus Torvalds 2008-12-05 21:26:53 UTC

Rafael J. Wysocki 2008-12-05 22:01:44 UTC

Linus Torvalds 2008-12-05 22:14:23 UTC

Rafael J. Wysocki 2008-12-06 00:04:49 UTC

Linus Torvalds 2008-12-06 00:50:34 UTC

Rafael J. Wysocki 2008-12-06 01:18:12 UTC

Linus Torvalds 2008-12-06 01:55:16 UTC

Rafael J. Wysocki 2008-12-06 02:18:07 UTC

Rafael J. Wysocki 2008-12-06 13:53:23 UTC

Greg KH 2008-12-06 02:45:08 UTC

Frans Pop 2008-12-06 09:20:17 UTC

Rafael J. Wysocki 2008-12-06 13:48:30 UTC

Frans Pop 2008-12-06 15:02:23 UTC

Rafael J. Wysocki 2008-12-04 22:46:21 UTC

Rafael J. Wysocki 2008-12-04 22:40:58 UTC

Linus Torvalds 2008-12-04 23:22:30 UTC

Rafael J. Wysocki 2008-12-04 23:45:40 UTC

Linus Torvalds 2008-12-05 00:07:47 UTC

Rafael J. Wysocki 2008-12-05 00:20:01 UTC

Frans Pop 2008-12-05 06:55:20 UTC

Rafael J. Wysocki 2008-12-04 22:09:11 UTC

Linus Torvalds 2008-12-04 22:20:21 UTC

Rafael J. Wysocki 2008-12-04 23:31:26 UTC

Linus Torvalds 2008-12-05 00:03:03 UTC

Linus Torvalds 2008-12-05 00:45:47 UTC

Rafael J. Wysocki 2008-12-05 01:08:33 UTC

Linus Torvalds 2008-12-05 01:45:12 UTC

Linus Torvalds 2008-12-05 02:55:18 UTC

Linus Torvalds 2008-12-05 03:25:56 UTC

Frans Pop 2008-12-05 06:44:54 UTC

Frans Pop 2008-12-05 08:27:34 UTC

Rafael J. Wysocki 2008-12-05 12:00:16 UTC

Linus Torvalds 2008-12-05 15:57:59 UTC

Rafael J. Wysocki 2008-12-05 21:32:12 UTC

Jesse Barnes 2008-12-05 17:25:30 UTC

Rafael J. Wysocki 2008-12-06 14:05:33 UTC

Rafael J. Wysocki 2008-12-06 14:07:59 UTC

Linus Torvalds 2008-12-06 17:15:22 UTC

Rafael J. Wysocki 2008-12-06 17:25:59 UTC

Linus Torvalds 2008-12-06 17:38:36 UTC

Rafael J. Wysocki 2008-12-06 17:46:11 UTC

Jesse Barnes 2008-12-07 02:18:54 UTC

Rafael J. Wysocki 2008-12-06 14:09:08 UTC

Jesse Barnes 2008-12-07 04:45:35 UTC

Rafael J. Wysocki 2008-12-06 14:07:05 UTC

Linus Torvalds 2008-12-06 17:07:26 UTC

Rafael J. Wysocki 2008-12-06 17:22:35 UTC

Linus Torvalds 2008-12-06 17:33:37 UTC

Rafael J. Wysocki 2008-12-06 17:43:58 UTC

Linus Torvalds 2008-12-06 18:00:35 UTC

Rafael J. Wysocki 2008-12-06 21:24:15 UTC

Jesse Barnes 2008-12-07 04:44:57 UTC

Greg KH 2008-12-07 05:41:49 UTC

Alan Stern 2008-12-06 18:30:37 UTC

Rafael J. Wysocki 2008-12-06 21:36:43 UTC

Linus Torvalds 2008-12-06 22:24:55 UTC

Arjan van de Ven 2008-12-06 23:25:45 UTC

Alan Cox 2008-12-06 23:35:07 UTC

Linus Torvalds 2008-12-07 06:00:59 UTC

Linus Torvalds 2008-12-07 06:03:43 UTC

Takashi Iwai 2008-12-07 09:44:55 UTC

Alan Stern 2008-12-07 00:02:20 UTC

Alan Cox 2008-12-06 21:09:27 UTC

Rafael J. Wysocki 2008-12-06 21:50:42 UTC

Frans Pop 2008-12-06 19:30:55 UTC

about - legalese

Loading...