Discussion:
[PATCH] kexec: force x86_64 arches to boot kdump kernels on boot cpu
(too old to reply)
Neil Horman
2007-11-27 01:47:40 UTC
Permalink
Hey all-
I've been working on an issue lately involving multi socket x86_64
systems connected via hypertransport bridges. It appears that some systems,
disable the hypertransport connections during a kdump operation when all but the
crashing processor gets halted in machine_crash_shutdown. This becomes a
problem when the ioapic attempts to route interrupts to the only remaining
processor. Even though the active processor is targeted for interrupt
reception, the fact that the hypertransport connections are inactive result in
interrupts not getting delivered. The effective result is that timer interrupts
are not delivered to the running cpu, and the system hangs on reboot into the
kdump kernel during calibrate_delay. I've found that I've been able to avoid
this hang, by forcing a transition to the bios defined boot cpu during the
crashing kernel shutdown. This patch accomplished that. Tested by myself and
the origional reporter with successful results.

Regards,
Neil

Signed-off-by: Neil Horman <nhorman-***@public.gmane.org>


arch/x86/kernel/crash.c | 46 ++++++++++++++++++++++++++++++++++++++--------
include/linux/kexec.h | 3 +++
init/main.c | 6 ++++++
kernel/kexec.c | 8 ++++++++
4 files changed, 55 insertions(+), 8 deletions(-)


diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index 8bb482f..0682e60 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -67,13 +67,36 @@ static int crash_nmi_callback(struct notifier_block *self,
}
#endif
crash_save_cpu(regs, cpu);
- disable_local_APIC();
- atomic_dec(&waiting_for_crash_ipi);
- /* Assume hlt works */
- halt();
- for (;;)
- cpu_relax();
-
+ if (smp_processor_id() == kexec_boot_cpu) {
+ /*
+ * This is the boot cpu. We need to:
+ * 1) Wait for the other processors to halt
+ * 2) clear our nmi interrupt
+ * 3) launch the new kernel
+ */
+ unsigned long msecs = 1000;
+ while ((atomic_read(&waiting_for_crash_ipi) > 0) && msecs) {
+ /*
+ * Use udelay to avoid the warnings here
+ * I know we shouldn't delay in an irq
+ * but we're about to reboot the box during
+ * a crash, a delay doesn't hurt here
+ */
+ udelay(1000);
+ msecs--;
+ }
+ ack_APIC_irq();
+ disable_local_APIC();
+ disable_IO_APIC();
+ machine_kexec(kexec_crash_image);
+
+ } else {
+ disable_local_APIC();
+ atomic_dec(&waiting_for_crash_ipi);
+ /* Assume hlt works */
+ for(;;)
+ halt();
+ }
return 1;
}

@@ -138,7 +161,14 @@ void machine_crash_shutdown(struct pt_regs *regs)
nmi_shootdown_cpus();
lapic_shutdown();
#if defined(CONFIG_X86_IO_APIC)
- disable_IO_APIC();
+ if (crashing_cpu == kexec_boot_cpu)
+ disable_IO_APIC();
#endif
crash_save_cpu(regs, safe_smp_processor_id());
+ if (crashing_cpu != kexec_boot_cpu) {
+ atomic_dec(&waiting_for_crash_ipi);
+ for(;;)
+ halt();
+ }
+
}
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 2d9c448..b5c12d6 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -187,6 +187,9 @@ extern u32 vmcoreinfo_note[VMCOREINFO_NOTE_SIZE/4];
extern size_t vmcoreinfo_size;
extern size_t vmcoreinfo_max_size;

+extern int kexec_boot_cpu;
+extern void kexec_record_boot_cpu();
+
int __init parse_crashkernel(char *cmdline, unsigned long long system_ram,
unsigned long long *crash_size, unsigned long long *crash_base);

diff --git a/init/main.c b/init/main.c
index 58f5a99..0f11ee0 100644
--- a/init/main.c
+++ b/init/main.c
@@ -58,6 +58,9 @@
#include <linux/kthread.h>
#include <linux/sched.h>
#include <linux/signal.h>
+#ifdef CONFIG_KEXEC
+#include <linux/kexec.h>
+#endif

#include <asm/io.h>
#include <asm/bugs.h>
@@ -538,6 +541,9 @@ asmlinkage void __init start_kernel(void)
unwind_setup();
setup_per_cpu_areas();
smp_prepare_boot_cpu(); /* arch-specific boot-cpu hooks */
+#ifdef CONFIG_KEXEC
+ kexec_record_boot_cpu();
+#endif

/*
* Set up the scheduler prior starting any interrupts (such as the
diff --git a/kernel/kexec.c b/kernel/kexec.c
index aa74a1e..cb6b1f3 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -41,6 +41,14 @@ u32 vmcoreinfo_note[VMCOREINFO_NOTE_SIZE/4];
size_t vmcoreinfo_size;
size_t vmcoreinfo_max_size = sizeof(vmcoreinfo_data);

+int kexec_boot_cpu = 0;
+
+void __init kexec_record_boot_cpu()
+{
+ kexec_boot_cpu = smp_processor_id();
+ printk(KERN_CRIT "kexec records boot cpu as %d\n",kexec_boot_cpu);
+}
+
/* Location of the reserved area for the crash kernel */
struct resource crashk_res = {
.name = "Crash kernel",
Eric W. Biederman
2007-11-27 04:12:25 UTC
Permalink
Post by Neil Horman
Hey all-
I've been working on an issue lately involving multi socket x86_64
systems connected via hypertransport bridges. It appears that some systems,
disable the hypertransport connections during a kdump operation when all but the
crashing processor gets halted in machine_crash_shutdown. This becomes a
problem when the ioapic attempts to route interrupts to the only remaining
processor. Even though the active processor is targeted for interrupt
reception, the fact that the hypertransport connections are inactive result in
interrupts not getting delivered. The effective result is that timer interrupts
are not delivered to the running cpu, and the system hangs on reboot into the
kdump kernel during calibrate_delay. I've found that I've been able to avoid
this hang, by forcing a transition to the bios defined boot cpu during the
crashing kernel shutdown. This patch accomplished that. Tested by myself and
the origional reporter with successful results.
If you can get to calibrate_delay hypertransport is still routing traffic.
Your diagnosis of the problem is wrong. Most likely it is just an ioapic
programming error in restoring the system to PIC mode.

I agree that there is a problem.

The reliable fix is to totally skip the PIC interrupt mode and go directly
to apic mode.

To make the code kexec on panic code path reliable we need to remove code
not add it.

Frankly I think switching cpus is one of the least reliable things that
we can do in general.

Eric
Neil Horman
2007-11-27 13:13:55 UTC
Permalink
Post by Eric W. Biederman
Post by Neil Horman
Hey all-
I've been working on an issue lately involving multi socket x86_64
systems connected via hypertransport bridges. It appears that some systems,
disable the hypertransport connections during a kdump operation when all but the
crashing processor gets halted in machine_crash_shutdown. This becomes a
problem when the ioapic attempts to route interrupts to the only remaining
processor. Even though the active processor is targeted for interrupt
reception, the fact that the hypertransport connections are inactive result in
interrupts not getting delivered. The effective result is that timer interrupts
are not delivered to the running cpu, and the system hangs on reboot into the
kdump kernel during calibrate_delay. I've found that I've been able to avoid
this hang, by forcing a transition to the bios defined boot cpu during the
crashing kernel shutdown. This patch accomplished that. Tested by myself and
the origional reporter with successful results.
If you can get to calibrate_delay hypertransport is still routing traffic.
Your diagnosis of the problem is wrong. Most likely it is just an ioapic
programming error in restoring the system to PIC mode.
What makes you say this? I don't see any need for interrupts prior to
calibrate_delay()
Post by Eric W. Biederman
I agree that there is a problem.
The reliable fix is to totally skip the PIC interrupt mode and go directly
to apic mode.
To make the code kexec on panic code path reliable we need to remove code
not add it.
Frankly I think switching cpus is one of the least reliable things that
we can do in general.
I understand the sentiment here, but its not like we're adding additional
functionality with this patch. We're already sending an IPI to all the
processors to halt them, we're just adding logic here so that we can detect the
boot cpu and use it to jump to the kexec image instead of halting. I don't
think this is any less reliable that what we have currently.

Regards
Neil
Post by Eric W. Biederman
Eric
_______________________________________________
kexec mailing list
http://lists.infradead.org/mailman/listinfo/kexec
--
/***************************************************
*Neil Horman
*Software Engineer
*Red Hat, Inc.
*nhorman-H+wXaHxf7aLQT0dZR+***@public.gmane.org
*gpg keyid: 1024D / 0x92A74FA1
*http://pgp.mit.edu
***************************************************/
ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
2007-11-27 13:28:13 UTC
Permalink
Post by Neil Horman
What makes you say this? I don't see any need for interrupts prior to
calibrate_delay()
Yes. calibrate_delay() is the first place we send interrupts over
hypertransport. However I/O still works. Thus hypertransport from
the first cpu is working, and hypertransport itself is working.

This is an interrupt specific problem not some generic hypertransport
problem.
Post by Neil Horman
Post by Eric W. Biederman
I agree that there is a problem.
The reliable fix is to totally skip the PIC interrupt mode and go directly
to apic mode.
To make the code kexec on panic code path reliable we need to remove code
not add it.
Frankly I think switching cpus is one of the least reliable things that
we can do in general.
I understand the sentiment here, but its not like we're adding additional
functionality with this patch. We're already sending an IPI to all the
processors to halt them
And we don't care if they halt. If they don't get the IPI we timeout.
Making the IPI mandatory is a _singificant_ change.

The only reason that code is on the kexec on panic code path is that
there is no other possible place we could put it.
Post by Neil Horman
, we're just adding logic here so that we can detect the
boot cpu and use it to jump to the kexec image instead of halting. I don't
think this is any less reliable that what we have currently.
It doesn't make things more reliable, and it adds code to a code path
that already has to much code to be solid reliable (thus your
problem).

Putting the system back in PIC legacy mode on the kexec on panic path
was supposed to be a short term hack until we could remove the need
by always deliver interrupts in apic mode.

If you can't root cause your problem and figure out how the apics
are misconfigured for legacy mode let's remove the need for going into
to legacy PIC mode and do what we should be able to do reliably. The
reward is much higher, as we kill all possibility of restoring PIC
mode wrong because we don't need to bother.

Eric
Andi Kleen
2007-11-27 13:45:56 UTC
Permalink
his is any less reliable that what we have currently.
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
It doesn't make things more reliable, and it adds code to a code path
that already has to much code to be solid reliable (thus your
problem).
Putting the system back in PIC legacy mode on the kexec on panic path
was supposed to be a short term hack until we could remove the need
by always deliver interrupts in apic mode.
If you can't root cause your problem and figure out how the apics
are misconfigured for legacy mode
Probably legacy mode always routes to CPU #0. Makes sense and is
not really a misconfiguration of legacy mode.

But if CPU #0 has interrupts disabled no interrupts get delivered.

So choices are:
- Move to CPU #0
- Do not use legacy mode during shutdown.
- Or do not rely on interrupts after enabling legacy mode
- Or do not disable interrupts on the other CPUs when they're
halted.

First and last option are probably unreliable for the kdump case.
Second or third sound best.

I suspect the real fix would be to enable IOAPIC mode really
early and never use the timers in legacy mode. Then the kdump
kernel wouldn't care about the legacy mode pointing to the wrong CPU.

IIrc Eric even had a patch for that a long time ago, but it broke some
things so it wasn't included. But perhaps it should be revisited.

-Andi
Neil Horman
2007-11-27 14:28:26 UTC
Permalink
Post by Neil Horman
his is any less reliable that what we have currently.
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
It doesn't make things more reliable, and it adds code to a code path
that already has to much code to be solid reliable (thus your
problem).
Putting the system back in PIC legacy mode on the kexec on panic path
was supposed to be a short term hack until we could remove the need
by always deliver interrupts in apic mode.
If you can't root cause your problem and figure out how the apics
are misconfigured for legacy mode
Probably legacy mode always routes to CPU #0. Makes sense and is
not really a misconfiguration of legacy mode.
But if CPU #0 has interrupts disabled no interrupts get delivered.
- Move to CPU #0
- Do not use legacy mode during shutdown.
- Or do not rely on interrupts after enabling legacy mode
- Or do not disable interrupts on the other CPUs when they're
halted.
First and last option are probably unreliable for the kdump case.
Second or third sound best.
Not sure if this is applicable, but I assume not relying on interrupts in legacy
mode would be equivalent to specifying irqpoll on the kdump kernel command line?
If so, there seems to be a problem with that solution, as doing so still results
in the same hang on the system in question.


As for solution 2, that brings me to my previous question. Is that really as
simple as just not moving the apic to legacy mode? It would seem some
additional programming would be in order to route the interrupt in question to
the proper cpu.

Regards
Neil
Post by Neil Horman
I suspect the real fix would be to enable IOAPIC mode really
early and never use the timers in legacy mode. Then the kdump
kernel wouldn't care about the legacy mode pointing to the wrong CPU.
IIrc Eric even had a patch for that a long time ago, but it broke some
things so it wasn't included. But perhaps it should be revisited.
-Andi
Andi Kleen
2007-11-27 14:43:11 UTC
Permalink
Post by Neil Horman
As for solution 2, that brings me to my previous question. Is that really as
simple as just not moving the apic to legacy mode? It would seem some
additional programming would be in order to route the interrupt in question to
the proper cpu.
The Linux kernel right now relies on being in legacy mode at bootup.
So obviously kexec has to switch back to that.

Not relying on legacy mode would require moving APIC setup much earlier which
is difficult because that's quite fragile. Longer term it might be a good
idea though anyways -- at least the timer code was always fragile
and eliminating one failure case and only ever running it in true
APIC mode would be probably a good thing.

-Andi
Neil Horman
2007-11-27 14:48:56 UTC
Permalink
Post by Andi Kleen
Post by Neil Horman
As for solution 2, that brings me to my previous question. Is that really as
simple as just not moving the apic to legacy mode? It would seem some
additional programming would be in order to route the interrupt in question to
the proper cpu.
The Linux kernel right now relies on being in legacy mode at bootup.
So obviously kexec has to switch back to that.
Not relying on legacy mode would require moving APIC setup much earlier which
is difficult because that's quite fragile. Longer term it might be a good
idea though anyways -- at least the timer code was always fragile
and eliminating one failure case and only ever running it in true
APIC mode would be probably a good thing.
So, it sounds to me then, like unless I'm willing to really re-write the APIC
setup code (which I don't feel qualified to do quite yet), that the immediate
solution would be to not rely on interrupts in legacy mode, which was according
to my understanding, what the use of the irqpoll command line option was
supposed to enable. Any thoughts as to why that might not be working in this
case, or suggested tests to determine a cause there?

Thanks & Regards
Neil
Post by Andi Kleen
-Andi
ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
2007-11-27 15:30:47 UTC
Permalink
Post by Neil Horman
So, it sounds to me then, like unless I'm willing to really re-write the APIC
setup code (which I don't feel qualified to do quite yet), that the immediate
solution would be to not rely on interrupts in legacy mode, which was according
to my understanding, what the use of the irqpoll command line option was
supposed to enable. Any thoughts as to why that might not be working in this
case, or suggested tests to determine a cause there?
Hmm. It looks like irqpoll expects to have at least one irq working.
I wonder if we can fix that.

Looking at it from the other direction what does this line
from check_timer() look like when you boot a normal kernel?

apic_printk(APIC_VERBOSE,KERN_INFO "..TIMER: vector=0x%02X apic1=%d pin1=%d apic2=%d pin2=%d\n",
cfg->vector, apic1, pin1, apic2, pin2);

I'm curious what values we see at boot for the legacy mode the first
time it is setup, and possibly what chipset you are on.
I know we have had a few times where we have failed to reprogram
the ioapic properly at shutdown. So it can't hurt to look at
that possibility one last time.


For whatever it is worth my original attempt at using only the
apic mode was commit: f2b36db692b7ff6972320ad9839ae656a3b0ee3e
Looks like I didn't have an x86_64 version.

It looks like about half the cleanups I needed was to decouple
smp cpu startup and apic initialization. I think the hotplug cpu
case has done a more thorough version of that now.

There was a bit of reshuffling needed to get the everything initialized
in the right order when we started apic mode sooner.

Anyway I might just take another look at this if I can find enough
moments to string together.

Eric
Neil Horman
2007-11-27 16:45:21 UTC
Permalink
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Post by Neil Horman
So, it sounds to me then, like unless I'm willing to really re-write the APIC
setup code (which I don't feel qualified to do quite yet), that the immediate
solution would be to not rely on interrupts in legacy mode, which was according
to my understanding, what the use of the irqpoll command line option was
supposed to enable. Any thoughts as to why that might not be working in this
case, or suggested tests to determine a cause there?
Hmm. It looks like irqpoll expects to have at least one irq working.
I wonder if we can fix that.
Looking at it from the other direction what does this line
from check_timer() look like when you boot a normal kernel?
apic_printk(APIC_VERBOSE,KERN_INFO "..TIMER: vector=0x%02X apic1=%d pin1=%d apic2=%d pin2=%d\n",
cfg->vector, apic1, pin1, apic2, pin2);
Sure, Ben, can you provide the output of that on the system in question? You'll need to boot with apic=verbose to see it.
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
I'm curious what values we see at boot for the legacy mode the first
time it is setup, and possibly what chipset you are on.
I know we have had a few times where we have failed to reprogram
the ioapic properly at shutdown. So it can't hurt to look at
that possibility one last time.
For whatever it is worth my original attempt at using only the
apic mode was commit: f2b36db692b7ff6972320ad9839ae656a3b0ee3e
Looks like I didn't have an x86_64 version.
Thanks, I'll take a look at that.
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
It looks like about half the cleanups I needed was to decouple
smp cpu startup and apic initialization. I think the hotplug cpu
case has done a more thorough version of that now.
There was a bit of reshuffling needed to get the everything initialized
in the right order when we started apic mode sooner.
Anyway I might just take another look at this if I can find enough
moments to string together.
Eric
Ben Woodard
2007-11-27 20:50:52 UTC
Permalink
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Post by Neil Horman
So, it sounds to me then, like unless I'm willing to really re-write the APIC
setup code (which I don't feel qualified to do quite yet), that the immediate
solution would be to not rely on interrupts in legacy mode, which was according
to my understanding, what the use of the irqpoll command line option was
supposed to enable. Any thoughts as to why that might not be working in this
case, or suggested tests to determine a cause there?
Hmm. It looks like irqpoll expects to have at least one irq working.
I wonder if we can fix that.
Looking at it from the other direction what does this line
from check_timer() look like when you boot a normal kernel?
apic_printk(APIC_VERBOSE,KERN_INFO "..TIMER: vector=0x%02X apic1=%d pin1=%d apic2=%d pin2=%d\n",
cfg->vector, apic1, pin1, apic2, pin2);
Here is what I get on a normal boot of the problem motherboard:

..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
I'm curious what values we see at boot for the legacy mode the first
time it is setup, and possibly what chipset you are on.
I know we have had a few times where we have failed to reprogram
the ioapic properly at shutdown. So it can't hurt to look at
that possibility one last time.
For whatever it is worth my original attempt at using only the
apic mode was commit: f2b36db692b7ff6972320ad9839ae656a3b0ee3e
Looks like I didn't have an x86_64 version.
It looks like about half the cleanups I needed was to decouple
smp cpu startup and apic initialization. I think the hotplug cpu
case has done a more thorough version of that now.
There was a bit of reshuffling needed to get the everything initialized
in the right order when we started apic mode sooner.
Anyway I might just take another look at this if I can find enough
moments to string together.
Eric
_______________________________________________
kexec mailing list
http://lists.infradead.org/mailman/listinfo/kexec
--
-ben
-=-
Neil Horman
2007-11-27 21:05:58 UTC
Permalink
Post by Ben Woodard
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Post by Neil Horman
So, it sounds to me then, like unless I'm willing to really re-write the APIC
setup code (which I don't feel qualified to do quite yet), that the immediate
solution would be to not rely on interrupts in legacy mode, which was according
to my understanding, what the use of the irqpoll command line option was
supposed to enable. Any thoughts as to why that might not be working in this
case, or suggested tests to determine a cause there?
Hmm. It looks like irqpoll expects to have at least one irq working.
I wonder if we can fix that.
Looking at it from the other direction what does this line
from check_timer() look like when you boot a normal kernel?
apic_printk(APIC_VERBOSE,KERN_INFO "..TIMER: vector=0x%02X apic1=%d
pin1=%d apic2=%d pin2=%d\n",
cfg->vector, apic1, pin1, apic2, pin2);
..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1
Ok, I think from what I understand of what we're reading here, the apic2 = -1
and pin2 = -1 indicate that the 8259 has no direct connection to any cpu, which
means that on shutdown disable_IO_APIC should take us into virtual wire mode.
As such enabling the APIC early in boot should fix us, but more consisely,
rewriting the entry in the IOAPIC to deliver int0 to the only running cpu should
accomplish the same goal for this problem. Does that sound reasonable (at least
as a test to ensure we understand the problem) to everyone?

Neil
Post by Ben Woodard
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
I'm curious what values we see at boot for the legacy mode the first
time it is setup, and possibly what chipset you are on.
I know we have had a few times where we have failed to reprogram
the ioapic properly at shutdown. So it can't hurt to look at
that possibility one last time.
For whatever it is worth my original attempt at using only the
apic mode was commit: f2b36db692b7ff6972320ad9839ae656a3b0ee3e
Looks like I didn't have an x86_64 version.
It looks like about half the cleanups I needed was to decouple
smp cpu startup and apic initialization. I think the hotplug cpu
case has done a more thorough version of that now.
There was a bit of reshuffling needed to get the everything initialized
in the right order when we started apic mode sooner.
Anyway I might just take another look at this if I can find enough
moments to string together.
Eric
_______________________________________________
kexec mailing list
http://lists.infradead.org/mailman/listinfo/kexec
--
-ben
-=-
--
/***************************************************
*Neil Horman
*Software Engineer
*Red Hat, Inc.
*nhorman-H+wXaHxf7aLQT0dZR+***@public.gmane.org
*gpg keyid: 1024D / 0x92A74FA1
*http://pgp.mit.edu
***************************************************/
ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
2007-11-27 22:38:52 UTC
Permalink
Post by Neil Horman
Post by Ben Woodard
..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1
Ben, what chipset is this?
Post by Neil Horman
Ok, I think from what I understand of what we're reading here, the apic2 = -1
and pin2 = -1 indicate that the 8259 has no direct connection to any cpu, which
means that on shutdown disable_IO_APIC should take us into virtual wire mode.
As such enabling the APIC early in boot should fix us, but more consisely,
rewriting the entry in the IOAPIC to deliver int0 to the only running cpu should
accomplish the same goal for this problem. Does that sound reasonable (at least
as a test to ensure we understand the problem) to everyone?
Close. There are two options with virtual wire mode.
- Either the local apic is in virtual wire mode, and somehow the
legacy interrupts make it to the local cpu.
- Or an ioapic is in virtual wire mode and the legacy interrupt
controller is connected to it.

So I guess fundamentally for any SMP system that only supports the
cpu being in local apic mode and only routes interrupts to the boot
strap processor we could be in trouble. That is what our current
information about your system suggests.

However most systems actually connect the i8254 PIC interrupt
controller to the ioapic in virtual wire mode. As I recall the
standard mapping is to ioapic 0, pin 0. With ioapic 0, pin 2 being
the timer interrupt (Possibly it is the other way around).

So as a test we could feed those values into ioapic_8259 and see
if the kdump case works. I believe we prefer putting the ioapic
into virtual wire mode over putting the cpu into virtual wire
mode. We can only control which cpu receives the legacy interrupts if
we are putting the ioapic in virtual wire mode.

It may also be an interesting test to just enable the timer for the
ioapic in early boot, as you have suggested. I don't have a clue what
that will do.

Eric
Ben Woodard
2007-11-27 23:15:44 UTC
Permalink
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Post by Neil Horman
Post by Ben Woodard
..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1
Ben, what chipset is this?
nVidia MCP55 pro

It is the original version of
http://www.supermicro.com/Aplus/motherboard/Opteron8000/MCP55/H8QM8-2.cfm

i.e. not the -2. However, they don't seem to advertise the original
version. Supermicro assures me that they are practically the same but I
haven't played with the -2 version yet.
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Post by Neil Horman
Ok, I think from what I understand of what we're reading here, the apic2 = -1
and pin2 = -1 indicate that the 8259 has no direct connection to any cpu, which
means that on shutdown disable_IO_APIC should take us into virtual wire mode.
As such enabling the APIC early in boot should fix us, but more consisely,
rewriting the entry in the IOAPIC to deliver int0 to the only running cpu should
accomplish the same goal for this problem. Does that sound reasonable (at least
as a test to ensure we understand the problem) to everyone?
Close. There are two options with virtual wire mode.
- Either the local apic is in virtual wire mode, and somehow the
legacy interrupts make it to the local cpu.
- Or an ioapic is in virtual wire mode and the legacy interrupt
controller is connected to it.
So I guess fundamentally for any SMP system that only supports the
cpu being in local apic mode and only routes interrupts to the boot
strap processor we could be in trouble. That is what our current
information about your system suggests.
However most systems actually connect the i8254 PIC interrupt
controller to the ioapic in virtual wire mode. As I recall the
standard mapping is to ioapic 0, pin 0. With ioapic 0, pin 2 being
the timer interrupt (Possibly it is the other way around).
So as a test we could feed those values into ioapic_8259 and see
if the kdump case works. I believe we prefer putting the ioapic
into virtual wire mode over putting the cpu into virtual wire
mode. We can only control which cpu receives the legacy interrupts if
we are putting the ioapic in virtual wire mode.
It may also be an interesting test to just enable the timer for the
ioapic in early boot, as you have suggested. I don't have a clue what
that will do.
Eric
_______________________________________________
kexec mailing list
http://lists.infradead.org/mailman/listinfo/kexec
--
-ben
-=-
ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
2007-11-28 00:15:06 UTC
Permalink
Post by Ben Woodard
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Post by Ben Woodard
..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1
Ben, what chipset is this?
nVidia MCP55 pro
It is the original version of
http://www.supermicro.com/Aplus/motherboard/Opteron8000/MCP55/H8QM8-2.cfm
i.e. not the -2. However, they don't seem to advertise the original
version. Supermicro assures me that they are practically the same but I haven't
played with the -2 version yet.
That is enough for an initial approximation. Unless something has
changed radically the Nvidia chipsets can put the ioapic instead of
the local apic in virtual wire mode so that is worth testing.

Eric
Neil Horman
2007-11-27 23:40:37 UTC
Permalink
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Post by Neil Horman
Post by Ben Woodard
..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1
Ben, what chipset is this?
Post by Neil Horman
Ok, I think from what I understand of what we're reading here, the apic2 = -1
and pin2 = -1 indicate that the 8259 has no direct connection to any cpu, which
means that on shutdown disable_IO_APIC should take us into virtual wire mode.
As such enabling the APIC early in boot should fix us, but more consisely,
rewriting the entry in the IOAPIC to deliver int0 to the only running cpu should
accomplish the same goal for this problem. Does that sound reasonable (at least
as a test to ensure we understand the problem) to everyone?
Close. There are two options with virtual wire mode.
- Either the local apic is in virtual wire mode, and somehow the
legacy interrupts make it to the local cpu.
I assume this is the case if the ioapic is also in virtual wire mode and the
destination field for the appropriate interrupt(s) (the timer interrupt in this
case) is set to either physical mode with a destination id of the lapic for the
running cpu, or if it is set to logical mode and the destination id has the
corresponding bit for the running cpu set. Is that right?
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
- Or an ioapic is in virtual wire mode and the legacy interrupt
controller is connected to it.
I thought we only had one ioapic in this system (Ben correct me if I'm wrong on
that please). I thought the above printk told us that, because apic2 and pin2
are both -1, that means that the 8259 isn't physically connected to any cpu, and
instead is routed through apic 0, and asserts on pin 2 of that ioapic.
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
So I guess fundamentally for any SMP system that only supports the
cpu being in local apic mode and only routes interrupts to the boot
strap processor we could be in trouble. That is what our current
information about your system suggests.
If that were the case, then we would need to support moving kexec boot to cpu0,
at least in some limited cases. I've got a patch together that enables the
handshaking I was brainstorming earlier, which should allow an attempted jump to
cpu0 on a crash, with a fallback to booting on the crashing processor. If we
wind up confirming the above case, then I'll post it.
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
However most systems actually connect the i8254 PIC interrupt
Sorry, to split hairs here, but you mean the 8259 right? Just want to be sure
I'm clear on whats going on. I thought the 8254 was the external timer.
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
controller to the ioapic in virtual wire mode. As I recall the
standard mapping is to ioapic 0, pin 0. With ioapic 0, pin 2 being
the timer interrupt (Possibly it is the other way around).
So as a test we could feed those values into ioapic_8259 and see
if the kdump case works. I believe we prefer putting the ioapic
into virtual wire mode over putting the cpu into virtual wire
mode. We can only control which cpu receives the legacy interrupts if
we are putting the ioapic in virtual wire mode.
I'm sorry, I can't find ioapic_8259 defined anywhere. Where is that supposed to
be? Show me where its defined and I'll happily write the patch.
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
It may also be an interesting test to just enable the timer for the
ioapic in early boot, as you have suggested. I don't have a clue what
that will do.
Unfortunately nothing. We've tried using the local apic timer in a previous
test, and it resulted in no change, as did transitioning the cpu to the apic
timer via a call to switch_ipi_to_APIC_timer. Its possible I did something
wrong however.

Currently I'm writing a patch that calls setup_ioapic_dest after we call
disable_IO_APIC. Looking at the implementation, it appears that calling this
function should rewrite the irq routing table in the ioapic to deliver
interrupts to the set of online cpus, as defined by the TARGET_CPUS macro. I
asusme that if the ioapic is in virtual wire mode from the call to
disablie_IO_APIC, then calling setup_ioapic_dest will force interrupts to be
delivered to the crashing cpu, as it should be the only bit set in the online
cpu mask. Please feel free to poke holes in this idea.


Thanks & Regards
Neil
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Eric
--
/***************************************************
*Neil Horman
*Software Engineer
*Red Hat, Inc.
*nhorman-H+wXaHxf7aLQT0dZR+***@public.gmane.org
*gpg keyid: 1024D / 0x92A74FA1
*http://pgp.mit.edu
***************************************************/
ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
2007-11-28 00:43:15 UTC
Permalink
Post by Neil Horman
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Close. There are two options with virtual wire mode.
- Either the local apic is in virtual wire mode, and somehow the
legacy interrupts make it to the local cpu.
I assume this is the case if the ioapic is also in virtual wire
mode.
No. The ioapic is completely disabled in this case.
Post by Neil Horman
and the
destination field for the appropriate interrupt(s) (the timer interrupt in this
case) is set to either physical mode with a destination id of the lapic for the
running cpu, or if it is set to logical mode and the destination id has the
corresponding bit for the running cpu set. Is that right?
No. All of the ioapic routing entries are disabled.
Post by Neil Horman
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
- Or an ioapic is in virtual wire mode and the legacy interrupt
controller is connected to it.
I thought we only had one ioapic in this system (Ben correct me if I'm wrong on
that please). I thought the above printk told us that, because apic2 and pin2
are both -1, that means that the 8259 isn't physically connected to any cpu, and
instead is routed through apic 0, and asserts on pin 2 of that ioapic.
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
So I guess fundamentally for any SMP system that only supports the
cpu being in local apic mode and only routes interrupts to the boot
strap processor we could be in trouble. That is what our current
information about your system suggests.
If that were the case, then we would need to support moving kexec boot to cpu0,
at least in some limited cases. I've got a patch together that enables the
handshaking I was brainstorming earlier, which should allow an attempted jump to
cpu0 on a crash, with a fallback to booting on the crashing processor. If we
wind up confirming the above case, then I'll post it.
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
However most systems actually connect the i8254 PIC interrupt
Sorry, to split hairs here, but you mean the 8259 right?
Grr. Yes. The i8259. Got the timer and the PIC numbers confused Oops.
The legacy configuration is the timer to the PIC to either the local
or the ioapic.
Post by Neil Horman
Just want to be sure I'm clear on whats going on. I thought the
8254 was the external timer.
Yep. You have it straight.
Post by Neil Horman
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
controller to the ioapic in virtual wire mode. As I recall the
standard mapping is to ioapic 0, pin 0. With ioapic 0, pin 2 being
the timer interrupt (Possibly it is the other way around).
So as a test we could feed those values into ioapic_8259 and see
if the kdump case works. I believe we prefer putting the ioapic
into virtual wire mode over putting the cpu into virtual wire
mode. We can only control which cpu receives the legacy interrupts if
we are putting the ioapic in virtual wire mode.
I'm sorry, I can't find ioapic_8259 defined anywhere. Where is that supposed to
be? Show me where its defined and I'll happily write the patch.
Grr. That should have been ioapic_i8259. The second value we print out
in the ..Timer printk.
Post by Neil Horman
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
It may also be an interesting test to just enable the timer for the
ioapic in early boot, as you have suggested. I don't have a clue what
that will do.
Unfortunately nothing. We've tried using the local apic timer in a previous
test, and it resulted in no change, as did transitioning the cpu to the apic
timer via a call to switch_ipi_to_APIC_timer. Its possible I did something
wrong however.
Well if you didn't have the local apic enabled beyond virtual wire
mode that could have cause problems.

Sure. I suspect the probably is that you were still in a legacy irq
mode.
Post by Neil Horman
Currently I'm writing a patch that calls setup_ioapic_dest after we call
disable_IO_APIC. Looking at the implementation, it appears that calling this
function should rewrite the irq routing table in the ioapic to deliver
interrupts to the set of online cpus, as defined by the TARGET_CPUS macro. I
asusme that if the ioapic is in virtual wire mode from the call to
disablie_IO_APIC, then calling setup_ioapic_dest will force interrupts to be
delivered to the crashing cpu, as it should be the only bit set in the online
cpu mask. Please feel free to poke holes in this idea.
Just try and make certain: ioapic_i8259.pin != -1
Which should cause disable_IO_APIC to put the ioapic and not the local
apic in virtual wire mode.

Anything else is likely to do strange things.

Eric
Neil Horman
2007-11-28 15:54:35 UTC
Permalink
Post by Neil Horman
Post by Ben Woodard
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Post by Neil Horman
So, it sounds to me then, like unless I'm willing to really re-write the APIC
setup code (which I don't feel qualified to do quite yet), that the immediate
solution would be to not rely on interrupts in legacy mode, which was according
to my understanding, what the use of the irqpoll command line option was
supposed to enable. Any thoughts as to why that might not be working in this
case, or suggested tests to determine a cause there?
Hmm. It looks like irqpoll expects to have at least one irq working.
I wonder if we can fix that.
Looking at it from the other direction what does this line
from check_timer() look like when you boot a normal kernel?
apic_printk(APIC_VERBOSE,KERN_INFO "..TIMER: vector=0x%02X apic1=%d
pin1=%d apic2=%d pin2=%d\n",
cfg->vector, apic1, pin1, apic2, pin2);
..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1
Ok, I think from what I understand of what we're reading here, the apic2 = -1
and pin2 = -1 indicate that the 8259 has no direct connection to any cpu, which
means that on shutdown disable_IO_APIC should take us into virtual wire mode.
As such enabling the APIC early in boot should fix us, but more consisely,
rewriting the entry in the IOAPIC to deliver int0 to the only running cpu should
accomplish the same goal for this problem. Does that sound reasonable (at least
as a test to ensure we understand the problem) to everyone?
Neil
I'm sorry to reply to myself, but I think I actually had this backwards. The
fact that the 8259 does not get reported as being routed through the ioapic,
means that in disable_IO_APIC during a crash shutdown, we do _not_ go into
virtual wire mode. Instead we simply disable the LVT0 on the ioapic (which as I
understand it from above is the timer interrupt), and we trust the 8259 to
route the interrupt to the appropriate cpu. If the 8259 is hardwired to deliver
interrupts to cpu0 only, then we do in fact need to switch processors during
shutdown. Perhaps what we need to do is detect if the 8259 is not routed
through the ioapic on shutdown, and if it is, only them implement my patch to
jump to the bsp.

Thoughts?

Neil
Post by Neil Horman
Post by Ben Woodard
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
I'm curious what values we see at boot for the legacy mode the first
time it is setup, and possibly what chipset you are on.
I know we have had a few times where we have failed to reprogram
the ioapic properly at shutdown. So it can't hurt to look at
that possibility one last time.
For whatever it is worth my original attempt at using only the
apic mode was commit: f2b36db692b7ff6972320ad9839ae656a3b0ee3e
Looks like I didn't have an x86_64 version.
It looks like about half the cleanups I needed was to decouple
smp cpu startup and apic initialization. I think the hotplug cpu
case has done a more thorough version of that now.
There was a bit of reshuffling needed to get the everything initialized
in the right order when we started apic mode sooner.
Anyway I might just take another look at this if I can find enough
moments to string together.
Eric
_______________________________________________
kexec mailing list
http://lists.infradead.org/mailman/listinfo/kexec
--
-ben
-=-
--
/***************************************************
*Neil Horman
*Software Engineer
*Red Hat, Inc.
*gpg keyid: 1024D / 0x92A74FA1
*http://pgp.mit.edu
***************************************************/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
--
/***************************************************
*Neil Horman
*nhorman-***@public.gmane.org
*gpg keyid: 1024D / 0x92A74FA1
*http://pgp.mit.edu
***************************************************/
Andi Kleen
2007-11-27 15:24:11 UTC
Permalink
Post by Neil Horman
So, it sounds to me then, like unless I'm willing to really re-write the APIC
setup code (which I don't feel qualified to do quite yet), that the immediate
solution would be to not rely on interrupts in legacy mode, which was according
to my understanding, what the use of the irqpoll command line option was
supposed to enable.
irqpoll won't fix the timer interrupt.

-Andi
ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
2007-11-27 14:56:44 UTC
Permalink
Post by Neil Horman
his is any less reliable that what we have currently.
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
It doesn't make things more reliable, and it adds code to a code path
that already has to much code to be solid reliable (thus your
problem).
Putting the system back in PIC legacy mode on the kexec on panic path
was supposed to be a short term hack until we could remove the need
by always deliver interrupts in apic mode.
If you can't root cause your problem and figure out how the apics
are misconfigured for legacy mode
Probably legacy mode always routes to CPU #0. Makes sense and is
not really a misconfiguration of legacy mode.
Possible. So far I have not seen a hardware setup that would force
interrupts to cpu #0 in legacy mode. But I would not be truly
surprised if it happened that there was hardware that only worked that
way.
Post by Neil Horman
But if CPU #0 has interrupts disabled no interrupts get delivered.
- Move to CPU #0
- Do not use legacy mode during shutdown.
(Do not use legacy mode in the kdump kernel. removing it from shutdown
is just minor optimization)
Post by Neil Horman
- Or do not rely on interrupts after enabling legacy mode
- Or do not disable interrupts on the other CPUs when they're
halted.
First and last option are probably unreliable for the kdump case.
Second or third sound best.
I suspect the real fix would be to enable IOAPIC mode really
early and never use the timers in legacy mode. Then the kdump
kernel wouldn't care about the legacy mode pointing to the wrong CPU.
Exactly. If we can work out the details that should be a much more reliable
mode of operation.
Post by Neil Horman
IIrc Eric even had a patch for that a long time ago, but it broke some
things so it wasn't included. But perhaps it should be revisited.
My real problem was the failure case was obscure (a bad interaction
with ACPI on Linus's laptop) and I didn't have the time to track it
down when it showed up.

My patch had two parts. Some cleanups to enable the code to be enabled
early, and the actually early enable. I figure if we can get the
cleanups in one major kernel version and then in the next enable
the apic mode before we start getting interrupts we should be in good
shape.

I expect with x86 becoming an embedded platform with multiple cpus we
may start seeing systems that don't actually support legacy PIC mode
for interrupt delivery.

Eric
Neil Horman
2007-11-27 15:34:44 UTC
Permalink
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Post by Neil Horman
his is any less reliable that what we have currently.
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
It doesn't make things more reliable, and it adds code to a code path
that already has to much code to be solid reliable (thus your
problem).
Putting the system back in PIC legacy mode on the kexec on panic path
was supposed to be a short term hack until we could remove the need
by always deliver interrupts in apic mode.
If you can't root cause your problem and figure out how the apics
are misconfigured for legacy mode
Probably legacy mode always routes to CPU #0. Makes sense and is
not really a misconfiguration of legacy mode.
Possible. So far I have not seen a hardware setup that would force
interrupts to cpu #0 in legacy mode. But I would not be truly
surprised if it happened that there was hardware that only worked that
way.
That would certainly explain the behavior I am observing here.\
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Post by Neil Horman
But if CPU #0 has interrupts disabled no interrupts get delivered.
- Move to CPU #0
- Do not use legacy mode during shutdown.
(Do not use legacy mode in the kdump kernel. removing it from shutdown
is just minor optimization)
Post by Neil Horman
- Or do not rely on interrupts after enabling legacy mode
- Or do not disable interrupts on the other CPUs when they're
halted.
First and last option are probably unreliable for the kdump case.
Second or third sound best.
I suspect the real fix would be to enable IOAPIC mode really
early and never use the timers in legacy mode. Then the kdump
kernel wouldn't care about the legacy mode pointing to the wrong CPU.
Exactly. If we can work out the details that should be a much more reliable
mode of operation.
Post by Neil Horman
IIrc Eric even had a patch for that a long time ago, but it broke some
things so it wasn't included. But perhaps it should be revisited.
My real problem was the failure case was obscure (a bad interaction
with ACPI on Linus's laptop) and I didn't have the time to track it
down when it showed up.
My patch had two parts. Some cleanups to enable the code to be enabled
early, and the actually early enable. I figure if we can get the
cleanups in one major kernel version and then in the next enable
the apic mode before we start getting interrupts we should be in good
shape.
I expect with x86 becoming an embedded platform with multiple cpus we
may start seeing systems that don't actually support legacy PIC mode
for interrupt delivery.
do you have a pointer to the old patch set? I'd like to try it out on the failing system here.

Regards
Neil
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Eric
Ben Woodard
2007-11-27 18:41:15 UTC
Permalink
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Post by Neil Horman
his is any less reliable that what we have currently.
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
It doesn't make things more reliable, and it adds code to a code path
that already has to much code to be solid reliable (thus your
problem).
Putting the system back in PIC legacy mode on the kexec on panic path
was supposed to be a short term hack until we could remove the need
by always deliver interrupts in apic mode.
If you can't root cause your problem and figure out how the apics
are misconfigured for legacy mode
Probably legacy mode always routes to CPU #0. Makes sense and is
not really a misconfiguration of legacy mode.
The BIOS and kernel writer's guide for Opteron explicitly states that
the platform will boot on CPU0 kind of by definition. So this seems like
a fair statement. I can easily see BIOS writers or hardware designers
interpreting that to mean that they only have to make sure that
interrupts get to the CPU that the BIOS thinks of as CPU0 when the APIC
is in legacy mode.

I have a query out to some SuperMicro engineers to find out if it is a
hardware limitation or if it is APIC misconfiguration. Maybe we can
solve this problem with a BIOS update.
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Possible. So far I have not seen a hardware setup that would force
interrupts to cpu #0 in legacy mode. But I would not be truly
surprised if it happened that there was hardware that only worked that
way.
Post by Neil Horman
But if CPU #0 has interrupts disabled no interrupts get delivered.
- Move to CPU #0
- Do not use legacy mode during shutdown.
(Do not use legacy mode in the kdump kernel. removing it from shutdown
is just minor optimization)
Post by Neil Horman
- Or do not rely on interrupts after enabling legacy mode
- Or do not disable interrupts on the other CPUs when they're
halted.
First and last option are probably unreliable for the kdump case.
Second or third sound best.
I can agree with the fourth option being a very bad one but I really
haven't seen anything in this discussion which supports the assertion
that "Move to the CPU that the BIOS originally called CPU#0" is going to
be unreliable. Admittedly we haven't tried this on every single x86_64
platform that we have but on the handful that we have tried so far, it
hasn't been a problem. Why is everybody jumping to the assumption that
it will be less reliable?
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Post by Neil Horman
I suspect the real fix would be to enable IOAPIC mode really
early and never use the timers in legacy mode. Then the kdump
kernel wouldn't care about the legacy mode pointing to the wrong CPU.
Exactly. If we can work out the details that should be a much more reliable
mode of operation.
Post by Neil Horman
IIrc Eric even had a patch for that a long time ago, but it broke some
things so it wasn't included. But perhaps it should be revisited.
My real problem was the failure case was obscure (a bad interaction
with ACPI on Linus's laptop) and I didn't have the time to track it
down when it showed up.
My patch had two parts. Some cleanups to enable the code to be enabled
early, and the actually early enable. I figure if we can get the
cleanups in one major kernel version and then in the next enable
the apic mode before we start getting interrupts we should be in good
shape.
I expect with x86 becoming an embedded platform with multiple cpus we
may start seeing systems that don't actually support legacy PIC mode
for interrupt delivery.
Eric
_______________________________________________
kexec mailing list
http://lists.infradead.org/mailman/listinfo/kexec
--
-ben
-=-
Neil Horman
2007-11-27 19:42:20 UTC
Permalink
Post by Ben Woodard
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Post by Andi Kleen
But if CPU #0 has interrupts disabled no interrupts get delivered.
- Move to CPU #0
- Do not use legacy mode during shutdown.
(Do not use legacy mode in the kdump kernel. removing it from shutdown
is just minor optimization)
Post by Andi Kleen
- Or do not rely on interrupts after enabling legacy mode
- Or do not disable interrupts on the other CPUs when they're
halted.
First and last option are probably unreliable for the kdump case.
Second or third sound best.
I can agree with the fourth option being a very bad one but I really
haven't seen anything in this discussion which supports the assertion
that "Move to the CPU that the BIOS originally called CPU#0" is going to
be unreliable. Admittedly we haven't tried this on every single x86_64
platform that we have but on the handful that we have tried so far, it
hasn't been a problem. Why is everybody jumping to the assumption that
it will be less reliable?
Ben I tend to agree. I think re-enabling the APIC early in the boot process
provides a greater degree of reliability in that it more quickly restores the
system to a state where booting on a cpu other than cpu0 will be more likely to
work, but I have to say that overall it seems like booting a secondary kernel on
cpu0, when possible offers the highest degree of reliability.

Perhaps what we need is a 'both solution'. Re-enabling the apic to full smp
functionality early in the boot process is a good solution for the problems
which we are hypothesizing here, and would be a good thing to do in general, but
it doesn't preclude also attmpting to switch back to cpu0 during a crash.

Perhaps it would be worthwhile to:

1) Investigate the early enablement of the ioapic for x86[_64]
2) implement my prevoiusly proposed patch with the addition of a handshake
element, such that:
a) when the boot cpu gets the ipi from machine_crash_shutdown it flags
the fact that it is going to boot the kexec kernel with a global
variable
b) the crashing cpu loops waiting for either:
I) a timeout of 1 second
II) a reduction of the halt count to zero
III) the setting of the flag mentioned in (a)
c) the crashing cpu, if it sees that it is not the boot cpu AND
that the flag in (III) is set, will halt itself, otherwise it
will set the flag and boot the kexec image itself.

With this modification, we can try to relocate to cpu0, and if we fail, we fall
back to booting on the crashing processor.

I'll work up a patch that implements (2), unless there are strong objections. I
see no reason why we can't implment this 'both' solution.

Regards
Neil
--
/***************************************************
*Neil Horman
*Software Engineer
*Red Hat, Inc.
*nhorman-H+wXaHxf7aLQT0dZR+***@public.gmane.org
*gpg keyid: 1024D / 0x92A74FA1
*http://pgp.mit.edu
***************************************************/
Vivek Goyal
2007-11-27 20:00:11 UTC
Permalink
On Tue, Nov 27, 2007 at 02:42:20PM -0500, Neil Horman wrote:

[..]
Post by Neil Horman
Ben I tend to agree. I think re-enabling the APIC early in the boot process
provides a greater degree of reliability in that it more quickly restores the
system to a state where booting on a cpu other than cpu0 will be more likely to
work, but I have to say that overall it seems like booting a secondary kernel on
cpu0, when possible offers the highest degree of reliability.
Perhaps what we need is a 'both solution'. Re-enabling the apic to full smp
functionality early in the boot process is a good solution for the problems
which we are hypothesizing here, and would be a good thing to do in general, but
it doesn't preclude also attmpting to switch back to cpu0 during a crash.
1) Investigate the early enablement of the ioapic for x86[_64]
2) implement my prevoiusly proposed patch with the addition of a handshake
a) when the boot cpu gets the ipi from machine_crash_shutdown it flags
the fact that it is going to boot the kexec kernel with a global
variable
I) a timeout of 1 second
II) a reduction of the halt count to zero
III) the setting of the flag mentioned in (a)
c) the crashing cpu, if it sees that it is not the boot cpu AND
that the flag in (III) is set, will halt itself, otherwise it
will set the flag and boot the kexec image itself.
With this modification, we can try to relocate to cpu0, and if we fail, we fall
back to booting on the crashing processor.
I'll work up a patch that implements (2), unless there are strong objections. I
see no reason why we can't implment this 'both' solution.
Hi Neil,

If we implement first solution, we don't have to implement second. Problem
will automatically be solved.

In general adding more code in crash shutdown path is not good. We are
trying to make that code path as small as possible.

OTOH, I think we have not root caused this problem yet. We don't know yet
why interrupts are not coming to non-boot cpu. I think we can go little
deeper to compare the system state in normal boot and kdump boot and see
what has changed. System state would include, LAPIC and IOAPIC entries
etc.

Are we putting the system back in PIC mode or virtual wire mode? I have
not seen systems which support PIC mode. All latest systems seems
to be having virtual wire mode. I think in case of PIC mode, interrupts
can be delivered to cpu0 only. In virt wire mode, one can program IOAPIC
to deliver interrupt to any of the cpus and that's what we have been
relying on until and unless there is something board specific.

Thanks
Vivek
Neil Horman
2007-11-27 20:52:55 UTC
Permalink
Post by Vivek Goyal
[..]
Post by Neil Horman
Ben I tend to agree. I think re-enabling the APIC early in the boot process
provides a greater degree of reliability in that it more quickly restores the
system to a state where booting on a cpu other than cpu0 will be more likely to
work, but I have to say that overall it seems like booting a secondary kernel on
cpu0, when possible offers the highest degree of reliability.
Perhaps what we need is a 'both solution'. Re-enabling the apic to full smp
functionality early in the boot process is a good solution for the problems
which we are hypothesizing here, and would be a good thing to do in general, but
it doesn't preclude also attmpting to switch back to cpu0 during a crash.
1) Investigate the early enablement of the ioapic for x86[_64]
2) implement my prevoiusly proposed patch with the addition of a handshake
a) when the boot cpu gets the ipi from machine_crash_shutdown it flags
the fact that it is going to boot the kexec kernel with a global
variable
I) a timeout of 1 second
II) a reduction of the halt count to zero
III) the setting of the flag mentioned in (a)
c) the crashing cpu, if it sees that it is not the boot cpu AND
that the flag in (III) is set, will halt itself, otherwise it
will set the flag and boot the kexec image itself.
With this modification, we can try to relocate to cpu0, and if we fail, we fall
back to booting on the crashing processor.
I'll work up a patch that implements (2), unless there are strong objections. I
see no reason why we can't implment this 'both' solution.
Hi Neil,
If we implement first solution, we don't have to implement second. Problem
will automatically be solved.
Agreed, assuming:
1) The problems we have been hypothesising are accurate. As you note below, Ben
and I have dug deep to find the problem, but we could try to go deeper.

2) There are no other issues with (re)booting a system on the non-boot cpu. It
seems to me that if its possible to reboot on cpu0, we should try. I understand
that we're trying to keep that code small for obvious reasons, but if we have a
fall back method to the crashing cpu, it seems reasonable safe to me.
Post by Vivek Goyal
In general adding more code in crash shutdown path is not good. We are
trying to make that code path as small as possible.
OTOH, I think we have not root caused this problem yet. We don't know yet
why interrupts are not coming to non-boot cpu. I think we can go little
deeper to compare the system state in normal boot and kdump boot and see
what has changed. System state would include, LAPIC and IOAPIC entries
etc.
Are we putting the system back in PIC mode or virtual wire mode? I have
not seen systems which support PIC mode. All latest systems seems
to be having virtual wire mode. I think in case of PIC mode, interrupts
can be delivered to cpu0 only. In virt wire mode, one can program IOAPIC
to deliver interrupt to any of the cpus and that's what we have been
relying on until and unless there is something board specific.
This is actually a very interesting question. Looking at disable_IO_APIC in the
latest git tree, which is used to revert the APIC to a legacy mode, we do one of
two things. If the on board 8259 is routed through the IOAPIC, then we
configure the APIC to be in virtual wire mode, so that the interrupt is
delivered via the APIC to whatever processor you want to configure. If however,
the 8259 bypasses the IOAPIC, then we simply disable the LAPICS LVT0 interrupt,
so that any legacy timer interrupts from the apic are ignored, ostesibly because
the 8259 will assert the interrupt pin on the processor it is wired to directly.
I wonder if most (almost all) modern systems use the former configuration, while
the supermicro board in question in a rare exception, uses the latter. If the
8259 was routed directly to cpu0, that would explain this hang.

Regards
Neil
Post by Vivek Goyal
Thanks
Vivek
--
/***************************************************
*Neil Horman
*Software Engineer
*Red Hat, Inc.
*nhorman-H+wXaHxf7aLQT0dZR+***@public.gmane.org
*gpg keyid: 1024D / 0x92A74FA1
*http://pgp.mit.edu
***************************************************/
Andi Kleen
2007-11-27 22:24:08 UTC
Permalink
Post by Vivek Goyal
Are we putting the system back in PIC mode or virtual wire mode? I have
not seen systems which support PIC mode. All latest systems seems
to be having virtual wire mode. I think in case of PIC mode, interrupts
Yes it's probably virtual wire. For real PIC mode we would need really
old systems without APIC.
Post by Vivek Goyal
can be delivered to cpu0 only. In virt wire mode, one can program IOAPIC
to deliver interrupt to any of the cpus and that's what we have been
The code doesn't try to program anything specific, it just restores the state
that was left over originally by the BIOS.

-Andi
Ben Woodard
2007-11-27 23:24:35 UTC
Permalink
Post by Andi Kleen
Post by Vivek Goyal
Are we putting the system back in PIC mode or virtual wire mode? I have
not seen systems which support PIC mode. All latest systems seems
to be having virtual wire mode. I think in case of PIC mode, interrupts
Yes it's probably virtual wire. For real PIC mode we would need really
old systems without APIC.
Post by Vivek Goyal
can be delivered to cpu0 only. In virt wire mode, one can program IOAPIC
to deliver interrupt to any of the cpus and that's what we have been
The code doesn't try to program anything specific, it just restores the state
that was left over originally by the BIOS.
So if the BIOS originally left the IOAPIC in a state where the timer
interrupts were only going to CPU0 then by restoring that state we could
be bringing this problem upon ourselves when we restore that state.

Would anyone have any problems with code that simply verified that the
state which we are restoring allowed interrupts to get to the processor
that we are currently crashing on and if not, poked in a reasonable value.

Yes this would add some complexity to the code paths where we were
crashing but it could prevent the problem that we are seeing. It seems
like a small fairly safe change rather than a big disruptive change like
moving the initialization of the IOAPIC earlier in the boot process.
Post by Andi Kleen
-Andi
--
-ben
-=-
Andi Kleen
2007-11-27 23:56:10 UTC
Permalink
Post by Ben Woodard
Would anyone have any problems with code that simply verified that the
state which we are restoring allowed interrupts to get to the processor
that we are currently crashing on and if not, poked in a reasonable value.
Sounds reasonable by itself.
Post by Ben Woodard
Yes this would add some complexity to the code paths where we were
crashing but it could prevent the problem that we are seeing. It seems
like a small fairly safe change rather than a big disruptive change like
moving the initialization of the IOAPIC earlier in the boot process.
But longer (or not so long) term moving the IOAPIC earlier is the better option,
simply because the short use of PIC mode traditionally was a source of problems
on a lot of boxes.

And it does not really make sense to keep this source of trouble just for a short
time during boot when we could as well go directly into IO-APIC mode. This would
probably also match what other OS are doing better and that is always a good idea
for stable operation.

-Andi
Vivek Goyal
2007-11-28 15:36:49 UTC
Permalink
Post by Ben Woodard
Post by Andi Kleen
Post by Vivek Goyal
Are we putting the system back in PIC mode or virtual wire mode? I have
not seen systems which support PIC mode. All latest systems seems
to be having virtual wire mode. I think in case of PIC mode, interrupts
Yes it's probably virtual wire. For real PIC mode we would need really
old systems without APIC.
Post by Vivek Goyal
can be delivered to cpu0 only. In virt wire mode, one can program IOAPIC
to deliver interrupt to any of the cpus and that's what we have been
The code doesn't try to program anything specific, it just restores the state
that was left over originally by the BIOS.
So if the BIOS originally left the IOAPIC in a state where the timer
interrupts were only going to CPU0 then by restoring that state we could be
bringing this problem upon ourselves when we restore that state.
Hi Ben,

Apart from restoring the original state (Bring APICS back to virtual wire
mode), we also reprogram IOAPIC so that timer interrupt can go to crashing
cpu (and not necessarily cpu0). Look at following code in disable_IO_APIC.

entry.dest.physical.physical_dest =
GET_APIC_ID(apic_read(APIC_ID));

Here we read the apic id of crashing cpu and program IOAPIC accordingly.
This will make sure that even in virtual wire mode, timer interrupts
will be delivered to crashing cpu APIC.

I think we need to go deeper and compare the state of system (APICS,
timer etc) during normal boot and kdump boot and see where is the
difference. This is how I solved some of the timer interrupt related
issues in the past.

Thanks
Vivek
Neil Horman
2007-11-28 16:02:06 UTC
Permalink
Post by Vivek Goyal
Post by Ben Woodard
Post by Andi Kleen
Post by Vivek Goyal
Are we putting the system back in PIC mode or virtual wire mode? I have
not seen systems which support PIC mode. All latest systems seems
to be having virtual wire mode. I think in case of PIC mode, interrupts
Yes it's probably virtual wire. For real PIC mode we would need really
old systems without APIC.
Post by Vivek Goyal
can be delivered to cpu0 only. In virt wire mode, one can program IOAPIC
to deliver interrupt to any of the cpus and that's what we have been
The code doesn't try to program anything specific, it just restores the state
that was left over originally by the BIOS.
So if the BIOS originally left the IOAPIC in a state where the timer
interrupts were only going to CPU0 then by restoring that state we could be
bringing this problem upon ourselves when we restore that state.
Hi Ben,
Apart from restoring the original state (Bring APICS back to virtual wire
mode), we also reprogram IOAPIC so that timer interrupt can go to crashing
cpu (and not necessarily cpu0). Look at following code in disable_IO_APIC.
entry.dest.physical.physical_dest =
GET_APIC_ID(apic_read(APIC_ID));
Here we read the apic id of crashing cpu and program IOAPIC accordingly.
This will make sure that even in virtual wire mode, timer interrupts
will be delivered to crashing cpu APIC.
Yes, but according to Bens last debug effort, the APIC printout regarding the
timer setup, indicates that ioapic_i8259.pin == -1, meaning that the 8259 is not
routed through the ioapic. In those cases, disable_IO_APIC does not take us
through the path you reference above, and does not revert to virtual wire mode.
Instead, it simply disables legacy vector 0, which if I understand this
correctly, simply tells the ioapic to not handle timer interrupts, trusting that
the 8259 in the system will deliver that interrupt where it needs to be. If the
8259 is wired to deliver timer interrupts to cpu0 only, then you get the problem
that we have, do you?

Regards
Neil
Post by Vivek Goyal
I think we need to go deeper and compare the state of system (APICS,
timer etc) during normal boot and kdump boot and see where is the
difference. This is how I solved some of the timer interrupt related
issues in the past.
Thanks
Vivek
--
/***************************************************
*Neil Horman
*Software Engineer
*Red Hat, Inc.
*nhorman-H+wXaHxf7aLQT0dZR+***@public.gmane.org
*gpg keyid: 1024D / 0x92A74FA1
*http://pgp.mit.edu
***************************************************/
ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
2007-11-28 17:36:12 UTC
Permalink
Post by Neil Horman
Post by Vivek Goyal
Post by Ben Woodard
Post by Andi Kleen
Post by Vivek Goyal
Are we putting the system back in PIC mode or virtual wire mode? I have
not seen systems which support PIC mode. All latest systems seems
to be having virtual wire mode. I think in case of PIC mode, interrupts
Yes it's probably virtual wire. For real PIC mode we would need really
old systems without APIC.
Post by Vivek Goyal
can be delivered to cpu0 only. In virt wire mode, one can program IOAPIC
to deliver interrupt to any of the cpus and that's what we have been
The code doesn't try to program anything specific, it just restores the
state
Post by Vivek Goyal
Post by Ben Woodard
Post by Andi Kleen
that was left over originally by the BIOS.
So if the BIOS originally left the IOAPIC in a state where the timer
interrupts were only going to CPU0 then by restoring that state we could be
bringing this problem upon ourselves when we restore that state.
Hi Ben,
Apart from restoring the original state (Bring APICS back to virtual wire
mode), we also reprogram IOAPIC so that timer interrupt can go to crashing
cpu (and not necessarily cpu0). Look at following code in disable_IO_APIC.
entry.dest.physical.physical_dest =
GET_APIC_ID(apic_read(APIC_ID));
Here we read the apic id of crashing cpu and program IOAPIC accordingly.
This will make sure that even in virtual wire mode, timer interrupts
will be delivered to crashing cpu APIC.
Yes, but according to Bens last debug effort, the APIC printout regarding the
timer setup, indicates that ioapic_i8259.pin == -1, meaning that the 8259 is not
routed through the ioapic. In those cases, disable_IO_APIC does not take us
through the path you reference above, and does not revert to virtual wire mode.
Instead, it simply disables legacy vector 0, which if I understand this
correctly, simply tells the ioapic to not handle timer interrupts, trusting that
the 8259 in the system will deliver that interrupt where it needs to be. If the
8259 is wired to deliver timer interrupts to cpu0 only, then you get the problem
that we have, do you?
Exactly.

It is still interesting to test to see what happens if you plugin the
normal values into ioapic_i8259 for .pin and .apic (.pin is 0 or 2 and .apic is 0)
and see what happens.

Having a command line parameter that could do that would be a cheap temporary
solution.

But this is the most likely reason why the timer interrupt is not working.

Eric
Neil Horman
2007-11-28 18:16:53 UTC
Permalink
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Post by Neil Horman
Post by Vivek Goyal
Post by Ben Woodard
Post by Andi Kleen
Post by Vivek Goyal
Are we putting the system back in PIC mode or virtual wire mode? I have
not seen systems which support PIC mode. All latest systems seems
to be having virtual wire mode. I think in case of PIC mode, interrupts
Yes it's probably virtual wire. For real PIC mode we would need really
old systems without APIC.
Post by Vivek Goyal
can be delivered to cpu0 only. In virt wire mode, one can program IOAPIC
to deliver interrupt to any of the cpus and that's what we have been
The code doesn't try to program anything specific, it just restores the
state
Post by Vivek Goyal
Post by Ben Woodard
Post by Andi Kleen
that was left over originally by the BIOS.
So if the BIOS originally left the IOAPIC in a state where the timer
interrupts were only going to CPU0 then by restoring that state we could be
bringing this problem upon ourselves when we restore that state.
Hi Ben,
Apart from restoring the original state (Bring APICS back to virtual wire
mode), we also reprogram IOAPIC so that timer interrupt can go to crashing
cpu (and not necessarily cpu0). Look at following code in disable_IO_APIC.
entry.dest.physical.physical_dest =
GET_APIC_ID(apic_read(APIC_ID));
Here we read the apic id of crashing cpu and program IOAPIC accordingly.
This will make sure that even in virtual wire mode, timer interrupts
will be delivered to crashing cpu APIC.
Yes, but according to Bens last debug effort, the APIC printout regarding the
timer setup, indicates that ioapic_i8259.pin == -1, meaning that the 8259 is not
routed through the ioapic. In those cases, disable_IO_APIC does not take us
through the path you reference above, and does not revert to virtual wire mode.
Instead, it simply disables legacy vector 0, which if I understand this
correctly, simply tells the ioapic to not handle timer interrupts, trusting that
the 8259 in the system will deliver that interrupt where it needs to be. If the
8259 is wired to deliver timer interrupts to cpu0 only, then you get the problem
that we have, do you?
Exactly.
It is still interesting to test to see what happens if you plugin the
normal values into ioapic_i8259 for .pin and .apic (.pin is 0 or 2 and .apic is 0)
and see what happens.
Having a command line parameter that could do that would be a cheap temporary
solution.
But this is the most likely reason why the timer interrupt is not working.
Ok, thank you for the explination, this all makes a good deal more sense to me
now. Ben is near the machine, so hopefully we'll hear from him soon with the
results of this test.

Given that, do you think the cpu-switch test that I proposed would be a good
solution now (with the fallback mechanism I described), or would a command line
8259 solution be better? I tend to think the former would be better since it
would be transparent to the user, but I'd like to have that debate.

Regards
Neil
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Eric
--
/***************************************************
*Neil Horman
*Software Engineer
*Red Hat, Inc.
*nhorman-H+wXaHxf7aLQT0dZR+***@public.gmane.org
*gpg keyid: 1024D / 0x92A74FA1
*http://pgp.mit.edu
***************************************************/
Vivek Goyal
2007-11-28 19:05:25 UTC
Permalink
Post by Neil Horman
Post by Vivek Goyal
Post by Ben Woodard
Post by Andi Kleen
Post by Vivek Goyal
Are we putting the system back in PIC mode or virtual wire mode? I have
not seen systems which support PIC mode. All latest systems seems
to be having virtual wire mode. I think in case of PIC mode, interrupts
Yes it's probably virtual wire. For real PIC mode we would need really
old systems without APIC.
Post by Vivek Goyal
can be delivered to cpu0 only. In virt wire mode, one can program IOAPIC
to deliver interrupt to any of the cpus and that's what we have been
The code doesn't try to program anything specific, it just restores the state
that was left over originally by the BIOS.
So if the BIOS originally left the IOAPIC in a state where the timer
interrupts were only going to CPU0 then by restoring that state we could be
bringing this problem upon ourselves when we restore that state.
Hi Ben,
Apart from restoring the original state (Bring APICS back to virtual wire
mode), we also reprogram IOAPIC so that timer interrupt can go to crashing
cpu (and not necessarily cpu0). Look at following code in disable_IO_APIC.
entry.dest.physical.physical_dest =
GET_APIC_ID(apic_read(APIC_ID));
Here we read the apic id of crashing cpu and program IOAPIC accordingly.
This will make sure that even in virtual wire mode, timer interrupts
will be delivered to crashing cpu APIC.
Yes, but according to Bens last debug effort, the APIC printout regarding the
timer setup, indicates that ioapic_i8259.pin == -1, meaning that the 8259 is not
routed through the ioapic. In those cases, disable_IO_APIC does not take us
through the path you reference above, and does not revert to virtual wire mode.
Instead, it simply disables legacy vector 0, which if I understand this
correctly, simply tells the ioapic to not handle timer interrupts, trusting that
the 8259 in the system will deliver that interrupt where it needs to be. If the
8259 is wired to deliver timer interrupts to cpu0 only, then you get the problem
that we have, do you?
Ok. Got it. So in this case we route the interrupts directly through LAPIC
and put LVT0 in ExtInt mode and IOAPIC is bypassed.

I am looking at Intel Multiprocessor specification v1.4 and as per figure
3-3 on page 3-9, 8259 is connected to LINTIN0 line, which in turn is
connected to LINTIN0 pin on all processors. If that is the case, even in
this mode, all the CPU should see the timer interrupts (which is coming
from 8259)?

Can you print the LAPIC registers (print_local_APIC) during normal boot
and during kdump boot and paste here?

Thanks
Vivek
ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
2007-11-28 19:42:22 UTC
Permalink
Post by Vivek Goyal
Ok. Got it. So in this case we route the interrupts directly through LAPIC
and put LVT0 in ExtInt mode and IOAPIC is bypassed.
I am looking at Intel Multiprocessor specification v1.4 and as per figure
3-3 on page 3-9, 8259 is connected to LINTIN0 line, which in turn is
connected to LINTIN0 pin on all processors. If that is the case, even in
this mode, all the CPU should see the timer interrupts (which is coming
from 8259)?
However things are implemented completely differently now. I don't think
the coherent hypertransport domain of AMD processors actually routes
ExtINT interrupts to all cpus but instead one (the default route?) is
picked.

So I think for the kdump case we pretty much need to use an IOAPIC
in virtual wire mode for recent AMD systems.

For current Intel systems I believe either scenario still works.
Post by Vivek Goyal
Can you print the LAPIC registers (print_local_APIC) during normal boot
and during kdump boot and paste here?
It's worth a look.

I still think we need to just use apic mode at kernel startup, and
be done with it.

Eric
Neil Horman
2007-11-28 21:09:44 UTC
Permalink
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Post by Vivek Goyal
Ok. Got it. So in this case we route the interrupts directly through LAPIC
and put LVT0 in ExtInt mode and IOAPIC is bypassed.
I am looking at Intel Multiprocessor specification v1.4 and as per figure
3-3 on page 3-9, 8259 is connected to LINTIN0 line, which in turn is
connected to LINTIN0 pin on all processors. If that is the case, even in
this mode, all the CPU should see the timer interrupts (which is coming
from 8259)?
However things are implemented completely differently now. I don't think
the coherent hypertransport domain of AMD processors actually routes
ExtINT interrupts to all cpus but instead one (the default route?) is
picked.
http://www.hypertransport.org/docs/tech/HTC20051222-0046-0008-Final-4-21-06.pdf
Table 143 suggest to me that legacy interrupts should be routed to all cpus,
which certainly doesn't seem to be the case in this situation. Perhaps Nvidia
goofed on that part of the specification?
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
So I think for the kdump case we pretty much need to use an IOAPIC
in virtual wire mode for recent AMD systems.
For current Intel systems I believe either scenario still works.
Post by Vivek Goyal
Can you print the LAPIC registers (print_local_APIC) during normal boot
and during kdump boot and paste here?
It's worth a look.
I still think we need to just use apic mode at kernel startup, and
be done with it.
Certainly, this seems like the best solution long term.

So I'm looking at the implementation for 64 bit system, and it seems a little
cleaner than 32 bit setup. I'm wondering if we can just call setup_IO_APIC
immediately after init_IRQ in start_kernel? Could it be that straightforward?

Neil
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Eric
--
/***************************************************
*Neil Horman
*Software Engineer
*Red Hat, Inc.
*nhorman-H+wXaHxf7aLQT0dZR+***@public.gmane.org
*gpg keyid: 1024D / 0x92A74FA1
*http://pgp.mit.edu
***************************************************/
ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
2007-11-28 23:27:43 UTC
Permalink
Post by Neil Horman
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Post by Vivek Goyal
Ok. Got it. So in this case we route the interrupts directly through LAPIC
and put LVT0 in ExtInt mode and IOAPIC is bypassed.
I am looking at Intel Multiprocessor specification v1.4 and as per figure
3-3 on page 3-9, 8259 is connected to LINTIN0 line, which in turn is
connected to LINTIN0 pin on all processors. If that is the case, even in
this mode, all the CPU should see the timer interrupts (which is coming
from 8259)?
However things are implemented completely differently now. I don't think
the coherent hypertransport domain of AMD processors actually routes
ExtINT interrupts to all cpus but instead one (the default route?) is
picked.
http://www.hypertransport.org/docs/tech/HTC20051222-0046-0008-Final-4-21-06.pdf
Table 143 suggest to me that legacy interrupts should be routed to all cpus,
which certainly doesn't seem to be the case in this situation. Perhaps Nvidia
goofed on that part of the specification?
Once it hits the coherent hypertransport fabric only AMD has control.
The packet does not have a destination field that could specify which
cpu. So Nvidia could not have goofed. It is a magic property of
the coherent hypertransport domain as which cpu gets it.
Post by Neil Horman
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
So I think for the kdump case we pretty much need to use an IOAPIC
in virtual wire mode for recent AMD systems.
For current Intel systems I believe either scenario still works.
Post by Vivek Goyal
Can you print the LAPIC registers (print_local_APIC) during normal boot
and during kdump boot and paste here?
It's worth a look.
I still think we need to just use apic mode at kernel startup, and
be done with it.
Certainly, this seems like the best solution long term.
So I'm looking at the implementation for 64 bit system, and it seems a little
cleaner than 32 bit setup. I'm wondering if we can just call setup_IO_APIC
immediately after init_IRQ in start_kernel? Could it be that straightforward?
Pretty much.

The essence of what I did last round was put at the end of
init_IRQ():


if (!disable_apic && cpu_has_apic) {
/*
* Switch to APIC mode.
*/
setup_local_APIC();

/*
* Now start the IO-APICs
*/
if (smp_found_config && !skip_ioapic_setup && nr_ioapics)
setup_IO_APIC();
else
nr_ioapics = 0;
}

Beyond that it is just dotting i's and crossing t's.

Eric
Ben Woodard
2007-11-30 02:16:48 UTC
Permalink
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Post by Vivek Goyal
Ok. Got it. So in this case we route the interrupts directly through LAPIC
and put LVT0 in ExtInt mode and IOAPIC is bypassed.
I am looking at Intel Multiprocessor specification v1.4 and as per figure
3-3 on page 3-9, 8259 is connected to LINTIN0 line, which in turn is
connected to LINTIN0 pin on all processors. If that is the case, even in
this mode, all the CPU should see the timer interrupts (which is coming
from 8259)?
However things are implemented completely differently now. I don't think
the coherent hypertransport domain of AMD processors actually routes
ExtINT interrupts to all cpus but instead one (the default route?) is
picked.
So I think for the kdump case we pretty much need to use an IOAPIC
in virtual wire mode for recent AMD systems.
For current Intel systems I believe either scenario still works.
Post by Vivek Goyal
Can you print the LAPIC registers (print_local_APIC) during normal boot
and during kdump boot and paste here?
It's worth a look.
I still think we need to just use apic mode at kernel startup, and
be done with it.
Neil whipped up a patch to try this and evidently it worked on his test
boxes but it didn't work very well on our problem tests box. It hung
after the kernel printed "Ready". i.e. on a normal boot I get:

<snip>
2007-11-29 13:48:29 Loading
vmlinuz-2.6.18-13chaos.ben.test................................
2007-11-29 13:48:29 Loading
initrd-2.6.18-13chaos.ben.test.........................................................
..............................................................................
2007-11-29 13:48:29 Ready.
2007-11-29 13:48:30 Linux version 2.6.18-13chaos.ben.test (***@wopri) (gcc
version 4.1.2 20070626 (Red Hat 4.1.2-14
)) #10 SMP Thu Nov 29 13:11:49 PST 2007
2007-11-29 13:48:30 Command line: initrd=initrd-2.6.18-13chaos.ben.test
loglevel=8 console=ttyS0,115200n8 crashkernel=***@16M elevator=deadline
swiotlb=65536 selinux=0 apic=debug
BOOT_IMAGE=vmlinuz-2.6.18-13chaos.ben.test BOOTIF=
01-00-30-48-57-91-56

With Neil's patch:
2007-11-29 17:12:55 PXELINUX 2.11 2004-08-16 Copyright (C) 1994-2004 H.
Peter Anvin
2007-11-29 17:12:55 Boot options [default: 2.6.18-54.el5.bz336371]:
2007-11-29 17:12:55 linux-2.6.18-13chaos.ben.test-2.6.18-54.el5.bz336371
2007-11-29 17:12:55 linux
2007-11-29 17:12:55 linux-2.6.18-54.el5.bz336371
2007-11-29 17:12:55 linux-2.6.18-52.el5
2007-11-29 17:12:55 linux-2.6.18-13chaos.ben.test-2.6.18-13chaos.ben.test
2007-11-29 17:12:55 linux-2.6.23-0.214.rc8.git2.fc8
2007-11-29 17:12:55 linux-2.6.18-8.1.14.el5
2007-11-29 17:12:55 linux-2.6.18-7chaos
2007-11-29 17:12:55 boot:
2007-11-29 17:13:02 Loading
vmlinuz-2.6.18-13chaos.ben.test................................
2007-11-29 17:13:02 Loading
initrd-2.6.18-13chaos.ben.test.........................................................
..............................................................................
2007-11-29 17:13:02 Ready.
(END)
That's all she wrote. End of story. Had to reboot to another kernel to
make get it back.

Neil's patch:

--- linux-2.6.18.noarch/arch/x86_64/kernel/i8259.c.orig 2007-11-28
18:00:31.000000000 -0500
+++ linux-2.6.18.noarch/arch/x86_64/kernel/i8259.c 2007-11-29
10:37:14.000000000 -0500
@@ -599,4 +599,30 @@

if (!acpi_ioapic)
setup_irq(2, &irq2);
+
+ /*
+ * Switch from PIC to APIC mode.
+ */
+ connect_bsp_APIC();
+ setup_local_APIC();
+
+ if (GET_APIC_ID(apic_read(APIC_ID)) != boot_cpu_id) {
+ panic("Boot APIC ID in local APIC unexpected (%d vs %d)",
+ GET_APIC_ID(apic_read(APIC_ID)), boot_cpu_id);
+ /* Or can we switch back to PIC here? */
+ }
+
+ /*
+ * Now start the IO-APICs
+ */
+ if (!skip_ioapic_setup && nr_ioapics)
+ setup_IO_APIC();
+ else
+ nr_ioapics = 0;
+
+ /*
+ * Disable local irqs here so start_kernel doesn't complain
+ */
+ local_irq_disable();
+
}
--- linux-2.6.18.noarch/arch/x86_64/kernel/smpboot.c.orig
2007-11-28 18:07:33.000000000 -0500
+++ linux-2.6.18.noarch/arch/x86_64/kernel/smpboot.c 2007-11-29
10:37:59.000000000 -0500
@@ -1088,26 +1088,6 @@


/*
- * Switch from PIC to APIC mode.
- */
- connect_bsp_APIC();
- setup_local_APIC();
-
- if (GET_APIC_ID(apic_read(APIC_ID)) != boot_cpu_id) {
- panic("Boot APIC ID in local APIC unexpected (%d vs %d)",
- GET_APIC_ID(apic_read(APIC_ID)), boot_cpu_id);
- /* Or can we switch back to PIC here? */
- }
-
- /*
- * Now start the IO-APICs
- */
- if (!skip_ioapic_setup && nr_ioapics)
- setup_IO_APIC();
- else
- nr_ioapics = 0;
-
- /*
* Set up local APIC timer on boot CPU.
*/
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Eric
_______________________________________________
kexec mailing list
http://lists.infradead.org/mailman/listinfo/kexec
--
-ben
-=-
ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
2007-11-30 02:54:16 UTC
Permalink
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Post by Vivek Goyal
Ok. Got it. So in this case we route the interrupts directly through LAPIC
and put LVT0 in ExtInt mode and IOAPIC is bypassed.
I am looking at Intel Multiprocessor specification v1.4 and as per figure
3-3 on page 3-9, 8259 is connected to LINTIN0 line, which in turn is
connected to LINTIN0 pin on all processors. If that is the case, even in
this mode, all the CPU should see the timer interrupts (which is coming
from 8259)?
However things are implemented completely differently now. I don't think
the coherent hypertransport domain of AMD processors actually routes
ExtINT interrupts to all cpus but instead one (the default route?) is
picked.
So I think for the kdump case we pretty much need to use an IOAPIC
in virtual wire mode for recent AMD systems.
For current Intel systems I believe either scenario still works.
Post by Vivek Goyal
Can you print the LAPIC registers (print_local_APIC) during normal boot
and during kdump boot and paste here?
It's worth a look.
I still think we need to just use apic mode at kernel startup, and
be done with it.
Neil whipped up a patch to try this and evidently it worked on his test boxes
but it didn't work very well on our problem tests box. It hung after the kernel
Interesting can you please try an early_printk console.


I expect you made it a fair ways and it just didn't show up because you didn't
get as far as the normal serial port setup.

You don't have any output from your linux kernel.

Eric
Yinghai Lu
2007-11-30 08:59:26 UTC
Permalink
Post by Vivek Goyal
Ok. Got it. So in this case we route the interrupts directly through LAPIC
and put LVT0 in ExtInt mode and IOAPIC is bypassed.
I am looking at Intel Multiprocessor specification v1.4 and as per figure
3-3 on page 3-9, 8259 is connected to LINTIN0 line, which in turn is
connected to LINTIN0 pin on all processors. If that is the case, even in
this mode, all the CPU should see the timer interrupts (which is coming
from 8259)?
there is two mode for mcp55. bios should have one option about virtul
wired to LVT0 of BSP or IOAPIC pin 0.
or the option like hpet route to ioapic pin 2.

for kdump fix, could enable LVT0 of CPU for kdump and disable that for BSP?

ben,
can you send out
lspci -vvxxx -s 00:1.0

YH
Vivek Goyal
2007-11-30 14:35:34 UTC
Permalink
Post by Yinghai Lu
Post by Vivek Goyal
Ok. Got it. So in this case we route the interrupts directly through LAPIC
and put LVT0 in ExtInt mode and IOAPIC is bypassed.
I am looking at Intel Multiprocessor specification v1.4 and as per figure
3-3 on page 3-9, 8259 is connected to LINTIN0 line, which in turn is
connected to LINTIN0 pin on all processors. If that is the case, even in
this mode, all the CPU should see the timer interrupts (which is coming
from 8259)?
there is two mode for mcp55. bios should have one option about virtul
wired to LVT0 of BSP or IOAPIC pin 0.
or the option like hpet route to ioapic pin 2.
That's interesting. So with BIOS options I can force timer
interrupts to be routed through IOAPIC? That would enable us to get
timer interrupts on any of the cpus in legacy mode. Ben, can you give it
a try?
Post by Yinghai Lu
for kdump fix, could enable LVT0 of CPU for kdump and disable that for BSP?
We would not know the crashing cpu in advance hence can't set it.

Thanks
Vivek
Neil Horman
2007-11-30 14:32:53 UTC
Permalink
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Post by Vivek Goyal
Ok. Got it. So in this case we route the interrupts directly through LAPIC
and put LVT0 in ExtInt mode and IOAPIC is bypassed.
I am looking at Intel Multiprocessor specification v1.4 and as per figure
3-3 on page 3-9, 8259 is connected to LINTIN0 line, which in turn is
connected to LINTIN0 pin on all processors. If that is the case, even in
this mode, all the CPU should see the timer interrupts (which is coming
from 8259)?
However things are implemented completely differently now. I don't think
the coherent hypertransport domain of AMD processors actually routes
ExtINT interrupts to all cpus but instead one (the default route?) is
picked.
So I think for the kdump case we pretty much need to use an IOAPIC
in virtual wire mode for recent AMD systems.
For current Intel systems I believe either scenario still works.
Post by Vivek Goyal
Can you print the LAPIC registers (print_local_APIC) during normal boot
and during kdump boot and paste here?
It's worth a look.
I still think we need to just use apic mode at kernel startup, and
be done with it.
Neil whipped up a patch to try this and evidently it worked on his test boxes
but it didn't work very well on our problem tests box. It hung after the kernel
Interesting can you please try an early_printk console.
I expect you made it a fair ways and it just didn't show up because you didn't
get as far as the normal serial port setup.
You don't have any output from your linux kernel.
I've got a system here that now seems to be behaving in a way that is simmilar
to what Ben describes (although I'm not sure its the same problem).
early_printk shows that we're panic-ing inside check_timer, because we fail to
find any way to route the timer interrupt to the cpu. Specifically, we're
hitting this panic:

panic("IO-APIC + timer doesn't work! Try using the 'noapic' kernel
parameter\n");

This doesn't make much sense to me, as we clearly have managed to get timer
interrupts at this point (since we made it through calibrate_delay)....

Looking at it, I wonder if this isn't a backporting issue. Ben and I ran this
test on an older kernel (since thats what the production system under test is
based on). Currently, check_timer is called directly from within setup_IO_APIC,
but in the 2.6.18 kernel that RHEL5 is based on its part of an initcall thats
run from within init near the call site of the origional io apic init code. I
wonder if this isn't just an 'old kernel' issue, and that I need to move
check_timer to inside setup_IO_APIC.

Ben, is it possible for you to run an upstream kernel on one of these systems so
we can see if the patch works in that case? In the interim, I'll update my
patch for 2.6.18 so that check_timer gets moved earlier

Neil
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Ben Woodard
2007-11-30 02:12:07 UTC
Permalink
Post by Vivek Goyal
Post by Neil Horman
Post by Vivek Goyal
Post by Ben Woodard
Post by Andi Kleen
Post by Vivek Goyal
Are we putting the system back in PIC mode or virtual wire mode? I have
not seen systems which support PIC mode. All latest systems seems
to be having virtual wire mode. I think in case of PIC mode, interrupts
Yes it's probably virtual wire. For real PIC mode we would need really
old systems without APIC.
Post by Vivek Goyal
can be delivered to cpu0 only. In virt wire mode, one can program IOAPIC
to deliver interrupt to any of the cpus and that's what we have been
The code doesn't try to program anything specific, it just restores the state
that was left over originally by the BIOS.
So if the BIOS originally left the IOAPIC in a state where the timer
interrupts were only going to CPU0 then by restoring that state we could be
bringing this problem upon ourselves when we restore that state.
Hi Ben,
Apart from restoring the original state (Bring APICS back to virtual wire
mode), we also reprogram IOAPIC so that timer interrupt can go to crashing
cpu (and not necessarily cpu0). Look at following code in disable_IO_APIC.
entry.dest.physical.physical_dest =
GET_APIC_ID(apic_read(APIC_ID));
Here we read the apic id of crashing cpu and program IOAPIC accordingly.
This will make sure that even in virtual wire mode, timer interrupts
will be delivered to crashing cpu APIC.
Yes, but according to Bens last debug effort, the APIC printout regarding the
timer setup, indicates that ioapic_i8259.pin == -1, meaning that the 8259 is not
routed through the ioapic. In those cases, disable_IO_APIC does not take us
through the path you reference above, and does not revert to virtual wire mode.
Instead, it simply disables legacy vector 0, which if I understand this
correctly, simply tells the ioapic to not handle timer interrupts, trusting that
the 8259 in the system will deliver that interrupt where it needs to be. If the
8259 is wired to deliver timer interrupts to cpu0 only, then you get the problem
that we have, do you?
Ok. Got it. So in this case we route the interrupts directly through LAPIC
and put LVT0 in ExtInt mode and IOAPIC is bypassed.
I am looking at Intel Multiprocessor specification v1.4 and as per figure
3-3 on page 3-9, 8259 is connected to LINTIN0 line, which in turn is
connected to LINTIN0 pin on all processors. If that is the case, even in
this mode, all the CPU should see the timer interrupts (which is coming
from 8259)?
Can you print the LAPIC registers (print_local_APIC) during normal boot
and during kdump boot and paste here?
Here are the ones from a normal bootup.

I was unable to get info from a kdump boot. I haven't figured out why
yet. With the same patch that I used to capture this, when I tried to
kdump the kernel, it paused a second or two after the backtrace and then
dropped to BIOS and came up normally.

Here is a little trick, at the point where we are trying to get the info
to print out, the kernel command line hasn't been completely parsed yet.
That tricked me for part of the day. I had apic=debug on the command
line but the logic in print_local_APIC saw the default value because the
kernel command line had yet to be parsed.

2007-11-29 17:58:07 ***Here is the info you requested
2007-11-29 17:58:07
2007-11-29 17:58:07 printing local APIC contents on CPU#0/0:
2007-11-29 17:58:07 ... APIC ID: 00000000 (0)
2007-11-29 17:58:07 ... APIC VERSION: 80050010
2007-11-29 17:58:07 ... APIC TASKPRI: 00000000 (00)
2007-11-29 17:58:07 ... APIC ARBPRI: 00000000 (00)
2007-11-29 17:58:07 ... APIC PROCPRI: 00000000
2007-11-29 17:58:07 ... APIC EOI: 00000000
2007-11-29 17:58:07 ... APIC RRR: 00000002
2007-11-29 17:58:07 ... APIC LDR: 00000000
2007-11-29 17:58:07 ... APIC DFR: ffffffff
2007-11-29 17:58:07 ... APIC SPIV: 0000010f
2007-11-29 17:58:07 ... APIC ISR field:
2007-11-29 17:58:07 ... APIC TMR field:
2007-11-29 17:58:07 ... APIC IRR field:
2007-11-29 17:58:07 ... APIC ESR: 00000000
2007-11-29 17:58:07 ... APIC ICR: 00004630
2007-11-29 17:58:07 ... APIC ICR2: 07000000
2007-11-29 17:58:07 ... APIC LVTT: 00010000
2007-11-29 17:58:07 ... APIC LVTPC: 00010000
2007-11-29 17:58:07 ... APIC LVT0: 00000700
2007-11-29 17:58:07 ... APIC LVT1: 00000400
2007-11-29 17:58:07 ... APIC LVTERR: 0001000f
2007-11-29 17:58:07 ... APIC TMICT: 80000000
2007-11-29 17:58:07 ... APIC TMCCT: 00000000
2007-11-29 17:58:07 ... APIC TDCR: 00000000
2007-11-29 17:58:07
2007-11-29 17:58:07 number of MP IRQ sources: 15.
2007-11-29 17:58:07 number of IO-APIC #8 registers: 0.
2007-11-29 17:58:07 number of IO-APIC #9 registers: 0.
2007-11-29 17:58:07 number of IO-APIC #10 registers: 0.
2007-11-29 17:58:07 testing the IO APIC.......................
2007-11-29 17:58:07
2007-11-29 17:58:07 IO APIC #8......
2007-11-29 17:58:07 .... register #00: 08000000
2007-11-29 17:58:07 ....... : physical APIC id: 08
2007-11-29 17:58:07 .... register #01: 00170011
2007-11-29 17:58:07 ....... : max redirection entries: 0017
2007-11-29 17:58:07 ....... : PRQ implemented: 0
2007-11-29 17:58:07 ....... : IO APIC version: 0011
2007-11-29 17:58:07 .... register #02: 08000000
2007-11-29 17:58:07 ....... : arbitration: 08
2007-11-29 17:58:07 .... IRQ redirection table:
2007-11-29 17:58:07 NR Log Phy Mask Trig IRR Pol Stat Dest Deli Vect:
2007-11-29 17:58:07 00 000 00 1 0 0 0 0 0 0 00
2007-11-29 17:58:07 01 000 00 1 0 0 0 0 0 0 00
2007-11-29 17:58:07 02 000 00 1 0 0 0 0 0 0 00
2007-11-29 17:58:07 03 000 00 1 0 0 0 0 0 0 00
2007-11-29 17:58:07 04 000 00 1 0 0 0 0 0 0 00
2007-11-29 17:58:07 05 000 00 1 0 0 0 0 0 0 00
2007-11-29 17:58:07 06 000 00 1 0 0 0 0 0 0 00
2007-11-29 17:58:07 07 000 00 1 0 0 0 0 0 0 00
2007-11-29 17:58:07 08 000 00 1 0 0 0 0 0 0 00
2007-11-29 17:58:07 09 000 00 1 0 0 0 0 0 0 00
2007-11-29 17:58:07 0a 000 00 1 0 0 0 0 0 0 00
2007-11-29 17:58:07 0b 000 00 1 0 0 0 0 0 0 00
2007-11-29 17:58:07 0c 000 00 1 0 0 0 0 0 0 00
2007-11-29 17:58:07 0d 000 00 1 0 0 0 0 0 0 00
2007-11-29 17:58:07 0e 000 00 1 0 0 0 0 0 0 00
2007-11-29 17:58:07 0f 000 00 1 0 0 0 0 0 0 00
2007-11-29 17:58:07 10 000 00 1 0 0 0 0 0 0 00
2007-11-29 17:58:07 11 000 00 1 0 0 0 0 0 0 00
2007-11-29 17:58:07 12 000 00 1 0 0 0 0 0 0 00
2007-11-29 17:58:07 13 000 00 1 0 0 0 0 0 0 00
2007-11-29 17:58:07 14 000 00 1 0 0 0 0 0 0 00
2007-11-29 17:58:07 15 000 00 1 0 0 0 0 0 0 00
2007-11-29 17:58:07 16 000 00 1 0 0 0 0 0 0 00
2007-11-29 17:58:07 17 000 00 1 0 0 0 0 0 0 00
2007-11-29 17:58:07
2007-11-29 17:58:07 IO APIC #9......
2007-11-29 17:58:07 .... register #00: 09000000
2007-11-29 17:58:07 ....... : physical APIC id: 09
2007-11-29 17:58:07 .... register #01: 00060011
2007-11-29 17:58:07 ....... : max redirection entries: 0006
2007-11-29 17:58:07 ....... : PRQ implemented: 0
2007-11-29 17:58:07 ....... : IO APIC version: 0011
2007-11-29 17:58:07 .... register #02: 00000000
2007-11-29 17:58:07 ....... : arbitration: 00
2007-11-29 17:58:07 .... IRQ redirection table:
2007-11-29 17:58:07 NR Log Phy Mask Trig IRR Pol Stat Dest Deli Vect:
2007-11-29 17:58:07 00 000 00 1 0 0 0 0 0 0 00
2007-11-29 17:58:07 01 000 00 1 0 0 0 0 0 0 00
2007-11-29 17:58:08 02 000 00 1 0 0 0 0 0 0 00
2007-11-29 17:58:08 03 000 00 1 0 0 0 0 0 0 00
2007-11-29 17:58:08 04 000 00 1 0 0 0 0 0 0 00
2007-11-29 17:58:08 05 000 00 1 0 0 0 0 0 0 00
2007-11-29 17:58:08 06 000 00 1 0 0 0 0 0 0 00
2007-11-29 17:58:08
2007-11-29 17:58:08 IO APIC #10......
2007-11-29 17:58:08 .... register #00: 0A000000
2007-11-29 17:58:08 ....... : physical APIC id: 0A
2007-11-29 17:58:08 .... register #01: 00060011
2007-11-29 17:58:08 ....... : max redirection entries: 0006
2007-11-29 17:58:08 ....... : PRQ implemented: 0
2007-11-29 17:58:08 ....... : IO APIC version: 0011
2007-11-29 17:58:08 .... register #02: 00000000
2007-11-29 17:58:08 ....... : arbitration: 00
2007-11-29 17:58:08 .... IRQ redirection table:
2007-11-29 17:58:08 NR Log Phy Mask Trig IRR Pol Stat Dest Deli Vect:
2007-11-29 17:58:08 00 000 00 1 0 0 0 0 0 0 00
2007-11-29 17:58:08 01 000 00 1 0 0 0 0 0 0 00
2007-11-29 17:58:08 02 000 00 1 0 0 0 0 0 0 00
2007-11-29 17:58:08 03 000 00 1 0 0 0 0 0 0 00
2007-11-29 17:58:08 04 000 00 1 0 0 0 0 0 0 00
2007-11-29 17:58:08 05 000 00 1 0 0 0 0 0 0 00
2007-11-29 17:58:08 06 000 00 1 0 0 0 0 0 0 00
2007-11-29 17:58:08 Using vector-based indexing
2007-11-29 17:58:08 IRQ to pin mappings:
2007-11-29 17:58:08 IRQ0 -> 0:0
2007-11-29 17:58:08 IRQ1 -> 0:0
2007-11-29 17:58:08 IRQ2 -> 0:0
2007-11-29 17:58:08 IRQ3 -> 0:0
2007-11-29 17:58:08 IRQ4 -> 0:0
2007-11-29 17:58:08 IRQ5 -> 0:0
2007-11-29 17:58:08 IRQ6 -> 0:0
2007-11-29 17:58:08 IRQ7 -> 0:0
2007-11-29 17:58:08 IRQ8 -> 0:0
2007-11-29 17:58:08 IRQ9 -> 0:0
2007-11-29 17:58:08 IRQ10 -> 0:0
2007-11-29 17:58:08 IRQ11 -> 0:0
2007-11-29 17:58:08 IRQ12 -> 0:0
2007-11-29 17:58:08 IRQ13 -> 0:0
2007-11-29 17:58:08 IRQ14 -> 0:0
2007-11-29 17:58:08 IRQ15 -> 0:0
2007-11-29 17:58:08 IRQ0 -> 0:0
<snip a whole bunch of identical lines>
2007-11-29 17:58:08 .................................... done.
2007-11-29 17:58:08 Built 4 zonelists. Total pages: 3937654


The patch that I was using to get this is as follows:
--- linux/init/main.c (revision 763)
+++ linux/init/main.c (working copy)
@@ -484,6 +484,8 @@
{
}

+extern void print_local_APIC(void);
+
asmlinkage void __init start_kernel(void)
{
char * command_line;
@@ -515,6 +517,11 @@
setup_per_cpu_areas();
smp_prepare_boot_cpu(); /* arch-specific boot-cpu hooks */

+ console_loglevel=10;
+ printk(KERN_DEBUG "***Here is the info you requested\n");
+ print_local_APIC();
+ print_IO_APIC();
+
/*
* Set up the scheduler prior starting any interrupts (such as the
* timer interrupt). Full topology setup happens at smp_init()
Index: linux/arch/x86_64/kernel/io_apic.c
===================================================================
--- linux/arch/x86_64/kernel/io_apic.c (revision 763)
+++ linux/arch/x86_64/kernel/io_apic.c (working copy)
@@ -1012,9 +1012,9 @@
union IO_APIC_reg_02 reg_02;
unsigned long flags;

- if (apic_verbosity == APIC_QUIET)
+/* if (apic_verbosity == APIC_QUIET)
return;
-
+*/
printk(KERN_DEBUG "number of MP IRQ sources: %d.\n",
mp_irq_entries);
for (i = 0; i < nr_ioapics; i++)
printk(KERN_DEBUG "number of IO-APIC #%d registers: %d.\n",
@@ -1131,8 +1131,6 @@
return;
}

-#if 0
-
static __apicdebuginit void print_APIC_bitfield (int base)
{
unsigned int v;
@@ -1158,9 +1156,9 @@
{
unsigned int v, ver, maxlvt;

- if (apic_verbosity == APIC_QUIET)
+/* if (apic_verbosity == APIC_QUIET)
return;
-
+*/
printk("\n" KERN_DEBUG "printing local APIC contents on
CPU#%d/%d:\n",
smp_processor_id(), hard_smp_processor_id());
v = apic_read(APIC_ID);
@@ -1268,8 +1266,6 @@
printk(KERN_DEBUG "... PIC ELCR: %04x\n", v);
}

-#endif /* 0 */
-
static void __init enable_IO_APIC(void)
{
union IO_APIC_reg_01 reg_01;
Post by Vivek Goyal
Thanks
Vivek
--
-ben
-=-
Vivek Goyal
2007-11-30 14:42:50 UTC
Permalink
[..]
Post by Ben Woodard
Post by Vivek Goyal
Can you print the LAPIC registers (print_local_APIC) during normal boot
and during kdump boot and paste here?
Here are the ones from a normal bootup.
I was unable to get info from a kdump boot. I haven't figured out why yet.
With the same patch that I used to capture this, when I tried to kdump the
kernel, it paused a second or two after the backtrace and then dropped to
BIOS and came up normally.
Here is a little trick, at the point where we are trying to get the info to
print out, the kernel command line hasn't been completely parsed yet. That
tricked me for part of the day. I had apic=debug on the command line but
the logic in print_local_APIC saw the default value because the kernel
command line had yet to be parsed.
2007-11-29 17:58:07 ***Here is the info you requested
2007-11-29 17:58:07
2007-11-29 17:58:07 ... APIC ID: 00000000 (0)
2007-11-29 17:58:07 ... APIC VERSION: 80050010
2007-11-29 17:58:07 ... APIC TASKPRI: 00000000 (00)
2007-11-29 17:58:07 ... APIC ARBPRI: 00000000 (00)
2007-11-29 17:58:07 ... APIC PROCPRI: 00000000
2007-11-29 17:58:07 ... APIC EOI: 00000000
2007-11-29 17:58:07 ... APIC RRR: 00000002
2007-11-29 17:58:07 ... APIC LDR: 00000000
2007-11-29 17:58:07 ... APIC DFR: ffffffff
2007-11-29 17:58:07 ... APIC SPIV: 0000010f
2007-11-29 17:58:07 ... APIC ESR: 00000000
2007-11-29 17:58:07 ... APIC ICR: 00004630
2007-11-29 17:58:07 ... APIC ICR2: 07000000
2007-11-29 17:58:07 ... APIC LVTT: 00010000
2007-11-29 17:58:07 ... APIC LVTPC: 00010000
2007-11-29 17:58:07 ... APIC LVT0: 00000700
Ok so here boot cpu LVT0 has been set to deliver any interrupt on pin
LINT0 as ExtInt. We do the same thing for kdump cpu local apic too.

I am not sure who is the guy in system who encodes the interrupt message
from 8259 to be put on hypertransport and on what basis do they decide
whether it should go to only cpu0 or a broadcast one. Looks like in this
case it is going to cpu0 only. That means we are left with no choice but
to work on patch to initialize IOAPIC early.

Thanks
Vivek
Neil Horman
2007-11-30 14:51:31 UTC
Permalink
Post by Vivek Goyal
[..]
Post by Ben Woodard
Post by Vivek Goyal
Can you print the LAPIC registers (print_local_APIC) during normal boot
and during kdump boot and paste here?
Here are the ones from a normal bootup.
I was unable to get info from a kdump boot. I haven't figured out why yet.
With the same patch that I used to capture this, when I tried to kdump the
kernel, it paused a second or two after the backtrace and then dropped to
BIOS and came up normally.
Here is a little trick, at the point where we are trying to get the info to
print out, the kernel command line hasn't been completely parsed yet. That
tricked me for part of the day. I had apic=debug on the command line but
the logic in print_local_APIC saw the default value because the kernel
command line had yet to be parsed.
2007-11-29 17:58:07 ***Here is the info you requested
2007-11-29 17:58:07
2007-11-29 17:58:07 ... APIC ID: 00000000 (0)
2007-11-29 17:58:07 ... APIC VERSION: 80050010
2007-11-29 17:58:07 ... APIC TASKPRI: 00000000 (00)
2007-11-29 17:58:07 ... APIC ARBPRI: 00000000 (00)
2007-11-29 17:58:07 ... APIC PROCPRI: 00000000
2007-11-29 17:58:07 ... APIC EOI: 00000000
2007-11-29 17:58:07 ... APIC RRR: 00000002
2007-11-29 17:58:07 ... APIC LDR: 00000000
2007-11-29 17:58:07 ... APIC DFR: ffffffff
2007-11-29 17:58:07 ... APIC SPIV: 0000010f
2007-11-29 17:58:07 ... APIC ESR: 00000000
2007-11-29 17:58:07 ... APIC ICR: 00004630
2007-11-29 17:58:07 ... APIC ICR2: 07000000
2007-11-29 17:58:07 ... APIC LVTT: 00010000
2007-11-29 17:58:07 ... APIC LVTPC: 00010000
2007-11-29 17:58:07 ... APIC LVT0: 00000700
Ok so here boot cpu LVT0 has been set to deliver any interrupt on pin
LINT0 as ExtInt. We do the same thing for kdump cpu local apic too.
I am not sure who is the guy in system who encodes the interrupt message
from 8259 to be put on hypertransport and on what basis do they decide
whether it should go to only cpu0 or a broadcast one. Looks like in this
case it is going to cpu0 only. That means we are left with no choice but
to work on patch to initialize IOAPIC early.
Thats what I'm doing at the moment. I'm working on a RHEL5 patch at the moment
(since thats whats on the production system thats failing), and will forward
port it once its working

And not to split hairs, but techically thats not our _only_ choice. We could
force kdump boots on cpu0 as well ;)

Thanks
Neil
Post by Vivek Goyal
Thanks
Vivek
--
/***************************************************
*Neil Horman
*Software Engineer
*Red Hat, Inc.
*nhorman-H+wXaHxf7aLQT0dZR+***@public.gmane.org
*gpg keyid: 1024D / 0x92A74FA1
*http://pgp.mit.edu
***************************************************/
Neil Horman
2007-12-06 21:39:51 UTC
Permalink
<snip>
Post by Neil Horman
Thats what I'm doing at the moment. I'm working on a RHEL5 patch at the moment
(since thats whats on the production system thats failing), and will forward
port it once its working
And not to split hairs, but techically thats not our _only_ choice. We could
force kdump boots on cpu0 as well ;)
Thanks
Neil
Post by Neil Horman
Thanks
Vivek
Sorry to have been quiet on this issue for a few days. Interesting news to
report, though. So I was working on a patch to do early apic enabling on
x86_64, and had something working for the old 2.6.18 kernel that we were
origionally testing on. Unfortunately while it worked on 2.6.18 it failed
miserably on 2.6.24-rc3-mm2, causing check_timer to consistently report that the
timer interrupt wasn't getting received (even though we could successfully run
calibrate_delay). Vivek and I were digging into this, when I ran accross the
description of the hypertransport configuration register in the opteron
specification. It contains a bit that, suprise, configures the ht bus to either
unicast interrupts delivered accross the ht bus to a single cpu, or to broadcast
it to all cpus. Since it seemed more likely that the 8259 in the nvidia
southbridge was transporting legacy mode interrupts over the ht bus than
directly to cpu0 via an actual wire, I wrote the attached patch to add a quirk
for nvidia chipsets, which scanned for hypertransport controllers, and ensured
that that broadcast bit was set. Test results indicate that this solves the
problem, and kdump kernels boot just fine on the affected system.

I understand that enabling the apic early in the boot process is a nice general
solution, but given the wide range of apic configurations possible, and the
need for large amounts of testing of such a change, this approach seems like it
might be the better way to go, as it corrects a bad behavior only in systems
that would be affected by this problem. It introduces no further complexity in
the kdump shutdown path, and creates no additional instability in systems that
would otherwise be unaffected by this bug.

I think this is the best way for us to go forward. Attached patch applies
cleanly against 2.6.24-rc3-mm2 and works for me on serveral systems unaffected
by the kdump crash problem I origionally reported and fixes the bug on the
affected system.

Thanks & Regards
Neil

Signed-off-by: Neil Horman <nhorman-***@public.gmane.org>


early-quirks.c | 41 ++++++++++++++++++++++++++++++++++++++++-
1 file changed, 40 insertions(+), 1 deletion(-)


diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index 88bb83e..ea16b53 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -44,6 +44,45 @@ static int __init nvidia_hpet_check(struct acpi_table_header *header)
#endif /* CONFIG_X86_IO_APIC */
#endif /* CONFIG_ACPI */

+static void __init fix_hypertransport_config(int num, int slot, int func)
+{
+ u32 htcfg;
+ /*
+ *we found a hypertransport bus
+ *make sure that are broadcasting
+ *interrupts to all cpus on the ht bus
+ */
+ htcfg = read_pci_config(num, slot, func, 0x68);
+ if ((htcfg & (1 << 17)) == 0)
+ printk(KERN_INFO "Enabling hypertransport interrupt broadcast\n");
+ htcfg |= (1 << 17);
+ write_pci_config(num, slot, func, 0x68, htcfg);
+
+}
+
+static void __init check_hypertransport_config()
+{
+ int num, slot, func;
+ u32 device, vendor;
+ func = 0;
+ for (num = 0; num < 32; num++) {
+ for (slot = 0; slot < 32; slot++) {
+ vendor = read_pci_config(num,slot,func,
+ PCI_VENDOR_ID);
+ device = read_pci_config(num,slot,func,
+ PCI_DEVICE_ID);
+ vendor &= 0x0000ffff;
+ device >>= 16;
+ if ((vendor == PCI_VENDOR_ID_AMD) &&
+ (device == PCI_DEVICE_ID_AMD_K8_NB))
+ fix_hypertransport_config(num,slot,func);
+ }
+ }
+
+ return;
+
+}
+
static void __init nvidia_bugs(void)
{
#ifdef CONFIG_ACPI
@@ -69,7 +108,7 @@ static void __init nvidia_bugs(void)
#endif
#endif
/* RED-PEN skip them on mptables too? */
-
+ check_hypertransport_config();
}

static void __init ati_bugs(void)
Vivek Goyal
2007-12-06 22:11:43 UTC
Permalink
Post by Ben Woodard
<snip>
Post by Neil Horman
Thats what I'm doing at the moment. I'm working on a RHEL5 patch at the moment
(since thats whats on the production system thats failing), and will forward
port it once its working
And not to split hairs, but techically thats not our _only_ choice. We could
force kdump boots on cpu0 as well ;)
Thanks
Neil
Post by Neil Horman
Thanks
Vivek
Sorry to have been quiet on this issue for a few days. Interesting news to
report, though. So I was working on a patch to do early apic enabling on
x86_64, and had something working for the old 2.6.18 kernel that we were
origionally testing on. Unfortunately while it worked on 2.6.18 it failed
miserably on 2.6.24-rc3-mm2, causing check_timer to consistently report that the
timer interrupt wasn't getting received (even though we could successfully run
calibrate_delay). Vivek and I were digging into this, when I ran accross the
description of the hypertransport configuration register in the opteron
specification. It contains a bit that, suprise, configures the ht bus to either
unicast interrupts delivered accross the ht bus to a single cpu, or to broadcast
it to all cpus. Since it seemed more likely that the 8259 in the nvidia
southbridge was transporting legacy mode interrupts over the ht bus than
directly to cpu0 via an actual wire, I wrote the attached patch to add a quirk
for nvidia chipsets, which scanned for hypertransport controllers, and ensured
that that broadcast bit was set. Test results indicate that this solves the
problem, and kdump kernels boot just fine on the affected system.
Hi Neil,

Should we disable this broadcasting feature once we are through? Otherwise
in normal systems it might mean extra traffic on hypertransport. There
is no need for every interrupt to be broadcasted in normal systems?

Thanks
Vivek
Neil Horman
2007-12-07 00:10:23 UTC
Permalink
Post by Vivek Goyal
Post by Ben Woodard
<snip>
Post by Neil Horman
Thats what I'm doing at the moment. I'm working on a RHEL5 patch at the moment
(since thats whats on the production system thats failing), and will forward
port it once its working
And not to split hairs, but techically thats not our _only_ choice. We could
force kdump boots on cpu0 as well ;)
Thanks
Neil
Post by Neil Horman
Thanks
Vivek
Sorry to have been quiet on this issue for a few days. Interesting news to
report, though. So I was working on a patch to do early apic enabling on
x86_64, and had something working for the old 2.6.18 kernel that we were
origionally testing on. Unfortunately while it worked on 2.6.18 it failed
miserably on 2.6.24-rc3-mm2, causing check_timer to consistently report that the
timer interrupt wasn't getting received (even though we could successfully run
calibrate_delay). Vivek and I were digging into this, when I ran accross the
description of the hypertransport configuration register in the opteron
specification. It contains a bit that, suprise, configures the ht bus to either
unicast interrupts delivered accross the ht bus to a single cpu, or to broadcast
it to all cpus. Since it seemed more likely that the 8259 in the nvidia
southbridge was transporting legacy mode interrupts over the ht bus than
directly to cpu0 via an actual wire, I wrote the attached patch to add a quirk
for nvidia chipsets, which scanned for hypertransport controllers, and ensured
that that broadcast bit was set. Test results indicate that this solves the
problem, and kdump kernels boot just fine on the affected system.
Hi Neil,
Should we disable this broadcasting feature once we are through? Otherwise
in normal systems it might mean extra traffic on hypertransport. There
is no need for every interrupt to be broadcasted in normal systems?
Thanks
Vivek
No, I don't think thats necessecary. Once the apics are enabled, interrupts
shouldn't travel accross the hypertransport bus anyway, opting instead to use
the dedicated apic bus (at least thats my understanding). The only systems what
you are suggesting would help with are systems that have no apic at all, which I
can only imagine on 64 bit systems is rare, to say the least. The affected
domain is further reduced by the fact that this quirk is only currently being
applied to systems with nvidia PCI bridges, since those are the only systems
that this problem has manifested on. That seems like a rather small subset, if
it exists at all. I suppose we could only optionally enable the quirk if we
are booting a kdump kernel (implying that we would need to do something like
detect the reset_devices command line option), but I think given the limited
affect this patch, its not really needed.

Regards
Neil
--
/***************************************************
*Neil Horman
*Software Engineer
*Red Hat, Inc.
*nhorman-H+wXaHxf7aLQT0dZR+***@public.gmane.org
*gpg keyid: 1024D / 0x92A74FA1
*http://pgp.mit.edu
***************************************************/
Vivek Goyal
2007-12-07 14:39:44 UTC
Permalink
Post by Neil Horman
Post by Vivek Goyal
Post by Ben Woodard
<snip>
Post by Neil Horman
Thats what I'm doing at the moment. I'm working on a RHEL5 patch at the moment
(since thats whats on the production system thats failing), and will forward
port it once its working
And not to split hairs, but techically thats not our _only_ choice. We could
force kdump boots on cpu0 as well ;)
Thanks
Neil
Post by Neil Horman
Thanks
Vivek
Sorry to have been quiet on this issue for a few days. Interesting news to
report, though. So I was working on a patch to do early apic enabling on
x86_64, and had something working for the old 2.6.18 kernel that we were
origionally testing on. Unfortunately while it worked on 2.6.18 it failed
miserably on 2.6.24-rc3-mm2, causing check_timer to consistently report that the
timer interrupt wasn't getting received (even though we could successfully run
calibrate_delay). Vivek and I were digging into this, when I ran accross the
description of the hypertransport configuration register in the opteron
specification. It contains a bit that, suprise, configures the ht bus to either
unicast interrupts delivered accross the ht bus to a single cpu, or to broadcast
it to all cpus. Since it seemed more likely that the 8259 in the nvidia
southbridge was transporting legacy mode interrupts over the ht bus than
directly to cpu0 via an actual wire, I wrote the attached patch to add a quirk
for nvidia chipsets, which scanned for hypertransport controllers, and ensured
that that broadcast bit was set. Test results indicate that this solves the
problem, and kdump kernels boot just fine on the affected system.
Hi Neil,
Should we disable this broadcasting feature once we are through? Otherwise
in normal systems it might mean extra traffic on hypertransport. There
is no need for every interrupt to be broadcasted in normal systems?
Thanks
Vivek
No, I don't think thats necessecary. Once the apics are enabled, interrupts
shouldn't travel accross the hypertransport bus anyway, opting instead to use
the dedicated apic bus (at least thats my understanding).
I think all interrupt message travel on hypertransport. Even after APICS
have been enabled.

Look at the following document.

http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24674.pdf

Have a look at figure 1, figure 2 and section 3.4.2.2 and 3.4.2.3

That's a different thing that once IOAPIC has formed the vectored message,
Hypertransport might not touch the destination field.

Having said that, I am wondering what will happen if a system continues
to operate the timer through IOAPIC in ExtInt mode. Will hypertransport
keep on broadcasting that interrupt to every cpu? And every cpu will
process that interrupt.

Hence, I feel it is safe to restore the broadcast bit back to BIOS value once
we are through calibrate_delay().

Thanks
Vivek
Neil Horman
2007-12-07 14:53:15 UTC
Permalink
Post by Vivek Goyal
Post by Neil Horman
Post by Vivek Goyal
Post by Ben Woodard
<snip>
Post by Neil Horman
Thats what I'm doing at the moment. I'm working on a RHEL5 patch at the moment
(since thats whats on the production system thats failing), and will forward
port it once its working
And not to split hairs, but techically thats not our _only_ choice. We could
force kdump boots on cpu0 as well ;)
Thanks
Neil
Post by Neil Horman
Thanks
Vivek
Sorry to have been quiet on this issue for a few days. Interesting news to
report, though. So I was working on a patch to do early apic enabling on
x86_64, and had something working for the old 2.6.18 kernel that we were
origionally testing on. Unfortunately while it worked on 2.6.18 it failed
miserably on 2.6.24-rc3-mm2, causing check_timer to consistently report that the
timer interrupt wasn't getting received (even though we could successfully run
calibrate_delay). Vivek and I were digging into this, when I ran accross the
description of the hypertransport configuration register in the opteron
specification. It contains a bit that, suprise, configures the ht bus to either
unicast interrupts delivered accross the ht bus to a single cpu, or to broadcast
it to all cpus. Since it seemed more likely that the 8259 in the nvidia
southbridge was transporting legacy mode interrupts over the ht bus than
directly to cpu0 via an actual wire, I wrote the attached patch to add a quirk
for nvidia chipsets, which scanned for hypertransport controllers, and ensured
that that broadcast bit was set. Test results indicate that this solves the
problem, and kdump kernels boot just fine on the affected system.
Hi Neil,
Should we disable this broadcasting feature once we are through? Otherwise
in normal systems it might mean extra traffic on hypertransport. There
is no need for every interrupt to be broadcasted in normal systems?
Thanks
Vivek
No, I don't think thats necessecary. Once the apics are enabled, interrupts
shouldn't travel accross the hypertransport bus anyway, opting instead to use
the dedicated apic bus (at least thats my understanding).
I think all interrupt message travel on hypertransport. Even after APICS
have been enabled.
Look at the following document.
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24674.pdf
Have a look at figure 1, figure 2 and section 3.4.2.2 and 3.4.2.3
That's a different thing that once IOAPIC has formed the vectored message,
Hypertransport might not touch the destination field.
Ok, that might be the case then.
Post by Vivek Goyal
Having said that, I am wondering what will happen if a system continues
to operate the timer through IOAPIC in ExtInt mode. Will hypertransport
keep on broadcasting that interrupt to every cpu? And every cpu will
process that interrupt.
I don't think so. IIRC once the other cpus are started they all disable the
timer interrupt, except for one cpu, opting instead to get the timer tick via
ipi, So while they all might see the interrupt packet on the ht bus, only one
cpu will process it.
Post by Vivek Goyal
Hence, I feel it is safe to restore the broadcast bit back to BIOS value once
we are through calibrate_delay().
I disagree. Looking at what Yinghai said, the default setting for the broadcast
bit isn't actually to unicast the interrupt, its just to set the broadcast mask
to 0xF, or to 0xFF. Its use is actually to allow cpus with an extended 8 bit
apic id see interrupts. So its not so much to direct interrupts to cpu0, but
rather to the first 16 cpus rather than to all 255 available cpus. From what
I've seen in my testing, systems that 'work' already have this bit set by bios,
and my quirk patch above does nothing to them. Disabling this bit after
calibrate_dealy is going to introduce more uncertainty in systems that have been
proven to work. We should leave well enough alone, and just enable the bit if
its off, and we see that we are using extended apic ids via bit 18 of the same
register, as Yinghai pointed out. By enabling the quirk that way, all we are
really doing is bringing into alignment two bits that should arguably be
set/cleared in unison anyway.

Regards
Neil
Post by Vivek Goyal
Thanks
Vivek
--
/***************************************************
*Neil Horman
*Software Engineer
*Red Hat, Inc.
*nhorman-H+wXaHxf7aLQT0dZR+***@public.gmane.org
*gpg keyid: 1024D / 0x92A74FA1
*http://pgp.mit.edu
***************************************************/
Vivek Goyal
2007-12-07 15:16:23 UTC
Permalink
Post by Neil Horman
Post by Vivek Goyal
Post by Neil Horman
Post by Vivek Goyal
Post by Ben Woodard
<snip>
Post by Neil Horman
Thats what I'm doing at the moment. I'm working on a RHEL5 patch at the moment
(since thats whats on the production system thats failing), and will forward
port it once its working
And not to split hairs, but techically thats not our _only_ choice. We could
force kdump boots on cpu0 as well ;)
Thanks
Neil
Post by Neil Horman
Thanks
Vivek
Sorry to have been quiet on this issue for a few days. Interesting news to
report, though. So I was working on a patch to do early apic enabling on
x86_64, and had something working for the old 2.6.18 kernel that we were
origionally testing on. Unfortunately while it worked on 2.6.18 it failed
miserably on 2.6.24-rc3-mm2, causing check_timer to consistently report that the
timer interrupt wasn't getting received (even though we could successfully run
calibrate_delay). Vivek and I were digging into this, when I ran accross the
description of the hypertransport configuration register in the opteron
specification. It contains a bit that, suprise, configures the ht bus to either
unicast interrupts delivered accross the ht bus to a single cpu, or to broadcast
it to all cpus. Since it seemed more likely that the 8259 in the nvidia
southbridge was transporting legacy mode interrupts over the ht bus than
directly to cpu0 via an actual wire, I wrote the attached patch to add a quirk
for nvidia chipsets, which scanned for hypertransport controllers, and ensured
that that broadcast bit was set. Test results indicate that this solves the
problem, and kdump kernels boot just fine on the affected system.
Hi Neil,
Should we disable this broadcasting feature once we are through? Otherwise
in normal systems it might mean extra traffic on hypertransport. There
is no need for every interrupt to be broadcasted in normal systems?
Thanks
Vivek
No, I don't think thats necessecary. Once the apics are enabled, interrupts
shouldn't travel accross the hypertransport bus anyway, opting instead to use
the dedicated apic bus (at least thats my understanding).
I think all interrupt message travel on hypertransport. Even after APICS
have been enabled.
Look at the following document.
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24674.pdf
Have a look at figure 1, figure 2 and section 3.4.2.2 and 3.4.2.3
That's a different thing that once IOAPIC has formed the vectored message,
Hypertransport might not touch the destination field.
Ok, that might be the case then.
Post by Vivek Goyal
Having said that, I am wondering what will happen if a system continues
to operate the timer through IOAPIC in ExtInt mode. Will hypertransport
keep on broadcasting that interrupt to every cpu? And every cpu will
process that interrupt.
I don't think so. IIRC once the other cpus are started they all disable the
timer interrupt, except for one cpu, opting instead to get the timer tick via
ipi, So while they all might see the interrupt packet on the ht bus, only one
cpu will process it.
Does LAPIC allow to disable a specific vector and not accept interrupts? I
don't think so. If a timer interrupt is broadcasted to every cpu I think
everybody will accept it (like broadcast IPI). That's why intelligence
is built into IOAPIC and direct interrupts to a cpu or group of cpu.

I am just trying to understand the functionality better. Can somebody help me
understand how do we make sure that same timer interrupt is not processed by
all cpus (assuming hypertransport is broadcasting it)?
Post by Neil Horman
Post by Vivek Goyal
Hence, I feel it is safe to restore the broadcast bit back to BIOS value once
we are through calibrate_delay().
I disagree. Looking at what Yinghai said, the default setting for the broadcast
bit isn't actually to unicast the interrupt, its just to set the broadcast mask
to 0xF, or to 0xFF. Its use is actually to allow cpus with an extended 8 bit
apic id see interrupts. So its not so much to direct interrupts to cpu0, but
rather to the first 16 cpus rather than to all 255 available cpus. From what
I've seen in my testing, systems that 'work' already have this bit set by bios,
and my quirk patch above does nothing to them. Disabling this bit after
calibrate_dealy is going to introduce more uncertainty in systems that have been
proven to work.
Again for my understanding, I got few questions.

- Why does nvidia choose not to broadcast the interrupts and still works
fine? Does that mean nvidia chipse will not work the extended cpu apic
ids?

- IOW, why do other chipsets choose to broadcast the interrupts and
nvidia chooses not to and still works well.

- Why do I need to broadcast the interrupts and not target specific cpus?

- If I am broadcasting interrupts, how do I make sure only one cpu
picks it up.


Thanks
Vivek
Neil Horman
2007-12-07 15:53:31 UTC
Permalink
Post by Vivek Goyal
Post by Neil Horman
Post by Vivek Goyal
Post by Neil Horman
Post by Vivek Goyal
Post by Ben Woodard
<snip>
Post by Neil Horman
Thats what I'm doing at the moment. I'm working on a RHEL5 patch at the moment
(since thats whats on the production system thats failing), and will forward
port it once its working
And not to split hairs, but techically thats not our _only_ choice. We could
force kdump boots on cpu0 as well ;)
Thanks
Neil
Post by Neil Horman
Thanks
Vivek
Sorry to have been quiet on this issue for a few days. Interesting news to
report, though. So I was working on a patch to do early apic enabling on
x86_64, and had something working for the old 2.6.18 kernel that we were
origionally testing on. Unfortunately while it worked on 2.6.18 it failed
miserably on 2.6.24-rc3-mm2, causing check_timer to consistently report that the
timer interrupt wasn't getting received (even though we could successfully run
calibrate_delay). Vivek and I were digging into this, when I ran accross the
description of the hypertransport configuration register in the opteron
specification. It contains a bit that, suprise, configures the ht bus to either
unicast interrupts delivered accross the ht bus to a single cpu, or to broadcast
it to all cpus. Since it seemed more likely that the 8259 in the nvidia
southbridge was transporting legacy mode interrupts over the ht bus than
directly to cpu0 via an actual wire, I wrote the attached patch to add a quirk
for nvidia chipsets, which scanned for hypertransport controllers, and ensured
that that broadcast bit was set. Test results indicate that this solves the
problem, and kdump kernels boot just fine on the affected system.
Hi Neil,
Should we disable this broadcasting feature once we are through? Otherwise
in normal systems it might mean extra traffic on hypertransport. There
is no need for every interrupt to be broadcasted in normal systems?
Thanks
Vivek
No, I don't think thats necessecary. Once the apics are enabled, interrupts
shouldn't travel accross the hypertransport bus anyway, opting instead to use
the dedicated apic bus (at least thats my understanding).
I think all interrupt message travel on hypertransport. Even after APICS
have been enabled.
Look at the following document.
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24674.pdf
Have a look at figure 1, figure 2 and section 3.4.2.2 and 3.4.2.3
That's a different thing that once IOAPIC has formed the vectored message,
Hypertransport might not touch the destination field.
Ok, that might be the case then.
Post by Vivek Goyal
Having said that, I am wondering what will happen if a system continues
to operate the timer through IOAPIC in ExtInt mode. Will hypertransport
keep on broadcasting that interrupt to every cpu? And every cpu will
process that interrupt.
I don't think so. IIRC once the other cpus are started they all disable the
timer interrupt, except for one cpu, opting instead to get the timer tick via
ipi, So while they all might see the interrupt packet on the ht bus, only one
cpu will process it.
Does LAPIC allow to disable a specific vector and not accept interrupts? I
don't think so. If a timer interrupt is broadcasted to every cpu I think
everybody will accept it (like broadcast IPI). That's why intelligence
is built into IOAPIC and direct interrupts to a cpu or group of cpu.
See disable_APIC_timer(). It seems to set the mask bit in the APIC_LVTT entry.
Post by Vivek Goyal
I am just trying to understand the functionality better. Can somebody help me
understand how do we make sure that same timer interrupt is not processed by
all cpus (assuming hypertransport is broadcasting it)?
I understand your desire, but clearly, something prevents it. Note our earlier
conversation, this bit doesn't actually force a unicast of an interrupt packet,
but simply masks the destination field. When set to zero, it simply means that
the ht interrupt packet destination field is restricted to 4 bits rather than 8.
So its not like when its set to zero we are guaranteed that it is forced to a
single processor anyway. All setting this bit does is ensure that if any apics
out on a system are addresed using an extended apic id, that interrupts can
reach them. Thats why it was suggested that this bit only be forcibly set if
bit 18 is also set.
Post by Vivek Goyal
Post by Neil Horman
Post by Vivek Goyal
Hence, I feel it is safe to restore the broadcast bit back to BIOS value once
we are through calibrate_delay().
I disagree. Looking at what Yinghai said, the default setting for the broadcast
bit isn't actually to unicast the interrupt, its just to set the broadcast mask
to 0xF, or to 0xFF. Its use is actually to allow cpus with an extended 8 bit
apic id see interrupts. So its not so much to direct interrupts to cpu0, but
rather to the first 16 cpus rather than to all 255 available cpus. From what
I've seen in my testing, systems that 'work' already have this bit set by bios,
and my quirk patch above does nothing to them. Disabling this bit after
calibrate_dealy is going to introduce more uncertainty in systems that have been
proven to work.
Again for my understanding, I got few questions.
- Why does nvidia choose not to broadcast the interrupts and still works
fine? Does that mean nvidia chipse will not work the extended cpu apic
ids?
It doesn't! When booting normally getting interrupts to apics that use 4 bit
apic ids is sufficient since cpu0 is in that set, but if we crash on a cpu with
an extended id, we hang.
Post by Vivek Goyal
- IOW, why do other chipsets choose to broadcast the interrupts and
nvidia chooses not to and still works well.
It doesnt! It hangs in the kdump kernel. It works well normally because
interrupts are delivered to cpus who's apic ids fit into 4 bits.
Post by Vivek Goyal
- Why do I need to broadcast the interrupts and not target specific cpus?
Its not a forced broadcast, its a mask on the apic id. The IOAPIC still
addresses specific cpus.
Post by Vivek Goyal
- If I am broadcasting interrupts, how do I make sure only one cpu
picks it up.
The IOAPIC handles that.

Regards
Neil
Post by Vivek Goyal
Thanks
Vivek
--
/***************************************************
*Neil Horman
*Software Engineer
*Red Hat, Inc.
*nhorman-H+wXaHxf7aLQT0dZR+***@public.gmane.org
*gpg keyid: 1024D / 0x92A74FA1
*http://pgp.mit.edu
***************************************************/
ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
2007-12-07 18:46:21 UTC
Permalink
Post by Neil Horman
Post by Vivek Goyal
Does LAPIC allow to disable a specific vector and not accept interrupts? I
don't think so. If a timer interrupt is broadcasted to every cpu I think
everybody will accept it (like broadcast IPI). That's why intelligence
is built into IOAPIC and direct interrupts to a cpu or group of cpu.
See disable_APIC_timer(). It seems to set the mask bit in the APIC_LVTT entry.
Yes. The local allows us to choose to accept ExtInt interrupts or not.
We can't do it per vector but we can do by interrupt delivery path.
External Interrupt.
External NMI.
Logical Apic Bus (Although I don't know if we can disable this one).
Post by Neil Horman
Post by Vivek Goyal
I am just trying to understand the functionality better. Can somebody help me
understand how do we make sure that same timer interrupt is not processed by
all cpus (assuming hypertransport is broadcasting it)?
I understand your desire, but clearly, something prevents it. Note our earlier
conversation, this bit doesn't actually force a unicast of an interrupt packet,
but simply masks the destination field. When set to zero, it simply means that
the ht interrupt packet destination field is restricted to 4 bits rather than 8.
So its not like when its set to zero we are guaranteed that it is forced to a
single processor anyway. All setting this bit does is ensure that if any apics
out on a system are addresed using an extended apic id, that interrupts can
reach them. Thats why it was suggested that this bit only be forcibly set if
bit 18 is also set.
We should only have a single cpus local apic configured to accept the
interrupt. Further it would not surprise me if there was some first
come first served logic with accepting the interrupt in there
somewhere.
Post by Neil Horman
Post by Vivek Goyal
Again for my understanding, I got few questions.
- Why does nvidia choose not to broadcast the interrupts and still works
fine? Does that mean nvidia chipse will not work the extended cpu apic
ids?
It doesn't! When booting normally getting interrupts to apics that use 4 bit
apic ids is sufficient since cpu0 is in that set, but if we crash on a cpu with
an extended id, we hang.
Yes. This sounds like a BIOS bug at present. A chipset feature
residing in the CPUs was not enabled properly.
Post by Neil Horman
Post by Vivek Goyal
- Why do I need to broadcast the interrupts and not target specific cpus?
Its not a forced broadcast, its a mask on the apic id. The IOAPIC still
addresses specific cpus.
This is a broadcast for legacy mode interrupts which don't have a cpu
destination field. Once we are running the timer through the ioapic
we have a destination field and the broadcast logic should not matter.
Post by Neil Horman
Post by Vivek Goyal
- If I am broadcasting interrupts, how do I make sure only one cpu
picks it up.
The IOAPIC handles that.
Well the local apics. If you are in ioapic mode and give the
cpus a choice of which cpu to deliver to it is a hardware anycast irq
transmission. That is the hardware magically picks one cpu of the
set of allow cpus to deliver the irq to. How that happens is
implementation specific.

Eric
ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
2007-12-07 00:33:31 UTC
Permalink
Post by Neil Horman
Post by Ben Woodard
<snip>
Post by Neil Horman
Thats what I'm doing at the moment. I'm working on a RHEL5 patch at the
moment
Post by Ben Woodard
Post by Neil Horman
(since thats whats on the production system thats failing), and will forward
port it once its working
And not to split hairs, but techically thats not our _only_ choice. We
could
Post by Ben Woodard
Post by Neil Horman
force kdump boots on cpu0 as well ;)
Thanks
Neil
Post by Neil Horman
Thanks
Vivek
Sorry to have been quiet on this issue for a few days. Interesting news to
report, though. So I was working on a patch to do early apic enabling on
x86_64, and had something working for the old 2.6.18 kernel that we were
origionally testing on. Unfortunately while it worked on 2.6.18 it failed
miserably on 2.6.24-rc3-mm2, causing check_timer to consistently report that
the
Post by Ben Woodard
timer interrupt wasn't getting received (even though we could successfully run
calibrate_delay). Vivek and I were digging into this, when I ran accross the
description of the hypertransport configuration register in the opteron
specification. It contains a bit that, suprise, configures the ht bus to
either
Post by Ben Woodard
unicast interrupts delivered accross the ht bus to a single cpu, or to
broadcast
Post by Ben Woodard
it to all cpus. Since it seemed more likely that the 8259 in the nvidia
southbridge was transporting legacy mode interrupts over the ht bus than
directly to cpu0 via an actual wire, I wrote the attached patch to add a quirk
for nvidia chipsets, which scanned for hypertransport controllers, and ensured
that that broadcast bit was set. Test results indicate that this solves the
problem, and kdump kernels boot just fine on the affected system.
Hi Neil,
Should we disable this broadcasting feature once we are through? Otherwise
in normal systems it might mean extra traffic on hypertransport. There
is no need for every interrupt to be broadcasted in normal systems?
My feel is that if it is for legacy interrupts only it should not be a problem.
Let's investigate and see if we can unconditionally enable this quirk
for all opteron systems.

Eric
Neil Horman
2007-12-07 02:04:00 UTC
Permalink
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Post by Neil Horman
Post by Ben Woodard
<snip>
Post by Neil Horman
Thats what I'm doing at the moment. I'm working on a RHEL5 patch at the
moment
Post by Ben Woodard
Post by Neil Horman
(since thats whats on the production system thats failing), and will forward
port it once its working
And not to split hairs, but techically thats not our _only_ choice. We
could
Post by Ben Woodard
Post by Neil Horman
force kdump boots on cpu0 as well ;)
Thanks
Neil
Post by Neil Horman
Thanks
Vivek
Sorry to have been quiet on this issue for a few days. Interesting news to
report, though. So I was working on a patch to do early apic enabling on
x86_64, and had something working for the old 2.6.18 kernel that we were
origionally testing on. Unfortunately while it worked on 2.6.18 it failed
miserably on 2.6.24-rc3-mm2, causing check_timer to consistently report that
the
Post by Ben Woodard
timer interrupt wasn't getting received (even though we could successfully run
calibrate_delay). Vivek and I were digging into this, when I ran accross the
description of the hypertransport configuration register in the opteron
specification. It contains a bit that, suprise, configures the ht bus to
either
Post by Ben Woodard
unicast interrupts delivered accross the ht bus to a single cpu, or to
broadcast
Post by Ben Woodard
it to all cpus. Since it seemed more likely that the 8259 in the nvidia
southbridge was transporting legacy mode interrupts over the ht bus than
directly to cpu0 via an actual wire, I wrote the attached patch to add a quirk
for nvidia chipsets, which scanned for hypertransport controllers, and ensured
that that broadcast bit was set. Test results indicate that this solves the
problem, and kdump kernels boot just fine on the affected system.
Hi Neil,
Should we disable this broadcasting feature once we are through? Otherwise
in normal systems it might mean extra traffic on hypertransport. There
is no need for every interrupt to be broadcasted in normal systems?
My feel is that if it is for legacy interrupts only it should not be a problem.
Let's investigate and see if we can unconditionally enable this quirk
for all opteron systems.
Eric
Copy that. Thus far, I've tested it on a pure AMD engineering sample, an intel
x86_64 box, and the affected system, a quad socket dual core AMD system with an
nvidia chipset. That last system is currently the only system that this patch
will check/enable the broadcast flag on. I did try a variant of the patch on
the AMD engineering sample where it enabled the bit unconditionally.
Interestingly enough, the bit was already turned on on that system. I'm
wondering if most systems don't already have this bit turned on. You should be
able to universally enable this bit, by moving the call to
check_hypertransport_config to the top of early_quirks()

Regards
Neil
--
/***************************************************
*Neil Horman
*Software Engineer
*Red Hat, Inc.
*nhorman-H+wXaHxf7aLQT0dZR+***@public.gmane.org
*gpg keyid: 1024D / 0x92A74FA1
*http://pgp.mit.edu
***************************************************/
Yinghai Lu
2007-12-07 08:50:45 UTC
Permalink
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Post by Neil Horman
Post by Ben Woodard
<snip>
Post by Neil Horman
Thats what I'm doing at the moment. I'm working on a RHEL5 patch at the
moment
Post by Ben Woodard
Post by Neil Horman
(since thats whats on the production system thats failing), and will forward
port it once its working
And not to split hairs, but techically thats not our _only_ choice. We
could
Post by Ben Woodard
Post by Neil Horman
force kdump boots on cpu0 as well ;)
Thanks
Neil
Post by Neil Horman
Thanks
Vivek
Sorry to have been quiet on this issue for a few days. Interesting news to
report, though. So I was working on a patch to do early apic enabling on
x86_64, and had something working for the old 2.6.18 kernel that we were
origionally testing on. Unfortunately while it worked on 2.6.18 it failed
miserably on 2.6.24-rc3-mm2, causing check_timer to consistently report that
the
Post by Ben Woodard
timer interrupt wasn't getting received (even though we could successfully run
calibrate_delay). Vivek and I were digging into this, when I ran accross the
description of the hypertransport configuration register in the opteron
specification. It contains a bit that, suprise, configures the ht bus to
either
Post by Ben Woodard
unicast interrupts delivered accross the ht bus to a single cpu, or to
broadcast
Post by Ben Woodard
it to all cpus. Since it seemed more likely that the 8259 in the nvidia
southbridge was transporting legacy mode interrupts over the ht bus than
directly to cpu0 via an actual wire, I wrote the attached patch to add a quirk
for nvidia chipsets, which scanned for hypertransport controllers, and ensured
that that broadcast bit was set. Test results indicate that this solves the
problem, and kdump kernels boot just fine on the affected system.
Hi Neil,
Should we disable this broadcasting feature once we are through? Otherwise
in normal systems it might mean extra traffic on hypertransport. There
is no need for every interrupt to be broadcasted in normal systems?
My feel is that if it is for legacy interrupts only it should not be a problem.
Let's investigate and see if we can unconditionally enable this quirk
for all opteron systems.
i checked that bit

http://www.openbios.org/viewvc/trunk/LinuxBIOSv2/src/northbridge/amd/amdk8/coherent_ht.c?revision=2596&view=markup

static void enable_apic_ext_id(u8 node)
{
#if ENABLE_APIC_EXT_ID==1
#warning "FIXME Is the right place to enable apic ext id here?"

u32 val;

val = pci_read_config32(NODE_HT(node), 0x68);
val |= (HTTC_APIC_EXT_SPUR | HTTC_APIC_EXT_ID | HTTC_APIC_EXT_BRD_CST);
pci_write_config32(NODE_HT(node), 0x68, val);
#endif
}

that bit only be should be set when apic id is lifted and cpu apid is
using 8 bits and that mean broadcast is 0xff instead 0x0f.
for example 8 socket dual core system or 4 socket quad core
system,that you should make BSP start from 0x04, so cpus apic id will
be [0x04, 0x13)


So if you want to enable that in early_quirk, you need to
make sure apic id is using 8 bits by check if the bit 16 (HTTC_APIC_ID) is set.

most BIOS already did that. You may ask Supermicro fix their broken
BIOS instead.

YH
Yinghai Lu
2007-12-07 09:22:04 UTC
Permalink
...
Post by Yinghai Lu
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
My feel is that if it is for legacy interrupts only it should not be a problem.
Let's investigate and see if we can unconditionally enable this quirk
for all opteron systems.
i checked that bit
http://www.openbios.org/viewvc/trunk/LinuxBIOSv2/src/northbridge/amd/amdk8/coherent_ht.c?revision=2596&view=markup
static void enable_apic_ext_id(u8 node)
{
#if ENABLE_APIC_EXT_ID==1
#warning "FIXME Is the right place to enable apic ext id here?"
u32 val;
val = pci_read_config32(NODE_HT(node), 0x68);
val |= (HTTC_APIC_EXT_SPUR | HTTC_APIC_EXT_ID | HTTC_APIC_EXT_BRD_CST);
pci_write_config32(NODE_HT(node), 0x68, val);
#endif
}
that bit only be should be set when apic id is lifted and cpu apid is
using 8 bits and that mean broadcast is 0xff instead 0x0f.
for example 8 socket dual core system or 4 socket quad core
system,that you should make BSP start from 0x04, so cpus apic id will
be [0x04, 0x13)
So if you want to enable that in early_quirk, you need to
make sure apic id is using 8 bits by check if the bit 16 (HTTC_APIC_ID) is set.
it should be bit 18 (HTTC_APIC_EXT_ID)


YH
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Neil Horman
2007-12-07 14:21:44 UTC
Permalink
Post by Yinghai Lu
...
Post by Yinghai Lu
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
My feel is that if it is for legacy interrupts only it should not be a problem.
Let's investigate and see if we can unconditionally enable this quirk
for all opteron systems.
i checked that bit
http://www.openbios.org/viewvc/trunk/LinuxBIOSv2/src/northbridge/amd/amdk8/coherent_ht.c?revision=2596&view=markup
static void enable_apic_ext_id(u8 node)
{
#if ENABLE_APIC_EXT_ID==1
#warning "FIXME Is the right place to enable apic ext id here?"
u32 val;
val = pci_read_config32(NODE_HT(node), 0x68);
val |= (HTTC_APIC_EXT_SPUR | HTTC_APIC_EXT_ID | HTTC_APIC_EXT_BRD_CST);
pci_write_config32(NODE_HT(node), 0x68, val);
#endif
}
that bit only be should be set when apic id is lifted and cpu apid is
using 8 bits and that mean broadcast is 0xff instead 0x0f.
for example 8 socket dual core system or 4 socket quad core
system,that you should make BSP start from 0x04, so cpus apic id will
be [0x04, 0x13)
So if you want to enable that in early_quirk, you need to
make sure apic id is using 8 bits by check if the bit 16 (HTTC_APIC_ID) is set.
it should be bit 18 (HTTC_APIC_EXT_ID)
YH
this seems reasonable, I can reroll the patch for this. As I think about it I'm
also going to update the patch to make this check occur for any pci class 0600
device from vendor AMD, since its possible that more than just nvidia chipsets
can be affected.

I'll repost as soon as I've tested, thanks!
Neil
--
/***************************************************
*Neil Horman
*Software Engineer
*Red Hat, Inc.
*nhorman-H+wXaHxf7aLQT0dZR+***@public.gmane.org
*gpg keyid: 1024D / 0x92A74FA1
*http://pgp.mit.edu
***************************************************/
Neil Horman
2007-12-07 17:58:32 UTC
Permalink
Post by Neil Horman
Post by Yinghai Lu
...
Post by Yinghai Lu
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
My feel is that if it is for legacy interrupts only it should not be a problem.
Let's investigate and see if we can unconditionally enable this quirk
for all opteron systems.
i checked that bit
http://www.openbios.org/viewvc/trunk/LinuxBIOSv2/src/northbridge/amd/amdk8/coherent_ht.c?revision=2596&view=markup
<snip>
Post by Neil Horman
Post by Yinghai Lu
it should be bit 18 (HTTC_APIC_EXT_ID)
YH
this seems reasonable, I can reroll the patch for this. As I think about it I'm
also going to update the patch to make this check occur for any pci class 0600
device from vendor AMD, since its possible that more than just nvidia chipsets
can be affected.
I'll repost as soon as I've tested, thanks!
Neil
Ok, New patch attached. It preforms the same function as previously described,
but is more restricted in its application. As Yinghai pointed out, the
broadcast mask bit (bit 17 in the htcfg register) should only be enabled, if the
extened apic id bit (bit 18 in the same register) is also set. So this patch
now check for that bit to be turned on first. Also, this patch now adds an
independent quirk check for all AMD hypertransport host controllers, since its
possible for this misconfiguration to be present in systems other than nvidias.
The net effect of these changes is, that its now applicable to all AMD systems
containing hypertransport busses, and is only activated if extended apic ids are
in use, meaning that this quirk guarantees that all processors in a system are
elligible to receive interrupts from the ioapic, even if their apicid extends
beyond the nominal 4 bit limitation. Tested successfully by me.

Thanks & Regards
Neil

Signed-off-by: Neil Horman <nhorman-***@public.gmane.org>


early-quirks.c | 83 ++++++++++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 76 insertions(+), 7 deletions(-)



diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index 88bb83e..d5a7b30 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -44,6 +44,50 @@ static int __init nvidia_hpet_check(struct acpi_table_header *header)
#endif /* CONFIG_X86_IO_APIC */
#endif /* CONFIG_ACPI */

+static void __init fix_hypertransport_config(int num, int slot, int func)
+{
+ u32 htcfg;
+ /*
+ *we found a hypertransport bus
+ *make sure that are broadcasting
+ *interrupts to all cpus on the ht bus
+ *if we're using extended apic ids
+ */
+ htcfg = read_pci_config(num, slot, func, 0x68);
+ if ((htcfg & (1 << 18)) == 1) {
+ printk(KERN_INFO "Detected use of extended apic ids on hypertransport bus\n");
+ if ((htcfg & (1 << 17)) == 0) {
+ printk(KERN_INFO "Enabling hypertransport extended apic interrupt broadcast\n");
+ htcfg |= (1 << 17);
+ write_pci_config(num, slot, func, 0x68, htcfg);
+ }
+ }
+
+}
+
+static void __init check_hypertransport_config()
+{
+ int num, slot, func;
+ u32 device, vendor;
+ func = 0;
+ for (num = 0; num < 32; num++) {
+ for (slot = 0; slot < 32; slot++) {
+ vendor = read_pci_config(num,slot,func,
+ PCI_VENDOR_ID);
+ device = read_pci_config(num,slot,func,
+ PCI_DEVICE_ID);
+ vendor &= 0x0000ffff;
+ device >>= 16;
+ if ((vendor == PCI_VENDOR_ID_AMD) &&
+ (device == PCI_DEVICE_ID_AMD_K8_NB))
+ fix_hypertransport_config(num,slot,func);
+ }
+ }
+
+ return;
+
+}
+
static void __init nvidia_bugs(void)
{
#ifdef CONFIG_ACPI
@@ -83,6 +127,12 @@ static void __init ati_bugs(void)
#endif
}

+static void __init amd_host_bugs(void)
+{
+ printk(KERN_CRIT "IN AMD_HOST_BUGS\n");
+ check_hypertransport_config();
+}
+
struct chipset {
u16 vendor;
void (*f)(void);
@@ -95,9 +145,16 @@ static struct chipset early_qrk[] __initdata = {
{}
};

+static struct chipset early_host_qrk[] __initdata = {
+ { PCI_VENDOR_ID_AMD, amd_host_bugs},
+ {}
+};
+
void __init early_quirks(void)
{
int num, slot, func;
+ u8 found_bridge = 0;
+ u8 found_host = 0;

if (!early_pci_allowed())
return;
@@ -115,18 +172,30 @@ void __init early_quirks(void)
if (class == 0xffffffff)
break;

- if ((class >> 16) != PCI_CLASS_BRIDGE_PCI)
+ class >>= 16;
+ if ((class != PCI_CLASS_BRIDGE_PCI) &&
+ (class != PCI_CLASS_BRIDGE_HOST))
continue;

vendor = read_pci_config(num, slot, func,
PCI_VENDOR_ID);
vendor &= 0xffff;
-
- for (i = 0; early_qrk[i].f; i++)
- if (early_qrk[i].vendor == vendor) {
- early_qrk[i].f();
- return;
- }
+ if ((class == PCI_CLASS_BRIDGE_PCI) && (!found_bridge)) {
+ for (i = 0; early_qrk[i].f; i++)
+ if (early_qrk[i].vendor == vendor) {
+ early_qrk[i].f();
+ found_bridge = 1;;
+ }
+ } else if (!found_host) {
+ for (i = 0; early_host_qrk[i].f; i++)
+ if (early_host_qrk[i].vendor == vendor) {
+ early_host_qrk[i].f();
+ found_host = 1;
+ }
+ }
+
+ if (found_bridge && found_host)
+ return;

type = read_pci_config_byte(num, slot, func,
PCI_HEADER_TYPE);
yhlu
2007-12-07 19:19:10 UTC
Permalink
Post by Ben Woodard
Post by Neil Horman
Post by Yinghai Lu
...
Post by Yinghai Lu
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
My feel is that if it is for legacy interrupts only it should not be a problem.
Let's investigate and see if we can unconditionally enable this quirk
for all opteron systems.
i checked that bit
http://www.openbios.org/viewvc/trunk/LinuxBIOSv2/src/northbridge/amd/amdk8/coherent_ht.c?revision=2596&view=markup
<snip>
Post by Neil Horman
Post by Yinghai Lu
it should be bit 18 (HTTC_APIC_EXT_ID)
YH
this seems reasonable, I can reroll the patch for this. As I think about it I'm
also going to update the patch to make this check occur for any pci class 0600
device from vendor AMD, since its possible that more than just nvidia chipsets
can be affected.
I'll repost as soon as I've tested, thanks!
Neil
Ok, New patch attached. It preforms the same function as previously described,
but is more restricted in its application. As Yinghai pointed out, the
broadcast mask bit (bit 17 in the htcfg register) should only be enabled, if the
extened apic id bit (bit 18 in the same register) is also set. So this patch
now check for that bit to be turned on first. Also, this patch now adds an
independent quirk check for all AMD hypertransport host controllers, since its
possible for this misconfiguration to be present in systems other than nvidias.
The net effect of these changes is, that its now applicable to all AMD systems
containing hypertransport busses, and is only activated if extended apic ids are
in use, meaning that this quirk guarantees that all processors in a system are
elligible to receive interrupts from the ioapic, even if their apicid extends
beyond the nominal 4 bit limitation. Tested successfully by me.
Thanks & Regards
Neil
early-quirks.c | 83 ++++++++++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 76 insertions(+), 7 deletions(-)
diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index 88bb83e..d5a7b30 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -44,6 +44,50 @@ static int __init nvidia_hpet_check(struct acpi_table_header *header)
#endif /* CONFIG_X86_IO_APIC */
#endif /* CONFIG_ACPI */
+static void __init fix_hypertransport_config(int num, int slot, int func)
+{
+ u32 htcfg;
+ /*
+ *we found a hypertransport bus
+ *make sure that are broadcasting
+ *interrupts to all cpus on the ht bus
+ *if we're using extended apic ids
+ */
+ htcfg = read_pci_config(num, slot, func, 0x68);
+ if ((htcfg & (1 << 18)) == 1) {
+ printk(KERN_INFO "Detected use of extended apic ids on hypertransport bus\n");
+ if ((htcfg & (1 << 17)) == 0) {
+ printk(KERN_INFO "Enabling hypertransport extended apic interrupt broadcast\n");
+ htcfg |= (1 << 17);
+ write_pci_config(num, slot, func, 0x68, htcfg);
+ }
+ }
+
+}
+
+static void __init check_hypertransport_config()
+{
+ int num, slot, func;
+ u32 device, vendor;
+ func = 0;
+ for (num = 0; num < 32; num++) {
+ for (slot = 0; slot < 32; slot++) {
+ vendor = read_pci_config(num,slot,func,
+ PCI_VENDOR_ID);
+ device = read_pci_config(num,slot,func,
+ PCI_DEVICE_ID);
+ vendor &= 0x0000ffff;
+ device >>= 16;
+ if ((vendor == PCI_VENDOR_ID_AMD) &&
+ (device == PCI_DEVICE_ID_AMD_K8_NB))
+ fix_hypertransport_config(num,slot,func);
+ }
+ }
+
+ return;
+
+}
+
static void __init nvidia_bugs(void)
{
#ifdef CONFIG_ACPI
@@ -83,6 +127,12 @@ static void __init ati_bugs(void)
#endif
}
+static void __init amd_host_bugs(void)
+{
+ printk(KERN_CRIT "IN AMD_HOST_BUGS\n");
+ check_hypertransport_config();
+}
+
struct chipset {
u16 vendor;
void (*f)(void);
@@ -95,9 +145,16 @@ static struct chipset early_qrk[] __initdata = {
{}
};
+static struct chipset early_host_qrk[] __initdata = {
+ { PCI_VENDOR_ID_AMD, amd_host_bugs},
+ {}
+};
+
void __init early_quirks(void)
{
int num, slot, func;
+ u8 found_bridge = 0;
+ u8 found_host = 0;
if (!early_pci_allowed())
return;
@@ -115,18 +172,30 @@ void __init early_quirks(void)
if (class == 0xffffffff)
break;
- if ((class >> 16) != PCI_CLASS_BRIDGE_PCI)
+ class >>= 16;
+ if ((class != PCI_CLASS_BRIDGE_PCI) &&
+ (class != PCI_CLASS_BRIDGE_HOST))
continue;
vendor = read_pci_config(num, slot, func,
PCI_VENDOR_ID);
vendor &= 0xffff;
-
- for (i = 0; early_qrk[i].f; i++)
- if (early_qrk[i].vendor == vendor) {
- early_qrk[i].f();
- return;
- }
+ if ((class == PCI_CLASS_BRIDGE_PCI) && (!found_bridge)) {
+ for (i = 0; early_qrk[i].f; i++)
+ if (early_qrk[i].vendor == vendor) {
+ early_qrk[i].f();
+ found_bridge = 1;;
+ }
+ } else if (!found_host) {
+ for (i = 0; early_host_qrk[i].f; i++)
+ if (early_host_qrk[i].vendor == vendor) {
+ early_host_qrk[i].f();
+ found_host = 1;
+ }
+ }
+
+ if (found_bridge && found_host)
+ return;
type = read_pci_config_byte(num, slot, func,
PCI_HEADER_TYPE);
1. k8 northbridge always on bus 00, and dev0x18~dev0x1f, some time
later it could be on bus 0xff
2. need to check for first node (BSP) bit 18 and bit 17 is already
set, then don't need to go wild to check and set all over the bus.

YH
Neil Horman
2007-12-07 20:13:33 UTC
Permalink
Post by yhlu
Post by Ben Woodard
Post by Neil Horman
Post by Yinghai Lu
...
Post by Yinghai Lu
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
My feel is that if it is for legacy interrupts only it should not be a problem.
Let's investigate and see if we can unconditionally enable this quirk
for all opteron systems.
i checked that bit
http://www.openbios.org/viewvc/trunk/LinuxBIOSv2/src/northbridge/amd/amdk8/coherent_ht.c?revision=2596&view=markup
<snip>
Post by Neil Horman
Post by Yinghai Lu
it should be bit 18 (HTTC_APIC_EXT_ID)
YH
this seems reasonable, I can reroll the patch for this. As I think about it I'm
also going to update the patch to make this check occur for any pci class 0600
device from vendor AMD, since its possible that more than just nvidia chipsets
can be affected.
I'll repost as soon as I've tested, thanks!
Neil
Ok, New patch attached. It preforms the same function as previously described,
but is more restricted in its application. As Yinghai pointed out, the
broadcast mask bit (bit 17 in the htcfg register) should only be enabled, if the
extened apic id bit (bit 18 in the same register) is also set. So this patch
now check for that bit to be turned on first. Also, this patch now adds an
independent quirk check for all AMD hypertransport host controllers, since its
possible for this misconfiguration to be present in systems other than nvidias.
The net effect of these changes is, that its now applicable to all AMD systems
containing hypertransport busses, and is only activated if extended apic ids are
in use, meaning that this quirk guarantees that all processors in a system are
elligible to receive interrupts from the ioapic, even if their apicid extends
beyond the nominal 4 bit limitation. Tested successfully by me.
Thanks & Regards
Neil
early-quirks.c | 83 ++++++++++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 76 insertions(+), 7 deletions(-)
diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index 88bb83e..d5a7b30 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -44,6 +44,50 @@ static int __init nvidia_hpet_check(struct acpi_table_header *header)
#endif /* CONFIG_X86_IO_APIC */
#endif /* CONFIG_ACPI */
+static void __init fix_hypertransport_config(int num, int slot, int func)
+{
+ u32 htcfg;
+ /*
+ *we found a hypertransport bus
+ *make sure that are broadcasting
+ *interrupts to all cpus on the ht bus
+ *if we're using extended apic ids
+ */
+ htcfg = read_pci_config(num, slot, func, 0x68);
+ if ((htcfg & (1 << 18)) == 1) {
+ printk(KERN_INFO "Detected use of extended apic ids on hypertransport bus\n");
+ if ((htcfg & (1 << 17)) == 0) {
+ printk(KERN_INFO "Enabling hypertransport extended apic interrupt broadcast\n");
+ htcfg |= (1 << 17);
+ write_pci_config(num, slot, func, 0x68, htcfg);
+ }
+ }
+
+}
+
+static void __init check_hypertransport_config()
+{
+ int num, slot, func;
+ u32 device, vendor;
+ func = 0;
+ for (num = 0; num < 32; num++) {
+ for (slot = 0; slot < 32; slot++) {
+ vendor = read_pci_config(num,slot,func,
+ PCI_VENDOR_ID);
+ device = read_pci_config(num,slot,func,
+ PCI_DEVICE_ID);
+ vendor &= 0x0000ffff;
+ device >>= 16;
+ if ((vendor == PCI_VENDOR_ID_AMD) &&
+ (device == PCI_DEVICE_ID_AMD_K8_NB))
+ fix_hypertransport_config(num,slot,func);
+ }
+ }
+
+ return;
+
+}
+
static void __init nvidia_bugs(void)
{
#ifdef CONFIG_ACPI
@@ -83,6 +127,12 @@ static void __init ati_bugs(void)
#endif
}
+static void __init amd_host_bugs(void)
+{
+ printk(KERN_CRIT "IN AMD_HOST_BUGS\n");
+ check_hypertransport_config();
+}
+
struct chipset {
u16 vendor;
void (*f)(void);
@@ -95,9 +145,16 @@ static struct chipset early_qrk[] __initdata = {
{}
};
+static struct chipset early_host_qrk[] __initdata = {
+ { PCI_VENDOR_ID_AMD, amd_host_bugs},
+ {}
+};
+
void __init early_quirks(void)
{
int num, slot, func;
+ u8 found_bridge = 0;
+ u8 found_host = 0;
if (!early_pci_allowed())
return;
@@ -115,18 +172,30 @@ void __init early_quirks(void)
if (class == 0xffffffff)
break;
- if ((class >> 16) != PCI_CLASS_BRIDGE_PCI)
+ class >>= 16;
+ if ((class != PCI_CLASS_BRIDGE_PCI) &&
+ (class != PCI_CLASS_BRIDGE_HOST))
continue;
vendor = read_pci_config(num, slot, func,
PCI_VENDOR_ID);
vendor &= 0xffff;
-
- for (i = 0; early_qrk[i].f; i++)
- if (early_qrk[i].vendor == vendor) {
- early_qrk[i].f();
- return;
- }
+ if ((class == PCI_CLASS_BRIDGE_PCI) && (!found_bridge)) {
+ for (i = 0; early_qrk[i].f; i++)
+ if (early_qrk[i].vendor == vendor) {
+ early_qrk[i].f();
+ found_bridge = 1;;
+ }
+ } else if (!found_host) {
+ for (i = 0; early_host_qrk[i].f; i++)
+ if (early_host_qrk[i].vendor == vendor) {
+ early_host_qrk[i].f();
+ found_host = 1;
+ }
+ }
+
+ if (found_bridge && found_host)
+ return;
type = read_pci_config_byte(num, slot, func,
PCI_HEADER_TYPE);
1. k8 northbridge always on bus 00, and dev0x18~dev0x1f, some time
later it could be on bus 0xff
2. need to check for first node (BSP) bit 18 and bit 17 is already
set, then don't need to go wild to check and set all over the bus.
Not really sure what you're saying here. I understand that the k8 northbridge
will for the forseeable future be on bus 00, but why not look for all ht hosts?
If we find an ht host, and bit 18 is set while bit 17 is not, we can safely
correct it.

Neil
Post by yhlu
YH
_______________________________________________
kexec mailing list
http://lists.infradead.org/mailman/listinfo/kexec
Neil Horman
2007-12-10 15:39:59 UTC
Permalink
Post by Neil Horman
Ok, New patch attached. It preforms the same function as previously described,
but is more restricted in its application. As Yinghai pointed out, the
broadcast mask bit (bit 17 in the htcfg register) should only be enabled, if the
extened apic id bit (bit 18 in the same register) is also set. So this patch
now check for that bit to be turned on first. Also, this patch now adds an
independent quirk check for all AMD hypertransport host controllers, since its
possible for this misconfiguration to be present in systems other than nvidias.
The net effect of these changes is, that its now applicable to all AMD systems
containing hypertransport busses, and is only activated if extended apic ids are
in use, meaning that this quirk guarantees that all processors in a system are
elligible to receive interrupts from the ioapic, even if their apicid extends
beyond the nominal 4 bit limitation. Tested successfully by me.
Thanks & Regards
Neil
early-quirks.c | 83 ++++++++++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 76 insertions(+), 7 deletions(-)
diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index 88bb83e..d5a7b30 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -44,6 +44,50 @@ static int __init nvidia_hpet_check(struct acpi_table_header *header)
#endif /* CONFIG_X86_IO_APIC */
#endif /* CONFIG_ACPI */
+static void __init fix_hypertransport_config(int num, int slot, int func)
+{
+ u32 htcfg;
+ /*
+ *we found a hypertransport bus
+ *make sure that are broadcasting
+ *interrupts to all cpus on the ht bus
+ *if we're using extended apic ids
+ */
+ htcfg = read_pci_config(num, slot, func, 0x68);
+ if ((htcfg & (1 << 18)) == 1) {
+ printk(KERN_INFO "Detected use of extended apic ids on hypertransport bus\n");
+ if ((htcfg & (1 << 17)) == 0) {
+ printk(KERN_INFO "Enabling hypertransport extended apic interrupt broadcast\n");
+ htcfg |= (1 << 17);
+ write_pci_config(num, slot, func, 0x68, htcfg);
+ }
+ }
+
+}
+
+static void __init check_hypertransport_config()
+{
+ int num, slot, func;
+ u32 device, vendor;
+ func = 0;
+ for (num = 0; num < 32; num++) {
+ for (slot = 0; slot < 32; slot++) {
+ vendor = read_pci_config(num,slot,func,
+ PCI_VENDOR_ID);
+ device = read_pci_config(num,slot,func,
+ PCI_DEVICE_ID);
+ vendor &= 0x0000ffff;
+ device >>= 16;
+ if ((vendor == PCI_VENDOR_ID_AMD) &&
+ (device == PCI_DEVICE_ID_AMD_K8_NB))
+ fix_hypertransport_config(num,slot,func);
+ }
+ }
+
+ return;
+
+}
+
static void __init nvidia_bugs(void)
{
#ifdef CONFIG_ACPI
@@ -83,6 +127,12 @@ static void __init ati_bugs(void)
#endif
}
+static void __init amd_host_bugs(void)
+{
+ printk(KERN_CRIT "IN AMD_HOST_BUGS\n");
+ check_hypertransport_config();
+}
+
struct chipset {
u16 vendor;
void (*f)(void);
@@ -95,9 +145,16 @@ static struct chipset early_qrk[] __initdata = {
{}
};
+static struct chipset early_host_qrk[] __initdata = {
+ { PCI_VENDOR_ID_AMD, amd_host_bugs},
+ {}
+};
+
void __init early_quirks(void)
{
int num, slot, func;
+ u8 found_bridge = 0;
+ u8 found_host = 0;
if (!early_pci_allowed())
return;
@@ -115,18 +172,30 @@ void __init early_quirks(void)
if (class == 0xffffffff)
break;
- if ((class >> 16) != PCI_CLASS_BRIDGE_PCI)
+ class >>= 16;
+ if ((class != PCI_CLASS_BRIDGE_PCI) &&
+ (class != PCI_CLASS_BRIDGE_HOST))
continue;
vendor = read_pci_config(num, slot, func,
PCI_VENDOR_ID);
vendor &= 0xffff;
-
- for (i = 0; early_qrk[i].f; i++)
- if (early_qrk[i].vendor == vendor) {
- early_qrk[i].f();
- return;
- }
+ if ((class == PCI_CLASS_BRIDGE_PCI) && (!found_bridge)) {
+ for (i = 0; early_qrk[i].f; i++)
+ if (early_qrk[i].vendor == vendor) {
+ early_qrk[i].f();
+ found_bridge = 1;;
+ }
+ } else if (!found_host) {
+ for (i = 0; early_host_qrk[i].f; i++)
+ if (early_host_qrk[i].vendor == vendor) {
+ early_host_qrk[i].f();
+ found_host = 1;
+ }
+ }
+
+ if (found_bridge && found_host)
+ return;
type = read_pci_config_byte(num, slot, func,
PCI_HEADER_TYPE);
_______________________________________________
kexec mailing list
http://lists.infradead.org/mailman/listinfo/kexec
Sorry to reply to myself, but do we have consensus on this patch? I'd like to
figure out its disposition if possible.

Thanks & Regards
Neil
Vivek Goyal
2007-12-10 16:20:12 UTC
Permalink
Post by Neil Horman
Post by Neil Horman
Ok, New patch attached. It preforms the same function as previously described,
but is more restricted in its application. As Yinghai pointed out, the
broadcast mask bit (bit 17 in the htcfg register) should only be enabled, if the
extened apic id bit (bit 18 in the same register) is also set. So this patch
now check for that bit to be turned on first. Also, this patch now adds an
independent quirk check for all AMD hypertransport host controllers, since its
possible for this misconfiguration to be present in systems other than nvidias.
The net effect of these changes is, that its now applicable to all AMD systems
containing hypertransport busses, and is only activated if extended apic ids are
in use, meaning that this quirk guarantees that all processors in a system are
elligible to receive interrupts from the ioapic, even if their apicid extends
beyond the nominal 4 bit limitation. Tested successfully by me.
Thanks & Regards
Neil
early-quirks.c | 83 ++++++++++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 76 insertions(+), 7 deletions(-)
diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index 88bb83e..d5a7b30 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -44,6 +44,50 @@ static int __init nvidia_hpet_check(struct acpi_table_header *header)
#endif /* CONFIG_X86_IO_APIC */
#endif /* CONFIG_ACPI */
+static void __init fix_hypertransport_config(int num, int slot, int func)
+{
+ u32 htcfg;
+ /*
+ *we found a hypertransport bus
+ *make sure that are broadcasting
+ *interrupts to all cpus on the ht bus
+ *if we're using extended apic ids
+ */
+ htcfg = read_pci_config(num, slot, func, 0x68);
+ if ((htcfg & (1 << 18)) == 1) {
+ printk(KERN_INFO "Detected use of extended apic ids on hypertransport bus\n");
+ if ((htcfg & (1 << 17)) == 0) {
+ printk(KERN_INFO "Enabling hypertransport extended apic interrupt broadcast\n");
+ htcfg |= (1 << 17);
+ write_pci_config(num, slot, func, 0x68, htcfg);
+ }
+ }
+
+}
+
+static void __init check_hypertransport_config()
+{
+ int num, slot, func;
+ u32 device, vendor;
+ func = 0;
+ for (num = 0; num < 32; num++) {
+ for (slot = 0; slot < 32; slot++) {
+ vendor = read_pci_config(num,slot,func,
+ PCI_VENDOR_ID);
+ device = read_pci_config(num,slot,func,
+ PCI_DEVICE_ID);
+ vendor &= 0x0000ffff;
+ device >>= 16;
+ if ((vendor == PCI_VENDOR_ID_AMD) &&
+ (device == PCI_DEVICE_ID_AMD_K8_NB))
+ fix_hypertransport_config(num,slot,func);
+ }
+ }
+
+ return;
+
+}
+
static void __init nvidia_bugs(void)
{
#ifdef CONFIG_ACPI
@@ -83,6 +127,12 @@ static void __init ati_bugs(void)
#endif
}
+static void __init amd_host_bugs(void)
+{
+ printk(KERN_CRIT "IN AMD_HOST_BUGS\n");
+ check_hypertransport_config();
+}
+
struct chipset {
u16 vendor;
void (*f)(void);
@@ -95,9 +145,16 @@ static struct chipset early_qrk[] __initdata = {
{}
};
+static struct chipset early_host_qrk[] __initdata = {
+ { PCI_VENDOR_ID_AMD, amd_host_bugs},
+ {}
+};
+
void __init early_quirks(void)
{
int num, slot, func;
+ u8 found_bridge = 0;
+ u8 found_host = 0;
if (!early_pci_allowed())
return;
@@ -115,18 +172,30 @@ void __init early_quirks(void)
if (class == 0xffffffff)
break;
- if ((class >> 16) != PCI_CLASS_BRIDGE_PCI)
+ class >>= 16;
+ if ((class != PCI_CLASS_BRIDGE_PCI) &&
+ (class != PCI_CLASS_BRIDGE_HOST))
continue;
vendor = read_pci_config(num, slot, func,
PCI_VENDOR_ID);
vendor &= 0xffff;
-
- for (i = 0; early_qrk[i].f; i++)
- if (early_qrk[i].vendor == vendor) {
- early_qrk[i].f();
- return;
- }
+ if ((class == PCI_CLASS_BRIDGE_PCI) && (!found_bridge)) {
+ for (i = 0; early_qrk[i].f; i++)
+ if (early_qrk[i].vendor == vendor) {
+ early_qrk[i].f();
+ found_bridge = 1;;
+ }
+ } else if (!found_host) {
+ for (i = 0; early_host_qrk[i].f; i++)
+ if (early_host_qrk[i].vendor == vendor) {
+ early_host_qrk[i].f();
+ found_host = 1;
+ }
+ }
+
+ if (found_bridge && found_host)
+ return;
type = read_pci_config_byte(num, slot, func,
PCI_HEADER_TYPE);
_______________________________________________
kexec mailing list
http://lists.infradead.org/mailman/listinfo/kexec
Sorry to reply to myself, but do we have consensus on this patch? I'd like to
figure out its disposition if possible.
I agree with the approach taken. Somebody needs to review the changes done
for applying early_quirks. I am not well versed with it.

Thanks
Vivek
ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
2007-12-11 01:17:06 UTC
Permalink
Post by Neil Horman
Sorry to reply to myself, but do we have consensus on this patch? I'd like to
figure out its disposition if possible.
What the patch tries to do looks like the right thing. So if we can get
a version that is clean and actually works we should merge it.

Eric
ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
2007-12-11 01:08:03 UTC
Permalink
Post by Eric W. Biederman
Post by Neil Horman
Post by Yinghai Lu
...
Post by Yinghai Lu
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
My feel is that if it is for legacy interrupts only it should not be a
problem.
Post by Neil Horman
Post by Yinghai Lu
Post by Yinghai Lu
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Let's investigate and see if we can unconditionally enable this quirk
for all opteron systems.
i checked that bit
http://www.openbios.org/viewvc/trunk/LinuxBIOSv2/src/northbridge/amd/amdk8/coherent_ht.c?revision=2596&view=markup
<snip>
Post by Neil Horman
Post by Yinghai Lu
it should be bit 18 (HTTC_APIC_EXT_ID)
YH
this seems reasonable, I can reroll the patch for this. As I think about it
I'm
Post by Neil Horman
also going to update the patch to make this check occur for any pci class 0600
device from vendor AMD, since its possible that more than just nvidia chipsets
can be affected.
I'll repost as soon as I've tested, thanks!
Neil
Ok, New patch attached. It preforms the same function as previously described,
but is more restricted in its application. As Yinghai pointed out, the
broadcast mask bit (bit 17 in the htcfg register) should only be enabled, if the
extened apic id bit (bit 18 in the same register) is also set. So this patch
now check for that bit to be turned on first. Also, this patch now adds an
independent quirk check for all AMD hypertransport host controllers, since its
possible for this misconfiguration to be present in systems other than nvidias.
The net effect of these changes is, that its now applicable to all AMD systems
containing hypertransport busses, and is only activated if extended apic ids are
in use, meaning that this quirk guarantees that all processors in a system are
elligible to receive interrupts from the ioapic, even if their apicid extends
beyond the nominal 4 bit limitation. Tested successfully by me.
Thanks & Regards
Neil
early-quirks.c | 83 ++++++++++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 76 insertions(+), 7 deletions(-)
diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index 88bb83e..d5a7b30 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -44,6 +44,50 @@ static int __init nvidia_hpet_check(struct acpi_table_header *header)
#endif /* CONFIG_X86_IO_APIC */
#endif /* CONFIG_ACPI */
+static void __init fix_hypertransport_config(int num, int slot, int func)
+{
+ u32 htcfg;
+ /*
+ *we found a hypertransport bus
+ *make sure that are broadcasting
+ *interrupts to all cpus on the ht bus
+ *if we're using extended apic ids
+ */
+ htcfg = read_pci_config(num, slot, func, 0x68);
+ if ((htcfg & (1 << 18)) == 1) {
Ok. This test is broken. Please remove the == 1. You are looking
for == (1 << 18). So just saying: "if (htcfg & (1 << 18))" should be clearer.
Post by Eric W. Biederman
+ printk(KERN_INFO "Detected use of extended apic ids on hypertransport bus\n");
+ if ((htcfg & (1 << 17)) == 0) {
+ printk(KERN_INFO "Enabling hypertransport extended apic interrupt
broadcast\n");
+ htcfg |= (1 << 17);
+ write_pci_config(num, slot, func, 0x68, htcfg);
+ }
+ }
+
+}
The rest of this quirk looks fine, include the fact it is only intended
to be applied to PCI_VENDOR_ID_AMD PCI_DEVICE_ID_AMD_K8_NB.


For what is below I don't like the way the infrastructure has been
extended as what you are doing quickly devolves into a big mess.

Please extend struct chipset to be something like:
struct chipset {
u16 vendor;
u16 device;
u32 class, class_mask;
void (*f)(void);
};

And then the test for matching the chipset can be something like:
if ((id->vendor == PCI_ANY_ID || id->vendor == dev->vendor) &&
(id->device == PCI_ANY_ID || id->device == dev->device) &&
!((id->class ^ dev->class) & id->class_mask))

Essentially a subset of pci_match_one_device from drivers/pci/pci.h

That way you don't need to increase the number of tables or the
number of passes through the pci busses, just update the early_qrk
table with a few more bits of information.

The extended form should be much more maintainable in the long
run. Given that we may want this before we enable the timer
which is very early doing this in the pci early quirks seems
to make sense.

Eric
Neil Horman
2007-12-11 03:43:49 UTC
Permalink
<snip>
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Ok. This test is broken. Please remove the == 1. You are looking
for == (1 << 18). So just saying: "if (htcfg & (1 << 18))" should be clearer.
Fixed. Thanks!
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Post by Eric W. Biederman
+ printk(KERN_INFO "Detected use of extended apic ids on hypertransport bus\n");
+ if ((htcfg & (1 << 17)) == 0) {
+ printk(KERN_INFO "Enabling hypertransport extended apic interrupt
broadcast\n");
+ htcfg |= (1 << 17);
+ write_pci_config(num, slot, func, 0x68, htcfg);
+ }
+ }
+
+}
The rest of this quirk looks fine, include the fact it is only intended
to be applied to PCI_VENDOR_ID_AMD PCI_DEVICE_ID_AMD_K8_NB.
Copy that.
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
For what is below I don't like the way the infrastructure has been
extended as what you are doing quickly devolves into a big mess.
struct chipset {
u16 vendor;
u16 device;
u32 class, class_mask;
void (*f)(void);
};
if ((id->vendor == PCI_ANY_ID || id->vendor == dev->vendor) &&
(id->device == PCI_ANY_ID || id->device == dev->device) &&
!((id->class ^ dev->class) & id->class_mask))
Essentially a subset of pci_match_one_device from drivers/pci/pci.h
That way you don't need to increase the number of tables or the
number of passes through the pci busses, just update the early_qrk
table with a few more bits of information.
copy that. Fixed. Thanks!
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
The extended form should be much more maintainable in the long
run. Given that we may want this before we enable the timer
which is very early doing this in the pci early quirks seems
to make sense.
Eric
New patch attached, with suggestions incorporated.

Thanks & regards
Neil

Signed-off-by: Neil Horman <nhorman-***@public.gmane.org>


early-quirks.c | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++-------
1 file changed, 73 insertions(+), 9 deletions(-)



diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index 88bb83e..4b0cee1 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -44,6 +44,50 @@ static int __init nvidia_hpet_check(struct acpi_table_header *header)
#endif /* CONFIG_X86_IO_APIC */
#endif /* CONFIG_ACPI */

+static void __init fix_hypertransport_config(int num, int slot, int func)
+{
+ u32 htcfg;
+ /*
+ *we found a hypertransport bus
+ *make sure that are broadcasting
+ *interrupts to all cpus on the ht bus
+ *if we're using extended apic ids
+ */
+ htcfg = read_pci_config(num, slot, func, 0x68);
+ if (htcfg & (1 << 18)) {
+ printk(KERN_INFO "Detected use of extended apic ids on hypertransport bus\n");
+ if ((htcfg & (1 << 17)) == 0) {
+ printk(KERN_INFO "Enabling hypertransport extended apic interrupt broadcast\n");
+ htcfg |= (1 << 17);
+ write_pci_config(num, slot, func, 0x68, htcfg);
+ }
+ }
+
+}
+
+static void __init check_hypertransport_config()
+{
+ int num, slot, func;
+ u32 device, vendor;
+ func = 0;
+ for (num = 0; num < 32; num++) {
+ for (slot = 0; slot < 32; slot++) {
+ vendor = read_pci_config(num,slot,func,
+ PCI_VENDOR_ID);
+ device = read_pci_config(num,slot,func,
+ PCI_DEVICE_ID);
+ vendor &= 0x0000ffff;
+ device >>= 16;
+ if ((vendor == PCI_VENDOR_ID_AMD) &&
+ (device == PCI_DEVICE_ID_AMD_K8_NB))
+ fix_hypertransport_config(num,slot,func);
+ }
+ }
+
+ return;
+
+}
+
static void __init nvidia_bugs(void)
{
#ifdef CONFIG_ACPI
@@ -83,15 +127,25 @@ static void __init ati_bugs(void)
#endif
}

+static void __init amd_host_bugs(void)
+{
+ printk(KERN_CRIT "IN AMD_HOST_BUGS\n");
+ check_hypertransport_config();
+}
+
struct chipset {
u16 vendor;
+ u16 device;
+ u32 class;
+ u32 class_mask;
void (*f)(void);
};

static struct chipset early_qrk[] __initdata = {
- { PCI_VENDOR_ID_NVIDIA, nvidia_bugs },
- { PCI_VENDOR_ID_VIA, via_bugs },
- { PCI_VENDOR_ID_ATI, ati_bugs },
+ { PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID, PCI_CLASS_BRIDGE_PCI, PCI_ANY_ID, nvidia_bugs },
+ { PCI_VENDOR_ID_VIA, PCI_ANY_ID, PCI_CLASS_BRIDGE_PCI, PCI_ANY_ID, via_bugs },
+ { PCI_VENDOR_ID_ATI, PCI_ANY_ID, PCI_CLASS_BRIDGE_PCI, PCI_ANY_ID, ati_bugs },
+ { PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_K8_NB, PCI_CLASS_BRIDGE_HOST, PCI_ANY_ID, amd_host_bugs },
{}
};

@@ -108,25 +162,35 @@ void __init early_quirks(void)
for (func = 0; func < 8; func++) {
u32 class;
u32 vendor;
+ u32 device;
u8 type;
int i;
+
class = read_pci_config(num,slot,func,
PCI_CLASS_REVISION);
if (class == 0xffffffff)
break;

- if ((class >> 16) != PCI_CLASS_BRIDGE_PCI)
- continue;
+ class >>= 16;

vendor = read_pci_config(num, slot, func,
PCI_VENDOR_ID);
vendor &= 0xffff;

- for (i = 0; early_qrk[i].f; i++)
- if (early_qrk[i].vendor == vendor) {
- early_qrk[i].f();
- return;
+ device = read_pci_config(num, slot, func,
+ PCI_DEVICE_ID);
+ device >>= 16;
+
+ for(i=0;early_qrk[i].f != NULL;i++) {
+ if (((early_qrk[i].vendor == PCI_ANY_ID) ||
+ (early_qrk[i].vendor == vendor)) &&
+ ((early_qrk[i].device == PCI_ANY_ID) ||
+ (early_qrk[i].device == device)) &&
+ ((early_qrk[i].class ^ class) &
+ early_qrk[i].class_mask)) {
+ early_qrk[i].f();
}
+ }

type = read_pci_config_byte(num, slot, func,
PCI_HEADER_TYPE);
ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
2007-12-11 04:48:11 UTC
Permalink
Neil Horman <nhorman-***@public.gmane.org> writes:

Almost there.
Post by Ben Woodard
<snip>
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Ok. This test is broken. Please remove the == 1. You are looking
for == (1 << 18). So just saying: "if (htcfg & (1 << 18))" should be clearer.
Fixed. Thanks!
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Post by Eric W. Biederman
+ printk(KERN_INFO "Detected use of extended apic ids on hypertransport
bus\n");
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Post by Eric W. Biederman
+ if ((htcfg & (1 << 17)) == 0) {
+ printk(KERN_INFO "Enabling hypertransport extended apic interrupt
broadcast\n");
+ htcfg |= (1 << 17);
+ write_pci_config(num, slot, func, 0x68, htcfg);
+ }
+ }
+
+}
The rest of this quirk looks fine, include the fact it is only intended
to be applied to PCI_VENDOR_ID_AMD PCI_DEVICE_ID_AMD_K8_NB.
Copy that.
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
For what is below I don't like the way the infrastructure has been
extended as what you are doing quickly devolves into a big mess.
struct chipset {
u16 vendor;
u16 device;
u32 class, class_mask;
void (*f)(void);
};
if ((id->vendor == PCI_ANY_ID || id->vendor == dev->vendor) &&
(id->device == PCI_ANY_ID || id->device == dev->device) &&
!((id->class ^ dev->class) & id->class_mask))
Essentially a subset of pci_match_one_device from drivers/pci/pci.h
That way you don't need to increase the number of tables or the
number of passes through the pci busses, just update the early_qrk
table with a few more bits of information.
copy that. Fixed. Thanks!
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
The extended form should be much more maintainable in the long
run. Given that we may want this before we enable the timer
which is very early doing this in the pci early quirks seems
to make sense.
Eric
New patch attached, with suggestions incorporated.
Thanks & regards
Neil
early-quirks.c | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++-------
1 file changed, 73 insertions(+), 9 deletions(-)
diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index 88bb83e..4b0cee1 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -44,6 +44,50 @@ static int __init nvidia_hpet_check(struct acpi_table_header *header)
#endif /* CONFIG_X86_IO_APIC */
#endif /* CONFIG_ACPI */
+static void __init fix_hypertransport_config(int num, int slot, int func)
+{
+ u32 htcfg;
+ /*
+ *we found a hypertransport bus
+ *make sure that are broadcasting
+ *interrupts to all cpus on the ht bus
+ *if we're using extended apic ids
+ */
+ htcfg = read_pci_config(num, slot, func, 0x68);
+ if (htcfg & (1 << 18)) {
+ printk(KERN_INFO "Detected use of extended apic ids on hypertransport bus\n");
+ if ((htcfg & (1 << 17)) == 0) {
+ printk(KERN_INFO "Enabling hypertransport extended apic interrupt
broadcast\n");
+ htcfg |= (1 << 17);
+ write_pci_config(num, slot, func, 0x68, htcfg);
+ }
+ }
+
+}
+
+static void __init check_hypertransport_config()
+{
+ int num, slot, func;
+ u32 device, vendor;
+ func = 0;
+ for (num = 0; num < 32; num++) {
+ for (slot = 0; slot < 32; slot++) {
+ vendor = read_pci_config(num,slot,func,
+ PCI_VENDOR_ID);
+ device = read_pci_config(num,slot,func,
+ PCI_DEVICE_ID);
+ vendor &= 0x0000ffff;
+ device >>= 16;
+ if ((vendor == PCI_VENDOR_ID_AMD) &&
+ (device == PCI_DEVICE_ID_AMD_K8_NB))
+ fix_hypertransport_config(num,slot,func);
+ }
+ }
+
+ return;
+
+}
We should not need check_hypertransport_config as the generic loop
now does the work for us.
Post by Ben Woodard
+
static void __init nvidia_bugs(void)
{
#ifdef CONFIG_ACPI
@@ -83,15 +127,25 @@ static void __init ati_bugs(void)
#endif
}
+static void __init amd_host_bugs(void)
+{
+ printk(KERN_CRIT "IN AMD_HOST_BUGS\n");
+ check_hypertransport_config();
+}
Likewise this function is unneeded and the printk is likely confusing
for users.
Post by Ben Woodard
struct chipset {
u16 vendor;
+ u16 device;
+ u32 class;
+ u32 class_mask;
void (*f)(void);
};
static struct chipset early_qrk[] __initdata = {
- { PCI_VENDOR_ID_NVIDIA, nvidia_bugs },
- { PCI_VENDOR_ID_VIA, via_bugs },
- { PCI_VENDOR_ID_ATI, ati_bugs },
+ { PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID, PCI_CLASS_BRIDGE_PCI, PCI_ANY_ID,
nvidia_bugs },
+ { PCI_VENDOR_ID_VIA, PCI_ANY_ID, PCI_CLASS_BRIDGE_PCI, PCI_ANY_ID, via_bugs },
+ { PCI_VENDOR_ID_ATI, PCI_ANY_ID, PCI_CLASS_BRIDGE_PCI, PCI_ANY_ID, ati_bugs },
+ { PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_K8_NB, PCI_CLASS_BRIDGE_HOST,
PCI_ANY_ID, amd_host_bugs },
{}
So make that fix_hypertransport_config and we should be good.
Post by Ben Woodard
};
@@ -108,25 +162,35 @@ void __init early_quirks(void)
for (func = 0; func < 8; func++) {
u32 class;
u32 vendor;
+ u32 device;
u8 type;
int i;
+
class = read_pci_config(num,slot,func,
PCI_CLASS_REVISION);
if (class == 0xffffffff)
break;
- if ((class >> 16) != PCI_CLASS_BRIDGE_PCI)
- continue;
+ class >>= 16;
vendor = read_pci_config(num, slot, func,
PCI_VENDOR_ID);
vendor &= 0xffff;
- for (i = 0; early_qrk[i].f; i++)
- if (early_qrk[i].vendor == vendor) {
- early_qrk[i].f();
- return;
+ device = read_pci_config(num, slot, func,
+ PCI_DEVICE_ID);
+ device >>= 16;
We don't need to shift device. Although we can do:
device_vendor = read_pci_config(num, slot, func, PCI_VENDOR_ID);
device = device_vendor >> 16;
vendor = device_vendor & 0xffff;

Assuming the early read_pci_config is limited to 32bit reads.
Post by Ben Woodard
+ for(i=0;early_qrk[i].f != NULL;i++) {
+ if (((early_qrk[i].vendor == PCI_ANY_ID) ||
+ (early_qrk[i].vendor == vendor)) &&
+ ((early_qrk[i].device == PCI_ANY_ID) ||
+ (early_qrk[i].device == device)) &&
+ ((early_qrk[i].class ^ class) &
+ early_qrk[i].class_mask)) {
+ early_qrk[i].f();
}
+ }
type = read_pci_config_byte(num, slot, func,
PCI_HEADER_TYPE);
Yinghai Lu
2007-12-11 06:31:19 UTC
Permalink
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Almost there.
Post by Ben Woodard
<snip>
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Ok. This test is broken. Please remove the == 1. You are looking
for == (1 << 18). So just saying: "if (htcfg & (1 << 18))" should be clearer.
Fixed. Thanks!
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Post by Eric W. Biederman
+ printk(KERN_INFO "Detected use of extended apic ids on hypertransport
bus\n");
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Post by Eric W. Biederman
+ if ((htcfg & (1 << 17)) == 0) {
+ printk(KERN_INFO "Enabling hypertransport extended apic interrupt
broadcast\n");
+ htcfg |= (1 << 17);
+ write_pci_config(num, slot, func, 0x68, htcfg);
+ }
+ }
+
+}
The rest of this quirk looks fine, include the fact it is only intended
to be applied to PCI_VENDOR_ID_AMD PCI_DEVICE_ID_AMD_K8_NB.
Copy that.
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
For what is below I don't like the way the infrastructure has been
extended as what you are doing quickly devolves into a big mess.
struct chipset {
u16 vendor;
u16 device;
u32 class, class_mask;
void (*f)(void);
};
if ((id->vendor == PCI_ANY_ID || id->vendor == dev->vendor) &&
(id->device == PCI_ANY_ID || id->device == dev->device) &&
!((id->class ^ dev->class) & id->class_mask))
Essentially a subset of pci_match_one_device from drivers/pci/pci.h
That way you don't need to increase the number of tables or the
number of passes through the pci busses, just update the early_qrk
table with a few more bits of information.
copy that. Fixed. Thanks!
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
The extended form should be much more maintainable in the long
run. Given that we may want this before we enable the timer
which is very early doing this in the pci early quirks seems
to make sense.
Eric
New patch attached, with suggestions incorporated.
Thanks & regards
Neil
early-quirks.c | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++-------
1 file changed, 73 insertions(+), 9 deletions(-)
diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index 88bb83e..4b0cee1 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -44,6 +44,50 @@ static int __init nvidia_hpet_check(struct acpi_table_header
*header)
#endif /* CONFIG_X86_IO_APIC */
#endif /* CONFIG_ACPI */
+static void __init fix_hypertransport_config(int num, int slot, int func)
+{
+ u32 htcfg;
+ /*
+ *we found a hypertransport bus
+ *make sure that are broadcasting
+ *interrupts to all cpus on the ht bus
+ *if we're using extended apic ids
+ */
+ htcfg = read_pci_config(num, slot, func, 0x68);
+ if (htcfg & (1 << 18)) {
+ printk(KERN_INFO "Detected use of extended apic ids on hypertransport bus\n");
+ if ((htcfg & (1 << 17)) == 0) {
+ printk(KERN_INFO "Enabling hypertransport extended apic interrupt
broadcast\n");
+ htcfg |= (1 << 17);
+ write_pci_config(num, slot, func, 0x68, htcfg);
+ }
+ }
+
+}
+
+static void __init check_hypertransport_config()
+{
+ int num, slot, func;
+ u32 device, vendor;
+ func = 0;
+ for (num = 0; num < 32; num++) {
+ for (slot = 0; slot < 32; slot++) {
+ vendor = read_pci_config(num,slot,func,
+ PCI_VENDOR_ID);
+ device = read_pci_config(num,slot,func,
+ PCI_DEVICE_ID);
+ vendor &= 0x0000ffff;
+ device >>= 16;
+ if ((vendor == PCI_VENDOR_ID_AMD) &&
+ (device == PCI_DEVICE_ID_AMD_K8_NB))
+ fix_hypertransport_config(num,slot,func);
+ }
+ }
+
+ return;
+
+}
We should not need check_hypertransport_config as the generic loop
now does the work for us.
Post by Ben Woodard
+
static void __init nvidia_bugs(void)
{
#ifdef CONFIG_ACPI
@@ -83,15 +127,25 @@ static void __init ati_bugs(void)
#endif
}
+static void __init amd_host_bugs(void)
+{
+ printk(KERN_CRIT "IN AMD_HOST_BUGS\n");
+ check_hypertransport_config();
+}
Likewise this function is unneeded and the printk is likely confusing
for users.
Post by Ben Woodard
struct chipset {
u16 vendor;
+ u16 device;
+ u32 class;
+ u32 class_mask;
void (*f)(void);
};
static struct chipset early_qrk[] __initdata = {
- { PCI_VENDOR_ID_NVIDIA, nvidia_bugs },
- { PCI_VENDOR_ID_VIA, via_bugs },
- { PCI_VENDOR_ID_ATI, ati_bugs },
+ { PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID, PCI_CLASS_BRIDGE_PCI, PCI_ANY_ID,
nvidia_bugs },
+ { PCI_VENDOR_ID_VIA, PCI_ANY_ID, PCI_CLASS_BRIDGE_PCI, PCI_ANY_ID, via_bugs },
+ { PCI_VENDOR_ID_ATI, PCI_ANY_ID, PCI_CLASS_BRIDGE_PCI, PCI_ANY_ID, ati_bugs },
+ { PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_K8_NB, PCI_CLASS_BRIDGE_HOST,
PCI_ANY_ID, amd_host_bugs },
{}
So make that fix_hypertransport_config and we should be good.
Agreed.

struct chipset {
u16 vendor;
u16 device;
u32 class;
+ u32 class_mask;
void (*f)(void); ============> int (*f) (int num, int slot, int func);
};
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Post by Ben Woodard
};
@@ -108,25 +162,35 @@ void __init early_quirks(void)
for (func = 0; func < 8; func++) {
u32 class;
u32 vendor;
+ u32 device;
u8 type;
int i;
+
class = read_pci_config(num,slot,func,
PCI_CLASS_REVISION);
if (class == 0xffffffff)
break;
- if ((class >> 16) != PCI_CLASS_BRIDGE_PCI)
- continue;
+ class >>= 16;
vendor = read_pci_config(num, slot, func,
PCI_VENDOR_ID);
vendor &= 0xffff;
- for (i = 0; early_qrk[i].f; i++)
- if (early_qrk[i].vendor == vendor) {
- early_qrk[i].f();
- return;
+ device = read_pci_config(num, slot, func,
+ PCI_DEVICE_ID);
+ device >>= 16;
device_vendor = read_pci_config(num, slot, func, PCI_VENDOR_ID);
device = device_vendor >> 16;
vendor = device_vendor & 0xffff;
Assuming the early read_pci_config is limited to 32bit reads.
Post by Ben Woodard
+ for(i=0;early_qrk[i].f != NULL;i++) {
+ if (((early_qrk[i].vendor == PCI_ANY_ID) ||
+ (early_qrk[i].vendor == vendor)) &&
+ ((early_qrk[i].device == PCI_ANY_ID) ||
+ (early_qrk[i].device == device)) &&
+ ((early_qrk[i].class ^ class) &
+ early_qrk[i].class_mask)) {
+ early_qrk[i].f();
}
+ early_qrk[i].f();
===>
int status;
status = early_qrk[i].f(num, slot, bus);
if (status == 1)
break;
else if (status == 2)
return;


YH
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Neil Horman
2007-12-11 14:39:10 UTC
Permalink
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Almost there.
cool! :)

<snip>
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
We should not need check_hypertransport_config as the generic loop
now does the work for us.
Post by Neil Horman
+
static void __init nvidia_bugs(void)
{
#ifdef CONFIG_ACPI
@@ -83,15 +127,25 @@ static void __init ati_bugs(void)
#endif
}
+static void __init amd_host_bugs(void)
+{
+ printk(KERN_CRIT "IN AMD_HOST_BUGS\n");
+ check_hypertransport_config();
+}
Likewise this function is unneeded and the printk is likely confusing
for users.
Copy that. Fixed

<snip>
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Post by Neil Horman
{}
So make that fix_hypertransport_config and we should be good.
Done
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
device_vendor = read_pci_config(num, slot, func, PCI_VENDOR_ID);
device = device_vendor >> 16;
vendor = device_vendor & 0xffff;
I'm not so sure about this. In my testing, it was clear that I needed to do a
shift on device to make valid comparisons to the defined PCI_DEVICE_* macros.
The origional code had to do the same thing with the class field, which is
simmilarly positioned in the pci config space.


Other than that, new patch attached. Enables the detection of AMD
hypertransport functions and checks for the proper quirk just as before, and
incoporates your comments above Eric, as well as yours Yinghai.

Thanks & Regards
Neil

Signed-off-by: Neil Horman <nhorman-***@public.gmane.org>


early-quirks.c | 76 +++++++++++++++++++++++++++++++++++++++++++++------------
1 file changed, 61 insertions(+), 15 deletions(-)



diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index 88bb83e..e13c999 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -21,7 +21,7 @@
#include <asm/gart.h>
#endif

-static void __init via_bugs(void)
+static int __init via_bugs(int num, int slot, int func)
{
#ifdef CONFIG_GART_IOMMU
if ((end_pfn > MAX_DMA32_PFN || force_iommu) &&
@@ -32,6 +32,7 @@ static void __init via_bugs(void)
gart_iommu_aperture_disabled = 1;
}
#endif
+ return 1;
}

#ifdef CONFIG_ACPI
@@ -44,7 +45,31 @@ static int __init nvidia_hpet_check(struct acpi_table_header *header)
#endif /* CONFIG_X86_IO_APIC */
#endif /* CONFIG_ACPI */

-static void __init nvidia_bugs(void)
+static int __init fix_hypertransport_config(int num, int slot, int func)
+{
+ u32 htcfg;
+ /*
+ *we found a hypertransport bus
+ *make sure that are broadcasting
+ *interrupts to all cpus on the ht bus
+ *if we're using extended apic ids
+ */
+ htcfg = read_pci_config(num, slot, func, 0x68);
+ if (htcfg & (1 << 18)) {
+ printk(KERN_INFO "Detected use of extended apic ids on hypertransport bus\n");
+ if ((htcfg & (1 << 17)) == 0) {
+ printk(KERN_INFO "Enabling hypertransport extended apic interrupt broadcast\n");
+ printk(KERN_INFO "Note this is a bios bug, please contact your hw vendor\n");
+ htcfg |= (1 << 17);
+ write_pci_config(num, slot, func, 0x68, htcfg);
+ }
+ }
+
+ return 1;
+
+}
+
+static int __init nvidia_bugs(int num, int slot, int func)
{
#ifdef CONFIG_ACPI
#ifdef CONFIG_X86_IO_APIC
@@ -56,7 +81,7 @@ static void __init nvidia_bugs(void)
* at least allow a command line override.
*/
if (acpi_use_timer_override)
- return;
+ return 1;

if (acpi_table_parse(ACPI_SIG_HPET, nvidia_hpet_check)) {
acpi_skip_timer_override = 1;
@@ -70,9 +95,10 @@ static void __init nvidia_bugs(void)
#endif
/* RED-PEN skip them on mptables too? */

+ return 1;
}

-static void __init ati_bugs(void)
+static int __init ati_bugs(int num, int slot, int func)
{
#ifdef CONFIG_X86_IO_APIC
if (timer_over_8254 == 1) {
@@ -81,17 +107,22 @@ static void __init ati_bugs(void)
"ATI board detected. Disabling timer routing over 8254.\n");
}
#endif
+ return 1;
}

struct chipset {
- u16 vendor;
- void (*f)(void);
+ u32 vendor;
+ u32 device;
+ u32 class;
+ u32 class_mask;
+ int (*f)(int num, int slot, int func);
};

static struct chipset early_qrk[] __initdata = {
- { PCI_VENDOR_ID_NVIDIA, nvidia_bugs },
- { PCI_VENDOR_ID_VIA, via_bugs },
- { PCI_VENDOR_ID_ATI, ati_bugs },
+ { PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID, PCI_CLASS_BRIDGE_PCI, PCI_ANY_ID, nvidia_bugs },
+ { PCI_VENDOR_ID_VIA, PCI_ANY_ID, PCI_CLASS_BRIDGE_PCI, PCI_ANY_ID, via_bugs },
+ { PCI_VENDOR_ID_ATI, PCI_ANY_ID, PCI_CLASS_BRIDGE_PCI, PCI_ANY_ID, ati_bugs },
+ { PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_K8_NB, PCI_CLASS_BRIDGE_HOST, PCI_ANY_ID, fix_hypertransport_config },
{}
};

@@ -108,25 +139,40 @@ void __init early_quirks(void)
for (func = 0; func < 8; func++) {
u32 class;
u32 vendor;
+ u32 device;
u8 type;
+ int ret;
int i;
+
class = read_pci_config(num,slot,func,
PCI_CLASS_REVISION);
if (class == 0xffffffff)
break;

- if ((class >> 16) != PCI_CLASS_BRIDGE_PCI)
- continue;
+ class >>= 16;

vendor = read_pci_config(num, slot, func,
PCI_VENDOR_ID);
vendor &= 0xffff;

- for (i = 0; early_qrk[i].f; i++)
- if (early_qrk[i].vendor == vendor) {
- early_qrk[i].f();
- return;
+ device = read_pci_config(num, slot, func,
+ PCI_DEVICE_ID);
+ device >>= 16;
+
+ for(i=0;early_qrk[i].f != NULL;i++) {
+ if (((early_qrk[i].vendor == PCI_ANY_ID) ||
+ (early_qrk[i].vendor == vendor)) &&
+ ((early_qrk[i].device == PCI_ANY_ID) ||
+ (early_qrk[i].device == device)) &&
+ (!((early_qrk[i].class ^ class) &
+ early_qrk[i].class_mask))) {
+ ret = early_qrk[i].f(num, slot, func);
+ if (ret == 1)
+ break;
+ if (ret == 2)
+ return;
}
+ }

type = read_pci_config_byte(num, slot, func,
PCI_HEADER_TYPE);
ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
2007-12-11 15:29:20 UTC
Permalink
Post by Neil Horman
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Almost there.
cool! :)
<snip>
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
We should not need check_hypertransport_config as the generic loop
now does the work for us.
Post by Neil Horman
+
static void __init nvidia_bugs(void)
{
#ifdef CONFIG_ACPI
@@ -83,15 +127,25 @@ static void __init ati_bugs(void)
#endif
}
+static void __init amd_host_bugs(void)
+{
+ printk(KERN_CRIT "IN AMD_HOST_BUGS\n");
+ check_hypertransport_config();
+}
Likewise this function is unneeded and the printk is likely confusing
for users.
Copy that. Fixed
<snip>
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Post by Neil Horman
{}
So make that fix_hypertransport_config and we should be good.
Done
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
device_vendor = read_pci_config(num, slot, func, PCI_VENDOR_ID);
device = device_vendor >> 16;
vendor = device_vendor & 0xffff;
I'm not so sure about this. In my testing, it was clear that I needed to do a
shift on device to make valid comparisons to the defined PCI_DEVICE_* macros.
The origional code had to do the same thing with the class field, which is
simmilarly positioned in the pci config space.
Ok. I just looked at read_pci_config. It doesn't do the right thing for
a non-aligned 32bit access. (Not that I am convinced there is a right
thing we can do). Please make this read_pci_config_16 instead
and you won't need the shift.

Either that or as I earlier suggested just do a 32bit read from offset 0
and use shifts and masks to get vendor and device fields.

The current code doing a shift where none should be needed (because
we ignore the two low order bits in our read) is totally weird
when looking at it.
Post by Neil Horman
Other than that, new patch attached. Enables the detection of AMD
hypertransport functions and checks for the proper quirk just as before, and
incoporates your comments above Eric, as well as yours Yinghai.
You almost got YH's comment. You need return 2 for the old functions
so we don't try and apply a per chipset fixup for every device in
the system.

I'm actually inclined to remove the return magic and just do something
like:
static fix_applied;
if (fix_applied++)
return;
In those functions that should be called only once.

Eric
Post by Neil Horman
early-quirks.c | 76 +++++++++++++++++++++++++++++++++++++++++++++------------
1 file changed, 61 insertions(+), 15 deletions(-)
diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index 88bb83e..e13c999 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -21,7 +21,7 @@
#include <asm/gart.h>
#endif
-static void __init via_bugs(void)
+static int __init via_bugs(int num, int slot, int func)
{
#ifdef CONFIG_GART_IOMMU
if ((end_pfn > MAX_DMA32_PFN || force_iommu) &&
@@ -32,6 +32,7 @@ static void __init via_bugs(void)
gart_iommu_aperture_disabled = 1;
}
#endif
+ return 1;
return 2;
Post by Neil Horman
}
#ifdef CONFIG_ACPI
@@ -44,7 +45,31 @@ static int __init nvidia_hpet_check(struct acpi_table_header *header)
#endif /* CONFIG_X86_IO_APIC */
#endif /* CONFIG_ACPI */
-static void __init nvidia_bugs(void)
+static int __init fix_hypertransport_config(int num, int slot, int func)
+{
+ u32 htcfg;
+ /*
+ *we found a hypertransport bus
+ *make sure that are broadcasting
+ *interrupts to all cpus on the ht bus
+ *if we're using extended apic ids
+ */
+ htcfg = read_pci_config(num, slot, func, 0x68);
+ if (htcfg & (1 << 18)) {
+ printk(KERN_INFO "Detected use of extended apic ids on hypertransport bus\n");
+ if ((htcfg & (1 << 17)) == 0) {
+ printk(KERN_INFO "Enabling hypertransport extended apic interrupt
broadcast\n");
+ printk(KERN_INFO "Note this is a bios bug, please contact your hw vendor\n");
+ htcfg |= (1 << 17);
+ write_pci_config(num, slot, func, 0x68, htcfg);
+ }
+ }
+
+ return 1;
+
+}
+
Hmm. I don't think we want this code positioned in the middle of the
nvidia bug checks.
Post by Neil Horman
+static int __init nvidia_bugs(int num, int slot, int func)
{
#ifdef CONFIG_ACPI
#ifdef CONFIG_X86_IO_APIC
@@ -56,7 +81,7 @@ static void __init nvidia_bugs(void)
* at least allow a command line override.
*/
if (acpi_use_timer_override)
- return;
+ return 1;
if (acpi_table_parse(ACPI_SIG_HPET, nvidia_hpet_check)) {
acpi_skip_timer_override = 1;
@@ -70,9 +95,10 @@ static void __init nvidia_bugs(void)
#endif
/* RED-PEN skip them on mptables too? */
+ return 1;
return 2;
Post by Neil Horman
}
-static void __init ati_bugs(void)
+static int __init ati_bugs(int num, int slot, int func)
{
#ifdef CONFIG_X86_IO_APIC
if (timer_over_8254 == 1) {
@@ -81,17 +107,22 @@ static void __init ati_bugs(void)
"ATI board detected. Disabling timer routing over 8254.\n");
}
#endif
+ return 1;
return 2;
Post by Neil Horman
}
struct chipset {
- u16 vendor;
- void (*f)(void);
+ u32 vendor;
+ u32 device;
+ u32 class;
+ u32 class_mask;
+ int (*f)(int num, int slot, int func);
};
static struct chipset early_qrk[] __initdata = {
- { PCI_VENDOR_ID_NVIDIA, nvidia_bugs },
- { PCI_VENDOR_ID_VIA, via_bugs },
- { PCI_VENDOR_ID_ATI, ati_bugs },
+ { PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID, PCI_CLASS_BRIDGE_PCI, PCI_ANY_ID,
nvidia_bugs },
+ { PCI_VENDOR_ID_VIA, PCI_ANY_ID, PCI_CLASS_BRIDGE_PCI, PCI_ANY_ID, via_bugs },
+ { PCI_VENDOR_ID_ATI, PCI_ANY_ID, PCI_CLASS_BRIDGE_PCI, PCI_ANY_ID, ati_bugs },
+ { PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_K8_NB, PCI_CLASS_BRIDGE_HOST,
PCI_ANY_ID, fix_hypertransport_config },
{}
};
@@ -108,25 +139,40 @@ void __init early_quirks(void)
for (func = 0; func < 8; func++) {
u32 class;
u32 vendor;
+ u32 device;
u8 type;
+ int ret;
int i;
+
class = read_pci_config(num,slot,func,
PCI_CLASS_REVISION);
if (class == 0xffffffff)
break;
- if ((class >> 16) != PCI_CLASS_BRIDGE_PCI)
- continue;
+ class >>= 16;
vendor = read_pci_config(num, slot, func,
PCI_VENDOR_ID);
vendor &= 0xffff;
vendor = read_pci_config_16(num, slot, func,
PCI_VENDOR_ID);
Post by Neil Horman
- for (i = 0; early_qrk[i].f; i++)
- if (early_qrk[i].vendor == vendor) {
- early_qrk[i].f();
- return;
+ device = read_pci_config(num, slot, func,
+ PCI_DEVICE_ID);
+ device >>= 16;
device = read_pci_config_16(num, slot, func,
PCI_DEVICE_ID);
Post by Neil Horman
+
+ for(i=0;early_qrk[i].f != NULL;i++) {
+ if (((early_qrk[i].vendor == PCI_ANY_ID) ||
+ (early_qrk[i].vendor == vendor)) &&
+ ((early_qrk[i].device == PCI_ANY_ID) ||
+ (early_qrk[i].device == device)) &&
+ (!((early_qrk[i].class ^ class) &
+ early_qrk[i].class_mask))) {
+ ret = early_qrk[i].f(num, slot, func);
+ if (ret == 1)
+ break;
+ if (ret == 2)
+ return;
}
+ }
type = read_pci_config_byte(num, slot, func,
PCI_HEADER_TYPE);
Yinghai Lu
2007-12-11 18:00:00 UTC
Permalink
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Post by Neil Horman
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Almost there.
cool! :)
<snip>
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
We should not need check_hypertransport_config as the generic loop
now does the work for us.
Post by Neil Horman
+
static void __init nvidia_bugs(void)
{
#ifdef CONFIG_ACPI
@@ -83,15 +127,25 @@ static void __init ati_bugs(void)
#endif
}
+static void __init amd_host_bugs(void)
+{
+ printk(KERN_CRIT "IN AMD_HOST_BUGS\n");
+ check_hypertransport_config();
+}
Likewise this function is unneeded and the printk is likely confusing
for users.
Copy that. Fixed
<snip>
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Post by Neil Horman
{}
So make that fix_hypertransport_config and we should be good.
Done
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
device_vendor = read_pci_config(num, slot, func, PCI_VENDOR_ID);
device = device_vendor >> 16;
vendor = device_vendor & 0xffff;
I'm not so sure about this. In my testing, it was clear that I needed to do a
shift on device to make valid comparisons to the defined PCI_DEVICE_* macros.
The origional code had to do the same thing with the class field, which is
simmilarly positioned in the pci config space.
Ok. I just looked at read_pci_config. It doesn't do the right thing for
a non-aligned 32bit access. (Not that I am convinced there is a right
thing we can do). Please make this read_pci_config_16 instead
and you won't need the shift.
Either that or as I earlier suggested just do a 32bit read from offset 0
and use shifts and masks to get vendor and device fields.
The current code doing a shift where none should be needed (because
we ignore the two low order bits in our read) is totally weird
when looking at it.
Post by Neil Horman
Other than that, new patch attached. Enables the detection of AMD
hypertransport functions and checks for the proper quirk just as before, and
incoporates your comments above Eric, as well as yours Yinghai.
You almost got YH's comment. You need return 2 for the old functions
so we don't try and apply a per chipset fixup for every device in
the system.
I'm actually inclined to remove the return magic and just do something
static fix_applied;
if (fix_applied++)
return;
In those functions that should be called only once.
it seems we need to have two tables. one for northbridge (sweep all
the NB_K8) and another for SB ( like Nvidia, ati..., one touch and
leave)

YH
Neil Horman
2007-12-11 18:29:51 UTC
Permalink
Post by Yinghai Lu
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Post by Neil Horman
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Almost there.
cool! :)
<snip>
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
We should not need check_hypertransport_config as the generic loop
now does the work for us.
Post by Neil Horman
+
static void __init nvidia_bugs(void)
{
#ifdef CONFIG_ACPI
@@ -83,15 +127,25 @@ static void __init ati_bugs(void)
#endif
}
+static void __init amd_host_bugs(void)
+{
+ printk(KERN_CRIT "IN AMD_HOST_BUGS\n");
+ check_hypertransport_config();
+}
Likewise this function is unneeded and the printk is likely confusing
for users.
Copy that. Fixed
<snip>
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Post by Neil Horman
{}
So make that fix_hypertransport_config and we should be good.
Done
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
device_vendor = read_pci_config(num, slot, func, PCI_VENDOR_ID);
device = device_vendor >> 16;
vendor = device_vendor & 0xffff;
I'm not so sure about this. In my testing, it was clear that I needed to do a
shift on device to make valid comparisons to the defined PCI_DEVICE_* macros.
The origional code had to do the same thing with the class field, which is
simmilarly positioned in the pci config space.
Ok. I just looked at read_pci_config. It doesn't do the right thing for
a non-aligned 32bit access. (Not that I am convinced there is a right
thing we can do). Please make this read_pci_config_16 instead
and you won't need the shift.
Either that or as I earlier suggested just do a 32bit read from offset 0
and use shifts and masks to get vendor and device fields.
The current code doing a shift where none should be needed (because
we ignore the two low order bits in our read) is totally weird
when looking at it.
Post by Neil Horman
Other than that, new patch attached. Enables the detection of AMD
hypertransport functions and checks for the proper quirk just as before, and
incoporates your comments above Eric, as well as yours Yinghai.
You almost got YH's comment. You need return 2 for the old functions
so we don't try and apply a per chipset fixup for every device in
the system.
I'm actually inclined to remove the return magic and just do something
static fix_applied;
if (fix_applied++)
return;
In those functions that should be called only once.
it seems we need to have two tables. one for northbridge (sweep all
the NB_K8) and another for SB ( like Nvidia, ati..., one touch and
leave)
YH
I like Erics idea better I think. My origional patch had two tables, and it
seems that it made the early quirk detection logic that much more convoluted.
This way each quirk can determine if it needs to be applied to more than one pci
device.

Neil
Post by Yinghai Lu
_______________________________________________
kexec mailing list
http://lists.infradead.org/mailman/listinfo/kexec
Yinghai Lu
2007-12-11 18:45:01 UTC
Permalink
Post by Neil Horman
Post by Yinghai Lu
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
I'm actually inclined to remove the return magic and just do something
static fix_applied;
if (fix_applied++)
return;
In those functions that should be called only once.
it seems we need to have two tables. one for northbridge (sweep all
the NB_K8) and another for SB ( like Nvidia, ati..., one touch and
leave)
YH
I like Erics idea better I think. My origional patch had two tables, and it
seems that it made the early quirk detection logic that much more convoluted.
This way each quirk can determine if it needs to be applied to more than one pci
device.
nvidia or ati chip will come first, and then amd NB ( K8). So you need
to make sure "fix_applied return" is not going to skip your fix to
K8_NB.

YH
Neil Horman
2007-12-11 18:22:54 UTC
Permalink
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Ok. I just looked at read_pci_config. It doesn't do the right thing for
a non-aligned 32bit access. (Not that I am convinced there is a right
thing we can do). Please make this read_pci_config_16 instead
and you won't need the shift.
Either that or as I earlier suggested just do a 32bit read from offset 0
and use shifts and masks to get vendor and device fields.
The former seems like a reasonable solution to me. Corrected in this updated
patch.
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
You almost got YH's comment. You need return 2 for the old functions
so we don't try and apply a per chipset fixup for every device in
the system.
I'm actually inclined to remove the return magic and just do something
static fix_applied;
if (fix_applied++)
return;
In those functions that should be called only once.
I like the latter approach better. It seems less convoluted to me.

New patch attached.

Thanks & Regards
Neil

Signed-off-by: Neil Horman <nhorman-***@public.gmane.org>


early-quirks.c | 90 +++++++++++++++++++++++++++++++++++++++++++--------------
1 file changed, 69 insertions(+), 21 deletions(-)


diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index 88bb83e..f307285 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -21,8 +21,13 @@
#include <asm/gart.h>
#endif

-static void __init via_bugs(void)
+static void __init via_bugs(int num, int slot, int func)
{
+ static int fix_applied = 0;
+
+ if (fix_applied++)
+ return;
+
#ifdef CONFIG_GART_IOMMU
if ((end_pfn > MAX_DMA32_PFN || force_iommu) &&
!gart_iommu_aperture_allowed) {
@@ -44,8 +49,36 @@ static int __init nvidia_hpet_check(struct acpi_table_header *header)
#endif /* CONFIG_X86_IO_APIC */
#endif /* CONFIG_ACPI */

-static void __init nvidia_bugs(void)
+static void __init fix_hypertransport_config(int num, int slot, int func)
{
+ u32 htcfg;
+ /*
+ *we found a hypertransport bus
+ *make sure that are broadcasting
+ *interrupts to all cpus on the ht bus
+ *if we're using extended apic ids
+ */
+ htcfg = read_pci_config(num, slot, func, 0x68);
+ if (htcfg & (1 << 18)) {
+ printk(KERN_INFO "Detected use of extended apic ids on hypertransport bus\n");
+ if ((htcfg & (1 << 17)) == 0) {
+ printk(KERN_INFO "Enabling hypertransport extended apic interrupt broadcast\n");
+ printk(KERN_INFO "Note this is a bios bug, please contact your hw vendor\n");
+ htcfg |= (1 << 17);
+ write_pci_config(num, slot, func, 0x68, htcfg);
+ }
+ }
+
+
+}
+
+static void __init nvidia_bugs(int num, int slot, int func)
+{
+ static int fix_applied = 0;
+
+ if (fix_applied++)
+ return;
+
#ifdef CONFIG_ACPI
#ifdef CONFIG_X86_IO_APIC
/*
@@ -72,8 +105,13 @@ static void __init nvidia_bugs(void)

}

-static void __init ati_bugs(void)
+static void __init ati_bugs(int num, int slot, int func)
{
+ static int fix_applied = 0;
+
+ if (fix_applied++)
+ return;
+
#ifdef CONFIG_X86_IO_APIC
if (timer_over_8254 == 1) {
timer_over_8254 = 0;
@@ -84,14 +122,18 @@ static void __init ati_bugs(void)
}

struct chipset {
- u16 vendor;
- void (*f)(void);
+ u32 vendor;
+ u32 device;
+ u32 class;
+ u32 class_mask;
+ void (*f)(int num, int slot, int func);
};

static struct chipset early_qrk[] __initdata = {
- { PCI_VENDOR_ID_NVIDIA, nvidia_bugs },
- { PCI_VENDOR_ID_VIA, via_bugs },
- { PCI_VENDOR_ID_ATI, ati_bugs },
+ { PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID, PCI_CLASS_BRIDGE_PCI, PCI_ANY_ID, nvidia_bugs },
+ { PCI_VENDOR_ID_VIA, PCI_ANY_ID, PCI_CLASS_BRIDGE_PCI, PCI_ANY_ID, via_bugs },
+ { PCI_VENDOR_ID_ATI, PCI_ANY_ID, PCI_CLASS_BRIDGE_PCI, PCI_ANY_ID, ati_bugs },
+ { PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_K8_NB, PCI_CLASS_BRIDGE_HOST, PCI_ANY_ID, fix_hypertransport_config },
{}
};

@@ -106,27 +148,33 @@ void __init early_quirks(void)
for (num = 0; num < 32; num++) {
for (slot = 0; slot < 32; slot++) {
for (func = 0; func < 8; func++) {
- u32 class;
- u32 vendor;
+ u16 class;
+ u16 vendor;
+ u16 device;
u8 type;
int i;
- class = read_pci_config(num,slot,func,
+
+ class = read_pci_config_16(num,slot,func,
PCI_CLASS_REVISION);
- if (class == 0xffffffff)
+ if (class == 0xffff)
break;

- if ((class >> 16) != PCI_CLASS_BRIDGE_PCI)
- continue;
-
- vendor = read_pci_config(num, slot, func,
+ vendor = read_pci_config_16(num, slot, func,
PCI_VENDOR_ID);
- vendor &= 0xffff;

- for (i = 0; early_qrk[i].f; i++)
- if (early_qrk[i].vendor == vendor) {
- early_qrk[i].f();
- return;
+ device = read_pci_config_16(num, slot, func,
+ PCI_DEVICE_ID);
+
+ for(i=0;early_qrk[i].f != NULL;i++) {
+ if (((early_qrk[i].vendor == PCI_ANY_ID) ||
+ (early_qrk[i].vendor == vendor)) &&
+ ((early_qrk[i].device == PCI_ANY_ID) ||
+ (early_qrk[i].device == device)) &&
+ (!((early_qrk[i].class ^ class) &
+ early_qrk[i].class_mask))) {
+ early_qrk[i].f(num, slot, func);
}
+ }

type = read_pci_config_byte(num, slot, func,
PCI_HEADER_TYPE);
ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
2007-12-11 18:46:34 UTC
Permalink
Post by Neil Horman
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Ok. I just looked at read_pci_config. It doesn't do the right thing for
a non-aligned 32bit access. (Not that I am convinced there is a right
thing we can do). Please make this read_pci_config_16 instead
and you won't need the shift.
Either that or as I earlier suggested just do a 32bit read from offset 0
and use shifts and masks to get vendor and device fields.
The former seems like a reasonable solution to me. Corrected in this updated
patch.
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
You almost got YH's comment. You need return 2 for the old functions
so we don't try and apply a per chipset fixup for every device in
the system.
I'm actually inclined to remove the return magic and just do something
static fix_applied;
if (fix_applied++)
return;
In those functions that should be called only once.
I like the latter approach better. It seems less convoluted to me.
New patch attached.
Ok. My only remaining nit to pick is that fix_hypertransport_config
is right in the middle of the nvidia quirks, which can be a bit
confusing when reading through the code. Otherwise I think this
is a version that we can merge.

Let's get a clean description on this thing and send it to the
current x86 maintainers. Thomas, Ingo, and HPA
Post by Neil Horman
Thanks & Regards
Neil
early-quirks.c | 90 +++++++++++++++++++++++++++++++++++++++++++--------------
1 file changed, 69 insertions(+), 21 deletions(-)
diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index 88bb83e..f307285 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -21,8 +21,13 @@
#include <asm/gart.h>
#endif
-static void __init via_bugs(void)
+static void __init via_bugs(int num, int slot, int func)
{
+ static int fix_applied = 0;
+
+ if (fix_applied++)
+ return;
+
#ifdef CONFIG_GART_IOMMU
if ((end_pfn > MAX_DMA32_PFN || force_iommu) &&
!gart_iommu_aperture_allowed) {
@@ -44,8 +49,36 @@ static int __init nvidia_hpet_check(struct acpi_table_header *header)
#endif /* CONFIG_X86_IO_APIC */
#endif /* CONFIG_ACPI */
-static void __init nvidia_bugs(void)
+static void __init fix_hypertransport_config(int num, int slot, int func)
{
+ u32 htcfg;
+ /*
+ *we found a hypertransport bus
+ *make sure that are broadcasting
+ *interrupts to all cpus on the ht bus
+ *if we're using extended apic ids
+ */
+ htcfg = read_pci_config(num, slot, func, 0x68);
+ if (htcfg & (1 << 18)) {
+ printk(KERN_INFO "Detected use of extended apic ids on hypertransport bus\n");
+ if ((htcfg & (1 << 17)) == 0) {
+ printk(KERN_INFO "Enabling hypertransport extended apic interrupt
broadcast\n");
+ printk(KERN_INFO "Note this is a bios bug, please contact your hw vendor\n");
+ htcfg |= (1 << 17);
+ write_pci_config(num, slot, func, 0x68, htcfg);
+ }
+ }
+
+
+}
+
+static void __init nvidia_bugs(int num, int slot, int func)
+{
+ static int fix_applied = 0;
+
+ if (fix_applied++)
+ return;
+
#ifdef CONFIG_ACPI
#ifdef CONFIG_X86_IO_APIC
/*
@@ -72,8 +105,13 @@ static void __init nvidia_bugs(void)
}
-static void __init ati_bugs(void)
+static void __init ati_bugs(int num, int slot, int func)
{
+ static int fix_applied = 0;
+
+ if (fix_applied++)
+ return;
+
#ifdef CONFIG_X86_IO_APIC
if (timer_over_8254 == 1) {
timer_over_8254 = 0;
@@ -84,14 +122,18 @@ static void __init ati_bugs(void)
}
struct chipset {
- u16 vendor;
- void (*f)(void);
+ u32 vendor;
+ u32 device;
+ u32 class;
+ u32 class_mask;
+ void (*f)(int num, int slot, int func);
};
static struct chipset early_qrk[] __initdata = {
- { PCI_VENDOR_ID_NVIDIA, nvidia_bugs },
- { PCI_VENDOR_ID_VIA, via_bugs },
- { PCI_VENDOR_ID_ATI, ati_bugs },
+ { PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID, PCI_CLASS_BRIDGE_PCI, PCI_ANY_ID,
nvidia_bugs },
+ { PCI_VENDOR_ID_VIA, PCI_ANY_ID, PCI_CLASS_BRIDGE_PCI, PCI_ANY_ID, via_bugs },
+ { PCI_VENDOR_ID_ATI, PCI_ANY_ID, PCI_CLASS_BRIDGE_PCI, PCI_ANY_ID, ati_bugs },
+ { PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_K8_NB, PCI_CLASS_BRIDGE_HOST,
PCI_ANY_ID, fix_hypertransport_config },
{}
};
@@ -106,27 +148,33 @@ void __init early_quirks(void)
for (num = 0; num < 32; num++) {
for (slot = 0; slot < 32; slot++) {
for (func = 0; func < 8; func++) {
- u32 class;
- u32 vendor;
+ u16 class;
+ u16 vendor;
+ u16 device;
u8 type;
int i;
- class = read_pci_config(num,slot,func,
+
+ class = read_pci_config_16(num,slot,func,
PCI_CLASS_REVISION);
- if (class == 0xffffffff)
+ if (class == 0xffff)
break;
- if ((class >> 16) != PCI_CLASS_BRIDGE_PCI)
- continue;
-
- vendor = read_pci_config(num, slot, func,
+ vendor = read_pci_config_16(num, slot, func,
PCI_VENDOR_ID);
- vendor &= 0xffff;
- for (i = 0; early_qrk[i].f; i++)
- if (early_qrk[i].vendor == vendor) {
- early_qrk[i].f();
- return;
+ device = read_pci_config_16(num, slot, func,
+ PCI_DEVICE_ID);
+
+ for(i=0;early_qrk[i].f != NULL;i++) {
+ if (((early_qrk[i].vendor == PCI_ANY_ID) ||
+ (early_qrk[i].vendor == vendor)) &&
+ ((early_qrk[i].device == PCI_ANY_ID) ||
+ (early_qrk[i].device == device)) &&
+ (!((early_qrk[i].class ^ class) &
+ early_qrk[i].class_mask))) {
+ early_qrk[i].f(num, slot, func);
}
+ }
type = read_pci_config_byte(num, slot, func,
PCI_HEADER_TYPE);
Neil Horman
2007-12-11 19:24:34 UTC
Permalink
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Ok. My only remaining nit to pick is that fix_hypertransport_config
is right in the middle of the nvidia quirks, which can be a bit
confusing when reading through the code. Otherwise I think this
is a version that we can merge.
Sure, I'll move it to the top of the file
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Let's get a clean description on this thing and send it to the
current x86 maintainers. Thomas, Ingo, and HPA
Clean Summary:

Recently a kdump bug was discovered in which a system would hang inside
calibrate_delay during the booting of the kdump kernel. This was caused by the
fact that the jiffies counter was not being incremented during timer
calibration. The root cause of this problem was found to be a bios
misconfiguration of the hypertransport bus. On system affected by this hang,
the bios had assigned APIC ids which used extended apic bits (more than the
nominal 4 bit ids's), but failed to configure bit 18 of the hypertransport
transaction config register, which indicated that the mask for the destination
field of interrupt packets accross the ht bus (see section 3.3.9 of
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/26094.PDF).
If a crash occurs on a cpu with an APIC id that extends beyond 4 bits, it will
not recieve interrupts during the kdump kernel boot, and this hang will be the
result. The fix is to add this patch, whcih add an early pci quirk check, to
forcibly enable this bit in the httcfg register. This enables all cpus on a
system to receive interrupts, and allows kdump kernel bootup to procede
normally.
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Post by Neil Horman
Thanks & Regards
Neil
Signed-off-by: Neil Horman <nhorman-***@public.gmane.org>


early-quirks.c | 90 +++++++++++++++++++++++++++++++++++++++++++--------------
1 file changed, 69 insertions(+), 21 deletions(-)



diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index 88bb83e..c0d0c69 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -21,8 +21,36 @@
#include <asm/gart.h>
#endif

-static void __init via_bugs(void)
+static void __init fix_hypertransport_config(int num, int slot, int func)
{
+ u32 htcfg;
+ /*
+ *we found a hypertransport bus
+ *make sure that are broadcasting
+ *interrupts to all cpus on the ht bus
+ *if we're using extended apic ids
+ */
+ htcfg = read_pci_config(num, slot, func, 0x68);
+ if (htcfg & (1 << 18)) {
+ printk(KERN_INFO "Detected use of extended apic ids on hypertransport bus\n");
+ if ((htcfg & (1 << 17)) == 0) {
+ printk(KERN_INFO "Enabling hypertransport extended apic interrupt broadcast\n");
+ printk(KERN_INFO "Note this is a bios bug, please contact your hw vendor\n");
+ htcfg |= (1 << 17);
+ write_pci_config(num, slot, func, 0x68, htcfg);
+ }
+ }
+
+
+}
+
+static void __init via_bugs(int num, int slot, int func)
+{
+ static int fix_applied = 0;
+
+ if (fix_applied++)
+ return;
+
#ifdef CONFIG_GART_IOMMU
if ((end_pfn > MAX_DMA32_PFN || force_iommu) &&
!gart_iommu_aperture_allowed) {
@@ -44,8 +72,13 @@ static int __init nvidia_hpet_check(struct acpi_table_header *header)
#endif /* CONFIG_X86_IO_APIC */
#endif /* CONFIG_ACPI */

-static void __init nvidia_bugs(void)
+static void __init nvidia_bugs(int num, int slot, int func)
{
+ static int fix_applied = 0;
+
+ if (fix_applied++)
+ return;
+
#ifdef CONFIG_ACPI
#ifdef CONFIG_X86_IO_APIC
/*
@@ -72,8 +105,13 @@ static void __init nvidia_bugs(void)

}

-static void __init ati_bugs(void)
+static void __init ati_bugs(int num, int slot, int func)
{
+ static int fix_applied = 0;
+
+ if (fix_applied++)
+ return;
+
#ifdef CONFIG_X86_IO_APIC
if (timer_over_8254 == 1) {
timer_over_8254 = 0;
@@ -84,14 +122,18 @@ static void __init ati_bugs(void)
}

struct chipset {
- u16 vendor;
- void (*f)(void);
+ u32 vendor;
+ u32 device;
+ u32 class;
+ u32 class_mask;
+ void (*f)(int num, int slot, int func);
};

static struct chipset early_qrk[] __initdata = {
- { PCI_VENDOR_ID_NVIDIA, nvidia_bugs },
- { PCI_VENDOR_ID_VIA, via_bugs },
- { PCI_VENDOR_ID_ATI, ati_bugs },
+ { PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID, PCI_CLASS_BRIDGE_PCI, PCI_ANY_ID, nvidia_bugs },
+ { PCI_VENDOR_ID_VIA, PCI_ANY_ID, PCI_CLASS_BRIDGE_PCI, PCI_ANY_ID, via_bugs },
+ { PCI_VENDOR_ID_ATI, PCI_ANY_ID, PCI_CLASS_BRIDGE_PCI, PCI_ANY_ID, ati_bugs },
+ { PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_K8_NB, PCI_CLASS_BRIDGE_HOST, PCI_ANY_ID, fix_hypertransport_config },
{}
};

@@ -106,27 +148,33 @@ void __init early_quirks(void)
for (num = 0; num < 32; num++) {
for (slot = 0; slot < 32; slot++) {
for (func = 0; func < 8; func++) {
- u32 class;
- u32 vendor;
+ u16 class;
+ u16 vendor;
+ u16 device;
u8 type;
int i;
- class = read_pci_config(num,slot,func,
+
+ class = read_pci_config_16(num,slot,func,
PCI_CLASS_REVISION);
- if (class == 0xffffffff)
+ if (class == 0xffff)
break;

- if ((class >> 16) != PCI_CLASS_BRIDGE_PCI)
- continue;
-
- vendor = read_pci_config(num, slot, func,
+ vendor = read_pci_config_16(num, slot, func,
PCI_VENDOR_ID);
- vendor &= 0xffff;

- for (i = 0; early_qrk[i].f; i++)
- if (early_qrk[i].vendor == vendor) {
- early_qrk[i].f();
- return;
+ device = read_pci_config_16(num, slot, func,
+ PCI_DEVICE_ID);
+
+ for(i=0;early_qrk[i].f != NULL;i++) {
+ if (((early_qrk[i].vendor == PCI_ANY_ID) ||
+ (early_qrk[i].vendor == vendor)) &&
+ ((early_qrk[i].device == PCI_ANY_ID) ||
+ (early_qrk[i].device == device)) &&
+ (!((early_qrk[i].class ^ class) &
+ early_qrk[i].class_mask))) {
+ early_qrk[i].f(num, slot, func);
}
+ }

type = read_pci_config_byte(num, slot, func,
PCI_HEADER_TYPE);
Yinghai Lu
2007-12-11 19:51:39 UTC
Permalink
Post by Neil Horman
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Ok. My only remaining nit to pick is that fix_hypertransport_config
is right in the middle of the nvidia quirks, which can be a bit
confusing when reading through the code. Otherwise I think this
is a version that we can merge.
Sure, I'll move it to the top of the file
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Let's get a clean description on this thing and send it to the
current x86 maintainers. Thomas, Ingo, and HPA
Recently a kdump bug was discovered in which a system would hang inside
calibrate_delay during the booting of the kdump kernel. This was caused by the
fact that the jiffies counter was not being incremented during timer
calibration. The root cause of this problem was found to be a bios
misconfiguration of the hypertransport bus. On system affected by this hang,
the bios had assigned APIC ids which used extended apic bits (more than the
nominal 4 bit ids's), but failed to configure bit 18 of the hypertransport
should be bit 17.

YH
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Neil Horman
2007-12-11 20:59:55 UTC
Permalink
Recently a kdump bug was discovered in which a system would hang inside
calibrate_delay during the booting of the kdump kernel. This was caused by the
fact that the jiffies counter was not being incremented during timer
calibration. The root cause of this problem was found to be a bios
misconfiguration of the hypertransport bus. On system affected by this hang,
the bios had assigned APIC ids which used extended apic bits (more than the
nominal 4 bit ids's), but failed to configure bit 17 of the hypertransport
transaction config register, which indicated that the mask for the destination
field of interrupt packets accross the ht bus (see section 3.3.9 of
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/26094.PDF).
If a crash occurs on a cpu with an APIC id that extends beyond 4 bits, it will
not recieve interrupts during the kdump kernel boot, and this hang will be the
result. The fix is to add this patch, whcih add an early pci quirk check, to
forcibly enable this bit in the httcfg register. This enables all cpus on a
system to receive interrupts, and allows kdump kernel bootup to procede
normally.

Regards
Neil


Signed-off-by: Neil Horman <nhorman-***@public.gmane.org>


early-quirks.c | 90 +++++++++++++++++++++++++++++++++++++++++++--------------
1 file changed, 69 insertions(+), 21 deletions(-)


diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index 88bb83e..c0d0c69 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -21,8 +21,36 @@
#include <asm/gart.h>
#endif

-static void __init via_bugs(void)
+static void __init fix_hypertransport_config(int num, int slot, int func)
{
+ u32 htcfg;
+ /*
+ *we found a hypertransport bus
+ *make sure that are broadcasting
+ *interrupts to all cpus on the ht bus
+ *if we're using extended apic ids
+ */
+ htcfg = read_pci_config(num, slot, func, 0x68);
+ if (htcfg & (1 << 18)) {
+ printk(KERN_INFO "Detected use of extended apic ids on hypertransport bus\n");
+ if ((htcfg & (1 << 17)) == 0) {
+ printk(KERN_INFO "Enabling hypertransport extended apic interrupt broadcast\n");
+ printk(KERN_INFO "Note this is a bios bug, please contact your hw vendor\n");
+ htcfg |= (1 << 17);
+ write_pci_config(num, slot, func, 0x68, htcfg);
+ }
+ }
+
+
+}
+
+static void __init via_bugs(int num, int slot, int func)
+{
+ static int fix_applied = 0;
+
+ if (fix_applied++)
+ return;
+
#ifdef CONFIG_GART_IOMMU
if ((end_pfn > MAX_DMA32_PFN || force_iommu) &&
!gart_iommu_aperture_allowed) {
@@ -44,8 +72,13 @@ static int __init nvidia_hpet_check(struct acpi_table_header *header)
#endif /* CONFIG_X86_IO_APIC */
#endif /* CONFIG_ACPI */

-static void __init nvidia_bugs(void)
+static void __init nvidia_bugs(int num, int slot, int func)
{
+ static int fix_applied = 0;
+
+ if (fix_applied++)
+ return;
+
#ifdef CONFIG_ACPI
#ifdef CONFIG_X86_IO_APIC
/*
@@ -72,8 +105,13 @@ static void __init nvidia_bugs(void)

}

-static void __init ati_bugs(void)
+static void __init ati_bugs(int num, int slot, int func)
{
+ static int fix_applied = 0;
+
+ if (fix_applied++)
+ return;
+
#ifdef CONFIG_X86_IO_APIC
if (timer_over_8254 == 1) {
timer_over_8254 = 0;
@@ -84,14 +122,18 @@ static void __init ati_bugs(void)
}

struct chipset {
- u16 vendor;
- void (*f)(void);
+ u32 vendor;
+ u32 device;
+ u32 class;
+ u32 class_mask;
+ void (*f)(int num, int slot, int func);
};

static struct chipset early_qrk[] __initdata = {
- { PCI_VENDOR_ID_NVIDIA, nvidia_bugs },
- { PCI_VENDOR_ID_VIA, via_bugs },
- { PCI_VENDOR_ID_ATI, ati_bugs },
+ { PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID, PCI_CLASS_BRIDGE_PCI, PCI_ANY_ID, nvidia_bugs },
+ { PCI_VENDOR_ID_VIA, PCI_ANY_ID, PCI_CLASS_BRIDGE_PCI, PCI_ANY_ID, via_bugs },
+ { PCI_VENDOR_ID_ATI, PCI_ANY_ID, PCI_CLASS_BRIDGE_PCI, PCI_ANY_ID, ati_bugs },
+ { PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_K8_NB, PCI_CLASS_BRIDGE_HOST, PCI_ANY_ID, fix_hypertransport_config },
{}
};

@@ -106,27 +148,33 @@ void __init early_quirks(void)
for (num = 0; num < 32; num++) {
for (slot = 0; slot < 32; slot++) {
for (func = 0; func < 8; func++) {
- u32 class;
- u32 vendor;
+ u16 class;
+ u16 vendor;
+ u16 device;
u8 type;
int i;
- class = read_pci_config(num,slot,func,
+
+ class = read_pci_config_16(num,slot,func,
PCI_CLASS_REVISION);
- if (class == 0xffffffff)
+ if (class == 0xffff)
break;

- if ((class >> 16) != PCI_CLASS_BRIDGE_PCI)
- continue;
-
- vendor = read_pci_config(num, slot, func,
+ vendor = read_pci_config_16(num, slot, func,
PCI_VENDOR_ID);
- vendor &= 0xffff;

- for (i = 0; early_qrk[i].f; i++)
- if (early_qrk[i].vendor == vendor) {
- early_qrk[i].f();
- return;
+ device = read_pci_config_16(num, slot, func,
+ PCI_DEVICE_ID);
+
+ for(i=0;early_qrk[i].f != NULL;i++) {
+ if (((early_qrk[i].vendor == PCI_ANY_ID) ||
+ (early_qrk[i].vendor == vendor)) &&
+ ((early_qrk[i].device == PCI_ANY_ID) ||
+ (early_qrk[i].device == device)) &&
+ (!((early_qrk[i].class ^ class) &
+ early_qrk[i].class_mask))) {
+ early_qrk[i].f(num, slot, func);
}
+ }

type = read_pci_config_byte(num, slot, func,
PCI_HEADER_TYPE);
Ben Woodard
2007-12-12 00:16:32 UTC
Permalink
We may need to go back and do some additional work on this. It doesn't
seem to be quite as cut and dried as we initially thought.

This quirk doesn't appear to work on virtually the same motherboard with
the barcelona processors in it. It also may be sensitive to the firmware
version. More extensive testing on a larger number of pre-production is
not showing it to be as effective as it appeared to be initially on the
testbed.

I'm doing some retesting to figure out what exact situations and
collection of patches were able to make it work before.

-ben
Post by Neil Horman
Recently a kdump bug was discovered in which a system would hang inside
calibrate_delay during the booting of the kdump kernel. This was caused by the
fact that the jiffies counter was not being incremented during timer
calibration. The root cause of this problem was found to be a bios
misconfiguration of the hypertransport bus. On system affected by this hang,
the bios had assigned APIC ids which used extended apic bits (more than the
nominal 4 bit ids's), but failed to configure bit 17 of the hypertransport
transaction config register, which indicated that the mask for the destination
field of interrupt packets accross the ht bus (see section 3.3.9 of
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/26094.PDF).
If a crash occurs on a cpu with an APIC id that extends beyond 4 bits, it will
not recieve interrupts during the kdump kernel boot, and this hang will be the
result. The fix is to add this patch, whcih add an early pci quirk check, to
forcibly enable this bit in the httcfg register. This enables all cpus on a
system to receive interrupts, and allows kdump kernel bootup to procede
normally.
Regards
Neil
early-quirks.c | 90 +++++++++++++++++++++++++++++++++++++++++++--------------
1 file changed, 69 insertions(+), 21 deletions(-)
diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index 88bb83e..c0d0c69 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -21,8 +21,36 @@
#include <asm/gart.h>
#endif
-static void __init via_bugs(void)
+static void __init fix_hypertransport_config(int num, int slot, int func)
{
+ u32 htcfg;
+ /*
+ *we found a hypertransport bus
+ *make sure that are broadcasting
+ *interrupts to all cpus on the ht bus
+ *if we're using extended apic ids
+ */
+ htcfg = read_pci_config(num, slot, func, 0x68);
+ if (htcfg & (1 << 18)) {
+ printk(KERN_INFO "Detected use of extended apic ids on hypertransport bus\n");
+ if ((htcfg & (1 << 17)) == 0) {
+ printk(KERN_INFO "Enabling hypertransport extended apic interrupt broadcast\n");
+ printk(KERN_INFO "Note this is a bios bug, please contact your hw vendor\n");
+ htcfg |= (1 << 17);
+ write_pci_config(num, slot, func, 0x68, htcfg);
+ }
+ }
+
+
+}
+
+static void __init via_bugs(int num, int slot, int func)
+{
+ static int fix_applied = 0;
+
+ if (fix_applied++)
+ return;
+
#ifdef CONFIG_GART_IOMMU
if ((end_pfn > MAX_DMA32_PFN || force_iommu) &&
!gart_iommu_aperture_allowed) {
@@ -44,8 +72,13 @@ static int __init nvidia_hpet_check(struct acpi_table_header *header)
#endif /* CONFIG_X86_IO_APIC */
#endif /* CONFIG_ACPI */
-static void __init nvidia_bugs(void)
+static void __init nvidia_bugs(int num, int slot, int func)
{
+ static int fix_applied = 0;
+
+ if (fix_applied++)
+ return;
+
#ifdef CONFIG_ACPI
#ifdef CONFIG_X86_IO_APIC
/*
@@ -72,8 +105,13 @@ static void __init nvidia_bugs(void)
}
-static void __init ati_bugs(void)
+static void __init ati_bugs(int num, int slot, int func)
{
+ static int fix_applied = 0;
+
+ if (fix_applied++)
+ return;
+
#ifdef CONFIG_X86_IO_APIC
if (timer_over_8254 == 1) {
timer_over_8254 = 0;
@@ -84,14 +122,18 @@ static void __init ati_bugs(void)
}
struct chipset {
- u16 vendor;
- void (*f)(void);
+ u32 vendor;
+ u32 device;
+ u32 class;
+ u32 class_mask;
+ void (*f)(int num, int slot, int func);
};
static struct chipset early_qrk[] __initdata = {
- { PCI_VENDOR_ID_NVIDIA, nvidia_bugs },
- { PCI_VENDOR_ID_VIA, via_bugs },
- { PCI_VENDOR_ID_ATI, ati_bugs },
+ { PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID, PCI_CLASS_BRIDGE_PCI, PCI_ANY_ID, nvidia_bugs },
+ { PCI_VENDOR_ID_VIA, PCI_ANY_ID, PCI_CLASS_BRIDGE_PCI, PCI_ANY_ID, via_bugs },
+ { PCI_VENDOR_ID_ATI, PCI_ANY_ID, PCI_CLASS_BRIDGE_PCI, PCI_ANY_ID, ati_bugs },
+ { PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_K8_NB, PCI_CLASS_BRIDGE_HOST, PCI_ANY_ID, fix_hypertransport_config },
{}
};
@@ -106,27 +148,33 @@ void __init early_quirks(void)
for (num = 0; num < 32; num++) {
for (slot = 0; slot < 32; slot++) {
for (func = 0; func < 8; func++) {
- u32 class;
- u32 vendor;
+ u16 class;
+ u16 vendor;
+ u16 device;
u8 type;
int i;
- class = read_pci_config(num,slot,func,
+
+ class = read_pci_config_16(num,slot,func,
PCI_CLASS_REVISION);
- if (class == 0xffffffff)
+ if (class == 0xffff)
break;
- if ((class >> 16) != PCI_CLASS_BRIDGE_PCI)
- continue;
-
- vendor = read_pci_config(num, slot, func,
+ vendor = read_pci_config_16(num, slot, func,
PCI_VENDOR_ID);
- vendor &= 0xffff;
- for (i = 0; early_qrk[i].f; i++)
- if (early_qrk[i].vendor == vendor) {
- early_qrk[i].f();
- return;
+ device = read_pci_config_16(num, slot, func,
+ PCI_DEVICE_ID);
+
+ for(i=0;early_qrk[i].f != NULL;i++) {
+ if (((early_qrk[i].vendor == PCI_ANY_ID) ||
+ (early_qrk[i].vendor == vendor)) &&
+ ((early_qrk[i].device == PCI_ANY_ID) ||
+ (early_qrk[i].device == device)) &&
+ (!((early_qrk[i].class ^ class) &
+ early_qrk[i].class_mask))) {
+ early_qrk[i].f(num, slot, func);
}
+ }
type = read_pci_config_byte(num, slot, func,
PCI_HEADER_TYPE);
--
-ben
-=-
Neil Horman
2007-12-12 00:52:02 UTC
Permalink
Post by Ben Woodard
We may need to go back and do some additional work on this. It doesn't
seem to be quite as cut and dried as we initially thought.
This quirk doesn't appear to work on virtually the same motherboard with
the barcelona processors in it. It also may be sensitive to the firmware
version. More extensive testing on a larger number of pre-production is
not showing it to be as effective as it appeared to be initially on the
testbed.
I'm doing some retesting to figure out what exact situations and
collection of patches were able to make it work before.
Ben, please lets be clear about this. You say this patch doesn't help on a new
system. Even thought its almost the exact same system, its not the same system.
Does this patch work consistently on the system you initially reported the
problem on? I've done enough work on this at this point that I'm invested in
not abandoning this fix. If this solves the problem on dual core system, but
not quad core, I'd much rather move forward with this fix and address your quad
core problem as a separate issue.

Neil
Post by Ben Woodard
-ben
Post by Neil Horman
Recently a kdump bug was discovered in which a system would hang inside
calibrate_delay during the booting of the kdump kernel. This was caused by the
fact that the jiffies counter was not being incremented during timer
calibration. The root cause of this problem was found to be a bios
misconfiguration of the hypertransport bus. On system affected by this hang,
the bios had assigned APIC ids which used extended apic bits (more than the
nominal 4 bit ids's), but failed to configure bit 17 of the hypertransport
transaction config register, which indicated that the mask for the destination
field of interrupt packets accross the ht bus (see section 3.3.9 of
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/26094.PDF).
If a crash occurs on a cpu with an APIC id that extends beyond 4 bits, it will
not recieve interrupts during the kdump kernel boot, and this hang will be the
result. The fix is to add this patch, whcih add an early pci quirk check, to
forcibly enable this bit in the httcfg register. This enables all cpus on a
system to receive interrupts, and allows kdump kernel bootup to procede
normally.
Regards
Neil
early-quirks.c | 90 +++++++++++++++++++++++++++++++++++++++++++--------------
1 file changed, 69 insertions(+), 21 deletions(-)
diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index 88bb83e..c0d0c69 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -21,8 +21,36 @@
#include <asm/gart.h>
#endif
-static void __init via_bugs(void)
+static void __init fix_hypertransport_config(int num, int slot, int func)
{
+ u32 htcfg;
+ /*
+ *we found a hypertransport bus
+ *make sure that are broadcasting
+ *interrupts to all cpus on the ht bus
+ *if we're using extended apic ids
+ */
+ htcfg = read_pci_config(num, slot, func, 0x68);
+ if (htcfg & (1 << 18)) {
+ printk(KERN_INFO "Detected use of extended apic ids on hypertransport bus\n");
+ if ((htcfg & (1 << 17)) == 0) {
+ printk(KERN_INFO "Enabling hypertransport extended apic interrupt broadcast\n");
+ printk(KERN_INFO "Note this is a bios bug, please contact your hw vendor\n");
+ htcfg |= (1 << 17);
+ write_pci_config(num, slot, func, 0x68, htcfg);
+ }
+ }
+
+
+}
+
+static void __init via_bugs(int num, int slot, int func)
+{
+ static int fix_applied = 0;
+
+ if (fix_applied++)
+ return;
+
#ifdef CONFIG_GART_IOMMU
if ((end_pfn > MAX_DMA32_PFN || force_iommu) &&
!gart_iommu_aperture_allowed) {
@@ -44,8 +72,13 @@ static int __init nvidia_hpet_check(struct acpi_table_header *header)
#endif /* CONFIG_X86_IO_APIC */
#endif /* CONFIG_ACPI */
-static void __init nvidia_bugs(void)
+static void __init nvidia_bugs(int num, int slot, int func)
{
+ static int fix_applied = 0;
+
+ if (fix_applied++)
+ return;
+
#ifdef CONFIG_ACPI
#ifdef CONFIG_X86_IO_APIC
/*
@@ -72,8 +105,13 @@ static void __init nvidia_bugs(void)
}
-static void __init ati_bugs(void)
+static void __init ati_bugs(int num, int slot, int func)
{
+ static int fix_applied = 0;
+
+ if (fix_applied++)
+ return;
+
#ifdef CONFIG_X86_IO_APIC
if (timer_over_8254 == 1) {
timer_over_8254 = 0;
@@ -84,14 +122,18 @@ static void __init ati_bugs(void)
}
struct chipset {
- u16 vendor;
- void (*f)(void);
+ u32 vendor;
+ u32 device;
+ u32 class;
+ u32 class_mask;
+ void (*f)(int num, int slot, int func);
};
static struct chipset early_qrk[] __initdata = {
- { PCI_VENDOR_ID_NVIDIA, nvidia_bugs },
- { PCI_VENDOR_ID_VIA, via_bugs },
- { PCI_VENDOR_ID_ATI, ati_bugs },
+ { PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID, PCI_CLASS_BRIDGE_PCI, PCI_ANY_ID, nvidia_bugs },
+ { PCI_VENDOR_ID_VIA, PCI_ANY_ID, PCI_CLASS_BRIDGE_PCI, PCI_ANY_ID, via_bugs },
+ { PCI_VENDOR_ID_ATI, PCI_ANY_ID, PCI_CLASS_BRIDGE_PCI, PCI_ANY_ID, ati_bugs },
+ { PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_K8_NB, PCI_CLASS_BRIDGE_HOST, PCI_ANY_ID, fix_hypertransport_config },
{}
};
@@ -106,27 +148,33 @@ void __init early_quirks(void)
for (num = 0; num < 32; num++) {
for (slot = 0; slot < 32; slot++) {
for (func = 0; func < 8; func++) {
- u32 class;
- u32 vendor;
+ u16 class;
+ u16 vendor;
+ u16 device;
u8 type;
int i;
- class = read_pci_config(num,slot,func,
+
+ class = read_pci_config_16(num,slot,func,
PCI_CLASS_REVISION);
- if (class == 0xffffffff)
+ if (class == 0xffff)
break;
- if ((class >> 16) != PCI_CLASS_BRIDGE_PCI)
- continue;
-
- vendor = read_pci_config(num, slot, func,
+ vendor = read_pci_config_16(num, slot, func,
PCI_VENDOR_ID);
- vendor &= 0xffff;
- for (i = 0; early_qrk[i].f; i++)
- if (early_qrk[i].vendor == vendor) {
- early_qrk[i].f();
- return;
+ device = read_pci_config_16(num, slot, func,
+ PCI_DEVICE_ID);
+
+ for(i=0;early_qrk[i].f != NULL;i++) {
+ if (((early_qrk[i].vendor == PCI_ANY_ID) ||
+ (early_qrk[i].vendor == vendor)) &&
+ ((early_qrk[i].device == PCI_ANY_ID) ||
+ (early_qrk[i].device == device)) &&
+ (!((early_qrk[i].class ^ class) &
+ early_qrk[i].class_mask))) {
+ early_qrk[i].f(num, slot, func);
}
+ }
type = read_pci_config_byte(num, slot, func,
PCI_HEADER_TYPE);
--
-ben
-=-
_______________________________________________
kexec mailing list
http://lists.infradead.org/mailman/listinfo/kexec
Yinghai Lu
2007-12-12 01:07:18 UTC
Permalink
Post by Neil Horman
Post by Ben Woodard
We may need to go back and do some additional work on this. It doesn't
seem to be quite as cut and dried as we initially thought.
This quirk doesn't appear to work on virtually the same motherboard with
the barcelona processors in it. It also may be sensitive to the firmware
version. More extensive testing on a larger number of pre-production is
not showing it to be as effective as it appeared to be initially on the
testbed.
I'm doing some retesting to figure out what exact situations and
collection of patches were able to make it work before.
Ben, please lets be clear about this. You say this patch doesn't help on a new
system. Even thought its almost the exact same system, its not the same system.
Does this patch work consistently on the system you initially reported the
problem on? I've done enough work on this at this point that I'm invested in
not abandoning this fix. If this solves the problem on dual core system, but
not quad core, I'd much rather move forward with this fix and address your quad
core problem as a separate issue.
Neil
Post by Ben Woodard
-ben
Post by Neil Horman
Recently a kdump bug was discovered in which a system would hang inside
calibrate_delay during the booting of the kdump kernel. This was caused by the
fact that the jiffies counter was not being incremented during timer
calibration. The root cause of this problem was found to be a bios
misconfiguration of the hypertransport bus. On system affected by this hang,
the bios had assigned APIC ids which used extended apic bits (more than the
nominal 4 bit ids's), but failed to configure bit 17 of the hypertransport
transaction config register, which indicated that the mask for the destination
field of interrupt packets accross the ht bus (see section 3.3.9 of
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/26094.PDF).
If a crash occurs on a cpu with an APIC id that extends beyond 4 bits, it will
not recieve interrupts during the kdump kernel boot, and this hang will be the
result. The fix is to add this patch, whcih add an early pci quirk check, to
forcibly enable this bit in the httcfg register. This enables all cpus on a
system to receive interrupts, and allows kdump kernel bootup to procede
normally.
Regards
Neil
...
Post by Neil Horman
Post by Ben Woodard
Post by Neil Horman
static struct chipset early_qrk[] __initdata = {
- { PCI_VENDOR_ID_NVIDIA, nvidia_bugs },
- { PCI_VENDOR_ID_VIA, via_bugs },
- { PCI_VENDOR_ID_ATI, ati_bugs },
+ { PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID, PCI_CLASS_BRIDGE_PCI, PCI_ANY_ID, nvidia_bugs },
+ { PCI_VENDOR_ID_VIA, PCI_ANY_ID, PCI_CLASS_BRIDGE_PCI, PCI_ANY_ID, via_bugs },
+ { PCI_VENDOR_ID_ATI, PCI_ANY_ID, PCI_CLASS_BRIDGE_PCI, PCI_ANY_ID, ati_bugs },
+ { PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_K8_NB, PCI_CLASS_BRIDGE_HOST, PCI_ANY_ID, fix_hypertransport_config },
==>

+ { PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_K8_NB,
PCI_CLASS_BRIDGE_HOST, PCI_ANY_ID, fix_hypertransport_config },
+ { PCI_VENDOR_ID_AMD, 0x1200 , PCI_CLASS_BRIDGE_HOST, PCI_ANY_ID,
fix_hypertransport_config },

I still think good way is that you ask Supermicro to update their BIOS
to use newer code from AMD.

YH
ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
2007-12-12 08:43:31 UTC
Permalink
Post by Neil Horman
Recently a kdump bug was discovered in which a system would hang inside
calibrate_delay during the booting of the kdump kernel. This was caused by the
fact that the jiffies counter was not being incremented during timer
calibration. The root cause of this problem was found to be a bios
misconfiguration of the hypertransport bus. On system affected by this hang,
the bios had assigned APIC ids which used extended apic bits (more than the
nominal 4 bit ids's), but failed to configure bit 17 of the hypertransport
transaction config register, which indicated that the mask for the destination
field of interrupt packets accross the ht bus (see section 3.3.9 of
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/26094.PDF).
If a crash occurs on a cpu with an APIC id that extends beyond 4 bits, it will
not recieve interrupts during the kdump kernel boot, and this hang will be the
result. The fix is to add this patch, whcih add an early pci quirk check, to
forcibly enable this bit in the httcfg register. This enables all cpus on a
system to receive interrupts, and allows kdump kernel bootup to procede
normally.
Regards
Neil
Acked-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/***@public.gmane.org>

Eric
Andi Kleen
2007-12-12 14:21:32 UTC
Permalink
Post by Neil Horman
+ htcfg = read_pci_config(num, slot, func, 0x68);
+ if (htcfg & (1 << 18)) {
+ printk(KERN_INFO "Detected use of extended apic ids on hypertransport bus\n");
+ if ((htcfg & (1 << 17)) == 0) {
+ printk(KERN_INFO "Enabling hypertransport extended apic interrupt broadcast\n");
+ printk(KERN_INFO "Note this is a bios bug, please contact your hw vendor\n");
I'm not convinced the message is correct. e.g. on a system with only a dual core not enabling
that is fine, but the extended IDs might be still set.
Post by Neil Horman
#endif /* CONFIG_ACPI */
-static void __init nvidia_bugs(void)
+static void __init nvidia_bugs(int num, int slot, int func)
{
+ static int fix_applied = 0;
+
+ if (fix_applied++)
+ return;
This looks like the wrong place to do this. Better add a flag or something
in the structure. Dito others.

Also while not a problem here in general it's bad style to add potential
wrapping bugs like this. Never use ++ for flags.

-Andi
Neil Horman
2007-12-12 15:55:15 UTC
Permalink
Post by Andi Kleen
Post by Neil Horman
+ htcfg = read_pci_config(num, slot, func, 0x68);
+ if (htcfg & (1 << 18)) {
+ printk(KERN_INFO "Detected use of extended apic ids on hypertransport bus\n");
+ if ((htcfg & (1 << 17)) == 0) {
+ printk(KERN_INFO "Enabling hypertransport extended apic interrupt broadcast\n");
+ printk(KERN_INFO "Note this is a bios bug, please contact your hw vendor\n");
I'm not convinced the message is correct. e.g. on a system with only a dual core not enabling
that is fine, but the extended IDs might be still set.
I'm not sure that would be fine. In the situation you describe, not setting
this bit means the second core won't receive interrupts. If we crash on that
core and boot the kdump kernel with it, we get exactly the same problem that we
currently see.
Post by Andi Kleen
Post by Neil Horman
#endif /* CONFIG_ACPI */
-static void __init nvidia_bugs(void)
+static void __init nvidia_bugs(int num, int slot, int func)
{
+ static int fix_applied = 0;
+
+ if (fix_applied++)
+ return;
This looks like the wrong place to do this. Better add a flag or something
in the structure. Dito others.
I suppose I can, but I'm not sure what benefit that provides. Can you
elaborate?
Post by Andi Kleen
Also while not a problem here in general it's bad style to add potential
wrapping bugs like this. Never use ++ for flags.
I can fix that up. I'll hold off though until ben redoes all his testing. He
mentioned earlier this morning, that some of the results he was getting may have
been caused by a kexec utility bug. He's re-confirming that this patch solves
the reported problem. Once he does, I'll repost.

Thanks & Regards
Neil
Post by Andi Kleen
-Andi
--
/***************************************************
*Neil Horman
*nhorman-***@public.gmane.org
*gpg keyid: 1024D / 0x92A74FA1
*http://pgp.mit.edu
***************************************************/
Andi Kleen
2007-12-12 16:07:22 UTC
Permalink
Post by Neil Horman
Post by Andi Kleen
Post by Neil Horman
+ htcfg = read_pci_config(num, slot, func, 0x68);
+ if (htcfg & (1 << 18)) {
+ printk(KERN_INFO "Detected use of extended apic ids on hypertransport bus\n");
+ if ((htcfg & (1 << 17)) == 0) {
+ printk(KERN_INFO "Enabling hypertransport extended apic interrupt broadcast\n");
+ printk(KERN_INFO "Note this is a bios bug, please contact your hw vendor\n");
I'm not convinced the message is correct. e.g. on a system with only a dual core not enabling
that is fine, but the extended IDs might be still set.
I'm not sure that would be fine. In the situation you describe, not setting
this bit means the second core won't receive interrupts. If we crash on that
core and boot the kdump kernel with it, we get exactly the same problem that we
currently see.
It could enable the extended APIC IDs but not use them?


Anyways I haven't got docs on that NV bridge so I might be wrong.
Post by Neil Horman
Post by Andi Kleen
Post by Neil Horman
#endif /* CONFIG_ACPI */
-static void __init nvidia_bugs(void)
+static void __init nvidia_bugs(int num, int slot, int func)
{
+ static int fix_applied = 0;
+
+ if (fix_applied++)
+ return;
This looks like the wrong place to do this. Better add a flag or something
in the structure. Dito others.
I suppose I can, but I'm not sure what benefit that provides. Can you
elaborate?
The code would be smaller and cleaner.

-Andi

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
2007-12-12 19:43:34 UTC
Permalink
Post by Andi Kleen
It could enable the extended APIC IDs but not use them?
In which case complaining is still correct (the BIOS was out of sync),
enabling bit 17 is still correct and we are just in overkill mode.
Post by Andi Kleen
Anyways I haven't got docs on that NV bridge so I might be wrong.
This has everything to do with how AMD coherent hypertransport works and
little if anything to do with how the NV bridge operated.

Basically the NV bridge seems to be sending a standard hypertransport
x86 legacy interrupt packet (that doesn't have any target information)
and when that packet hits the coherent hypertransport domain it isn't
being converted into whatever would send it to all cpus.

.....

The real practical problem is if somehow the BIOS goofs up this way
and it then decides to ask us to boot on one of these cpus with
an extended apic id. We will hang in calibrate_delay. So far
this only seems to happen in the kdump case but in theory the BIOS
could be completely crazy.

Eric
Neil Horman
2007-12-12 20:22:15 UTC
Permalink
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Post by Andi Kleen
It could enable the extended APIC IDs but not use them?
In which case complaining is still correct (the BIOS was out of sync),
enabling bit 17 is still correct and we are just in overkill mode.
Post by Andi Kleen
Anyways I haven't got docs on that NV bridge so I might be wrong.
This has everything to do with how AMD coherent hypertransport works and
little if anything to do with how the NV bridge operated.
Basically the NV bridge seems to be sending a standard hypertransport
x86 legacy interrupt packet (that doesn't have any target information)
and when that packet hits the coherent hypertransport domain it isn't
being converted into whatever would send it to all cpus.
.....
The real practical problem is if somehow the BIOS goofs up this way
and it then decides to ask us to boot on one of these cpus with
an extended apic id. We will hang in calibrate_delay. So far
this only seems to happen in the kdump case but in theory the BIOS
could be completely crazy.
I think this just leaves us with deciding on a mechanism for how to do
single-application quirks. I take Andi's point that adding a flag set to the
quirk data structure is a fine solution, but I'm really ok with static integers
in individual functions. Do we have consensus on how to handle that? I'm happy
either way, but I'd rather have agreement on how to handle it before I post
another iteration of this patch.

Thanks & Regards
Neil
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Eric
--
/***************************************************
*Neil Horman
*nhorman-***@public.gmane.org
*gpg keyid: 1024D / 0x92A74FA1
*http://pgp.mit.edu
***************************************************/
ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
2007-12-12 21:32:02 UTC
Permalink
Post by Neil Horman
I think this just leaves us with deciding on a mechanism for how to do
single-application quirks. I take Andi's point that adding a flag set to the
quirk data structure is a fine solution, but I'm really ok with static integers
in individual functions. Do we have consensus on how to handle that? I'm happy
either way, but I'd rather have agreement on how to handle it before I post
another iteration of this patch.
As long as the solution is simple, small and concise I don't care.

And since what will make Andi happy seems to meet those criteria,
that should be fine.


Eric
Neil Horman
2007-12-13 14:39:22 UTC
Permalink
Ok, new patch attached, taking into account Andi's request for a cleaner method
to implement single application quirks. I've spoken with Ben, who is continuing
to retest, and reports that clean methodical testing results in success with
this patch.


Summary:

Recently a kdump bug was discovered in which a system would hang inside
calibrate_delay during the booting of the kdump kernel. This was caused by the
fact that the jiffies counter was not being incremented during timer
calibration. The root cause of this problem was found to be a bios
misconfiguration of the hypertransport bus. On system affected by this hang,
the bios had assigned APIC ids which used extended apic bits (more than the
nominal 4 bit ids's), but failed to configure bit 17 of the hypertransport
transaction config register, which indicated that the mask for the destination
field of interrupt packets accross the ht bus (see section 3.3.9 of
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/26094.PDF).
If a crash occurs on a cpu with an APIC id that extends beyond 4 bits, it will
not recieve interrupts during the kdump kernel boot, and this hang will be the
result. The fix is to add this patch, whcih add an early pci quirk check, to
forcibly enable this bit in the httcfg register. This enables all cpus on a
system to receive interrupts, and allows kdump kernel bootup to procede
normally.


Regards
Neil

Signed-off-by: Neil Horman <nhorman-***@public.gmane.org>


early-quirks.c | 86 +++++++++++++++++++++++++++++++++++++++++++--------------
1 file changed, 65 insertions(+), 21 deletions(-)



diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index 88bb83e..f4ed3d1 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -21,8 +21,31 @@
#include <asm/gart.h>
#endif

-static void __init via_bugs(void)
+static void __init fix_hypertransport_config(int num, int slot, int func)
{
+ u32 htcfg;
+ /*
+ *we found a hypertransport bus
+ *make sure that are broadcasting
+ *interrupts to all cpus on the ht bus
+ *if we're using extended apic ids
+ */
+ htcfg = read_pci_config(num, slot, func, 0x68);
+ if (htcfg & (1 << 18)) {
+ printk(KERN_INFO "Detected use of extended apic ids on hypertransport bus\n");
+ if ((htcfg & (1 << 17)) == 0) {
+ printk(KERN_INFO "Enabling hypertransport extended apic interrupt broadcast\n");
+ printk(KERN_INFO "Note this is a bios bug, please contact your hw vendor\n");
+ htcfg |= (1 << 17);
+ write_pci_config(num, slot, func, 0x68, htcfg);
+ }
+ }
+
+
+}
+
+static void __init via_bugs(int num, int slot, int func)
+{
#ifdef CONFIG_GART_IOMMU
if ((end_pfn > MAX_DMA32_PFN || force_iommu) &&
!gart_iommu_aperture_allowed) {
@@ -44,8 +67,8 @@
#endif /* CONFIG_X86_IO_APIC */
#endif /* CONFIG_ACPI */

-static void __init nvidia_bugs(void)
+static void __init nvidia_bugs(int num, int slot, int func)
{
#ifdef CONFIG_ACPI
#ifdef CONFIG_X86_IO_APIC
/*
@@ -72,8 +95,8 @@

}

-static void __init ati_bugs(void)
+static void __init ati_bugs(int num, int slot, int func)
{
#ifdef CONFIG_X86_IO_APIC
if (timer_over_8254 == 1) {
timer_over_8254 = 0;
@@ -83,15 +106,27 @@
#endif
}

+#define QFLAG_APPLY_ONCE 0x1
+#define QFLAG_APPLIED 0x2
+#define QFLAG_DONE (QFLAG_APPLY_ONCE|QFLAG_APPLIED)
struct chipset {
- u16 vendor;
- void (*f)(void);
+ u32 vendor;
+ u32 device;
+ u32 class;
+ u32 class_mask;
+ u32 flags;
+ void (*f)(int num, int slot, int func);
};

static struct chipset early_qrk[] __initdata = {
- { PCI_VENDOR_ID_NVIDIA, nvidia_bugs },
- { PCI_VENDOR_ID_VIA, via_bugs },
- { PCI_VENDOR_ID_ATI, ati_bugs },
+ { PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID,
+ PCI_CLASS_BRIDGE_PCI, PCI_ANY_ID, QFLAG_APPLY_ONCE, nvidia_bugs },
+ { PCI_VENDOR_ID_VIA, PCI_ANY_ID,
+ PCI_CLASS_BRIDGE_PCI, PCI_ANY_ID, QFLAG_APPLY_ONCE, via_bugs },
+ { PCI_VENDOR_ID_ATI, PCI_ANY_ID,
+ PCI_CLASS_BRIDGE_PCI, PCI_ANY_ID, QFLAG_APPLY_ONCE, ati_bugs },
+ { PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_K8_NB,
+ PCI_CLASS_BRIDGE_HOST, PCI_ANY_ID, 0, fix_hypertransport_config },
{}
};

@@ -106,27 +141,36 @@
for (num = 0; num < 32; num++) {
for (slot = 0; slot < 32; slot++) {
for (func = 0; func < 8; func++) {
- u32 class;
- u32 vendor;
+ u16 class;
+ u16 vendor;
+ u16 device;
u8 type;
int i;
- class = read_pci_config(num,slot,func,
+
+ class = read_pci_config_16(num,slot,func,
PCI_CLASS_REVISION);
- if (class == 0xffffffff)
+ if (class == 0xffff)
break;

- if ((class >> 16) != PCI_CLASS_BRIDGE_PCI)
- continue;
-
- vendor = read_pci_config(num, slot, func,
+ vendor = read_pci_config_16(num, slot, func,
PCI_VENDOR_ID);
- vendor &= 0xffff;

- for (i = 0; early_qrk[i].f; i++)
- if (early_qrk[i].vendor == vendor) {
- early_qrk[i].f();
- return;
+ device = read_pci_config_16(num, slot, func,
+ PCI_DEVICE_ID);
+
+ for(i=0;early_qrk[i].f != NULL;i++) {
+ if (((early_qrk[i].vendor == PCI_ANY_ID) ||
+ (early_qrk[i].vendor == vendor)) &&
+ ((early_qrk[i].device == PCI_ANY_ID) ||
+ (early_qrk[i].device == device)) &&
+ (!((early_qrk[i].class ^ class) &
+ early_qrk[i].class_mask))) {
+ if ((early_qrk[i].flags & QFLAG_DONE) != QFLAG_DONE)
+ early_qrk[i].f(num, slot, func);
+ early_qrk[i].flags |= QFLAG_APPLIED;
+
}
+ }

type = read_pci_config_byte(num, slot, func,
PCI_HEADER_TYPE);
Andi Kleen
2007-12-13 15:16:29 UTC
Permalink
Post by Neil Horman
Ok, new patch attached, taking into account Andi's request for a cleaner method
Sorry for not noticing that earlier, but was there a specific reason this needs
to be an early quirk at all? kexec can only happen after the standard quirks ran.
I think it should be fine as a standard "late" quirk.

-Andi
Neil Horman
2007-12-13 15:32:22 UTC
Permalink
Post by Andi Kleen
Post by Neil Horman
Ok, new patch attached, taking into account Andi's request for a cleaner method
Sorry for not noticing that earlier, but was there a specific reason this needs
to be an early quirk at all? kexec can only happen after the standard quirks ran.
I think it should be fine as a standard "late" quirk.
-Andi
Early quirk seemed like the right thing to do to me. Starting from boot up,
this (mis)configuration by the bios can mean that come cpus just don't get
interrupts. I could imagine situations like serial console not working if the
serial port interrupt was routed to a cpu that used extended APIC id. I've
never actually observed it happening, but making sure that all cpus were
eligible to get interrupts early in the boot process made sense to me.

Neil
Post by Andi Kleen
_______________________________________________
kexec mailing list
http://lists.infradead.org/mailman/listinfo/kexec
ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
2007-12-18 00:13:25 UTC
Permalink
Post by Neil Horman
Post by Neil Horman
Ok, new patch attached, taking into account Andi's request for a cleaner
method
Sorry for not noticing that earlier, but was there a specific reason this needs
to be an early quirk at all? kexec can only happen after the standard quirks ran.
I think it should be fine as a standard "late" quirk.
Just to document things. The important thing is this quirk happens
before calibrate_delay(). Which is still before the normal pci
subsystem gets initialized. So that seems to require an early_quirk
as the pci subsystem is not initialized by that point.

The only case we are likely to hit this is kdump because BIOS almost
always boot us on the cpu with apic id == 0. However in the case
of this bug if we happen to boot on a cpu with apic id >= 16 we
won't be able to boot the linux kernel either, because calibrate_delay
will fail.

Eric

ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
2007-12-07 18:36:58 UTC
Permalink
Post by Neil Horman
this seems reasonable, I can reroll the patch for this. As I think about it I'm
also going to update the patch to make this check occur for any pci class 0600
device from vendor AMD, since its possible that more than just nvidia chipsets
can be affected.
I'll repost as soon as I've tested, thanks!
Thanks.

Neil in your testing please confirm the preconditions for setting
the Apic Extended Broadcast flag (bit 17) are present.

If that is the case it makes sense to always set that bit on conforming
systems but we will also want to print a message noting that the
BIOS has a bug, and we are working around it.

Thanks,
Eric
Neil Horman
2007-12-07 18:48:10 UTC
Permalink
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Post by Neil Horman
this seems reasonable, I can reroll the patch for this. As I think about it I'm
also going to update the patch to make this check occur for any pci class 0600
device from vendor AMD, since its possible that more than just nvidia chipsets
can be affected.
I'll repost as soon as I've tested, thanks!
Thanks.
Neil in your testing please confirm the preconditions for setting
the Apic Extended Broadcast flag (bit 17) are present.
The systems that I have here do _not_ in fact have that precondition, but the
systems from Ben, who originoally reported the problem do have that
precondition, and he has reported that this fixes the hang in the kdump boot.
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
If that is the case it makes sense to always set that bit on conforming
systems but we will also want to print a message noting that the
BIOS has a bug, and we are working around it.
I've got two printk's in this patch, one that indicates that Extended APIC ID's
are in use, and a second that indicates that there is a mismatch between the use
of extended APIC ids (bit 18) and the lack of an extended APIC id dest mask for
interrupt packets (bit 17). Not sure if that meets you're requirements, but I
think its sufficient. If you disagree, let me know and we can enhance them.

Thanks
Neil
Neil Horman
2007-11-27 13:53:56 UTC
Permalink
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Post by Neil Horman
What makes you say this? I don't see any need for interrupts prior to
calibrate_delay()
Yes. calibrate_delay() is the first place we send interrupts over
hypertransport. However I/O still works. Thus hypertransport from
the first cpu is working, and hypertransport itself is working.
This is an interrupt specific problem not some generic hypertransport
problem.
Is it possible that the hypertansport bus can be in a state where I/O would
work, but not interrupt routing? I confess my knoweldge of this system bus is
lacking.
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Post by Neil Horman
Post by Eric W. Biederman
I agree that there is a problem.
The reliable fix is to totally skip the PIC interrupt mode and go directly
to apic mode.
To make the code kexec on panic code path reliable we need to remove code
not add it.
Frankly I think switching cpus is one of the least reliable things that
we can do in general.
I understand the sentiment here, but its not like we're adding additional
functionality with this patch. We're already sending an IPI to all the
processors to halt them
And we don't care if they halt. If they don't get the IPI we timeout.
Making the IPI mandatory is a _singificant_ change.
But how likely is a kdump kernel to work properly if an errant cpu is running
unhalted while we try to boot? I understand your point regarding the
significance of the need for reliable IPI's, but in fairness, I think that we
rely on IPI delivery here, weather we want to or not.
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
The only reason that code is on the kexec on panic code path is that
there is no other possible place we could put it.
Post by Neil Horman
, we're just adding logic here so that we can detect the
boot cpu and use it to jump to the kexec image instead of halting. I don't
think this is any less reliable that what we have currently.
It doesn't make things more reliable, and it adds code to a code path
that already has to much code to be solid reliable (thus your
problem).
Putting the system back in PIC legacy mode on the kexec on panic path
was supposed to be a short term hack until we could remove the need
by always deliver interrupts in apic mode.
If you can't root cause your problem and figure out how the apics
are misconfigured for legacy mode let's remove the need for going into
to legacy PIC mode and do what we should be able to do reliably. The
reward is much higher, as we kill all possibility of restoring PIC
mode wrong because we don't need to bother.
I understand your suggestion, but to do that don't we need to do more than just
not move the apic to legacy pic mode? It was my understanding that the ioapic
delivered timer interrupts to one cpu, who's interrupt handler then distributed
it to the other cpu's via IPI. That suggests to me that we will need to
re-write the apic config so that the crashing processor is the target of the
ioapic interrupt delivery.

And if this is truly the case, I would really like to furhter understand why
this isn't working on this specific system before I implment anything. Any
suggestions for how to further root cause this problem?

Regards
Neil
Post by ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
Eric
Andi Kleen
2007-11-27 10:55:03 UTC
Permalink
Post by Neil Horman
Hey all-
I've been working on an issue lately involving multi socket x86_64
systems connected via hypertransport bridges. It appears that some systems,
disable the hypertransport connections during a kdump operation when all but the
crashing processor gets halted in machine_crash_shutdown. This becomes a
That would be hard to believe. Linux cannot shut down CPUs
completely (put them back to SINIT state) because there is no generic
interface for this and it might be impossible. They just get put
into a HLT loop.

And even if they were in SINIT they would still need the HT connections
otherwise it would be impossible to ever wake them up.

The way HT setup is normally done is that the BIOS sets it all up
and then it never changes (except for some error conditions that
may cause SYNC flood)
Post by Neil Horman
problem when the ioapic attempts to route interrupts to the only remaining
processor. Even though the active processor is targeted for interrupt
reception, the fact that the hypertransport connections are inactive result in
interrupts not getting delivered. The effective result is that timer interrupts
are not delivered to the running cpu, and the system hangs on reboot into the
kdump kernel during calibrate_delay. I've found that I've been able to avoid
this hang, by forcing a transition to the bios defined boot cpu during the
crashing kernel shutdown. This patch accomplished that. Tested by myself and
the origional reporter with successful results.
While that may help I doubt your analysis of the source of the problem
is correct. Most likely something is broken with the IO-APIC, but not
related to HyperTransport.

It would be better to properly root cause before applying.

-Andi
ebiederm-aS9lmoZGLiVWk0Htik3J/ (Eric W. Biederman)
2007-11-27 11:19:48 UTC
Permalink
Post by Andi Kleen
Post by Neil Horman
Hey all-
I've been working on an issue lately involving multi socket x86_64
systems connected via hypertransport bridges. It appears that some systems,
disable the hypertransport connections during a kdump operation when all but
the
Post by Neil Horman
crashing processor gets halted in machine_crash_shutdown. This becomes a
That would be hard to believe. Linux cannot shut down CPUs
completely (put them back to SINIT state) because there is no generic
interface for this and it might be impossible. They just get put
into a HLT loop.
And even if they were in SINIT they would still need the HT connections
otherwise it would be impossible to ever wake them up.
If forget off the top of my head. But my memory is that if you send
the proper IPI cpus go back into SINIT. However linux doesn't do this
currently.
Post by Andi Kleen
The way HT setup is normally done is that the BIOS sets it all up
and then it never changes (except for some error conditions that
may cause SYNC flood)
Yep.
Post by Andi Kleen
It would be better to properly root cause before applying.
Agreed.

Eric
Neil Horman
2007-11-27 13:28:11 UTC
Permalink
Post by Andi Kleen
Post by Neil Horman
Hey all-
I've been working on an issue lately involving multi socket x86_64
systems connected via hypertransport bridges. It appears that some systems,
disable the hypertransport connections during a kdump operation when all but the
crashing processor gets halted in machine_crash_shutdown. This becomes a
That would be hard to believe. Linux cannot shut down CPUs
completely (put them back to SINIT state) because there is no generic
interface for this and it might be impossible. They just get put
into a HLT loop.
And even if they were in SINIT they would still need the HT connections
otherwise it would be impossible to ever wake them up.
I understand that this is how its supposed to work, but my analysis nevertheless
in my mind points to an inability to route irqs to any processor other than the
boot cpu. I conducted a test whereby I forced a crash on cpu0, and then again
on cpu3. I've found that on all the systems I have available, I'm able to boot
to a kexec kernel without issue. However, on this system:
http://www.supermicro.com/Aplus/motherboard/Opteron8000/MCP55/H8QM8-2.cfm

Crashing on any cpu other than the boot cpu leads to a hang in calibrate_delay
on reboot. I certainly make room for the notion that this could be an ioapic
programming error, but I don't see that this system has a different ioapic that
my other test systems. I've also simply tried booting the system that is
failing with noapic, to force the system to not use the ioapic, to no avail. If
you could suggest another test to point to another root cause, I would happily
run it, but the evidence that I have at the moment suggests to me that the
ioapic, while normally getting succefully programmed to deliver interrupts to
the appropriate cpu, is unable to on this system, and removing the traversal of
the system bus on the affected system restores functionality. That suggests a
system bus error to me.
Post by Andi Kleen
The way HT setup is normally done is that the BIOS sets it all up
and then it never changes (except for some error conditions that
may cause SYNC flood)
I think thats part of the issue. Somehow (and in fairness I don't know how this
occurs), the crash of the kernel affects the functionality of the hypertransport
bus in such a way that it can no longer deliver interrupts. I'm not sur how,
beyond the testing that I describe above, that I can further prove or disprove
this.
Post by Andi Kleen
Post by Neil Horman
problem when the ioapic attempts to route interrupts to the only remaining
processor. Even though the active processor is targeted for interrupt
reception, the fact that the hypertransport connections are inactive result in
interrupts not getting delivered. The effective result is that timer interrupts
are not delivered to the running cpu, and the system hangs on reboot into the
kdump kernel during calibrate_delay. I've found that I've been able to avoid
this hang, by forcing a transition to the bios defined boot cpu during the
crashing kernel shutdown. This patch accomplished that. Tested by myself and
the origional reporter with successful results.
While that may help I doubt your analysis of the source of the problem
is correct. Most likely something is broken with the IO-APIC, but not
related to HyperTransport.
If you could suggest a test or observation to make so that I could further
diagnose this, I would appreciate it. Currently the code seems to configure the
ioapic properly on all systems available to me, except the supermicro board
above. Any thoughts welcome here.


Thanks & Regards
Neil
Post by Andi Kleen
It would be better to properly root cause before applying.
-Andi
_______________________________________________
kexec mailing list
http://lists.infradead.org/mailman/listinfo/kexec
--
/***************************************************
*Neil Horman
*Software Engineer
*Red Hat, Inc.
*nhorman-H+wXaHxf7aLQT0dZR+***@public.gmane.org
*gpg keyid: 1024D / 0x92A74FA1
*http://pgp.mit.edu
***************************************************/
Continue reading on narkive:
Loading...