Discussion:
sata_sil24 broken since 2.6.23-rc4-mm1
Torsten Kaiser
2007-09-27 17:34:34 UTC
Permalink
I compared the dmesg form good and bad boots with -rc7-mm1 but could
not see any difference, so do you think that these additional
diagnostics could show a difference?
Or could you suggest any other debugging options I should try?
I think since its a reproducible problem, I think it's easiest to get
you straight to git-bisect. In this case, that would be
Sorry, but I don't think that will work.
It seems that I am able to reproduce the bug, but not reliable. And my
current best guess to make it happen involves the step "leaf the
computer powered off for 8 hours".
I estimate that even with the 8 hour pause only at ~50% of the boots
one drive fails. So I have no safe point to mark a bisect step as
'good'.
a) start with a known good point (v2.6.22? v2.6.23?) and
known bad point (HEAD, aka the most recent commit in
libata-dev.git#upstream)
Known good is for me 2.6.23-rc3-mm1, the first known bad is 2.6.23-rc4-mm1.
I will try to look at the diff between these revisions some more, but
the change in sata_sil24.c looked like a perfect match for the
symptoms I was seeing.

What I just noticed, as I wanted two re-add the drive to the RAID:
This time it was not sda, but sdb that was kicked. But otherwise the
errors are perfectly identical.

I will try to make a 2.6.23-rc3.5-mm1 to narrow it down some more...

Torsten
Tejun Heo
2007-09-27 20:22:28 UTC
Permalink
Post by Torsten Kaiser
Known good is for me 2.6.23-rc3-mm1, the first known bad is 2.6.23-rc4-mm1.
I will try to look at the diff between these revisions some more, but
the change in sata_sil24.c looked like a perfect match for the
symptoms I was seeing.
I think the first thing to do here is to verify 2.6.23-rc3-mm1 still
works fine and my previous debug patch is pretty much meaningless if
address initialization failure isn't the cause.

Thanks.
--
tejun
Torsten Kaiser
2007-09-28 05:36:26 UTC
Permalink
Post by Tejun Heo
Post by Torsten Kaiser
Known good is for me 2.6.23-rc3-mm1, the first known bad is 2.6.23-rc4-mm1.
I will try to look at the diff between these revisions some more, but
the change in sata_sil24.c looked like a perfect match for the
symptoms I was seeing.
I think the first thing to do here is to verify 2.6.23-rc3-mm1 still
works fine and my previous debug patch is pretty much meaningless if
address initialization failure isn't the cause.
After the first trouble with -rc4-mm1 I switched back to -rc3-mm1. I
booted that kernel 7 times over 4 days and never had trouble. (Before
-rc4-mm1 came out, I used -rc3-mm1 for over a week)

So in case of -rc3-mm1 I'm pretty sure that it works.

Not completely sure is if 2.6.23-rc7-sglist kernel works. I booted
that 9 times, but from a quick look in /var/log/messages, I might not
have hit the "correct" situation to trigger the error.
That kernel is vanilla 2.6.23-rc7 plus the patch from
http://www.kernel.org/pub/linux/kernel/people/tomo/misc/v2.6.23-rc7-sglist-arch.diff.bz2
( http://marc.info/?l=linux-ide&m=119055574826083&w=2 )

Torsten
Torsten Kaiser
2007-09-30 06:00:26 UTC
Permalink
Post by Torsten Kaiser
So in case of -rc3-mm1 I'm pretty sure that it works.
That's still the case.
Post by Torsten Kaiser
Not completely sure is if 2.6.23-rc7-sglist kernel works. I booted
that 9 times, but from a quick look in /var/log/messages, I might not
have hit the "correct" situation to trigger the error.
That kernel is vanilla 2.6.23-rc7 plus the patch from
http://www.kernel.org/pub/linux/kernel/people/tomo/misc/v2.6.23-rc7-sglist-arch.diff.bz2
( http://marc.info/?l=linux-ide&m=119055574826083&w=2 )
That is no longer the case. Yesterday this kernel did also show the failure.
The error looked a little bit different, but happend at the same
location during the bootup.
Sadly dmesg overflowed and I was not able to capture the first error.

[ 53.462632] Freeing unused kernel memory: 332k freed
ite cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 77.170903] ata1.00: exception Emask 0x40 SAct 0x1 SErr 0x0 action 0x2
[ 77.170905] ata1.00: irq_stat 0x00020002, SGT no on qword boundary
[ 77.170908] ata1.00: cmd 61/08:00:09:d6:42/00:00:25:00:00/40 tag 0
cdb 0x0 data 4096 out
[ 77.170909] res 50/00:00:af:ea:42/00:00:25:00:00/e0 Emask
0x40 (internal error)
[ 77.504292] ata1: soft resetting port
[ 77.544221] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ 77.642567] ata1.00: configured for UDMA/100
[ 77.642572] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK
driverbyte=DRIVER_SENSE,SUGGEST_OK
[ 77.642574] sd 0:0:0:0: [sda] Sense Key : Aborted Command [current]
[descriptor]
[ 77.642578] Descriptor sense data with sense descriptors (in hex):
[ 77.642580] 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00
[ 77.642584] 00 00 00 af
[ 77.642586] sd 0:0:0:0: [sda] Add. Sense: No additional sense information
[ 77.642589] end_request: I/O error, dev sda, sector 625137161
[ 77.642593] ata1: EH complete
[ 77.642609] sd 0:0:0:0: [sda] 625142448 512-byte hardware sectors (320073 MB)
[ 77.642616] sd 0:0:0:0: [sda] Write Protect is off
[ 77.642618] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
[ 77.642628] sd 0:0:0:0: [sda] Write cache: enabled, read cache:
enabled, doesn't support DPO or FUA
[ 77.666927] md: super_written gets error=-5, uptodate=0
[ 77.666930] raid5: Disk failure on sda2, disabling device.
Operation continuing on 1 devices

That "Operation continuing on 1 devices" is a 'little' bit misleading.
A RAID5 with two failed devices will not continue to operate. :(
The same error is repeated several times, so I expect the first error
also looked like that.

Other things I have done to narrow it down:
Comparing 2.6.23-rc3-mm1 and 2.6.23-rc4-mm1 I found the following hunk
from libata-core.c:
@@ -2182,6 +2183,17 @@
if (ap->ops->cable_detect)
ap->cbl = ap->ops->cable_detect(ap);

+ /* We may have SATA bridge glue hiding here irrespective of the
+ reported cable types and sensed types */
+ ata_link_for_each_dev(dev, &ap->link) {
+ if (!ata_dev_enabled(dev))
+ continue;
+ /* SATA drives indicate we have a bridge. We don't know which
+ end of the link the bridge is which is a problem */
+ if (ata_id_is_sata(dev->id))
+ ap->cbl = ATA_CBL_SATA;
+ }
+

But removing this from the current -mm does still show the error, so
thats not the source of the bug. (That test kicked the first drive
(sdb) from the RAID5 with the same error as most times)

Rebooting into the "safe" rc7-sglist-kernel I got the above error that
killed the RAID5.
I rebooted into a system (kernel 2.6.21-rc5-mm2, please not the
2.6.*21*, that is only a rescue system and not updated often) on a
separate partition to rebuild it. When I tried to readd the failed sdb
the system locked up with this error:

Sep 29 22:25:37 treogen [ 205.407893] ata2.00: exception Emask 0x0
SAct 0x0 SErr 0x0 action 0x2 frozen
Sep 29 22:25:37 treogen [ 205.407900] ata2.00: cmd
ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0
Sep 29 22:25:37 treogen [ 205.407901] res
40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (time out)
Sep 29 22:25:38 treogen [ 205.720531] ata2: soft resetting port
Sep 29 22:25:38 treogen [ 205.720554] ata2: SATA link up 3.0 Gbps
(SStatus 123 SControl 300)
Sep 29 22:25:38 treogen [ 205.730560] ATA: abnormal status 0xEA on
port 0xffffc20001432007
Sep 29 22:25:38 treogen [ 205.740560] ATA: abnormal status 0xEA on
port 0xffffc20001432007
Sep 29 22:26:07 treogen [ 235.343001] ata1.00: exception Emask 0x0
SAct 0x1 SErr 0x0 action 0x2 frozen
Sep 29 22:26:07 treogen [ 235.343007] ata1.00: cmd
61/08:00:bf:52:2a/00:00:01:00:00/40 tag 0 cdb 0x0 data 4096 out
Sep 29 22:26:07 treogen [ 235.343009] res
40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (time out)
Sep 29 22:26:08 treogen [ 235.655644] ata1: soft resetting port
Sep 29 22:26:08 treogen [ 235.655666] ata1: SATA link up 3.0 Gbps
(SStatus 123 SControl 300)
Sep 29 22:26:08 treogen [ 235.675602] ata2.00: qc timeout (cmd 0xec)
Sep 29 22:26:08 treogen [ 235.675608] ata2.00: failed to IDENTIFY
(I/O error, err_mask=0x104)
Sep 29 22:26:08 treogen [ 235.675610] ata2.00: revalidation failed (errno=-5)
Sep 29 22:26:08 treogen [ 235.675612] ata2: failed to recover some
devices, retrying in 5 secs
Sep 29 22:26:13 treogen [ 240.668113] ata2: hard resetting port
Sep 29 22:26:15 treogen [ 242.833438] ata2: SATA link up 3.0 Gbps
(SStatus 123 SControl 300)
Sep 29 22:26:15 treogen [ 242.843444] ATA: abnormal status 0xEC on
port 0xffffc20001432007
Sep 29 22:26:15 treogen [ 242.853443] ATA: abnormal status 0xEC on
port 0xffffc20001432007
Sep 29 22:26:38 treogen [ 265.590759] ata1.00: qc timeout (cmd 0xec)
Sep 29 22:26:38 treogen [ 265.590765] ata1.00: failed to IDENTIFY
(I/O error, err_mask=0x104)
Sep 29 22:26:38 treogen [ 265.590767] ata1.00: revalidation failed (errno=-5)
Sep 29 22:26:38 treogen [ 265.590769] ata1: failed to recover some
devices, retrying in 5 secs
Sep 29 22:26:43 treogen [ 270.583271] ata1: hard resetting port
Sep 29 22:26:45 treogen [ 272.748596] ata1: SATA link up 3.0 Gbps
(SStatus 123 SControl 300)
Sep 29 22:26:45 treogen [ 272.758601] ATA: abnormal status 0xEC on
port 0xffffc20001430007
Sep 29 22:26:45 treogen [ 272.768602] ATA: abnormal status 0xEC on
port 0xffffc20001430007
Sep 29 22:26:45 treogen [ 272.788490] ata2.00: qc timeout (cmd 0xec)
Sep 29 22:26:45 treogen [ 272.788494] ata2.00: failed to IDENTIFY
(I/O error, err_mask=0x104)
Sep 29 22:26:45 treogen [ 272.788496] ata2.00: revalidation failed (errno=-5)
Sep 29 22:26:45 treogen [ 272.788501] ata2.00: limiting speed to UDMA/100:PIO3
Sep 29 22:26:45 treogen [ 272.788503] ata2: failed to recover some
devices, retrying in 5 secs
Sep 29 22:26:50 treogen [ 277.781001] ata2: hard resetting port
Sep 29 22:26:52 treogen [ 279.946326] ata2: SATA link up 3.0 Gbps
(SStatus 123 SControl 300)
Sep 29 22:26:52 treogen [ 279.956331] ATA: abnormal status 0xEC on
port 0xffffc20001432007
Sep 29 22:26:52 treogen [ 279.966330] ATA: abnormal status 0xEC on
port 0xffffc20001432007
Sep 29 22:27:15 treogen [ 302.703652] ata1.00: qc timeout (cmd 0xec)
Sep 29 22:27:15 treogen [ 302.703661] ata1.00: failed to IDENTIFY
(I/O error, err_mask=0x104)
Sep 29 22:27:15 treogen [ 302.703663] ata1.00: revalidation failed (errno=-5)
Sep 29 22:27:15 treogen [ 302.703669] ata1.00: limiting speed to UDMA/100:PIO3
Sep 29 22:27:15 treogen [ 302.703671] ata1: failed to recover some
devices, retrying in 5 secs
Sep 29 22:27:20 treogen [ 307.696161] ata1: hard resetting port
Sep 29 22:27:22 treogen [ 309.884773] ata1: SATA link up 3.0 Gbps
(SStatus 123 SControl 300)
Sep 29 22:27:22 treogen [ 309.894784] ATA: abnormal status 0xEC on
port 0xffffc20001430007
Sep 29 22:27:22 treogen [ 309.904786] ATA: abnormal status 0xEC on
port 0xffffc20001430007
Sep 29 22:27:22 treogen [ 309.904940] ata2.00: qc timeout (cmd 0xec)
Sep 29 22:27:22 treogen [ 309.904946] ata2.00: failed to IDENTIFY
(I/O error, err_mask=0x104)
Sep 29 22:27:22 treogen [ 309.904948] ata2.00: revalidation failed (errno=-5)
Sep 29 22:27:22 treogen [ 309.904951] ata2.00: disabled
Sep 29 22:27:23 treogen [ 310.406968] ata2: EH complete
Sep 29 22:27:23 treogen [ 310.407013] sd 1:0:0:0: [sdb] Result:
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
Sep 29 22:27:23 treogen [ 310.407018] end_request: I/O error, dev
sdb, sector 19550911

Is it possible that the bug is much older, and only some change in
rc3-mm->rc4-mm makes it visible that the initialization of the SiI3132
is incomplete?

Currently I try a 2.6.23-rc4-mm1 with only 629 of 1628 mm-patches applied...

Torsten
Tejun Heo
2007-09-30 14:34:58 UTC
Permalink
Hello, Torsten.
Post by Torsten Kaiser
Post by Torsten Kaiser
So in case of -rc3-mm1 I'm pretty sure that it works.
That's still the case.
Ah... that's weird. It would be much better if -rc3-mm1 is broken too. :-P
Post by Torsten Kaiser
Post by Torsten Kaiser
Not completely sure is if 2.6.23-rc7-sglist kernel works. I booted
that 9 times, but from a quick look in /var/log/messages, I might not
have hit the "correct" situation to trigger the error.
That kernel is vanilla 2.6.23-rc7 plus the patch from
http://www.kernel.org/pub/linux/kernel/people/tomo/misc/v2.6.23-rc7-sglist-arch.diff.bz2
( http://marc.info/?l=linux-ide&m=119055574826083&w=2 )
That is no longer the case. Yesterday this kernel did also show the failure.
The error looked a little bit different, but happend at the same
location during the bootup.
Sadly dmesg overflowed and I was not able to capture the first error.
[ 53.462632] Freeing unused kernel memory: 332k freed
ite cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 77.170903] ata1.00: exception Emask 0x40 SAct 0x1 SErr 0x0 action 0x2
[ 77.170905] ata1.00: irq_stat 0x00020002, SGT no on qword boundary
[ 77.170908] ata1.00: cmd 61/08:00:09:d6:42/00:00:25:00:00/40 tag 0
cdb 0x0 data 4096 out
[ 77.170909] res 50/00:00:af:ea:42/00:00:25:00:00/e0 Emask
0x40 (internal error)
Hmm... SGT not on qword boundary? Please add the following to the end
of sil24_port_start() and report successful and failed kernel boot log.

printk("XXX sil24 cb=%p cb_dma=%llx\n",
cb, (unsigned long long)cb_dma);

Also, does 'dmesg -s 10000000' give full dmesg?
Post by Torsten Kaiser
I rebooted into a system (kernel 2.6.21-rc5-mm2, please not the
2.6.*21*, that is only a rescue system and not updated often) on a
separate partition to rebuild it. When I tried to readd the failed sdb
Sep 29 22:25:37 treogen [ 205.407893] ata2.00: exception Emask 0x0
SAct 0x0 SErr 0x0 action 0x2 frozen
Sep 29 22:25:37 treogen [ 205.407900] ata2.00: cmd
ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0
Sep 29 22:25:37 treogen [ 205.407901] res
40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (time out)
Is it possible that the bug is much older, and only some change in
rc3-mm->rc4-mm makes it visible that the initialization of the SiI3132
is incomplete?
I can't tell but there is a pretty large userbase of sil24/32 and you
seem to be the only one to report this problem yet. I think it might be
coming somewhere else than libata or sata_sil24 itself. Hmmm... It
would be really great if you can break -rc3-mm1 too. Correct behavior
for something like that for just one -mm version sounds very weird.

Thanks.
--
tejun
Torsten Kaiser
2007-09-30 16:19:19 UTC
Permalink
Post by Tejun Heo
Hello, Torsten.
Post by Torsten Kaiser
Post by Torsten Kaiser
So in case of -rc3-mm1 I'm pretty sure that it works.
That's still the case.
Ah... that's weird. It would be much better if -rc3-mm1 is broken too. :-P
It would be much better if it would break all the time. Then bisecting
would not be guesswork. :-P
Post by Tejun Heo
Post by Torsten Kaiser
Post by Torsten Kaiser
Not completely sure is if 2.6.23-rc7-sglist kernel works. I booted
that 9 times, but from a quick look in /var/log/messages, I might not
have hit the "correct" situation to trigger the error.
That kernel is vanilla 2.6.23-rc7 plus the patch from
http://www.kernel.org/pub/linux/kernel/people/tomo/misc/v2.6.23-rc7-sglist-arch.diff.bz2
( http://marc.info/?l=linux-ide&m=119055574826083&w=2 )
That is no longer the case. Yesterday this kernel did also show the failure.
The error looked a little bit different, but happend at the same
location during the bootup.
Sadly dmesg overflowed and I was not able to capture the first error.
[ 53.462632] Freeing unused kernel memory: 332k freed
ite cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 77.170903] ata1.00: exception Emask 0x40 SAct 0x1 SErr 0x0 action 0x2
[ 77.170905] ata1.00: irq_stat 0x00020002, SGT no on qword boundary
[ 77.170908] ata1.00: cmd 61/08:00:09:d6:42/00:00:25:00:00/40 tag 0
cdb 0x0 data 4096 out
[ 77.170909] res 50/00:00:af:ea:42/00:00:25:00:00/e0 Emask
0x40 (internal error)
Hmm... SGT not on qword boundary? Please add the following to the end
of sil24_port_start() and report successful and failed kernel boot log.
printk("XXX sil24 cb=%p cb_dma=%llx\n",
cb, (unsigned long long)cb_dma);
Added to my current "2.6.23-rc4-mm0.5"
I will report it's output.
Post by Tejun Heo
Also, does 'dmesg -s 10000000' give full dmesg?
That boot ended in a minimal initrd environment that normally only
starts the RAID5 and then opens contained encrypted real root.
I was just able to push the output from dmesg through the serial link,
but had no man pages to tell me about -s ...
And that kind of error was until now a one-of-a-kind one. All other
errors where not "internal error" but "timeout".
But one time I had another SGT related error:
Sep 11 19:19:24 treogen [ 33.340000] ata1.00: exception Emask 0x20
SAct 0x1 SErr 0x0 action 0x2
Sep 11 19:19:24 treogen [ 33.340000] ata1.00: irq_stat 0x00020002,
PCI master abort while fetching SGT
Sep 11 19:19:24 treogen [ 33.340000] ata1.00: cmd
61/08:00:09:d6:42/00:00:25:00:00/40 tag 0 cdb 0x0 data 4096 out
Sep 11 19:19:24 treogen [ 33.340000] res
50/00:00:af:ea:42/00:00:25:00:00/e0 Emask 0x20 (host bus error)

What I find kind of interessing is, that while I got three different
error codes the cmd part of the output was always the same.
Post by Tejun Heo
Post by Torsten Kaiser
Is it possible that the bug is much older, and only some change in
rc3-mm->rc4-mm makes it visible that the initialization of the SiI3132
is incomplete?
I can't tell but there is a pretty large userbase of sil24/32 and you
seem to be the only one to report this problem yet. I think it might be
coming somewhere else than libata or sata_sil24 itself. Hmmm... It
would be really great if you can break -rc3-mm1 too. Correct behavior
for something like that for just one -mm version sounds very weird.
It's not just 2.6.23-rc4-mm1. All -mm's after rc4 are broken for me.
Confirmed breakage on -rc4-mm1, -rc6-mm1 and -rc8-mm1. I'm just
narrowing on rc4-mm1 because that was the first version to break.

I'm currently trying to bisect 2.6.23-rc4-mm1. Here is the current status:

[the 2.6.23-rc4-mm1 series-file has 2013 lines]
Up to (incl.) x86_64-convert-to-clockevents.patch (line 747): 2 good boots
Up to (incl.) x86_64-cleanup-struct-irqaction-initializers-patch
(line779): 2 good boots
Up to (incl.) slub-optimize-cacheline-use-for-zeroing.patch (line
1045): 1 failed
Up to (incl.) fix-discrepancy-between-vdso-based... (line1461): 1 good, 1 failed

Next try: up to patch fs-remove-some-aop_truncated_page.patch

That means from the patches added to the rc4 variant of the mm-kernel
the following are remaining:
git-xfs-build-fix.patch
enforce-noreplace-smp-in-alternative_instructions.patch
paravirt-fix-preemptible-lazy-mode-bug.patch
i386-apic-fix-4-bit-apicid-assumption-of-mach-default.patch
fix-the-max-path-calculation-in-radix-treec-update.patch
mm-no-need-to-cast-vmalloc-return-value-in-zone_wait_table_init.patch
introduce-write_begin-write_end-aops-fix2.patch
implement-simple-fs-aops-fix.patch
ext2-convert-to-new-aops-fix2.patch
ext3-convert-to-new-aops-fix-fix.patch
ext4-convert-to-new-aops-fix-fix.patch
gfs2-convert-to-new-aops-fix.patch
reiserfs-convert-to-new-aops-fix2.patch
hostfs-convert-to-new-aops-fix-fix.patch
ufs-convert-to-new-aops-fix2.patch
sysv-convert-to-new-aops-fix2.patch
minix-convert-to-new-aops-fix2.patch
affs-convert-to-new-aops-fix-fix.patch
memoryless-nodes-add-n_cpu-node-state-move-setup-of-n_cpu-node-state-mask.patch
memoryless-nodes-fixup-uses-of-node_online_map-in-generic-code-fix.patch
memoryless-nodes-fixup-uses-of-node_online_map-in-generic-code-fix-2.patch
update-n_high_memory-node-state-for-memory-hotadd.patch
slub-avoid-page-struct-cacheline-bouncing-due-to-remote-frees-to-cpu-slab.patch
slub-do-not-use-page-mapping.patch
slub-move-page-offset-to-kmem_cache_cpu-offset.patch
slub-avoid-touching-page-struct-when-freeing-to-per-cpu-slab.patch
slub-place-kmem_cache_cpu-structures-in-a-numa-aware-way.patch
slub-optimize-cacheline-use-for-zeroing.patch

But due to the unreliable nature of the bug, I can't be to sure about that.

Next version is compiled, now again switching the PC off for an hour...

Torsten
Tejun Heo
2007-09-30 17:39:48 UTC
Permalink
Post by Torsten Kaiser
That boot ended in a minimal initrd environment that normally only
starts the RAID5 and then opens contained encrypted real root.
I was just able to push the output from dmesg through the serial link,
but had no man pages to tell me about -s ...
And that kind of error was until now a one-of-a-kind one. All other
errors where not "internal error" but "timeout".
Sep 11 19:19:24 treogen [ 33.340000] ata1.00: exception Emask 0x20
SAct 0x1 SErr 0x0 action 0x2
Sep 11 19:19:24 treogen [ 33.340000] ata1.00: irq_stat 0x00020002,
PCI master abort while fetching SGT
Sep 11 19:19:24 treogen [ 33.340000] ata1.00: cmd
61/08:00:09:d6:42/00:00:25:00:00/40 tag 0 cdb 0x0 data 4096 out
Sep 11 19:19:24 treogen [ 33.340000] res
50/00:00:af:ea:42/00:00:25:00:00/e0 Emask 0x20 (host bus error)
What I find kind of interessing is, that while I got three different
error codes the cmd part of the output was always the same.
That's NCQ write command. You'll be using it a lot if you're rebuilding
md5. It seems like something is going wrong with request DMA or sg
mapping. Maybe some change in block/*.[hc]?
Post by Torsten Kaiser
It's not just 2.6.23-rc4-mm1. All -mm's after rc4 are broken for me.
Confirmed breakage on -rc4-mm1, -rc6-mm1 and -rc8-mm1. I'm just
narrowing on rc4-mm1 because that was the first version to break.
Have you tested 2.6.23-rc4 without mm patches? It could be something
introduced between -rc3 and 4.
Post by Torsten Kaiser
[the 2.6.23-rc4-mm1 series-file has 2013 lines]
Up to (incl.) x86_64-convert-to-clockevents.patch (line 747): 2 good boots
Up to (incl.) x86_64-cleanup-struct-irqaction-initializers-patch
(line779): 2 good boots
Up to (incl.) slub-optimize-cacheline-use-for-zeroing.patch (line
1045): 1 failed
Up to (incl.) fix-discrepancy-between-vdso-based... (line1461): 1 good, 1 failed
Next try: up to patch fs-remove-some-aop_truncated_page.patch
That means from the patches added to the rc4 variant of the mm-kernel
git-xfs-build-fix.patch
enforce-noreplace-smp-in-alternative_instructions.patch
paravirt-fix-preemptible-lazy-mode-bug.patch
i386-apic-fix-4-bit-apicid-assumption-of-mach-default.patch
fix-the-max-path-calculation-in-radix-treec-update.patch
mm-no-need-to-cast-vmalloc-return-value-in-zone_wait_table_init.patch
introduce-write_begin-write_end-aops-fix2.patch
implement-simple-fs-aops-fix.patch
ext2-convert-to-new-aops-fix2.patch
ext3-convert-to-new-aops-fix-fix.patch
ext4-convert-to-new-aops-fix-fix.patch
gfs2-convert-to-new-aops-fix.patch
reiserfs-convert-to-new-aops-fix2.patch
hostfs-convert-to-new-aops-fix-fix.patch
ufs-convert-to-new-aops-fix2.patch
sysv-convert-to-new-aops-fix2.patch
minix-convert-to-new-aops-fix2.patch
affs-convert-to-new-aops-fix-fix.patch
memoryless-nodes-add-n_cpu-node-state-move-setup-of-n_cpu-node-state-mask.patch
memoryless-nodes-fixup-uses-of-node_online_map-in-generic-code-fix.patch
memoryless-nodes-fixup-uses-of-node_online_map-in-generic-code-fix-2.patch
update-n_high_memory-node-state-for-memory-hotadd.patch
slub-avoid-page-struct-cacheline-bouncing-due-to-remote-frees-to-cpu-slab.patch
slub-do-not-use-page-mapping.patch
slub-move-page-offset-to-kmem_cache_cpu-offset.patch
slub-avoid-touching-page-struct-when-freeing-to-per-cpu-slab.patch
slub-place-kmem_cache_cpu-structures-in-a-numa-aware-way.patch
slub-optimize-cacheline-use-for-zeroing.patch
But due to the unreliable nature of the bug, I can't be to sure about that.
Yeah, that's what I'm worried about. Bisection is extremely difficult
if errors are intermittent and takes long time to reproduce.
Post by Torsten Kaiser
Next version is compiled, now again switching the PC off for an hour...
Thanks a lot. Much appreciated.
--
tejun
Torsten Kaiser
2007-09-30 18:39:44 UTC
Permalink
Post by Tejun Heo
Post by Torsten Kaiser
What I find kind of interessing is, that while I got three different
error codes the cmd part of the output was always the same.
That's NCQ write command. You'll be using it a lot if you're rebuilding
md5.
It's not rebuilding the RAID at that point.
If one drive fails, I reboot into a "safe" kernel, fix the RAID and
that way try the next boot with a clean RAID again.

The error happens when the RAID is initialized, might be the first
write into the superblock to mark it dirty/inuse that triggers the
error.
Post by Tejun Heo
It seems like something is going wrong with request DMA or sg
mapping. Maybe some change in block/*.[hc]?
The sg-chaining-patch stands out, but I have no conclusive proof that
it really is the cause.
As noted in this thread, a long time I thought that rc7 with the
sg-chaining-patch was safe, but one time it also showed the error.
Post by Tejun Heo
Post by Torsten Kaiser
It's not just 2.6.23-rc4-mm1. All -mm's after rc4 are broken for me.
Confirmed breakage on -rc4-mm1, -rc6-mm1 and -rc8-mm1. I'm just
narrowing on rc4-mm1 because that was the first version to break.
Have you tested 2.6.23-rc4 without mm patches? It could be something
introduced between -rc3 and 4.
Not directly, but I have 4 good boots with one part of the mm-patches.
So I would tend to say that mainline 2.6.23-rc4 does not have this
bug.
Post by Tejun Heo
Post by Torsten Kaiser
[the 2.6.23-rc4-mm1 series-file has 2013 lines]
Up to (incl.) x86_64-convert-to-clockevents.patch (line 747): 2 good boots
Up to (incl.) x86_64-cleanup-struct-irqaction-initializers-patch
(line779): 2 good boots
Up to (incl.) slub-optimize-cacheline-use-for-zeroing.patch (line
1045): 1 failed
Up to (incl.) fix-discrepancy-between-vdso-based... (line1461): 1 good, 1 failed
Next try: up to patch fs-remove-some-aop_truncated_page.patch
Looks more like this is OK too.
Post by Tejun Heo
Post by Torsten Kaiser
That means from the patches added to the rc4 variant of the mm-kernel
[snip]
Post by Tejun Heo
Post by Torsten Kaiser
memoryless-nodes-add-n_cpu-node-state-move-setup-of-n_cpu-node-state-mask.patch
memoryless-nodes-fixup-uses-of-node_online_map-in-generic-code-fix.patch
memoryless-nodes-fixup-uses-of-node_online_map-in-generic-code-fix-2.patch
update-n_high_memory-node-state-for-memory-hotadd.patch
slub-avoid-page-struct-cacheline-bouncing-due-to-remote-frees-to-cpu-slab.patch
slub-do-not-use-page-mapping.patch
slub-move-page-offset-to-kmem_cache_cpu-offset.patch
slub-avoid-touching-page-struct-when-freeing-to-per-cpu-slab.patch
slub-place-kmem_cache_cpu-structures-in-a-numa-aware-way.patch
slub-optimize-cacheline-use-for-zeroing.patch
But due to the unreliable nature of the bug, I can't be to sure about that.
Yeah, that's what I'm worried about. Bisection is extremely difficult
if errors are intermittent and takes long time to reproduce.
Yes...

As for the remaining patches:
memoryless-nodes-*
Don't think so: I do have a NUMA system, but both nodes have memory.
flush-cache-before-*
Don't think so: No ia64 system, unchanged from rc3
# grouping pages by mobility patches
... no idee, but seem unchanged
maps2.*
Don't think that related...
remaining slub-* patches
Might be...

As for you printk:
From two goot boots, I had not had any failures with it:
First one:
Sep 30 19:24:53 treogen [ 3.810000] XXX sil24 cb=ffff810037ef0000
cb_dma=37ef0000
Sep 30 19:24:53 treogen [ 3.820000] XXX sil24 cb=ffff810037f00000
cb_dma=37f00000
Second:
Sep 30 20:06:22 treogen [ 3.820000] XXX sil24 cb=ffff810037f00000
cb_dma=37f00000
Sep 30 20:06:22 treogen [ 3.830000] XXX sil24 cb=ffff810037f10000
cb_dma=37f10000

Torsten
Torsten Kaiser
2007-10-01 18:00:14 UTC
Permalink
Post by Torsten Kaiser
Post by Tejun Heo
Post by Torsten Kaiser
What I find kind of interessing is, that while I got three different
error codes the cmd part of the output was always the same.
That's NCQ write command. You'll be using it a lot if you're rebuilding
md5.
It's not rebuilding the RAID at that point.
If one drive fails, I reboot into a "safe" kernel, fix the RAID and
that way try the next boot with a clean RAID again.
The error happens when the RAID is initialized, might be the first
write into the superblock to mark it dirty/inuse that triggers the
error.
To make that comment "cmd part of the output was always the same" more
clear: I did not only mean that the first (cmd-)byte was the same, but
the whole line:
cmd 61/08:00:09:d6:42/00:00:25:00:00/40 tag 0 cdb 0x0 data 4096 out

I did find ATA_CMD_FPDMA_WRITE = 0x61in ata.h, but I don't find a good
hint for the 08 that follows. The interpretation XFER_PIO_0 = 0x08
seems wrong...
Post by Torsten Kaiser
Post by Tejun Heo
It seems like something is going wrong with request DMA or sg
mapping. Maybe some change in block/*.[hc]?
The sg-chaining-patch stands out, but I have no conclusive proof that
it really is the cause.
As noted in this thread, a long time I thought that rc7 with the
sg-chaining-patch was safe, but one time it also showed the error.
If I look at what patches remain, it seems that some other (earlier)
patch that is new in 2.6.23-rc4-mm1 is the trigger, but it will only
fail together with a second patch.
Post by Torsten Kaiser
Post by Tejun Heo
Post by Torsten Kaiser
[the 2.6.23-rc4-mm1 series-file has 2013 lines]
Up to (incl.) x86_64-convert-to-clockevents.patch (line 747): 2 good boots
Up to (incl.) x86_64-cleanup-struct-irqaction-initializers-patch
(line779): 2 good boots
Up to (incl.) fs-remove-some-aop_truncated_page.patch (line 956): 2 good boots
Up to (incl.) update-n_high_memory-node-state-for-memory-hotadd.patch (line978)
1 good boot, so I would tend to say that this patches are also OK, but
I will recheck this version.

Up to (incl.) maps2-make-proc-pid-smaps-optional-under-config_embeddedpatch-fix.patch
(line 1038): 1 failed
Post by Torsten Kaiser
Post by Tejun Heo