Discussion:
sata_sil24 broken since 2.6.23-rc4-mm1
(too old to reply)
Torsten Kaiser
2007-09-27 17:34:34 UTC
Permalink
I compared the dmesg form good and bad boots with -rc7-mm1 but could
not see any difference, so do you think that these additional
diagnostics could show a difference?
Or could you suggest any other debugging options I should try?
I think since its a reproducible problem, I think it's easiest to get
you straight to git-bisect. In this case, that would be
Sorry, but I don't think that will work.
It seems that I am able to reproduce the bug, but not reliable. And my
current best guess to make it happen involves the step "leaf the
computer powered off for 8 hours".
I estimate that even with the 8 hour pause only at ~50% of the boots
one drive fails. So I have no safe point to mark a bisect step as
'good'.
a) start with a known good point (v2.6.22? v2.6.23?) and
known bad point (HEAD, aka the most recent commit in
libata-dev.git#upstream)
Known good is for me 2.6.23-rc3-mm1, the first known bad is 2.6.23-rc4-mm1.
I will try to look at the diff between these revisions some more, but
the change in sata_sil24.c looked like a perfect match for the
symptoms I was seeing.

What I just noticed, as I wanted two re-add the drive to the RAID:
This time it was not sda, but sdb that was kicked. But otherwise the
errors are perfectly identical.

I will try to make a 2.6.23-rc3.5-mm1 to narrow it down some more...

Torsten
Tejun Heo
2007-09-27 20:22:28 UTC
Permalink
Post by Torsten Kaiser
Known good is for me 2.6.23-rc3-mm1, the first known bad is 2.6.23-rc4-mm1.
I will try to look at the diff between these revisions some more, but
the change in sata_sil24.c looked like a perfect match for the
symptoms I was seeing.
I think the first thing to do here is to verify 2.6.23-rc3-mm1 still
works fine and my previous debug patch is pretty much meaningless if
address initialization failure isn't the cause.

Thanks.
--
tejun
Torsten Kaiser
2007-09-28 05:36:26 UTC
Permalink
Post by Tejun Heo
Post by Torsten Kaiser
Known good is for me 2.6.23-rc3-mm1, the first known bad is 2.6.23-rc4-mm1.
I will try to look at the diff between these revisions some more, but
the change in sata_sil24.c looked like a perfect match for the
symptoms I was seeing.
I think the first thing to do here is to verify 2.6.23-rc3-mm1 still
works fine and my previous debug patch is pretty much meaningless if
address initialization failure isn't the cause.
After the first trouble with -rc4-mm1 I switched back to -rc3-mm1. I
booted that kernel 7 times over 4 days and never had trouble. (Before
-rc4-mm1 came out, I used -rc3-mm1 for over a week)

So in case of -rc3-mm1 I'm pretty sure that it works.

Not completely sure is if 2.6.23-rc7-sglist kernel works. I booted
that 9 times, but from a quick look in /var/log/messages, I might not
have hit the "correct" situation to trigger the error.
That kernel is vanilla 2.6.23-rc7 plus the patch from
http://www.kernel.org/pub/linux/kernel/people/tomo/misc/v2.6.23-rc7-sglist-arch.diff.bz2
( http://marc.info/?l=linux-ide&m=119055574826083&w=2 )

Torsten
Torsten Kaiser
2007-09-30 06:00:26 UTC
Permalink
Post by Torsten Kaiser
So in case of -rc3-mm1 I'm pretty sure that it works.
That's still the case.
Post by Torsten Kaiser
Not completely sure is if 2.6.23-rc7-sglist kernel works. I booted
that 9 times, but from a quick look in /var/log/messages, I might not
have hit the "correct" situation to trigger the error.
That kernel is vanilla 2.6.23-rc7 plus the patch from
http://www.kernel.org/pub/linux/kernel/people/tomo/misc/v2.6.23-rc7-sglist-arch.diff.bz2
( http://marc.info/?l=linux-ide&m=119055574826083&w=2 )
That is no longer the case. Yesterday this kernel did also show the failure.
The error looked a little bit different, but happend at the same
location during the bootup.
Sadly dmesg overflowed and I was not able to capture the first error.

[ 53.462632] Freeing unused kernel memory: 332k freed
ite cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 77.170903] ata1.00: exception Emask 0x40 SAct 0x1 SErr 0x0 action 0x2
[ 77.170905] ata1.00: irq_stat 0x00020002, SGT no on qword boundary
[ 77.170908] ata1.00: cmd 61/08:00:09:d6:42/00:00:25:00:00/40 tag 0
cdb 0x0 data 4096 out
[ 77.170909] res 50/00:00:af:ea:42/00:00:25:00:00/e0 Emask
0x40 (internal error)
[ 77.504292] ata1: soft resetting port
[ 77.544221] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ 77.642567] ata1.00: configured for UDMA/100
[ 77.642572] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK
driverbyte=DRIVER_SENSE,SUGGEST_OK
[ 77.642574] sd 0:0:0:0: [sda] Sense Key : Aborted Command [current]
[descriptor]
[ 77.642578] Descriptor sense data with sense descriptors (in hex):
[ 77.642580] 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00
[ 77.642584] 00 00 00 af
[ 77.642586] sd 0:0:0:0: [sda] Add. Sense: No additional sense information
[ 77.642589] end_request: I/O error, dev sda, sector 625137161
[ 77.642593] ata1: EH complete
[ 77.642609] sd 0:0:0:0: [sda] 625142448 512-byte hardware sectors (320073 MB)
[ 77.642616] sd 0:0:0:0: [sda] Write Protect is off
[ 77.642618] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
[ 77.642628] sd 0:0:0:0: [sda] Write cache: enabled, read cache:
enabled, doesn't support DPO or FUA
[ 77.666927] md: super_written gets error=-5, uptodate=0
[ 77.666930] raid5: Disk failure on sda2, disabling device.
Operation continuing on 1 devices

That "Operation continuing on 1 devices" is a 'little' bit misleading.
A RAID5 with two failed devices will not continue to operate. :(
The same error is repeated several times, so I expect the first error
also looked like that.

Other things I have done to narrow it down:
Comparing 2.6.23-rc3-mm1 and 2.6.23-rc4-mm1 I found the following hunk
from libata-core.c:
@@ -2182,6 +2183,17 @@
if (ap->ops->cable_detect)
ap->cbl = ap->ops->cable_detect(ap);

+ /* We may have SATA bridge glue hiding here irrespective of the
+ reported cable types and sensed types */
+ ata_link_for_each_dev(dev, &ap->link) {
+ if (!ata_dev_enabled(dev))
+ continue;
+ /* SATA drives indicate we have a bridge. We don't know which
+ end of the link the bridge is which is a problem */
+ if (ata_id_is_sata(dev->id))
+ ap->cbl = ATA_CBL_SATA;
+ }
+

But removing this from the current -mm does still show the error, so
thats not the source of the bug. (That test kicked the first drive
(sdb) from the RAID5 with the same error as most times)

Rebooting into the "safe" rc7-sglist-kernel I got the above error that
killed the RAID5.
I rebooted into a system (kernel 2.6.21-rc5-mm2, please not the
2.6.*21*, that is only a rescue system and not updated often) on a
separate partition to rebuild it. When I tried to readd the failed sdb
the system locked up with this error:

Sep 29 22:25:37 treogen [ 205.407893] ata2.00: exception Emask 0x0
SAct 0x0 SErr 0x0 action 0x2 frozen
Sep 29 22:25:37 treogen [ 205.407900] ata2.00: cmd
ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0
Sep 29 22:25:37 treogen [ 205.407901] res
40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (time out)
Sep 29 22:25:38 treogen [ 205.720531] ata2: soft resetting port
Sep 29 22:25:38 treogen [ 205.720554] ata2: SATA link up 3.0 Gbps
(SStatus 123 SControl 300)
Sep 29 22:25:38 treogen [ 205.730560] ATA: abnormal status 0xEA on
port 0xffffc20001432007
Sep 29 22:25:38 treogen [ 205.740560] ATA: abnormal status 0xEA on
port 0xffffc20001432007
Sep 29 22:26:07 treogen [ 235.343001] ata1.00: exception Emask 0x0
SAct 0x1 SErr 0x0 action 0x2 frozen
Sep 29 22:26:07 treogen [ 235.343007] ata1.00: cmd
61/08:00:bf:52:2a/00:00:01:00:00/40 tag 0 cdb 0x0 data 4096 out
Sep 29 22:26:07 treogen [ 235.343009] res
40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (time out)
Sep 29 22:26:08 treogen [ 235.655644] ata1: soft resetting port
Sep 29 22:26:08 treogen [ 235.655666] ata1: SATA link up 3.0 Gbps
(SStatus 123 SControl 300)
Sep 29 22:26:08 treogen [ 235.675602] ata2.00: qc timeout (cmd 0xec)
Sep 29 22:26:08 treogen [ 235.675608] ata2.00: failed to IDENTIFY
(I/O error, err_mask=0x104)
Sep 29 22:26:08 treogen [ 235.675610] ata2.00: revalidation failed (errno=-5)
Sep 29 22:26:08 treogen [ 235.675612] ata2: failed to recover some
devices, retrying in 5 secs
Sep 29 22:26:13 treogen [ 240.668113] ata2: hard resetting port
Sep 29 22:26:15 treogen [ 242.833438] ata2: SATA link up 3.0 Gbps
(SStatus 123 SControl 300)
Sep 29 22:26:15 treogen [ 242.843444] ATA: abnormal status 0xEC on
port 0xffffc20001432007
Sep 29 22:26:15 treogen [ 242.853443] ATA: abnormal status 0xEC on
port 0xffffc20001432007
Sep 29 22:26:38 treogen [ 265.590759] ata1.00: qc timeout (cmd 0xec)
Sep 29 22:26:38 treogen [ 265.590765] ata1.00: failed to IDENTIFY
(I/O error, err_mask=0x104)
Sep 29 22:26:38 treogen [ 265.590767] ata1.00: revalidation failed (errno=-5)
Sep 29 22:26:38 treogen [ 265.590769] ata1: failed to recover some
devices, retrying in 5 secs
Sep 29 22:26:43 treogen [ 270.583271] ata1: hard resetting port
Sep 29 22:26:45 treogen [ 272.748596] ata1: SATA link up 3.0 Gbps
(SStatus 123 SControl 300)
Sep 29 22:26:45 treogen [ 272.758601] ATA: abnormal status 0xEC on
port 0xffffc20001430007
Sep 29 22:26:45 treogen [ 272.768602] ATA: abnormal status 0xEC on
port 0xffffc20001430007
Sep 29 22:26:45 treogen [ 272.788490] ata2.00: qc timeout (cmd 0xec)
Sep 29 22:26:45 treogen [ 272.788494] ata2.00: failed to IDENTIFY
(I/O error, err_mask=0x104)
Sep 29 22:26:45 treogen [ 272.788496] ata2.00: revalidation failed (errno=-5)
Sep 29 22:26:45 treogen [ 272.788501] ata2.00: limiting speed to UDMA/100:PIO3
Sep 29 22:26:45 treogen [ 272.788503] ata2: failed to recover some
devices, retrying in 5 secs
Sep 29 22:26:50 treogen [ 277.781001] ata2: hard resetting port
Sep 29 22:26:52 treogen [ 279.946326] ata2: SATA link up 3.0 Gbps
(SStatus 123 SControl 300)
Sep 29 22:26:52 treogen [ 279.956331] ATA: abnormal status 0xEC on
port 0xffffc20001432007
Sep 29 22:26:52 treogen [ 279.966330] ATA: abnormal status 0xEC on
port 0xffffc20001432007
Sep 29 22:27:15 treogen [ 302.703652] ata1.00: qc timeout (cmd 0xec)
Sep 29 22:27:15 treogen [ 302.703661] ata1.00: failed to IDENTIFY
(I/O error, err_mask=0x104)
Sep 29 22:27:15 treogen [ 302.703663] ata1.00: revalidation failed (errno=-5)
Sep 29 22:27:15 treogen [ 302.703669] ata1.00: limiting speed to UDMA/100:PIO3
Sep 29 22:27:15 treogen [ 302.703671] ata1: failed to recover some
devices, retrying in 5 secs
Sep 29 22:27:20 treogen [ 307.696161] ata1: hard resetting port
Sep 29 22:27:22 treogen [ 309.884773] ata1: SATA link up 3.0 Gbps
(SStatus 123 SControl 300)
Sep 29 22:27:22 treogen [ 309.894784] ATA: abnormal status 0xEC on
port 0xffffc20001430007
Sep 29 22:27:22 treogen [ 309.904786] ATA: abnormal status 0xEC on
port 0xffffc20001430007
Sep 29 22:27:22 treogen [ 309.904940] ata2.00: qc timeout (cmd 0xec)
Sep 29 22:27:22 treogen [ 309.904946] ata2.00: failed to IDENTIFY
(I/O error, err_mask=0x104)
Sep 29 22:27:22 treogen [ 309.904948] ata2.00: revalidation failed (errno=-5)
Sep 29 22:27:22 treogen [ 309.904951] ata2.00: disabled
Sep 29 22:27:23 treogen [ 310.406968] ata2: EH complete
Sep 29 22:27:23 treogen [ 310.407013] sd 1:0:0:0: [sdb] Result:
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
Sep 29 22:27:23 treogen [ 310.407018] end_request: I/O error, dev
sdb, sector 19550911

Is it possible that the bug is much older, and only some change in
rc3-mm->rc4-mm makes it visible that the initialization of the SiI3132
is incomplete?

Currently I try a 2.6.23-rc4-mm1 with only 629 of 1628 mm-patches applied...

Torsten
Tejun Heo
2007-09-30 14:34:58 UTC
Permalink
Hello, Torsten.
Post by Torsten Kaiser
Post by Torsten Kaiser
So in case of -rc3-mm1 I'm pretty sure that it works.
That's still the case.
Ah... that's weird. It would be much better if -rc3-mm1 is broken too. :-P
Post by Torsten Kaiser
Post by Torsten Kaiser
Not completely sure is if 2.6.23-rc7-sglist kernel works. I booted
that 9 times, but from a quick look in /var/log/messages, I might not
have hit the "correct" situation to trigger the error.
That kernel is vanilla 2.6.23-rc7 plus the patch from
http://www.kernel.org/pub/linux/kernel/people/tomo/misc/v2.6.23-rc7-sglist-arch.diff.bz2
( http://marc.info/?l=linux-ide&m=119055574826083&w=2 )
That is no longer the case. Yesterday this kernel did also show the failure.
The error looked a little bit different, but happend at the same
location during the bootup.
Sadly dmesg overflowed and I was not able to capture the first error.
[ 53.462632] Freeing unused kernel memory: 332k freed
ite cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 77.170903] ata1.00: exception Emask 0x40 SAct 0x1 SErr 0x0 action 0x2
[ 77.170905] ata1.00: irq_stat 0x00020002, SGT no on qword boundary
[ 77.170908] ata1.00: cmd 61/08:00:09:d6:42/00:00:25:00:00/40 tag 0
cdb 0x0 data 4096 out
[ 77.170909] res 50/00:00:af:ea:42/00:00:25:00:00/e0 Emask
0x40 (internal error)
Hmm... SGT not on qword boundary? Please add the following to the end
of sil24_port_start() and report successful and failed kernel boot log.

printk("XXX sil24 cb=%p cb_dma=%llx\n",
cb, (unsigned long long)cb_dma);

Also, does 'dmesg -s 10000000' give full dmesg?
Post by Torsten Kaiser
I rebooted into a system (kernel 2.6.21-rc5-mm2, please not the
2.6.*21*, that is only a rescue system and not updated often) on a
separate partition to rebuild it. When I tried to readd the failed sdb
Sep 29 22:25:37 treogen [ 205.407893] ata2.00: exception Emask 0x0
SAct 0x0 SErr 0x0 action 0x2 frozen
Sep 29 22:25:37 treogen [ 205.407900] ata2.00: cmd
ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0
Sep 29 22:25:37 treogen [ 205.407901] res
40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (time out)
Is it possible that the bug is much older, and only some change in
rc3-mm->rc4-mm makes it visible that the initialization of the SiI3132
is incomplete?
I can't tell but there is a pretty large userbase of sil24/32 and you
seem to be the only one to report this problem yet. I think it might be
coming somewhere else than libata or sata_sil24 itself. Hmmm... It
would be really great if you can break -rc3-mm1 too. Correct behavior
for something like that for just one -mm version sounds very weird.

Thanks.
--
tejun
Torsten Kaiser
2007-09-30 16:19:19 UTC
Permalink
Post by Tejun Heo
Hello, Torsten.
Post by Torsten Kaiser
Post by Torsten Kaiser
So in case of -rc3-mm1 I'm pretty sure that it works.
That's still the case.
Ah... that's weird. It would be much better if -rc3-mm1 is broken too. :-P
It would be much better if it would break all the time. Then bisecting
would not be guesswork. :-P
Post by Tejun Heo
Post by Torsten Kaiser
Post by Torsten Kaiser
Not completely sure is if 2.6.23-rc7-sglist kernel works. I booted
that 9 times, but from a quick look in /var/log/messages, I might not
have hit the "correct" situation to trigger the error.
That kernel is vanilla 2.6.23-rc7 plus the patch from
http://www.kernel.org/pub/linux/kernel/people/tomo/misc/v2.6.23-rc7-sglist-arch.diff.bz2
( http://marc.info/?l=linux-ide&m=119055574826083&w=2 )
That is no longer the case. Yesterday this kernel did also show the failure.
The error looked a little bit different, but happend at the same
location during the bootup.
Sadly dmesg overflowed and I was not able to capture the first error.
[ 53.462632] Freeing unused kernel memory: 332k freed
ite cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 77.170903] ata1.00: exception Emask 0x40 SAct 0x1 SErr 0x0 action 0x2
[ 77.170905] ata1.00: irq_stat 0x00020002, SGT no on qword boundary
[ 77.170908] ata1.00: cmd 61/08:00:09:d6:42/00:00:25:00:00/40 tag 0
cdb 0x0 data 4096 out
[ 77.170909] res 50/00:00:af:ea:42/00:00:25:00:00/e0 Emask
0x40 (internal error)
Hmm... SGT not on qword boundary? Please add the following to the end
of sil24_port_start() and report successful and failed kernel boot log.
printk("XXX sil24 cb=%p cb_dma=%llx\n",
cb, (unsigned long long)cb_dma);
Added to my current "2.6.23-rc4-mm0.5"
I will report it's output.
Post by Tejun Heo
Also, does 'dmesg -s 10000000' give full dmesg?
That boot ended in a minimal initrd environment that normally only
starts the RAID5 and then opens contained encrypted real root.
I was just able to push the output from dmesg through the serial link,
but had no man pages to tell me about -s ...
And that kind of error was until now a one-of-a-kind one. All other
errors where not "internal error" but "timeout".
But one time I had another SGT related error:
Sep 11 19:19:24 treogen [ 33.340000] ata1.00: exception Emask 0x20
SAct 0x1 SErr 0x0 action 0x2
Sep 11 19:19:24 treogen [ 33.340000] ata1.00: irq_stat 0x00020002,
PCI master abort while fetching SGT
Sep 11 19:19:24 treogen [ 33.340000] ata1.00: cmd
61/08:00:09:d6:42/00:00:25:00:00/40 tag 0 cdb 0x0 data 4096 out
Sep 11 19:19:24 treogen [ 33.340000] res
50/00:00:af:ea:42/00:00:25:00:00/e0 Emask 0x20 (host bus error)

What I find kind of interessing is, that while I got three different
error codes the cmd part of the output was always the same.
Post by Tejun Heo
Post by Torsten Kaiser
Is it possible that the bug is much older, and only some change in
rc3-mm->rc4-mm makes it visible that the initialization of the SiI3132
is incomplete?
I can't tell but there is a pretty large userbase of sil24/32 and you
seem to be the only one to report this problem yet. I think it might be
coming somewhere else than libata or sata_sil24 itself. Hmmm... It
would be really great if you can break -rc3-mm1 too. Correct behavior
for something like that for just one -mm version sounds very weird.
It's not just 2.6.23-rc4-mm1. All -mm's after rc4 are broken for me.
Confirmed breakage on -rc4-mm1, -rc6-mm1 and -rc8-mm1. I'm just
narrowing on rc4-mm1 because that was the first version to break.

I'm currently trying to bisect 2.6.23-rc4-mm1. Here is the current status:

[the 2.6.23-rc4-mm1 series-file has 2013 lines]
Up to (incl.) x86_64-convert-to-clockevents.patch (line 747): 2 good boots
Up to (incl.) x86_64-cleanup-struct-irqaction-initializers-patch
(line779): 2 good boots
Up to (incl.) slub-optimize-cacheline-use-for-zeroing.patch (line
1045): 1 failed
Up to (incl.) fix-discrepancy-between-vdso-based... (line1461): 1 good, 1 failed

Next try: up to patch fs-remove-some-aop_truncated_page.patch

That means from the patches added to the rc4 variant of the mm-kernel
the following are remaining:
git-xfs-build-fix.patch
enforce-noreplace-smp-in-alternative_instructions.patch
paravirt-fix-preemptible-lazy-mode-bug.patch
i386-apic-fix-4-bit-apicid-assumption-of-mach-default.patch
fix-the-max-path-calculation-in-radix-treec-update.patch
mm-no-need-to-cast-vmalloc-return-value-in-zone_wait_table_init.patch
introduce-write_begin-write_end-aops-fix2.patch
implement-simple-fs-aops-fix.patch
ext2-convert-to-new-aops-fix2.patch
ext3-convert-to-new-aops-fix-fix.patch
ext4-convert-to-new-aops-fix-fix.patch
gfs2-convert-to-new-aops-fix.patch
reiserfs-convert-to-new-aops-fix2.patch
hostfs-convert-to-new-aops-fix-fix.patch
ufs-convert-to-new-aops-fix2.patch
sysv-convert-to-new-aops-fix2.patch
minix-convert-to-new-aops-fix2.patch
affs-convert-to-new-aops-fix-fix.patch
memoryless-nodes-add-n_cpu-node-state-move-setup-of-n_cpu-node-state-mask.patch
memoryless-nodes-fixup-uses-of-node_online_map-in-generic-code-fix.patch
memoryless-nodes-fixup-uses-of-node_online_map-in-generic-code-fix-2.patch
update-n_high_memory-node-state-for-memory-hotadd.patch
slub-avoid-page-struct-cacheline-bouncing-due-to-remote-frees-to-cpu-slab.patch
slub-do-not-use-page-mapping.patch
slub-move-page-offset-to-kmem_cache_cpu-offset.patch
slub-avoid-touching-page-struct-when-freeing-to-per-cpu-slab.patch
slub-place-kmem_cache_cpu-structures-in-a-numa-aware-way.patch
slub-optimize-cacheline-use-for-zeroing.patch

But due to the unreliable nature of the bug, I can't be to sure about that.

Next version is compiled, now again switching the PC off for an hour...

Torsten
Tejun Heo
2007-09-30 17:39:48 UTC
Permalink
Post by Torsten Kaiser
That boot ended in a minimal initrd environment that normally only
starts the RAID5 and then opens contained encrypted real root.
I was just able to push the output from dmesg through the serial link,
but had no man pages to tell me about -s ...
And that kind of error was until now a one-of-a-kind one. All other
errors where not "internal error" but "timeout".
Sep 11 19:19:24 treogen [ 33.340000] ata1.00: exception Emask 0x20
SAct 0x1 SErr 0x0 action 0x2
Sep 11 19:19:24 treogen [ 33.340000] ata1.00: irq_stat 0x00020002,
PCI master abort while fetching SGT
Sep 11 19:19:24 treogen [ 33.340000] ata1.00: cmd
61/08:00:09:d6:42/00:00:25:00:00/40 tag 0 cdb 0x0 data 4096 out
Sep 11 19:19:24 treogen [ 33.340000] res
50/00:00:af:ea:42/00:00:25:00:00/e0 Emask 0x20 (host bus error)
What I find kind of interessing is, that while I got three different
error codes the cmd part of the output was always the same.
That's NCQ write command. You'll be using it a lot if you're rebuilding
md5. It seems like something is going wrong with request DMA or sg
mapping. Maybe some change in block/*.[hc]?
Post by Torsten Kaiser
It's not just 2.6.23-rc4-mm1. All -mm's after rc4 are broken for me.
Confirmed breakage on -rc4-mm1, -rc6-mm1 and -rc8-mm1. I'm just
narrowing on rc4-mm1 because that was the first version to break.
Have you tested 2.6.23-rc4 without mm patches? It could be something
introduced between -rc3 and 4.
Post by Torsten Kaiser
[the 2.6.23-rc4-mm1 series-file has 2013 lines]
Up to (incl.) x86_64-convert-to-clockevents.patch (line 747): 2 good boots
Up to (incl.) x86_64-cleanup-struct-irqaction-initializers-patch
(line779): 2 good boots
Up to (incl.) slub-optimize-cacheline-use-for-zeroing.patch (line
1045): 1 failed
Up to (incl.) fix-discrepancy-between-vdso-based... (line1461): 1 good, 1 failed
Next try: up to patch fs-remove-some-aop_truncated_page.patch
That means from the patches added to the rc4 variant of the mm-kernel
git-xfs-build-fix.patch
enforce-noreplace-smp-in-alternative_instructions.patch
paravirt-fix-preemptible-lazy-mode-bug.patch
i386-apic-fix-4-bit-apicid-assumption-of-mach-default.patch
fix-the-max-path-calculation-in-radix-treec-update.patch
mm-no-need-to-cast-vmalloc-return-value-in-zone_wait_table_init.patch
introduce-write_begin-write_end-aops-fix2.patch
implement-simple-fs-aops-fix.patch
ext2-convert-to-new-aops-fix2.patch
ext3-convert-to-new-aops-fix-fix.patch
ext4-convert-to-new-aops-fix-fix.patch
gfs2-convert-to-new-aops-fix.patch
reiserfs-convert-to-new-aops-fix2.patch
hostfs-convert-to-new-aops-fix-fix.patch
ufs-convert-to-new-aops-fix2.patch
sysv-convert-to-new-aops-fix2.patch
minix-convert-to-new-aops-fix2.patch
affs-convert-to-new-aops-fix-fix.patch
memoryless-nodes-add-n_cpu-node-state-move-setup-of-n_cpu-node-state-mask.patch
memoryless-nodes-fixup-uses-of-node_online_map-in-generic-code-fix.patch
memoryless-nodes-fixup-uses-of-node_online_map-in-generic-code-fix-2.patch
update-n_high_memory-node-state-for-memory-hotadd.patch
slub-avoid-page-struct-cacheline-bouncing-due-to-remote-frees-to-cpu-slab.patch
slub-do-not-use-page-mapping.patch
slub-move-page-offset-to-kmem_cache_cpu-offset.patch
slub-avoid-touching-page-struct-when-freeing-to-per-cpu-slab.patch
slub-place-kmem_cache_cpu-structures-in-a-numa-aware-way.patch
slub-optimize-cacheline-use-for-zeroing.patch
But due to the unreliable nature of the bug, I can't be to sure about that.
Yeah, that's what I'm worried about. Bisection is extremely difficult
if errors are intermittent and takes long time to reproduce.
Post by Torsten Kaiser
Next version is compiled, now again switching the PC off for an hour...
Thanks a lot. Much appreciated.
--
tejun
Torsten Kaiser
2007-09-30 18:39:44 UTC
Permalink
Post by Tejun Heo
Post by Torsten Kaiser
What I find kind of interessing is, that while I got three different
error codes the cmd part of the output was always the same.
That's NCQ write command. You'll be using it a lot if you're rebuilding
md5.
It's not rebuilding the RAID at that point.
If one drive fails, I reboot into a "safe" kernel, fix the RAID and
that way try the next boot with a clean RAID again.

The error happens when the RAID is initialized, might be the first
write into the superblock to mark it dirty/inuse that triggers the
error.
Post by Tejun Heo
It seems like something is going wrong with request DMA or sg
mapping. Maybe some change in block/*.[hc]?
The sg-chaining-patch stands out, but I have no conclusive proof that
it really is the cause.
As noted in this thread, a long time I thought that rc7 with the
sg-chaining-patch was safe, but one time it also showed the error.
Post by Tejun Heo
Post by Torsten Kaiser
It's not just 2.6.23-rc4-mm1. All -mm's after rc4 are broken for me.
Confirmed breakage on -rc4-mm1, -rc6-mm1 and -rc8-mm1. I'm just
narrowing on rc4-mm1 because that was the first version to break.
Have you tested 2.6.23-rc4 without mm patches? It could be something
introduced between -rc3 and 4.
Not directly, but I have 4 good boots with one part of the mm-patches.
So I would tend to say that mainline 2.6.23-rc4 does not have this
bug.
Post by Tejun Heo
Post by Torsten Kaiser
[the 2.6.23-rc4-mm1 series-file has 2013 lines]
Up to (incl.) x86_64-convert-to-clockevents.patch (line 747): 2 good boots
Up to (incl.) x86_64-cleanup-struct-irqaction-initializers-patch
(line779): 2 good boots
Up to (incl.) slub-optimize-cacheline-use-for-zeroing.patch (line
1045): 1 failed
Up to (incl.) fix-discrepancy-between-vdso-based... (line1461): 1 good, 1 failed
Next try: up to patch fs-remove-some-aop_truncated_page.patch
Looks more like this is OK too.
Post by Tejun Heo
Post by Torsten Kaiser
That means from the patches added to the rc4 variant of the mm-kernel
[snip]
Post by Tejun Heo
Post by Torsten Kaiser
memoryless-nodes-add-n_cpu-node-state-move-setup-of-n_cpu-node-state-mask.patch
memoryless-nodes-fixup-uses-of-node_online_map-in-generic-code-fix.patch
memoryless-nodes-fixup-uses-of-node_online_map-in-generic-code-fix-2.patch
update-n_high_memory-node-state-for-memory-hotadd.patch
slub-avoid-page-struct-cacheline-bouncing-due-to-remote-frees-to-cpu-slab.patch
slub-do-not-use-page-mapping.patch
slub-move-page-offset-to-kmem_cache_cpu-offset.patch
slub-avoid-touching-page-struct-when-freeing-to-per-cpu-slab.patch
slub-place-kmem_cache_cpu-structures-in-a-numa-aware-way.patch
slub-optimize-cacheline-use-for-zeroing.patch
But due to the unreliable nature of the bug, I can't be to sure about that.
Yeah, that's what I'm worried about. Bisection is extremely difficult
if errors are intermittent and takes long time to reproduce.
Yes...

As for the remaining patches:
memoryless-nodes-*
Don't think so: I do have a NUMA system, but both nodes have memory.
flush-cache-before-*
Don't think so: No ia64 system, unchanged from rc3
# grouping pages by mobility patches
... no idee, but seem unchanged
maps2.*
Don't think that related...
remaining slub-* patches
Might be...

As for you printk:
From two goot boots, I had not had any failures with it:
First one:
Sep 30 19:24:53 treogen [ 3.810000] XXX sil24 cb=ffff810037ef0000
cb_dma=37ef0000
Sep 30 19:24:53 treogen [ 3.820000] XXX sil24 cb=ffff810037f00000
cb_dma=37f00000
Second:
Sep 30 20:06:22 treogen [ 3.820000] XXX sil24 cb=ffff810037f00000
cb_dma=37f00000
Sep 30 20:06:22 treogen [ 3.830000] XXX sil24 cb=ffff810037f10000
cb_dma=37f10000

Torsten
Torsten Kaiser
2007-10-01 18:00:14 UTC
Permalink
Post by Torsten Kaiser
Post by Tejun Heo
Post by Torsten Kaiser
What I find kind of interessing is, that while I got three different
error codes the cmd part of the output was always the same.
That's NCQ write command. You'll be using it a lot if you're rebuilding
md5.
It's not rebuilding the RAID at that point.
If one drive fails, I reboot into a "safe" kernel, fix the RAID and
that way try the next boot with a clean RAID again.
The error happens when the RAID is initialized, might be the first
write into the superblock to mark it dirty/inuse that triggers the
error.
To make that comment "cmd part of the output was always the same" more
clear: I did not only mean that the first (cmd-)byte was the same, but
the whole line:
cmd 61/08:00:09:d6:42/00:00:25:00:00/40 tag 0 cdb 0x0 data 4096 out

I did find ATA_CMD_FPDMA_WRITE = 0x61in ata.h, but I don't find a good
hint for the 08 that follows. The interpretation XFER_PIO_0 = 0x08
seems wrong...
Post by Torsten Kaiser
Post by Tejun Heo
It seems like something is going wrong with request DMA or sg
mapping. Maybe some change in block/*.[hc]?
The sg-chaining-patch stands out, but I have no conclusive proof that
it really is the cause.
As noted in this thread, a long time I thought that rc7 with the
sg-chaining-patch was safe, but one time it also showed the error.
If I look at what patches remain, it seems that some other (earlier)
patch that is new in 2.6.23-rc4-mm1 is the trigger, but it will only
fail together with a second patch.
Post by Torsten Kaiser
Post by Tejun Heo
Post by Torsten Kaiser
[the 2.6.23-rc4-mm1 series-file has 2013 lines]
Up to (incl.) x86_64-convert-to-clockevents.patch (line 747): 2 good boots
Up to (incl.) x86_64-cleanup-struct-irqaction-initializers-patch
(line779): 2 good boots
Up to (incl.) fs-remove-some-aop_truncated_page.patch (line 956): 2 good boots
Up to (incl.) update-n_high_memory-node-state-for-memory-hotadd.patch (line978)
1 good boot, so I would tend to say that this patches are also OK, but
I will recheck this version.

Up to (incl.) maps2-make-proc-pid-smaps-optional-under-config_embeddedpatch-fix.patch
(line 1038): 1 failed
Post by Torsten Kaiser
Post by Tejun Heo
Post by Torsten Kaiser
Up to (incl.) slub-optimize-cacheline-use-for-zeroing.patch (line
1045): 1 failed
Up to (incl.) fix-discrepancy-between-vdso-based... (line1461): 1 good, 1 failed
That means from the patches added to the rc4 variant of the mm-kernel
[snip]
Post by Tejun Heo
Post by Torsten Kaiser
memoryless-nodes-add-n_cpu-node-state-move-setup-of-n_cpu-node-state-mask.patch
memoryless-nodes-fixup-uses-of-node_online_map-in-generic-code-fix.patch
memoryless-nodes-fixup-uses-of-node_online_map-in-generic-code-fix-2.patch
update-n_high_memory-node-state-for-memory-hotadd.patch
And I think these are OK too.
So I think some earlier patch triggers bug together with one of these patches:

categorize-gfp-flags.patch
categorize-gfp-flags-fix.patch
make-swappiness-safer-to-use.patch
... these should not change any behaivor from there description
Post by Torsten Kaiser
flush-cache-before-*
Don't think so: No ia64 system, unchanged from rc3
# grouping pages by mobility patches
... no idee, but seem unchanged
maps2.*
Don't think that related...
Will continue to break this down...
Post by Torsten Kaiser
Sep 30 19:24:53 treogen [ 3.810000] XXX sil24 cb=ffff810037ef0000
cb_dma=37ef0000
Sep 30 19:24:53 treogen [ 3.820000] XXX sil24 cb=ffff810037f00000
cb_dma=37f00000
Sep 30 20:06:22 treogen [ 3.820000] XXX sil24 cb=ffff810037f00000
cb_dma=37f00000
Sep 30 20:06:22 treogen [ 3.830000] XXX sil24 cb=ffff810037f10000
cb_dma=37f10000
Next boot with this kernel was also good:
Oct 1 06:52:32 treogen [ 3.740000] XXX sil24 cb=ffff810037f00000
cb_dma=37f00000
Oct 1 06:52:32 treogen [ 3.750000] XXX sil24 cb=ffff810037f10000
cb_dma=37f10000

Next kernel (maps2-make-proc-pid-smaps-optional-under-config_embeddedpatch-fix.patch)
Oct 1 18:07:09 treogen [ 3.760000] XXX sil24 cb=ffff810005df0000
cb_dma=5df0000
Oct 1 18:07:09 treogen [ 3.770000] XXX sil24 cb=ffff810005e00000
cb_dma=5e00000
-> failed, but when rebooting instead of a cold boot that kernel worked:
Oct 1 18:14:42 treogen [ 3.800000] XXX sil24 cb=ffff810005df0000
cb_dma=5df0000
Oct 1 18:14:42 treogen [ 3.810000] XXX sil24 cb=ffff810005e00000
cb_dma=5e00000
... same values, so thats not an indicator that trouble will follow.

Next kernel (update-n_high_memory-node-state-for-memory-hotadd.patch)
Oct 1 19:31:38 treogen [ 3.840000] XXX sil24 cb=ffff81007fa60000
cb_dma=7fa60000
Oct 1 19:31:38 treogen [ 3.850000] XXX sil24 cb=ffff810037f00000
cb_dma=37f00000
that was also a good boot.

Torsten
Torsten Kaiser
2007-10-03 15:21:13 UTC
Permalink
Post by Torsten Kaiser
To make that comment "cmd part of the output was always the same" more
clear: I did not only mean that the first (cmd-)byte was the same, but
cmd 61/08:00:09:d6:42/00:00:25:00:00/40 tag 0 cdb 0x0 data 4096 out
I did find ATA_CMD_FPDMA_WRITE = 0x61in ata.h, but I don't find a good
hint for the 08 that follows. The interpretation XFER_PIO_0 = 0x08
seems wrong...
Even all of the last bisect steps still showed the same line...
Post by Torsten Kaiser
If I look at what patches remain, it seems that some other (earlier)
patch that is new in 2.6.23-rc4-mm1 is the trigger, but it will only
fail together with a second patch.
I'm now finished with bisecting, still 2 patches, but I don't want to
spend another two hours waiting...

And the winners are: (from broken-out patchset of 2.6.23-rc4-mm1)
maps2-simplify-interdependence-of-proc-pid-maps-and-smaps.patch
maps2-move-clear_refs-code-to-task_mmuc.patch

Before these patches I have never seen the bug, with these I only got
two good boots when trying to recreate the problem. But even the
kernels that did one good boot failed on the second try.
From the last testkernel (up to incl.
maps2-move-clear_refs-code-to-task_mmuc.patch)

Good:
Oct 3 15:50:39 treogen [ 3.780000] XXX sil24 cb=ffff810005dd0000
cb_dma=5dd0000
Oct 3 15:50:39 treogen [ 3.790000] XXX sil24 cb=ffff810005de0000
cb_dma=5de0000

Bad:
Oct 3 16:56:46 treogen [ 3.750000] XXX sil24 cb=ffff810005de0000
cb_dma=5de0000
Oct 3 16:56:46 treogen [ 3.760000] XXX sil24 cb=ffff810005df0000
cb_dma=5df0000

I have *NO* clue how these procfs changes can kill the sata_sil24
driver, but all bisect steps point to these two patches as cause or
trigger of the bug.

Torsten
Torsten Kaiser
2007-10-03 15:55:10 UTC
Permalink
[CC added to author of the bad patch]

Short recap:
Since 2.6.23-rc4-mm1 all mm-kernel randomly fail one of two drives on
my Silicon Image 3132. This failure happens when my initramfs wants to
start the RAID that is on these drives.

The first error libata throws is:
Oct 3 16:56:46 treogen [ 63.320000] ata2.00: exception Emask 0x0
SAct 0x1 SErr 0x0 action 0x6 frozen
Oct 3 16:56:46 treogen [ 63.320000] ata2.00: cmd
61/08:00:09:d6:42/00:00:25:00:00/40 tag 0 cdb 0x0 data 4096 out
Oct 3 16:56:46 treogen [ 63.320000] res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 3 16:56:46 treogen [ 63.320000] ata2.00: status: {DRDY }

Resetting the sata link fails, the drive is no longer reachable until a reboot.
Post by Torsten Kaiser
I'm now finished with bisecting, still 2 patches, but I don't want to
spend another two hours waiting...
And the winners are: (from broken-out patchset of 2.6.23-rc4-mm1)
maps2-simplify-interdependence-of-proc-pid-maps-and-smaps.patch
maps2-move-clear_refs-code-to-task_mmuc.patch
Before these patches I have never seen the bug, with these I only got
two good boots when trying to recreate the problem. But even the
kernels that did one good boot failed on the second try.
The simplify-patch just seems to move some code around, but I see a
real change in the other one:
This patch removes clear_refs_smap() from fs/proc/task_mmu.c by moving
its code to a new function. But during the move the main for-loop from
clear_refs_smap was changed:

old:
for (vma = mm->mmap; vma; vma = vma->vm_next)
if (vma->vm_mm && !is_vm_hugetlb_page(vma))
walk_page_range(vma->vm_mm, vma->vm_start, vma->vm_end,
&clear_refs_walk, vma);

new:
for (vma = mm->mmap; vma; vma = vma->vm_next)
if (!is_vm_hugetlb_page(vma))
walk_page_range(mm, vma->vm_start, vma->vm_end,
&clear_refs_walk, vma);

The walk_page_range() is no longer called on vma->vm_mm, but on mm directly.
I don't know how this can kill the sata_sil24-driver, but at least it
looks suspicious.
As I'm not really a kernel hacker, I defer this question to the ones that are.

Torsten
Matt Mackall
2007-10-03 16:38:04 UTC
Permalink
Post by Torsten Kaiser
[CC added to author of the bad patch]
Since 2.6.23-rc4-mm1 all mm-kernel randomly fail one of two drives on
my Silicon Image 3132. This failure happens when my initramfs wants to
start the RAID that is on these drives.
Oct 3 16:56:46 treogen [ 63.320000] ata2.00: exception Emask 0x0
SAct 0x1 SErr 0x0 action 0x6 frozen
Oct 3 16:56:46 treogen [ 63.320000] ata2.00: cmd
61/08:00:09:d6:42/00:00:25:00:00/40 tag 0 cdb 0x0 data 4096 out
Oct 3 16:56:46 treogen [ 63.320000] res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 3 16:56:46 treogen [ 63.320000] ata2.00: status: {DRDY }
Resetting the sata link fails, the drive is no longer reachable until a reboot.
Post by Torsten Kaiser
I'm now finished with bisecting, still 2 patches, but I don't want to
spend another two hours waiting...
And the winners are: (from broken-out patchset of 2.6.23-rc4-mm1)
maps2-simplify-interdependence-of-proc-pid-maps-and-smaps.patch
maps2-move-clear_refs-code-to-task_mmuc.patch
Before these patches I have never seen the bug, with these I only got
two good boots when trying to recreate the problem. But even the
kernels that did one good boot failed on the second try.
The simplify-patch just seems to move some code around, but I see a
This patch removes clear_refs_smap() from fs/proc/task_mmu.c by moving
its code to a new function. But during the move the main for-loop from
for (vma = mm->mmap; vma; vma = vma->vm_next)
if (vma->vm_mm && !is_vm_hugetlb_page(vma))
walk_page_range(vma->vm_mm, vma->vm_start, vma->vm_end,
&clear_refs_walk, vma);
for (vma = mm->mmap; vma; vma = vma->vm_next)
if (!is_vm_hugetlb_page(vma))
walk_page_range(mm, vma->vm_start, vma->vm_end,
&clear_refs_walk, vma);
The walk_page_range() is no longer called on vma->vm_mm, but on mm directly.
I don't know how this can kill the sata_sil24-driver, but at least it
looks suspicious.
That code should be fine. Further, it's pretty unlikely that this code
ever gets invoked. This whole interface was only recently added by
Google folks and its usage is pretty obscure.

Oh wait - you're _at_ Google, aren't you? Perhaps you're actually
using clear_refs.

Well I can see no reason why the vma we just got to by the mm->mmap
would have a vm_mm != mm, but I've certainly been wrong before.

Try changing it to:

for (vma = mm->mmap; vma; vma = vma->vm_next)
if (!is_vm_hugetlb_page(vma)) {
if (vma->vm_mm != mm)
printk("WTF: vma->vm_mm %p mm %p\n",
vma->vm_mm, mm);
walk_page_range(vma->vm_mm, vma->vm_start, vma->vm_end,
&clear_refs_walk, vma);
}
--
Mathematics is the supreme nostalgia of our time.
Torsten Kaiser
2007-10-03 17:36:55 UTC
Permalink
Post by Matt Mackall
Post by Torsten Kaiser
This patch removes clear_refs_smap() from fs/proc/task_mmu.c by moving
its code to a new function. But during the move the main for-loop from
for (vma = mm->mmap; vma; vma = vma->vm_next)
if (vma->vm_mm && !is_vm_hugetlb_page(vma))
walk_page_range(vma->vm_mm, vma->vm_start, vma->vm_end,
&clear_refs_walk, vma);
for (vma = mm->mmap; vma; vma = vma->vm_next)
if (!is_vm_hugetlb_page(vma))
walk_page_range(mm, vma->vm_start, vma->vm_end,
&clear_refs_walk, vma);
The walk_page_range() is no longer called on vma->vm_mm, but on mm directly.
I don't know how this can kill the sata_sil24-driver, but at least it
looks suspicious.
That code should be fine. Further, it's pretty unlikely that this code
ever gets invoked. This whole interface was only recently added by
Google folks and its usage is pretty obscure.
Oh wait - you're _at_ Google, aren't you? Perhaps you're actually
using clear_refs.
No. I'm just using Google Mail.

And I'm not using any additional patches apart from what is in the mm-patchset.
Post by Matt Mackall
Well I can see no reason why the vma we just got to by the mm->mmap
would have a vm_mm != mm, but I've certainly been wrong before.
for (vma = mm->mmap; vma; vma = vma->vm_next)
if (!is_vm_hugetlb_page(vma)) {
if (vma->vm_mm != mm)
printk("WTF: vma->vm_mm %p mm %p\n",
vma->vm_mm, mm);
walk_page_range(vma->vm_mm, vma->vm_start, vma->vm_end,
&clear_refs_walk, vma);
}
Will try this, but as the error only seems to trigger, if I leave the
pc powered off for more then a hour, you will need to wait until
tomorrow.

Of note might be, that at the time of this error init has not been
started. I'm using a program from initramfs to start the RAID.
The initramfs was primarily build using the genkernel package from
gentoo, but I have removed some parts of it (kernel modules and some
other part I no longer remember)

Torsten
Matt Mackall
2007-10-03 17:51:05 UTC
Permalink
Post by Torsten Kaiser
Post by Matt Mackall
Post by Torsten Kaiser
This patch removes clear_refs_smap() from fs/proc/task_mmu.c by moving
its code to a new function. But during the move the main for-loop from
for (vma = mm->mmap; vma; vma = vma->vm_next)
if (vma->vm_mm && !is_vm_hugetlb_page(vma))
walk_page_range(vma->vm_mm, vma->vm_start, vma->vm_end,
&clear_refs_walk, vma);
for (vma = mm->mmap; vma; vma = vma->vm_next)
if (!is_vm_hugetlb_page(vma))
walk_page_range(mm, vma->vm_start, vma->vm_end,
&clear_refs_walk, vma);
The walk_page_range() is no longer called on vma->vm_mm, but on mm directly.
I don't know how this can kill the sata_sil24-driver, but at least it
looks suspicious.
That code should be fine. Further, it's pretty unlikely that this code
ever gets invoked. This whole interface was only recently added by
Google folks and its usage is pretty obscure.
Oh wait - you're _at_ Google, aren't you? Perhaps you're actually
using clear_refs.
No. I'm just using Google Mail.
Ahh, I see. Wasn't aware of the trademark dispute..
Post by Torsten Kaiser
Of note might be, that at the time of this error init has not been
started. I'm using a program from initramfs to start the RAID.
The initramfs was primarily build using the genkernel package from
gentoo, but I have removed some parts of it (kernel modules and some
other part I no longer remember)
Then I really don't see how these patches can have anything to do with
your problem. They only affect some currently obscure files in /proc
and it's unlikely to have any impact at init time. It's more likely
that there's some code alignment Heisenbug or you got a false negative
somewhere in your bisection.
--
Mathematics is the supreme nostalgia of our time.
Torsten Kaiser
2007-10-03 18:06:48 UTC
Permalink
Post by Matt Mackall
Post by Torsten Kaiser
Of note might be, that at the time of this error init has not been
started. I'm using a program from initramfs to start the RAID.
The initramfs was primarily build using the genkernel package from
gentoo, but I have removed some parts of it (kernel modules and some
other part I no longer remember)
Then I really don't see how these patches can have anything to do with
your problem. They only affect some currently obscure files in /proc
and it's unlikely to have any impact at init time. It's more likely
that there's some code alignment Heisenbug or you got a false negative
somewhere in your bisection.
Heisenbug via code alignment might be, but the bisection result seemed
rather reliable.
I have 4 good boots with all mm-patches before the maps2-* ones and 2
out of 3 boots failed with part or all maps2-* patches applied. (Any
many more failed boots with more mm-patches applied.)

But I do think it might be very possible that some (maybe even
correct) change in the behavior of this code triggers the error. So
I'm very interested in what you additional printk outputs to track
down what changed and why this breaks.

I looked at the scripts from the initramfs, but sadly I don't see
anything that crawls throug /proc and might hit your files. Other than
that there is nearly only busybox in there. cryptsetup should run
later and modprobe does not have any modules it could try to load...

Torsten
Torsten Kaiser
2007-10-04 05:32:52 UTC
Permalink
Post by Matt Mackall
Well I can see no reason why the vma we just got to by the mm->mmap
would have a vm_mm != mm, but I've certainly been wrong before.
for (vma = mm->mmap; vma; vma = vma->vm_next)
if (!is_vm_hugetlb_page(vma)) {
if (vma->vm_mm != mm)
printk("WTF: vma->vm_mm %p mm %p\n",
vma->vm_mm, mm);
walk_page_range(vma->vm_mm, vma->vm_start, vma->vm_end,
&clear_refs_walk, vma);
}
You were right.
I was able to trigger the error with above printk added, but nothing
was written to the syslog.

So now I'm rather out of ideas what to test... :(

Torsten
Matt Mackall
2007-10-04 17:05:36 UTC
Permalink
Post by Torsten Kaiser
Post by Matt Mackall
Well I can see no reason why the vma we just got to by the mm->mmap
would have a vm_mm != mm, but I've certainly been wrong before.
for (vma = mm->mmap; vma; vma = vma->vm_next)
if (!is_vm_hugetlb_page(vma)) {
if (vma->vm_mm != mm)
printk("WTF: vma->vm_mm %p mm %p\n",
vma->vm_mm, mm);
walk_page_range(vma->vm_mm, vma->vm_start, vma->vm_end,
&clear_refs_walk, vma);
}
You were right.
I was able to trigger the error with above printk added, but nothing
was written to the syslog.
So now I'm rather out of ideas what to test... :(
I'd give your previous bisect step another try.

Looking back at the thread a bit, anything that requires the machine
to be off for more than a couple seconds to manifest stops looking
like software and firmware and starts looking like a heat-related
electrical or mechanical issue. Make sure your backups are current.
--
Mathematics is the supreme nostalgia of our time.
Torsten Kaiser
2007-10-05 06:06:11 UTC
Permalink
Post by Matt Mackall
Post by Torsten Kaiser
So now I'm rather out of ideas what to test... :(
I'd give your previous bisect step another try.
Yes, I thought about that too. But I never seemed to need more than
two tries to make it fail.
So I would only suspect the last good step as wrong positive.
That would then point to the first of your maps2-patches, the moving
of the pagewalker code.
Would you thing that this is a plausible cause?
Post by Matt Mackall
Looking back at the thread a bit, anything that requires the machine
to be off for more than a couple seconds to manifest stops looking
like software and firmware and starts looking like a heat-related
electrical or mechanical issue. Make sure your backups are current.
What backups? :-)

Yes, I also thought about hardware trouble, but the bisect result
seemed to consistent.
Also that its not always the same drive that fails, only every time
one of the sil-drives.

I now have activated ATA_DEBUG to see if the good and the bad boots differ.
It looks the same until the RAID5 starts.

Good boot:
[ 40.160000] ata_scsi_dump_cdb: CDB (2:0,0,0) 35 00 00 00 00 00 00 00 00
[ 40.160000] ata_scsi_dump_cdb: CDB (1:0,0,0) 35 00 00 00 00 00 00 00 00
[ 40.160000] ata_scsi_dump_cdb: CDB (2:0,0,0) 2a 00 25 42 d6 09 00 00 08
[ 40.160000] ata_sg_setup: 1 sg elements mapped
[ 40.160000] ata_scsi_dump_cdb: CDB (1:0,0,0) 2a 00 25 42 d6 09 00 00 08
[ 40.160000] ata_sg_setup: 1 sg elements mapped
[ 40.160000] ata_scsi_dump_cdb: CDB (2:0,0,0) 35 00 00 00 00 00 00 00 00
[ 40.160000] ata_scsi_dump_cdb: CDB (1:0,0,0) 35 00 00 00 00 00 00 00 00
[ 40.320000] nv_swncq_host_interrupt: id 0x3 SWNCQ: qc_active 0x1
dhfis 0x1 dmafis 0x1 sactive 0x0
[ 40.320000] nv_swncq_sdbfis: over
[ 40.320000] ata_scsi_dump_cdb: CDB (3:0,0,0) 35 00 00 00 00 00 00 00 00
[ 40.320000] ata_exec_command: ata3: cmd 0xEA
[ 40.390000] ata_hsm_move: ata3: protocol 1 task_state 3 (dev_stat 0x40)
[ 40.390000] ata_hsm_move: ata3: dev 0 command complete, drv_stat 0x40
[ 40.420000] md: considering sdb1 ...
[ 40.440000] md: adding sdb1 ...
[ 40.440000] md: adding sda1 ...
[ 40.450000] md: created md0
[ 40.460000] md: bind<sda1>
[ 40.470000] md: bind<sdb1>
[ 40.480000] md: running: <sdb1><sda1>
[ 40.500000] raid1: raid set md0 active with 2 out of 2 mirrors

Bad boot:
[ 40.060000] ata_scsi_dump_cdb: CDB (2:0,0,0) 35 00 00 00 00 00 00 00 00
[ 40.060000] ata_scsi_dump_cdb: CDB (1:0,0,0) 35 00 00 00 00 00 00 00 00
[ 40.060000] ata_scsi_dump_cdb: CDB (2:0,0,0) 2a 00 25 42 d6 09 00 00 08
[ 40.060000] ata_sg_setup: 1 sg elements mapped
[ 40.060000] ata_scsi_dump_cdb: CDB (1:0,0,0) 2a 00 25 42 d6 09 00 00 08
[ 40.060000] ata_sg_setup: 1 sg elements mapped
[ 40.060000] ata_scsi_dump_cdb: CDB (2:0,0,0) 35 00 00 00 00 00 00 00 00
[ 40.200000] nv_swncq_host_interrupt: id 0x3 SWNCQ: qc_active 0x1
dhfis 0x1 dmafis 0x1 sactive 0x0
[ 40.200000] nv_swncq_sdbfis: over
[ 40.200000] ata_scsi_dump_cdb: CDB (3:0,0,0) 35 00 00 00 00 00 00 00 00
[ 40.200000] ata_exec_command: ata3: cmd 0xEA
[ 40.270000] ata_hsm_move: ata3: protocol 1 task_state 3 (dev_stat 0x40)
[ 40.270000] ata_hsm_move: ata3: dev 0 command complete, drv_stat 0x40
[ 70.060000] ata_scsi_timed_out: ENTER
[ 70.060000] ata_scsi_timed_out: EXIT, ret=0
[ 70.080000] ata_scsi_error: ENTER
[ 70.080000] ata_port_flush_task: ENTER
[ 70.100000] ata1: ata_port_flush_task: EXIT
[ 70.110000] __ata_port_freeze: ata1 port frozen
[ 70.220000] __ata_port_freeze: ata1 port frozen
[ 70.230000] ata_eh_link_autopsy: ENTER
[ 70.240000] ata_eh_link_autopsy: EXIT
[ 70.250000] ata1.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
[ 70.270000] ata1.00: cmd 61/08:00:09:d6:42/00:00:25:00:00/40 tag 0
cdb 0x0 data 4096 out
[ 70.270000] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask
0x4 (timeout)

After [ 40.060000] ata_scsi_dump_cdb: CDB (1:0,0,0) 2a 00 25 42 d6 09 00 00 08
the drive sda falls of the earth and can't be recovered through soft-
or hard-resetting the port by the error handler.

So I will use the weekend to see if I can find out who issues this
command and add more debug to that place...

Torsten
Torsten Kaiser
2007-10-07 08:44:25 UTC
Permalink
Post by Torsten Kaiser
So I will use the weekend to see if I can find out who issues this
command and add more debug to that place...
I added some DPRINTK to sil24_qc_issue and sil24_fill_sg, but I only
found one suspicious thing.

My sil24_fill_sg now looks like this:
static inline void sil24_fill_sg(struct ata_queued_cmd *qc,
struct sil24_sge *sge)
{
struct scatterlist *sg;

ata_for_each_sg(sg, qc) {
sge->addr = cpu_to_le64(sg_dma_address(sg));
sge->cnt = cpu_to_le32(sg_dma_len(sg));
if (ata_sg_is_last(sg, qc))
sge->flags = cpu_to_le32(SGE_TRM);
else
sge->flags = 0;
DPRINTK("flags,addr,cnt = 0x%x, 0x%X, 0x%X\n", sge->flags,
sge->addr, sge->cnt);
sge++;
}
}

Suspicious is, that *all* output from this DPRINTK shows flags as 0x0,
so the last sg is never terminated (SGE_TRM is 1<<31)?
But if that is the cause, how is this working at all? Or am I doing
something stupid?

Timing and outputs from five boots:
good: bad:
more moreboot more
3->35 3->35 3->35 3->35 3->35
3->2a 2->35 2->35 3->2a 3->2a
3->setup 2->2a 2->2a 3->setup 3->setup
2->35 2->35 2->35 2->35 2->35
1->35 3->2a 3->2a 1->35 1->35
2->2a 3->setup 3->setup 2->2a 2->2a
1->2a 1->35 1->35 1->2a 1->2a
2->35 1->2a 1->2a 2->35
1->35 1->35 1->35 1->35
3->int 3->int 3->int 3->int 3->int
3->35 3->35 3->35 3->35 3->35
1->5DF/1439C 1->5DC/1439C 1->5DE/1439C
2->5E0/143BC 2->5DE/143BC 2->5DF/143BC
sg:170E sg:1AAB sg:1A60
XXX:
5DD 5DF 5DC 5DF 5DE
5E0 5E0 5DE 5E0 5DF

The first three columns where working tries, the last two failed one drive.
column 1: ATA_DEBUG added, reboot
column 2: +my additions, reboot
column 3: +my additions, cold boot, wanted to make it fail, but worked
column 4: ATA_DEBUG added, cold boot
column 5: +my additions, cold boot
[x]->[y]: x is the ata-port, 1+2 on the sata_sil24, 3 on sata_nv with swncq
y:35 -> SYNCHRONIZE_CACHE commands that where send to the drive
y:2a -> WRITE_10 commands that where send to the drive
y:setup -> Debug from swncq: nv_swncq_dmafis: dma setup tag 0x0
y:int -> Debug from swncq: nv_swncq_host_interrupt: id 0x3 SWNCQ:
qc_active 0x1 ...

The lines before the XXX:
x->a/b: x is the ata-port, a the paddr from sil24_qc_issue, b the
activate from sil24_qc_issue
All outputs from sil24_qc_issue where identical in each boot sequence,
only differed from run to run.
sg:a: a is the sge->addr from sil24_fill_sg

The lines after the XXX:
This are the addresses that the XXX-printk from sil24_port_start prints.

I hope I explained enough what above table should mean.
This hole sequence (two syncs and one write to each drive) happens
between the output:
[ 40.300000] md1: bitmap initialized from disk: read 10/10 pages, set 87 bits
[ 40.320000] created bitmap (145 pages) for device md1
and the error on a bad boot:
[ 70.680000] ata2.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
[ 70.700000] ata2.00: cmd 61/08:00:09:d6:42/00:00:25:00:00/40 tag 0
cdb 0x0 data 4096 out
or if on a good boot:
[ 40.910000] md: considering sdb1 ...
(sdb1 is part of another raid)

(If someone whats to complete bootlogs, just ask)

So now I have two questions:
1) What happens in sil24_fill_sg with SGE_TRM?
2) If that is ok, should I try to add debug to sil24_error_intr and/or
sil24_host_intr?

Torsten
Torsten Kaiser
2007-10-07 14:39:15 UTC
Permalink
[Adding Jens Axboe, the author of what looks like the probable cause]
Post by Torsten Kaiser
static inline void sil24_fill_sg(struct ata_queued_cmd *qc,
struct sil24_sge *sge)
{
struct scatterlist *sg;
ata_for_each_sg(sg, qc) {
sge->addr = cpu_to_le64(sg_dma_address(sg));
sge->cnt = cpu_to_le32(sg_dma_len(sg));
if (ata_sg_is_last(sg, qc))
sge->flags = cpu_to_le32(SGE_TRM);
else
sge->flags = 0;
DPRINTK("flags,addr,cnt = 0x%x, 0x%X, 0x%X\n", sge->flags,
sge->addr, sge->cnt);
sge++;
}
}
Suspicious is, that *all* output from this DPRINTK shows flags as 0x0,
so the last sg is never terminated (SGE_TRM is 1<<31)?
But if that is the cause, how is this working at all? Or am I doing
something stupid?
Looking closer at
http://git.kernel.org/?p=linux/kernel/git/axboe/linux-2.6-block.git;a=commitdiff;h=ec6fdded4d76aa54aa57341e5dfdd61c507b1dcd
the change to libata.h seems bogus :

in ata_qc_first_sg:
old new
return qc->__sg return qc->__sg
qc->__sg - qc->__sg == 0 qc->n_iter=0
-> sg - qc->__sg corresponds to qc->n_iter

in ata_qc_next_sg:
sg++; sg_next(sg); qc->n_iter++;
sg - qc->__sg < qc->n_elem qc->n_iter < qc->nelem
-> sg - qc->__sg corresponds to qc->n_iter

but in ata_sg_is_last:
(sg - qc->__sg) +1 == qc->n_elem qc->n_iter == qc->n_elem
if sg - qc->__sg corresponds to qc->n_iter then shoudn't it be
qc->n_iter+1 == qc->n_elem?

That missing +1 would explain, why the SGE_TRM never gets set.

And it would fit the symptoms, that the boot would fail at random. If
the "correct" garbage was in place to where the sglist runs off it
hangs the drive.
And that would even fit the two different errors that I only got one time each:
* a completely illegal access (PCI master abort while fetching SGT)
* wrong alignment of the SGT (SGT no on qword boundary)
At that that times the garbage seemed to point invalid addresses.

But I'm still not understanding, how the kernel could only fail
sometimes at bootup, but after that working without any visible
errors? Is the sil-chip rather intelligent about detecting corrupted
sglists and silently ignoring them?

Torsten
Tejun Heo
2007-10-11 03:25:49 UTC
Permalink
Post by Torsten Kaiser
Looking closer at
http://git.kernel.org/?p=linux/kernel/git/axboe/linux-2.6-block.git;a=commitdiff;h=ec6fdded4d76aa54aa57341e5dfdd61c507b1dcd
old new
return qc->__sg return qc->__sg
qc->__sg - qc->__sg == 0 qc->n_iter=0
-> sg - qc->__sg corresponds to qc->n_iter
sg++; sg_next(sg); qc->n_iter++;
sg - qc->__sg < qc->n_elem qc->n_iter < qc->nelem
-> sg - qc->__sg corresponds to qc->n_iter
(sg - qc->__sg) +1 == qc->n_elem qc->n_iter == qc->n_elem
if sg - qc->__sg corresponds to qc->n_iter then shoudn't it be
qc->n_iter+1 == qc->n_elem?
That missing +1 would explain, why the SGE_TRM never gets set.
Thanks a lot for tracking this down. Does changing the above code fix
your problem?

Jens, Torsten's analysis looks correct && depending on qc state (n_iter)
during iteration doesn't look like a good idea. Those iterators are not
supposed to have side effects. Would it be difficult to implement
sg_last() test?
Post by Torsten Kaiser
And it would fit the symptoms, that the boot would fail at random. If
the "correct" garbage was in place to where the sglist runs off it
hangs the drive.
* a completely illegal access (PCI master abort while fetching SGT)
* wrong alignment of the SGT (SGT no on qword boundary)
At that that times the garbage seemed to point invalid addresses.
But I'm still not understanding, how the kernel could only fail
sometimes at bootup, but after that working without any visible
errors? Is the sil-chip rather intelligent about detecting corrupted
sglists and silently ignoring them?
I have no idea why it fails only sometimes.
--
tejun
Torsten Kaiser
2007-10-11 05:54:53 UTC
Permalink
Post by Tejun Heo
Post by Torsten Kaiser
Looking closer at
http://git.kernel.org/?p=linux/kernel/git/axboe/linux-2.6-block.git;a=commitdiff;h=ec6fdded4d76aa54aa57341e5dfdd61c507b1dcd
old new
return qc->__sg return qc->__sg
qc->__sg - qc->__sg == 0 qc->n_iter=0
-> sg - qc->__sg corresponds to qc->n_iter
sg++; sg_next(sg); qc->n_iter++;
sg - qc->__sg < qc->n_elem qc->n_iter < qc->nelem
-> sg - qc->__sg corresponds to qc->n_iter
(sg - qc->__sg) +1 == qc->n_elem qc->n_iter == qc->n_elem
if sg - qc->__sg corresponds to qc->n_iter then shoudn't it be
qc->n_iter+1 == qc->n_elem?
That missing +1 would explain, why the SGE_TRM never gets set.
Thanks a lot for tracking this down. Does changing the above code fix
your problem?
I did not try it.
I'm not an libata expert and while this change looks suspicios, I
can't be 100% sure if that change was intended.
And I did not want to experiment this deep in the code and risk
corrupting the hole drive.
Post by Tejun Heo
Jens, Torsten's analysis looks correct && depending on qc state (n_iter)
during iteration doesn't look like a good idea. Those iterators are not
supposed to have side effects. Would it be difficult to implement
sg_last() test?
Post by Torsten Kaiser
And it would fit the symptoms, that the boot would fail at random. If
the "correct" garbage was in place to where the sglist runs off it
hangs the drive.
* a completely illegal access (PCI master abort while fetching SGT)
* wrong alignment of the SGT (SGT no on qword boundary)
At that that times the garbage seemed to point invalid addresses.
But I'm still not understanding, how the kernel could only fail
sometimes at bootup, but after that working without any visible
errors? Is the sil-chip rather intelligent about detecting corrupted
sglists and silently ignoring them?
I have no idea why it fails only sometimes.
And that is, why I'm so unsure.
The error looks to serious to only cause random failures on one of two
drives on bootup.
I never had trouble with the remaining drive on the SiI-chip or both
drives if one got killed during booting.

I'm guessing that leaving the computer powered down long enough fills
the RAM with a special pattern that really hangs the drive, while
normaly it would just reject the invalid data. (I have ECC-RAM, does
this matter?)

Another guess might be that most of the time the Sil-chip correctly
terminates after the transfer-length is reached, even if SGE_TRM is
missing...

Torsten
Tejun Heo
2007-10-11 06:26:27 UTC
Permalink
Post by Torsten Kaiser
Post by Tejun Heo
Post by Torsten Kaiser
That missing +1 would explain, why the SGE_TRM never gets set.
Thanks a lot for tracking this down. Does changing the above code fix
your problem?
I did not try it.
I'm not an libata expert and while this change looks suspicios, I
can't be 100% sure if that change was intended.
And I did not want to experiment this deep in the code and risk
corrupting the hole drive.
I don't think you would risk too much by changing that bit of code.
Please try it.
Post by Torsten Kaiser
Post by Tejun Heo
Post by Torsten Kaiser
But I'm still not understanding, how the kernel could only fail
sometimes at bootup, but after that working without any visible
errors? Is the sil-chip rather intelligent about detecting corrupted
sglists and silently ignoring them?
I have no idea why it fails only sometimes.
And that is, why I'm so unsure.
The error looks to serious to only cause random failures on one of two
drives on bootup.
I never had trouble with the remaining drive on the SiI-chip or both
drives if one got killed during booting.
I'm guessing that leaving the computer powered down long enough fills
the RAM with a special pattern that really hangs the drive, while
normaly it would just reject the invalid data. (I have ECC-RAM, does
this matter?)
Another guess might be that most of the time the Sil-chip correctly
terminates after the transfer-length is reached, even if SGE_TRM is
missing...
I have no idea either. We'll probably need a PCI bus tracer to tell
exactly what's going on.

Thanks.
--
tejun
Torsten Kaiser
2007-10-11 17:51:11 UTC
Permalink
Post by Tejun Heo
Post by Torsten Kaiser
Post by Tejun Heo
Post by Torsten Kaiser
That missing +1 would explain, why the SGE_TRM never gets set.
Thanks a lot for tracking this down. Does changing the above code fix
your problem?
I did not try it.
I'm not an libata expert and while this change looks suspicios, I
can't be 100% sure if that change was intended.
And I did not want to experiment this deep in the code and risk
corrupting the hole drive.
I don't think you would risk too much by changing that bit of code.
Please try it.
The debug output now shows that SGE_TRM is getting set.
I will try to reproduce the error tomorrow morning, and report if I'm
still seeing it.

Reporting that I'm no longer see it is sadly not a good indicator, as
it also seems to depend on other factors like memory layout and/or
code alignment.

Torsten

Jens Axboe
2007-10-11 08:26:04 UTC
Permalink
Post by Tejun Heo
Post by Torsten Kaiser
Looking closer at
http://git.kernel.org/?p=linux/kernel/git/axboe/linux-2.6-block.git;a=commitdiff;h=ec6fdded4d76aa54aa57341e5dfdd61c507b1dcd
old new
return qc->__sg return qc->__sg
qc->__sg - qc->__sg == 0 qc->n_iter=0
-> sg - qc->__sg corresponds to qc->n_iter
sg++; sg_next(sg); qc->n_iter++;
sg - qc->__sg < qc->n_elem qc->n_iter < qc->nelem
-> sg - qc->__sg corresponds to qc->n_iter
(sg - qc->__sg) +1 == qc->n_elem qc->n_iter == qc->n_elem
if sg - qc->__sg corresponds to qc->n_iter then shoudn't it be
qc->n_iter+1 == qc->n_elem?
That missing +1 would explain, why the SGE_TRM never gets set.
Thanks a lot for tracking this down. Does changing the above code fix
your problem?
Jens, Torsten's analysis looks correct && depending on qc state (n_iter)
during iteration doesn't look like a good idea. Those iterators are not
supposed to have side effects. Would it be difficult to implement
sg_last() test?
This is the old ata_sg_is_last:

static inline int
ata_sg_is_last(struct scatterlist *sg, struct ata_queued_cmd *qc)
{
if (sg == &qc->pad_sgent)
return 1;
if (qc->pad_len)
return 0;
if (((sg - qc->__sg) + 1) == qc->n_elem)
return 1;
return 0;
}

and the new one:

static inline int
ata_sg_is_last(struct scatterlist *sg, struct ata_queued_cmd *qc)
{
if (sg == &qc->pad_sgent)
return 1;
if (qc->pad_len)
return 0;
if (qc->n_iter == qc->n_elem)
return 1;
return 0;
}

->n_iter is how ata_qc_next_sg() walks over the sglist, I don't
understand your reference to why depending on that during iteration
would be bad?

So we could add a test for sg_last() there, but that would turn sg table
iteration into an O(N^2) operation for those drivers that use
ata_sg_is_last() with chained sg tables. I'd much rather just get rid of
ata_sg_is_last(), it's only used to mark end-of-table entries for
hardware. That logic can be performed cheaper.

Torsten, your analysis does look correct. Does it work with this simple
patch?

diff --git a/include/linux/libata.h b/include/linux/libata.h
index 2784163..0152bf4 100644
--- a/include/linux/libata.h
+++ b/include/linux/libata.h
@@ -947,7 +947,7 @@ ata_sg_is_last(struct scatterlist *sg, struct ata_queued_cmd *qc)
return 1;
if (qc->pad_len)
return 0;
- if (qc->n_iter == qc->n_elem)
+ if ((qc->n_iter + 1) == qc->n_elem)
return 1;
return 0;
}
--
Jens Axboe
Tejun Heo
2007-10-11 08:36:45 UTC
Permalink
Post by Jens Axboe
static inline int
ata_sg_is_last(struct scatterlist *sg, struct ata_queued_cmd *qc)
{
if (sg == &qc->pad_sgent)
return 1;
if (qc->pad_len)
return 0;
if (((sg - qc->__sg) + 1) == qc->n_elem)
return 1;
return 0;
}
static inline int
ata_sg_is_last(struct scatterlist *sg, struct ata_queued_cmd *qc)
{
if (sg == &qc->pad_sgent)
return 1;
if (qc->pad_len)
return 0;
if (qc->n_iter == qc->n_elem)
return 1;
return 0;
}
->n_iter is how ata_qc_next_sg() walks over the sglist, I don't
understand your reference to why depending on that during iteration
would be bad?
Because that makes ata_sg iterator macros have hidden side effects
(nothing in the interface suggests it can't be nested and when somebody
actually nests it, finding what went wrong can be pretty difficult). I
think it would be better to have explicit ata_sg_iter passed around if
sg itself isn't enough to walk the sg list.
Post by Jens Axboe
So we could add a test for sg_last() there, but that would turn sg table
iteration into an O(N^2) operation for those drivers that use
ata_sg_is_last() with chained sg tables. I'd much rather just get rid of
ata_sg_is_last(), it's only used to mark end-of-table entries for
hardware. That logic can be performed cheaper.
Yeap, it can be removed but having "is this the last one?" test is just
nicer to low level drivers. With ata_sg_iter, I think we can do it in
O(N) by looking up and caching the next entry.

Thanks.
--
tejun
Jens Axboe
2007-10-11 10:28:40 UTC
Permalink
Post by Tejun Heo
Post by Jens Axboe
static inline int
ata_sg_is_last(struct scatterlist *sg, struct ata_queued_cmd *qc)
{
if (sg == &qc->pad_sgent)
return 1;
if (qc->pad_len)
return 0;
if (((sg - qc->__sg) + 1) == qc->n_elem)
return 1;
return 0;
}
static inline int
ata_sg_is_last(struct scatterlist *sg, struct ata_queued_cmd *qc)
{
if (sg == &qc->pad_sgent)
return 1;
if (qc->pad_len)
return 0;
if (qc->n_iter == qc->n_elem)
return 1;
return 0;
}
->n_iter is how ata_qc_next_sg() walks over the sglist, I don't
understand your reference to why depending on that during iteration
would be bad?
Because that makes ata_sg iterator macros have hidden side effects
(nothing in the interface suggests it can't be nested and when somebody
actually nests it, finding what went wrong can be pretty difficult). I
think it would be better to have explicit ata_sg_iter passed around if
sg itself isn't enough to walk the sg list.
I think it's implicit in the interface, since you don't pass an sgtable
in. But it's not a big deal to me, if you want it changed, go ahead :-)
Post by Tejun Heo
Post by Jens Axboe
So we could add a test for sg_last() there, but that would turn sg table
iteration into an O(N^2) operation for those drivers that use
ata_sg_is_last() with chained sg tables. I'd much rather just get rid of
ata_sg_is_last(), it's only used to mark end-of-table entries for
hardware. That logic can be performed cheaper.
Yeap, it can be removed but having "is this the last one?" test is just
nicer to low level drivers. With ata_sg_iter, I think we can do it in
O(N) by looking up and caching the next entry.
Sure, of you could just lookup sg_last() in the beginning of the loop.
Still seems a bit silly to me just to keep the ata_sg_is_last()
interface, since it can be done for free basically by maintaining an
sglast element in the loop that you then use when the loop is done to
set your SG_END marker (or whatever the driver needs).
--
Jens Axboe
Continue reading on narkive:
Loading...