Discussion:
2.6.29 regression: ATA bus errors on resume (was: EXT4: __ext4_get_inode_loc errors after s2disk)
(too old to reply)
Niel Lambrechts
2009-03-25 05:19:04 UTC
Permalink
On 03/25/2009 03:30 AM, Theodore Tso wrote:
> On Tue, Mar 24, 2009 at 10:25:57PM +0200, Niel Lambrechts wrote:
>> Hi,
>>
>> After upgrading to 2.6.29 I get the below errors after resuming from
>> hibernating with s2disk. I ran fsck and tried doing the same thing again
>> in 2.6.28.9-pae, but do not get any errors there.
>
> The ext4 errors are interleaved with hardware errors, and the ext4
> errors are about I/O errors.
>
> EXT4-fs error (device sda6): __ext4_get_inode_loc: unable to read inode block - inode=2346519
> EXT4-fs error (device sda6) in ext4_reserve_inode_write: IO failure
>
> This looks more like a hibernation problem than an ext4 problem.
> Looks like the hard drive is being left in some inconsistent state
> after resuming from hibernation.
>
> - Ted

Thanks for the info Theodore, this is definitely looks like some type of
regression in 2.6.29, as the problem is not evident when I s2disk using
2.6.28.9, even after multiple suspend/resume cycles.

I found some 'ATA bus errors' and 'SError' messages in
/var/log/messages, so I've attached the messages from both 2.6.29 and
2.6.28 for comparison.

Regards,
Niel
---
Jeff Garzik
2009-03-25 06:06:59 UTC
Permalink
Niel Lambrechts wrote:
> On 03/25/2009 03:30 AM, Theodore Tso wrote:
>> On Tue, Mar 24, 2009 at 10:25:57PM +0200, Niel Lambrechts wrote:
>>> Hi,
>>>
>>> After upgrading to 2.6.29 I get the below errors after resuming from
>>> hibernating with s2disk. I ran fsck and tried doing the same thing again
>>> in 2.6.28.9-pae, but do not get any errors there.
>> The ext4 errors are interleaved with hardware errors, and the ext4
>> errors are about I/O errors.
>>
>> EXT4-fs error (device sda6): __ext4_get_inode_loc: unable to read inode block - inode=2346519
>> EXT4-fs error (device sda6) in ext4_reserve_inode_write: IO failure
>>
>> This looks more like a hibernation problem than an ext4 problem.
>> Looks like the hard drive is being left in some inconsistent state
>> after resuming from hibernation.
>>
>> - Ted
>
> Thanks for the info Theodore, this is definitely looks like some type of
> regression in 2.6.29, as the problem is not evident when I s2disk using
> 2.6.28.9, even after multiple suspend/resume cycles.
>
> I found some 'ATA bus errors' and 'SError' messages in
> /var/log/messages, so I've attached the messages from both 2.6.29 and
> 2.6.28 for comparison.

Well, here is the interpretation of messages:

> ata1.00: irq_stat 0x00400008, PHY RDY changed
> ata1: SError: { PHYRdyChg CommWake }

Your SATA hardware flags a connect-or-disconnect event ("PHY RDY"),
which requires us to abort a bunch of queued commands:

> ata1.00: cmd 60/18:00:77:88:6f/00:00:0e:00:00/40 tag 0 ncq 12288 in
> res 50/00:30:07:b3:10/00:00:0c:00:00/40 Emask 0x10 (ATA bus error)
[...]
> ata1.00: cmd 60/30:68:07:b3:10/00:00:0c:00:00/40 tag 13 ncq 24576 in
> res 50/00:30:07:b3:10/00:00:0c:00:00/40 Emask 0x10 (ATA bus error)

...through the 14th command (tag 13).

> Mar 24 21:29:14 linux-7vph kernel: ata1: hard resetting link
> Mar 24 21:29:14 linux-7vph kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
> Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/02:00:00:00:00:a0 succeeded
> Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0 filtered out
> Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/5f:00:00:00:00:a0 succeeded
> Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0 filtered out
> Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/02:00:00:00:00:a0 succeeded
> Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0 filtered out
> Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/5f:00:00:00:00:a0 succeeded
> Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0 filtered out
> Mar 24 21:29:14 linux-7vph kernel: ata1.00: configured for UDMA/133
> Mar 24 21:29:14 linux-7vph kernel: ata1.00: configured for UDMA/133


SATA link is reset, and ACPI is re-run.

> Mar 24 21:29:14 linux-7vph kernel: sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
> Mar 24 21:29:14 linux-7vph kernel: sd 0:0:0:0: [sda] Sense Key : Aborted Command [current] [descriptor]
> Mar 24 21:29:14 linux-7vph kernel: Descriptor sense data with sense descriptors (in hex):
> Mar 24 21:29:14 linux-7vph kernel: 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00
> Mar 24 21:29:14 linux-7vph kernel: 0c 10 b3 07
> Mar 24 21:29:14 linux-7vph kernel: sd 0:0:0:0: [sda] Add. Sense: No additional sense information
> Mar 24 21:29:14 linux-7vph kernel: end_request: I/O error, dev sda, sector 242190455

The SCSI subsystem aborts each of the queued commands.

> Mar 24 21:29:14 linux-7vph kernel: ata1: EH complete

SATA error handling completes

> Mar 24 21:29:14 linux-7vph kernel: EXT4-fs error (device sda6): __ext4_get_inode_loc: unable to read inode block - inode=2330178, block=9338883

ext4 pukes, because of SATA/SCSI errors

> Mar 24 21:29:14 linux-7vph kernel: sd 0:0:0:0: [sda] 390721968 512-byte hardware sectors: (200 GB/186 GiB)
> Mar 24 21:29:14 linux-7vph kernel: sd 0:0:0:0: [sda] Write Protect is off
> Mar 24 21:29:14 linux-7vph kernel: sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
> Mar 24 21:29:14 linux-7vph kernel: sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
> Mar 24 21:29:14 linux-7vph kernel: sd 0:0:0:0: [sda] 390721968 512-byte hardware sectors: (200 GB/186 GiB)
> Mar 24 21:29:14 linux-7vph kernel: sd 0:0:0:0: [sda] Write Protect is off
> Mar 24 21:29:14 linux-7vph kernel: sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
> Mar 24 21:29:14 linux-7vph kernel: sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

SCSI finished re-initialization.

So from this, a few observations and open questions:

Why didn't these SCSI commands get retried?

Were they left over from prior to resume?

Did SCSI error out the commands too soon? The probing is not complete
until AFTER the sector errors, I note.

Did the system resume before ACPI resume and SCSI resume completed? It
sure looks that way, from the log.

Does the asynchronous probing play a role here?

Jeff



--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Niel Lambrechts
2009-03-25 21:40:12 UTC
Permalink
On 03/25/2009 08:06 AM, Jeff Garzik wrote:
>
> So from this, a few observations and open questions:
>
> Why didn't these SCSI commands get retried?
>
> Were they left over from prior to resume?
>
> Did SCSI error out the commands too soon? The probing is not complete
> until AFTER the sector errors, I note.
>
> Did the system resume before ACPI resume and SCSI resume completed?
> It sure looks that way, from the log.
>
> Does the asynchronous probing play a role here?
It looks like there are a few different scenarios in what could happen:

I had another go at hibernating on 2.6.29 running with and without X,
and this time round it resumed with a *clean* file-system. Without X, it
looked to be better off as there were not even any ATA errors logged,
but I'm not sure if this is just because of the system being almost
entirely idle.

Going on your comments of earlier, when hibernating in X it looks like
this time the SCSI initialization completes normally before any errors,
being 'ATA bus error' messages for a while until a hard reset occurs.

I've attached the messages from both, hopefully this gives someone more
of a clue of what is happening.

cheers
Niel
James Bottomley
2009-03-25 22:16:18 UTC
Permalink
On Wed, 2009-03-25 at 02:06 -0400, Jeff Garzik wrote:
> Niel Lambrechts wrote:
> > On 03/25/2009 03:30 AM, Theodore Tso wrote:
> >> On Tue, Mar 24, 2009 at 10:25:57PM +0200, Niel Lambrechts wrote:
> >>> Hi,
> >>>
> >>> After upgrading to 2.6.29 I get the below errors after resuming from
> >>> hibernating with s2disk. I ran fsck and tried doing the same thing again
> >>> in 2.6.28.9-pae, but do not get any errors there.
> >> The ext4 errors are interleaved with hardware errors, and the ext4
> >> errors are about I/O errors.
> >>
> >> EXT4-fs error (device sda6): __ext4_get_inode_loc: unable to read inode block - inode=2346519
> >> EXT4-fs error (device sda6) in ext4_reserve_inode_write: IO failure
> >>
> >> This looks more like a hibernation problem than an ext4 problem.
> >> Looks like the hard drive is being left in some inconsistent state
> >> after resuming from hibernation.
> >>
> >> - Ted
> >
> > Thanks for the info Theodore, this is definitely looks like some type of
> > regression in 2.6.29, as the problem is not evident when I s2disk using
> > 2.6.28.9, even after multiple suspend/resume cycles.
> >
> > I found some 'ATA bus errors' and 'SError' messages in
> > /var/log/messages, so I've attached the messages from both 2.6.29 and
> > 2.6.28 for comparison.
>
> Well, here is the interpretation of messages:
>
> > ata1.00: irq_stat 0x00400008, PHY RDY changed
> > ata1: SError: { PHYRdyChg CommWake }
>
> Your SATA hardware flags a connect-or-disconnect event ("PHY RDY"),
> which requires us to abort a bunch of queued commands:
>
> > ata1.00: cmd 60/18:00:77:88:6f/00:00:0e:00:00/40 tag 0 ncq 12288 in
> > res 50/00:30:07:b3:10/00:00:0c:00:00/40 Emask 0x10 (ATA bus error)
> [...]
> > ata1.00: cmd 60/30:68:07:b3:10/00:00:0c:00:00/40 tag 13 ncq 24576 in
> > res 50/00:30:07:b3:10/00:00:0c:00:00/40 Emask 0x10 (ATA bus error)
>
> ...through the 14th command (tag 13).
>
> > Mar 24 21:29:14 linux-7vph kernel: ata1: hard resetting link
> > Mar 24 21:29:14 linux-7vph kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
> > Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/02:00:00:00:00:a0 succeeded
> > Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0 filtered out
> > Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/5f:00:00:00:00:a0 succeeded
> > Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0 filtered out
> > Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/02:00:00:00:00:a0 succeeded
> > Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0 filtered out
> > Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/5f:00:00:00:00:a0 succeeded
> > Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0 filtered out
> > Mar 24 21:29:14 linux-7vph kernel: ata1.00: configured for UDMA/133
> > Mar 24 21:29:14 linux-7vph kernel: ata1.00: configured for UDMA/133
>
>
> SATA link is reset, and ACPI is re-run.
>
> > Mar 24 21:29:14 linux-7vph kernel: sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
> > Mar 24 21:29:14 linux-7vph kernel: sd 0:0:0:0: [sda] Sense Key : Aborted Command [current] [descriptor]
> > Mar 24 21:29:14 linux-7vph kernel: Descriptor sense data with sense descriptors (in hex):
> > Mar 24 21:29:14 linux-7vph kernel: 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00
> > Mar 24 21:29:14 linux-7vph kernel: 0c 10 b3 07
> > Mar 24 21:29:14 linux-7vph kernel: sd 0:0:0:0: [sda] Add. Sense: No additional sense information
> > Mar 24 21:29:14 linux-7vph kernel: end_request: I/O error, dev sda, sector 242190455
>
> The SCSI subsystem aborts each of the queued commands.

No .. this is the SCSI subsystem receives an ABORTED COMMAND return in
sense data for each of the outstanding I/Os

The only place these are generated is in ata_sense_to_error() which only
occurs if there's some type of ata error.

If I had to theorise, I'd say the system suspended with commands
outstanding to the device. On resume, the device gets reset and returns
some type of ATA error which gets translated to ABORTED COMMAND which
causes a failure.

In the mid layer, we translate ABORTED_COMMAND into a retry until the
command runs out of them ... could it be there's a race readying the
device and we run through the retries before it can accept the command?

James


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Niel Lambrechts
2009-03-25 22:43:38 UTC
Permalink
On 03/26/2009 12:30 AM, Pavel Machek wrote:
> On Tue 2009-03-24 22:25:57, Niel Lambrechts wrote:
>> After upgrading to 2.6.29 I get the below errors after resuming from
>> hibernating with s2disk. I ran fsck and tried doing the same thing again
>> in 2.6.28.9-pae, but do not get any errors there.
>
> what kind of disk/what kind of controller is that?

Hi Pavel,

It is a Lenovo W500 Thinkpad.

lspci shows:

00:1f.2 SATA controller: Intel Corporation ICH9M/M-E SATA AHCI
Controller (rev 03)

and for the disk, hdparm shows:
/dev/sda:

Model=HITACHI HTS722020K9SA00 FDE , FwRev=DC4LC75A,
SerialNo=080726DP0470DTGMUNPC
Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs }
RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=4
BuffType=DualPortCache, BuffSize=15203kB, MaxMultSect=16, MultSect=?16?
CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=390721968
IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
PIO modes: pio0 pio1 pio2 pio3 pio4
DMA modes: mdma0 mdma1 mdma2
UDMA modes: udma0 udma1 udma2 udma3 udma4 udma5 *udma6
AdvancedPM=yes: mode=0x80 (128) WriteCache=enabled
Drive conforms to: unknown: ATA/ATAPI-2,3,4,5,6,7

* signifies the current active mode


Niel
Niel Lambrechts
2009-03-27 19:10:52 UTC
Permalink
On 03/26/2009 12:20 AM, James Bottomley wrote:
> On Wed, 2009-03-25 at 02:06 -0400, Jeff Garzik wrote:
>> Niel Lambrechts wrote:
>>> On 03/25/2009 03:30 AM, Theodore Tso wrote:
>>>> On Tue, Mar 24, 2009 at 10:25:57PM +0200, Niel Lambrechts wrote:
>>>>> Hi,
>>>>>
>>>>> After upgrading to 2.6.29 I get the below errors after resuming from
>>>>> hibernating with s2disk. I ran fsck and tried doing the same thing again
>>>>> in 2.6.28.9-pae, but do not get any errors there.
>>>> The ext4 errors are interleaved with hardware errors, and the ext4
>>>> errors are about I/O errors.
>>>>
>>>> EXT4-fs error (device sda6): __ext4_get_inode_loc: unable to read inode block - inode=2346519
>>>> EXT4-fs error (device sda6) in ext4_reserve_inode_write: IO failure
>>>>
>>>> This looks more like a hibernation problem than an ext4 problem.
>>>> Looks like the hard drive is being left in some inconsistent state
>>>> after resuming from hibernation.
>>>>
>>>> - Ted
>>> Thanks for the info Theodore, this is definitely looks like some type of
>>> regression in 2.6.29, as the problem is not evident when I s2disk using
>>> 2.6.28.9, even after multiple suspend/resume cycles.
>>>
>>> I found some 'ATA bus errors' and 'SError' messages in
>>> /var/log/messages, so I've attached the messages from both 2.6.29 and
>>> 2.6.28 for comparison.
>> Well, here is the interpretation of messages:
>>
>>> ata1.00: irq_stat 0x00400008, PHY RDY changed
>>> ata1: SError: { PHYRdyChg CommWake }
>> Your SATA hardware flags a connect-or-disconnect event ("PHY RDY"),
>> which requires us to abort a bunch of queued commands:
>>
>>> ata1.00: cmd 60/18:00:77:88:6f/00:00:0e:00:00/40 tag 0 ncq 12288 in
>>> res 50/00:30:07:b3:10/00:00:0c:00:00/40 Emask 0x10 (ATA bus error)
>> [...]
>>> ata1.00: cmd 60/30:68:07:b3:10/00:00:0c:00:00/40 tag 13 ncq 24576 in
>>> res 50/00:30:07:b3:10/00:00:0c:00:00/40 Emask 0x10 (ATA bus error)
>> ...through the 14th command (tag 13).
>>
>>> Mar 24 21:29:14 linux-7vph kernel: ata1: hard resetting link
>>> Mar 24 21:29:14 linux-7vph kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
>>> Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/02:00:00:00:00:a0 succeeded
>>> Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0 filtered out
>>> Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/5f:00:00:00:00:a0 succeeded
>>> Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0 filtered out
>>> Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/02:00:00:00:00:a0 succeeded
>>> Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0 filtered out
>>> Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/5f:00:00:00:00:a0 succeeded
>>> Mar 24 21:29:14 linux-7vph kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0 filtered out
>>> Mar 24 21:29:14 linux-7vph kernel: ata1.00: configured for UDMA/133
>>> Mar 24 21:29:14 linux-7vph kernel: ata1.00: configured for UDMA/133
>>
>> SATA link is reset, and ACPI is re-run.
>>
>>> Mar 24 21:29:14 linux-7vph kernel: sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
>>> Mar 24 21:29:14 linux-7vph kernel: sd 0:0:0:0: [sda] Sense Key : Aborted Command [current] [descriptor]
>>> Mar 24 21:29:14 linux-7vph kernel: Descriptor sense data with sense descriptors (in hex):
>>> Mar 24 21:29:14 linux-7vph kernel: 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00
>>> Mar 24 21:29:14 linux-7vph kernel: 0c 10 b3 07
>>> Mar 24 21:29:14 linux-7vph kernel: sd 0:0:0:0: [sda] Add. Sense: No additional sense information
>>> Mar 24 21:29:14 linux-7vph kernel: end_request: I/O error, dev sda, sector 242190455
>> The SCSI subsystem aborts each of the queued commands.
>
> No .. this is the SCSI subsystem receives an ABORTED COMMAND return in
> sense data for each of the outstanding I/Os
>
> The only place these are generated is in ata_sense_to_error() which only
> occurs if there's some type of ata error.
>
> If I had to theorise, I'd say the system suspended with commands
> outstanding to the device. On resume, the device gets reset and returns
> some type of ATA error which gets translated to ABORTED COMMAND which
> causes a failure.
>
> In the mid layer, we translate ABORTED_COMMAND into a retry until the
> command runs out of them ... could it be there's a race readying the
> device and we run through the retries before it can accept the command?

I'm seeing some dubious looking ATA messages even on 2.6.28.9-pae,
although with all the 2.6.28 variants I used s2disk/resume has always
worked. I was wondering if these "errors" perhaps play more of a role in
2.6.29, perhaps due to the async. changes that was mentioned?


In particular, I notice some the following before "EH complete":
ata1: exception Emask 0x10 SAct 0x0 SErr 0x0 action 0x9 t4
ata1: irq_stat 0x00400040, connection status changed

but before this last message there is a
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)

Why would the connection status change if a moment ago it was "link up"?

Niel
Arjan van de Ven
2009-03-27 22:30:44 UTC
Permalink
On Fri, 27 Mar 2009 21:10:52 +0200
Niel Lambrechts <***@gmail.com> wrote:

> I'm seeing some dubious looking ATA messages even on 2.6.28.9-pae,
> although with all the 2.6.28 variants I used s2disk/resume has always
> worked. I was wondering if these "errors" perhaps play more of a role
> in 2.6.29, perhaps due to the async. changes that was mentioned?

unless you actively enabled this via a kernel command line option there
are no async changes in 2.6.29 in terms of behavior.


--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Niel Lambrechts
2009-03-28 10:22:24 UTC
Permalink
On 03/28/2009 12:30 AM, Arjan van de Ven wrote:
> On Fri, 27 Mar 2009 21:10:52 +0200
> Niel Lambrechts <***@gmail.com> wrote:
>
>
>> I'm seeing some dubious looking ATA messages even on 2.6.28.9-pae,
>> although with all the 2.6.28 variants I used s2disk/resume has always
>> worked. I was wondering if these "errors" perhaps play more of a role
>> in 2.6.29, perhaps due to the async. changes that was mentioned?
>>
>
> unless you actively enabled this via a kernel command line option there
> are no async changes in 2.6.29 in terms of behavior.
>
>
>
The only non-default option I had was 'modeset=1'. From Jeff's earlier
comment I understood the probing behaviour changed.

The fundamental difference is that in 2.6.29 everything initially seems
okay, but then there is a
ata1.00: exception Emask 0x10 SAct 0x3f SErr 0x50000 action0xe frozen
ata1.00: irq_stat 0x00400008, PHY RDY changed

There's nothing frozen it 2.6.28.

Should I log a kernel bug, what's the best way forward and is there
anything more I can do to help?

cheers
Niel
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki
2009-03-28 14:06:02 UTC
Permalink
On Saturday 28 March 2009, Niel Lambrechts wrote:
> On 03/28/2009 12:30 AM, Arjan van de Ven wrote:
> > On Fri, 27 Mar 2009 21:10:52 +0200
> > Niel Lambrechts <***@gmail.com> wrote:
> >
> >
> >> I'm seeing some dubious looking ATA messages even on 2.6.28.9-pae,
> >> although with all the 2.6.28 variants I used s2disk/resume has always
> >> worked. I was wondering if these "errors" perhaps play more of a role
> >> in 2.6.29, perhaps due to the async. changes that was mentioned?
> >>
> >
> > unless you actively enabled this via a kernel command line option there
> > are no async changes in 2.6.29 in terms of behavior.
> >
> >
> >
> The only non-default option I had was 'modeset=1'. From Jeff's earlier
> comment I understood the probing behaviour changed.
>
> The fundamental difference is that in 2.6.29 everything initially seems
> okay, but then there is a
> ata1.00: exception Emask 0x10 SAct 0x3f SErr 0x50000 action0xe frozen
> ata1.00: irq_stat 0x00400008, PHY RDY changed
>
> There's nothing frozen it 2.6.28.
>
> Should I log a kernel bug, what's the best way forward and is there
> anything more I can do to help?

Let Tejun have a look a this, perhaps?

Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Tejun Heo
2009-03-30 08:43:17 UTC
Permalink
Hello,

Rafael J. Wysocki wrote:
> On Saturday 28 March 2009, Niel Lambrechts wrote:
>> On 03/28/2009 12:30 AM, Arjan van de Ven wrote:
>>> On Fri, 27 Mar 2009 21:10:52 +0200
>>> Niel Lambrechts <***@gmail.com> wrote:
>>>
>>>
>>>> I'm seeing some dubious looking ATA messages even on 2.6.28.9-pae,
>>>> although with all the 2.6.28 variants I used s2disk/resume has always
>>>> worked. I was wondering if these "errors" perhaps play more of a role
>>>> in 2.6.29, perhaps due to the async. changes that was mentioned?
>>>>
>>> unless you actively enabled this via a kernel command line option there
>>> are no async changes in 2.6.29 in terms of behavior.
>>>
>>>
>>>
>> The only non-default option I had was 'modeset=1'. From Jeff's earlier
>> comment I understood the probing behaviour changed.
>>
>> The fundamental difference is that in 2.6.29 everything initially seems
>> okay, but then there is a
>> ata1.00: exception Emask 0x10 SAct 0x3f SErr 0x50000 action0xe frozen
>> ata1.00: irq_stat 0x00400008, PHY RDY changed
>>
>> There's nothing frozen it 2.6.28.
>>
>> Should I log a kernel bug, what's the best way forward and is there
>> anything more I can do to help?
>
> Let Tejun have a look a this, perhaps?

What Niel is seeing is probably caused by libata EH somehow moving
forward too fast and receiving the second PHY changed event after the
initial reset is complete. That or the thaw routine is broken and
doesn't clear hotplug event properly. Actually, this double reset
seems to happen quite often, so it might be about time to drill it
down and find out what's really going on. But, generally, it isn't a
serious problem, all that happens is EH doing another round. The
original one looks quite serious tho. I'll reply separately.

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Tejun Heo
2009-03-30 08:55:23 UTC
Permalink
Hello,

For some reason, I can't find the original thread, so replying here.

Niel Lambrechts wrote:
>>>>> The ext4 errors are interleaved with hardware errors, and the ext4
>>>>> errors are about I/O errors.
>>>>>
>>>>> EXT4-fs error (device sda6): __ext4_get_inode_loc: unable to read inode block - inode=2346519
>>>>> EXT4-fs error (device sda6) in ext4_reserve_inode_write: IO failure
>>>>>
>>>>> This looks more like a hibernation problem than an ext4 problem.
>>>>> Looks like the hard drive is being left in some inconsistent state
>>>>> after resuming from hibernation.

Yeap, ext4 is just the victim here.

>>>> ata1.00: irq_stat 0x00400008, PHY RDY changed
>>>> ata1: SError: { PHYRdyChg CommWake }
>>> Your SATA hardware flags a connect-or-disconnect event ("PHY RDY"),
>>> which requires us to abort a bunch of queued commands:
>>>
>>>> ata1.00: cmd 60/18:00:77:88:6f/00:00:0e:00:00/40 tag 0 ncq 12288 in
>>>> res 50/00:30:07:b3:10/00:00:0c:00:00/40 Emask 0x10 (ATA bus error)
>>> [...]
...
>>> The SCSI subsystem aborts each of the queued commands.
>> No .. this is the SCSI subsystem receives an ABORTED COMMAND return in
>> sense data for each of the outstanding I/Os
>>
>> The only place these are generated is in ata_sense_to_error() which only
>> occurs if there's some type of ata error.
>>
>> If I had to theorise, I'd say the system suspended with commands
>> outstanding to the device. On resume, the device gets reset and returns
>> some type of ATA error which gets translated to ABORTED COMMAND which
>> causes a failure.
>>
>> In the mid layer, we translate ABORTED_COMMAND into a retry until the
>> command runs out of them ... could it be there's a race readying the
>> device and we run through the retries before it can accept the command?

When libata-eh thinks that the problem isn't worth retrying, it sets
scmd->retries to scmd->allowed so that it gets aborted immediately.
The code is in ata_eh_qc_complete().

Whether a command is to be retried or not is determined with
ATA_QCFLAG_RETRY which is set in ata_eh_link_autopsy() for each failed
command. Immediate-failure criteria is pretty strict - only driver
software errors (AC_ERR_INVALID) and PC or other special commands
which failed which got aborted by the device get the immediate pink
slip. In this case, the commands are from FS and failed with
AC_ERR_ATA_BUS, so it definitely doesn't fit into the criteria.
Strange.

How reproducible is the problem? Are you interested in trying out
some debug patches?

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Niel Lambrechts
2009-04-01 19:48:11 UTC
Permalink
On 03/30/2009 04:40 PM, Jeff Garzik wrote:
> Niel Lambrechts wrote:
>> On 03/30/2009 11:00 AM, Tejun Heo wrote:
>>> Hello,
>>>
>>> For some reason, I can't find the original thread, so replying here.
>>>
>>> Niel Lambrechts wrote:
>>>>>>>> The ext4 errors are interleaved with hardware errors, and the ext4
>>>>>>>> errors are about I/O errors.
>>>>>>>>
>>>>>>>> EXT4-fs error (device sda6): __ext4_get_inode_loc: unable to
>>>>>>>> read inode block - inode=2346519
>>>>>>>> EXT4-fs error (device sda6) in ext4_reserve_inode_write: IO
>>>>>>>> failure
>>>>>>>>
>>>>>>>> This looks more like a hibernation problem than an ext4 problem.
>>>>>>>> Looks like the hard drive is being left in some inconsistent state
>>>>>>>> after resuming from hibernation.
>>> Yeap, ext4 is just the victim here.
>>>
>>>>>>> ata1.00: irq_stat 0x00400008, PHY RDY changed
>>>>>>> ata1: SError: { PHYRdyChg CommWake }
>>>>>> Your SATA hardware flags a connect-or-disconnect event ("PHY
>>>>>> RDY"), which requires us to abort a bunch of queued commands:
>>>>>>
>>>>>>> ata1.00: cmd 60/18:00:77:88:6f/00:00:0e:00:00/40 tag 0 ncq 12288 in
>>>>>>> res 50/00:30:07:b3:10/00:00:0c:00:00/40 Emask 0x10 (ATA
>>>>>>> bus error)
>>>>>> [...]
>>> ...
>>>>>> The SCSI subsystem aborts each of the queued commands.
>>>>> No .. this is the SCSI subsystem receives an ABORTED COMMAND
>>>>> return in
>>>>> sense data for each of the outstanding I/Os
>>>>>
>>>>> The only place these are generated is in ata_sense_to_error()
>>>>> which only
>>>>> occurs if there's some type of ata error.
>>>>>
>>>>> If I had to theorise, I'd say the system suspended with commands
>>>>> outstanding to the device. On resume, the device gets reset and
>>>>> returns
>>>>> some type of ATA error which gets translated to ABORTED COMMAND which
>>>>> causes a failure.
>>>>>
>>>>> In the mid layer, we translate ABORTED_COMMAND into a retry until the
>>>>> command runs out of them ... could it be there's a race readying the
>>>>> device and we run through the retries before it can accept the
>>>>> command?
>>> When libata-eh thinks that the problem isn't worth retrying, it sets
>>> scmd->retries to scmd->allowed so that it gets aborted immediately.
>>> The code is in ata_eh_qc_complete().
>>>
>>> Whether a command is to be retried or not is determined with
>>> ATA_QCFLAG_RETRY which is set in ata_eh_link_autopsy() for each failed
>>> command. Immediate-failure criteria is pretty strict - only driver
>>> software errors (AC_ERR_INVALID) and PC or other special commands
>>> which failed which got aborted by the device get the immediate pink
>>> slip. In this case, the commands are from FS and failed with
>>> AC_ERR_ATA_BUS, so it definitely doesn't fit into the criteria.
>>> Strange.
>>>
>>> How reproducible is the problem? Are you interested in trying out
>>> some debug patches?
>>
>> Hi Tejun,
>>
>> I think I should be able to reproduce when actively using X with 2.6.29,
>> and I have an external disk where I could backup to / boot from if the
>> corruption became a problem.
>>
>> These issues are keeping me from 2.6.29 so I'll gladly help where I can,
>> if you can please provide me the patches and the .config settings that
>> may be required?
>>
>> Niel
>> --
>> To unsubscribe from this list: send the line "unsubscribe
>> linux-kernel" in
>> the body of a message to ***@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>>
>
> Any chance you could use bisect to narrow down the problem commit?
>
> http://kernel.org/pub/software/scm/git/docs/v1.4.4.4/howto/isolate-bugs-with-bisect.txt
>
>
> This should identify which patch caused your problems, if you have a
> known good starting point (such as 2.6.28).
I'm struggling with this - my good kernel is 2.6.28.9 and as far as I
can tell the closest thing good kernel I can tell git to use is 2.6.28
base itself. So now what happens is that resume entirely fails during
some of the bisects due to entirely other regressions that are present
in older and newer kernels than mine, so I can't test the real issue! :(

cheers
Niel
Jeff Garzik
2009-04-03 20:09:14 UTC
Permalink
Niel Lambrechts wrote:
> On 03/30/2009 04:40 PM, Jeff Garzik wrote:
>> Niel Lambrechts wrote:
>>> On 03/30/2009 11:00 AM, Tejun Heo wrote:
>>>> Hello,
>>>>
>>>> For some reason, I can't find the original thread, so replying here.
>>>>
>>>> Niel Lambrechts wrote:
>>>>>>>>> The ext4 errors are interleaved with hardware errors, and the ext4
>>>>>>>>> errors are about I/O errors.
>>>>>>>>>
>>>>>>>>> EXT4-fs error (device sda6): __ext4_get_inode_loc: unable to
>>>>>>>>> read inode block - inode=2346519
>>>>>>>>> EXT4-fs error (device sda6) in ext4_reserve_inode_write: IO
>>>>>>>>> failure
>>>>>>>>>
>>>>>>>>> This looks more like a hibernation problem than an ext4 problem.
>>>>>>>>> Looks like the hard drive is being left in some inconsistent state
>>>>>>>>> after resuming from hibernation.
>>>> Yeap, ext4 is just the victim here.
>>>>
>>>>>>>> ata1.00: irq_stat 0x00400008, PHY RDY changed
>>>>>>>> ata1: SError: { PHYRdyChg CommWake }
>>>>>>> Your SATA hardware flags a connect-or-disconnect event ("PHY
>>>>>>> RDY"), which requires us to abort a bunch of queued commands:
>>>>>>>
>>>>>>>> ata1.00: cmd 60/18:00:77:88:6f/00:00:0e:00:00/40 tag 0 ncq 12288 in
>>>>>>>> res 50/00:30:07:b3:10/00:00:0c:00:00/40 Emask 0x10 (ATA
>>>>>>>> bus error)
>>>>>>> [...]
>>>> ...
>>>>>>> The SCSI subsystem aborts each of the queued commands.
>>>>>> No .. this is the SCSI subsystem receives an ABORTED COMMAND
>>>>>> return in
>>>>>> sense data for each of the outstanding I/Os
>>>>>>
>>>>>> The only place these are generated is in ata_sense_to_error()
>>>>>> which only
>>>>>> occurs if there's some type of ata error.
>>>>>>
>>>>>> If I had to theorise, I'd say the system suspended with commands
>>>>>> outstanding to the device. On resume, the device gets reset and
>>>>>> returns
>>>>>> some type of ATA error which gets translated to ABORTED COMMAND which
>>>>>> causes a failure.
>>>>>>
>>>>>> In the mid layer, we translate ABORTED_COMMAND into a retry until the
>>>>>> command runs out of them ... could it be there's a race readying the
>>>>>> device and we run through the retries before it can accept the
>>>>>> command?
>>>> When libata-eh thinks that the problem isn't worth retrying, it sets
>>>> scmd->retries to scmd->allowed so that it gets aborted immediately.
>>>> The code is in ata_eh_qc_complete().
>>>>
>>>> Whether a command is to be retried or not is determined with
>>>> ATA_QCFLAG_RETRY which is set in ata_eh_link_autopsy() for each failed
>>>> command. Immediate-failure criteria is pretty strict - only driver
>>>> software errors (AC_ERR_INVALID) and PC or other special commands
>>>> which failed which got aborted by the device get the immediate pink
>>>> slip. In this case, the commands are from FS and failed with
>>>> AC_ERR_ATA_BUS, so it definitely doesn't fit into the criteria.
>>>> Strange.
>>>>
>>>> How reproducible is the problem? Are you interested in trying out
>>>> some debug patches?
>>> Hi Tejun,
>>>
>>> I think I should be able to reproduce when actively using X with 2.6.29,
>>> and I have an external disk where I could backup to / boot from if the
>>> corruption became a problem.
>>>
>>> These issues are keeping me from 2.6.29 so I'll gladly help where I can,
>>> if you can please provide me the patches and the .config settings that
>>> may be required?
>>>
>>> Niel
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe
>>> linux-kernel" in
>>> the body of a message to ***@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at http://www.tux.org/lkml/
>>>
>> Any chance you could use bisect to narrow down the problem commit?
>>
>> http://kernel.org/pub/software/scm/git/docs/v1.4.4.4/howto/isolate-bugs-with-bisect.txt
>>
>>
>> This should identify which patch caused your problems, if you have a
>> known good starting point (such as 2.6.28).
> I'm struggling with this - my good kernel is 2.6.28.9 and as far as I
> can tell the closest thing good kernel I can tell git to use is 2.6.28
> base itself. So now what happens is that resume entirely fails during
> some of the bisects due to entirely other regressions that are present
> in older and newer kernels than mine, so I can't test the real issue! :(

"git help bisect" or "man git-bisect" has a wealth of information.

Most notably, you can use "git bisect skip" if the current commit cannot
be tested, and thus cannot be declared good or bad.

Jeff
Niel Lambrechts
2009-04-03 20:54:23 UTC
Permalink
On 04/03/2009 10:09 PM, Jeff Garzik wrote:
> Niel Lambrechts wrote:
>> On 03/30/2009 04:40 PM, Jeff Garzik wrote:
>>> Niel Lambrechts wrote:
>>>> On 03/30/2009 11:00 AM, Tejun Heo wrote:
>>>>> Hello,
>>>>>
>>>>> For some reason, I can't find the original thread, so replying here.
>>>>>
>>>>> Niel Lambrechts wrote:
>>>>>>>>>> The ext4 errors are interleaved with hardware errors, and the
>>>>>>>>>> ext4
>>>>>>>>>> errors are about I/O errors.
>>>>>>>>>>
>>>>>>>>>> EXT4-fs error (device sda6): __ext4_get_inode_loc: unable to
>>>>>>>>>> read inode block - inode=2346519
>>>>>>>>>> EXT4-fs error (device sda6) in ext4_reserve_inode_write: IO
>>>>>>>>>> failure
>>>>>>>>>>
>>>>>>>>>> This looks more like a hibernation problem than an ext4 problem.
>>>>>>>>>> Looks like the hard drive is being left in some inconsistent
>>>>>>>>>> state
>>>>>>>>>> after resuming from hibernation.
>>>>> Yeap, ext4 is just the victim here.
>>>>>
>>>>>>>>> ata1.00: irq_stat 0x00400008, PHY RDY changed
>>>>>>>>> ata1: SError: { PHYRdyChg CommWake }
>>>>>>>> Your SATA hardware flags a connect-or-disconnect event ("PHY
>>>>>>>> RDY"), which requires us to abort a bunch of queued commands:
>>>>>>>>
>>>>>>>>> ata1.00: cmd 60/18:00:77:88:6f/00:00:0e:00:00/40 tag 0 ncq
>>>>>>>>> 12288 in
>>>>>>>>> res 50/00:30:07:b3:10/00:00:0c:00:00/40 Emask 0x10 (ATA
>>>>>>>>> bus error)
>>>>>>>> [...]
>>>>> ...
>>>>>>>> The SCSI subsystem aborts each of the queued commands.
>>>>>>> No .. this is the SCSI subsystem receives an ABORTED COMMAND
>>>>>>> return in
>>>>>>> sense data for each of the outstanding I/Os
>>>>>>>
>>>>>>> The only place these are generated is in ata_sense_to_error()
>>>>>>> which only
>>>>>>> occurs if there's some type of ata error.
>>>>>>>
>>>>>>> If I had to theorise, I'd say the system suspended with commands
>>>>>>> outstanding to the device. On resume, the device gets reset and
>>>>>>> returns
>>>>>>> some type of ATA error which gets translated to ABORTED COMMAND
>>>>>>> which
>>>>>>> causes a failure.
>>>>>>>
>>>>>>> In the mid layer, we translate ABORTED_COMMAND into a retry
>>>>>>> until the
>>>>>>> command runs out of them ... could it be there's a race readying
>>>>>>> the
>>>>>>> device and we run through the retries before it can accept the
>>>>>>> command?
>>>>> When libata-eh thinks that the problem isn't worth retrying, it sets
>>>>> scmd->retries to scmd->allowed so that it gets aborted immediately.
>>>>> The code is in ata_eh_qc_complete().
>>>>>
>>>>> Whether a command is to be retried or not is determined with
>>>>> ATA_QCFLAG_RETRY which is set in ata_eh_link_autopsy() for each
>>>>> failed
>>>>> command. Immediate-failure criteria is pretty strict - only driver
>>>>> software errors (AC_ERR_INVALID) and PC or other special commands
>>>>> which failed which got aborted by the device get the immediate pink
>>>>> slip. In this case, the commands are from FS and failed with
>>>>> AC_ERR_ATA_BUS, so it definitely doesn't fit into the criteria.
>>>>> Strange.
>>>>>
>>>>> How reproducible is the problem? Are you interested in trying out
>>>>> some debug patches?
>>>> Hi Tejun,
>>>>
>>>> I think I should be able to reproduce when actively using X with
>>>> 2.6.29,
>>>> and I have an external disk where I could backup to / boot from if the
>>>> corruption became a problem.
>>>>
>>>> These issues are keeping me from 2.6.29 so I'll gladly help where I
>>>> can,
>>>> if you can please provide me the patches and the .config settings that
>>>> may be required?
>>>>
>>>> Niel
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe
>>>> linux-kernel" in
>>>> the body of a message to ***@vger.kernel.org
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>> Please read the FAQ at http://www.tux.org/lkml/
>>>>
>>> Any chance you could use bisect to narrow down the problem commit?
>>>
>>> http://kernel.org/pub/software/scm/git/docs/v1.4.4.4/howto/isolate-bugs-with-bisect.txt
>>>
>>>
>>>
>>> This should identify which patch caused your problems, if you have a
>>> known good starting point (such as 2.6.28).
>> I'm struggling with this - my good kernel is 2.6.28.9 and as far as I
>> can tell the closest thing good kernel I can tell git to use is 2.6.28
>> base itself. So now what happens is that resume entirely fails during
>> some of the bisects due to entirely other regressions that are present
>> in older and newer kernels than mine, so I can't test the real issue! :(
>
> "git help bisect" or "man git-bisect" has a wealth of information.
>
> Most notably, you can use "git bisect skip" if the current commit
> cannot be tested, and thus cannot be declared good or bad.
Thanks Jeff, still finding my way around git, it was not that clear to
me if there is a way to tell git that 2.6.28.9 was good, to me it looked
like the main kernel only consists of the RC and the final release, not
any subsequent patches?

Anyhow, Tejun's patch is holding up well apart from the glitch I
reported on yesterday - I guess it could have simply been part of the
debugging output Tejun included in the patch and not a real issue.

I haven't heard from him (yet), but I wonder if you might know if the
results give enough of a clue for a patch that could be included in say
2.6.29.1?

Thanks!
Niel
Tejun Heo
2009-04-02 01:50:32 UTC
Permalink
Hello,

Sorry about the delay.

Niel Lambrechts wrote:
> I think I should be able to reproduce when actively using X with 2.6.29,
> and I have an external disk where I could backup to / boot from if the
> corruption became a problem.
>
> These issues are keeping me from 2.6.29 so I'll gladly help where I can,
> if you can please provide me the patches and the .config settings that
> may be required?

Attached is the debug patch. Please reproduce the problem with the
patch applied and post full kernel log.

Thanks.

--
tejun
Niel Lambrechts
2009-04-02 06:20:11 UTC
Permalink
On 04/02/2009 03:50 AM, Tejun Heo wrote:
> Hello,
>
> Sorry about the delay.
>
> Niel Lambrechts wrote:
>
>> I think I should be able to reproduce when actively using X with 2.6.29,
>> and I have an external disk where I could backup to / boot from if the
>> corruption became a problem.
>>
>> These issues are keeping me from 2.6.29 so I'll gladly help where I can,
>> if you can please provide me the patches and the .config settings that
>> may be required?
>>
>
> Attached is the debug patch. Please reproduce the problem with the
> patch applied and post full kernel log.
>
Hi there,

Here is the output, it took 2 attempts to reproduce the EXT4 corruption
(clean with errors).

Niel
Tejun Heo
2009-04-02 06:52:06 UTC
Permalink
Niel Lambrechts wrote:
>> Attached is the debug patch. Please reproduce the problem with the
>> patch applied and post full kernel log.
>>
> Hi there,
>
> Here is the output, it took 2 attempts to reproduce the EXT4 corruption
> (clean with errors).

Strange. Maybe IO commands are getting through while the sdev is
still in quiesce state? Can you please repeat the test with the
attached patch?

Thanks.

--
tejun
Niel Lambrechts
2009-04-02 11:03:38 UTC
Permalink
On 04/02/2009 08:52 AM, Tejun Heo wrote:
> Niel Lambrechts wrote:
>
>>> Attached is the debug patch. Please reproduce the problem with the
>>> patch applied and post full kernel log.
>>>
>>>
>> Hi there,
>>
>> Here is the output, it took 2 attempts to reproduce the EXT4 corruption
>> (clean with errors).
>>
>
> Strange. Maybe IO commands are getting through while the sdev is
> still in quiesce state? Can you please repeat the test with the
> attached patch?
>
Hi Tejun,

Thanks for your help, I've done at least 5 hibernate cycles without the
problem recurring, I'll keep at it for a while... :)

For the sake of being thorough, I'd like to mention some of the
remaining issues/messages, but to be honest some of them were there
before and may not be relevant to your efforts:

1) Can you perhaps confirm if the remaining ATA messages are harmless
enough to ignore?

dmesg:
ata2: exception Emask 0x10 SAct 0x0 SErr 0x4050000 action 0x1e frozen
ata2: irq_stat 0x00400040, connection status changed
ata2: SError: { PHYRdyChg CommWake DevExch }
Clocksource tsc unstable (delta = -412838835 ns)

and in messages:
Apr 2 12:31:44 linux-7vph kernel: ata1: exception Emask 0x10 SAct 0x0
SErr 0x0 action 0x9 t4

2) The screen remains blank on resume, right until I both press a key
_and_ touch the touchpad. Weird, but this happens in 2.6.28.9 as well,
perhaps this is i915 related.

Thanks very much for sorting the more serious issue so quickly!

Regards,
Niel
Niel Lambrechts
2009-04-02 14:15:34 UTC
Permalink
On 04/02/2009 01:03 PM, Niel Lambrechts wrote:
> On 04/02/2009 08:52 AM, Tejun Heo wrote:
>
>> Niel Lambrechts wrote:
>>
>>
>>>> Attached is the debug patch. Please reproduce the problem with the
>>>> patch applied and post full kernel log.
>>>>
>>>>
>>>>
>>> Hi there,
>>>
>>> Here is the output, it took 2 attempts to reproduce the EXT4 corruption
>>> (clean with errors).
>>>
>>>
>> Strange. Maybe IO commands are getting through while the sdev is
>> still in quiesce state? Can you please repeat the test with the
>> attached patch?
>>
>>
> Hi Tejun,
>
> Thanks for your help, I've done at least 5 hibernate cycles without the
> problem recurring, I'll keep at it for a while... :)
>
> For the sake of being thorough, I'd like to mention some of the
> remaining issues/messages, but to be honest some of them were there
> before and may not be relevant to your efforts:
>
> 1) Can you perhaps confirm if the remaining ATA messages are harmless
> enough to ignore?
>
> dmesg:
> ata2: exception Emask 0x10 SAct 0x0 SErr 0x4050000 action 0x1e frozen
> ata2: irq_stat 0x00400040, connection status changed
> ata2: SError: { PHYRdyChg CommWake DevExch }
> Clocksource tsc unstable (delta = -412838835 ns)
>
> and in messages:
> Apr 2 12:31:44 linux-7vph kernel: ata1: exception Emask 0x10 SAct 0x0
> SErr 0x0 action 0x9 t4
>
> 2) The screen remains blank on resume, right until I both press a key
> _and_ touch the touchpad. Weird, but this happens in 2.6.28.9 as well,
> perhaps this is i915 related.
>
> Thanks very much for sorting the more serious issue so quickly!
>
> Regards,
> Niel
>
UPDATE:

Just arrived back home, before suspending I enabled all the powertop
laptop mode/SATA link management etc and I removed a CD-ROM from the CD
drive so I'm not sure what could be responsible for this:

ata2.00: XXX setting retry on qc0
ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata2.00: irq_stat 0x40000001
ata2.00: XXX terminating qc0 (SENSE), retries=0
XXX scsi_eh_flush_done_q: online=1(2) noretry=0 retries=3 allowed=3
scsi_eh_1: flush finish cmd: f5dea740
ata2.00: XXX setting retry on qc0
ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata2.00: irq_stat 0x40000001
ata2.00: XXX terminating qc0 (SENSE), retries=0
XXX scsi_eh_flush_done_q: online=1(2) noretry=0 retries=2 allowed=2
.. loop ..

I hope it's simply something like "drive not ready" debugging ouput,
since ata2 seems associated with the CD-Rom drive... :)

This continued, until I again in powertop enabled SATA link management.

The full dmesg is attached...

Niel
Tejun Heo
2009-04-04 04:54:18 UTC
Permalink
Hello, Neil.

Niel Lambrechts wrote:
>> Thanks for your help, I've done at least 5 hibernate cycles without the
>> problem recurring, I'll keep at it for a while... :)

Hmmm..

>> For the sake of being thorough, I'd like to mention some of the
>> remaining issues/messages, but to be honest some of them were there
>> before and may not be relevant to your efforts:
>>
>> 1) Can you perhaps confirm if the remaining ATA messages are harmless
>> enough to ignore?
>>
>> dmesg:
>> ata2: exception Emask 0x10 SAct 0x0 SErr 0x4050000 action 0x1e frozen
>> ata2: irq_stat 0x00400040, connection status changed
>> ata2: SError: { PHYRdyChg CommWake DevExch }
>> Clocksource tsc unstable (delta = -412838835 ns)

Yeah, that happens on some machines during resume. It doesn't on some
machines (including mine) and dunno what's the difference yet. It's a
bit annoying and probably adds a bit to resume time but libata handles
things like this just fine, so no need to be alarmed too much.

>> and in messages:
>> Apr 2 12:31:44 linux-7vph kernel: ata1: exception Emask 0x10 SAct 0x0
>> SErr 0x0 action 0x9 t4
>>
>> 2) The screen remains blank on resume, right until I both press a key
>> _and_ touch the touchpad. Weird, but this happens in 2.6.28.9 as well,
>> perhaps this is i915 related.

Heh... yeah, black magic of display suspend/resume. Hopefully it will
get better with KMS.

> UPDATE:
>
> Just arrived back home, before suspending I enabled all the powertop
> laptop mode/SATA link management etc and I removed a CD-ROM from the
> CD drive so I'm not sure what could be responsible for this:
>
> ata2.00: XXX setting retry on qc0
> ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
> ata2.00: irq_stat 0x40000001
> ata2.00: XXX terminating qc0 (SENSE), retries=0
> XXX scsi_eh_flush_done_q: online=1(2) noretry=0 retries=3 allowed=3
> scsi_eh_1: flush finish cmd: f5dea740
> ata2.00: XXX setting retry on qc0
> ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
> ata2.00: irq_stat 0x40000001
> ata2.00: XXX terminating qc0 (SENSE), retries=0
> XXX scsi_eh_flush_done_q: online=1(2) noretry=0 retries=2 allowed=2
> .. loop ..
>
> I hope it's simply something like "drive not ready" debugging ouput,
> since ata2 seems associated with the CD-Rom drive... :)
>
> This continued, until I again in powertop enabled SATA link management.

That's probably hal polling for media presence. It uses
TEST_UNIT_READY and if there's no media in the drive, it will raise
CHECK_SENSE condition which is handled by the libata EH which is
usually quiet about it as it's not an error but with the debugging
patch, it got more whiny. If you're annoyed by it just put a media in
the drive.

Thanks.

--
tejun
Niel Lambrechts
2009-04-06 05:01:18 UTC
Permalink
On 04/04/2009 06:54 AM, Tejun Heo wrote:
> Hello, Neil.
>
> Niel Lambrechts wrote:
>
>>> Thanks for your help, I've done at least 5 hibernate cycles without the
>>> problem recurring, I'll keep at it for a while... :)
>>>
>
> Hmmm..
>
Hi Tejun,

I have tested your patch some more over the weekend as well as on
2.6.29.1, and suspend-to-disk and ram still works without a glitch for
me apart from the extra debugging being active.

Will the fix naturally make its way into the mainline kernel, or is
there any extra debugging/testing I can help with?

Regards,
Niel
Tejun Heo
2009-04-06 10:09:28 UTC
Permalink
Niel Lambrechts wrote:
> On 04/04/2009 06:54 AM, Tejun Heo wrote:
>> Hello, Neil.
>>
>> Niel Lambrechts wrote:
>>
>>>> Thanks for your help, I've done at least 5 hibernate cycles without the
>>>> problem recurring, I'll keep at it for a while... :)
>>>>
>> Hmmm..
>>
> Hi Tejun,
>
> I have tested your patch some more over the weekend as well as on
> 2.6.29.1, and suspend-to-disk and ram still works without a glitch for
> me apart from the extra debugging being active.
>
> Will the fix naturally make its way into the mainline kernel, or is
> there any extra debugging/testing I can help with?

Well, the problem is the debug patch doesn't actually do anything
other than printing out messages. It could be that the problem is
timing dependent (which is likely anyway). You still can reporduce
the problem with the patch, right?

Thanks.

--
tejun
Niel Lambrechts
2009-04-06 18:23:26 UTC
Permalink
On 04/06/2009 12:09 PM, Tejun Heo wrote:
>
>> Will the fix naturally make its way into the mainline kernel, or is
>> there any extra debugging/testing I can help with?
>>
>
> Well, the problem is the debug patch doesn't actually do anything
> other than printing out messages. It could be that the problem is
> timing dependent (which is likely anyway). You still can reporduce
> the problem with the patch, right?
>
Heh? You provided two patches, with the last one you said:
> Strange. Maybe IO commands are getting through while the sdev is
> still in quiesce state? Can you please repeat the test with the
> attached patch?
>
With the latter, I have not encountered the original problem i.e. any
severe EXT4 corruption again, not in 2.6.29 and not in 2.6.29.1.

Do I also need to try the last patch without any debugging messages?

cheers
Niel
Tejun Heo
2009-04-06 19:39:47 UTC
Permalink
Hello,

Niel Lambrechts wrote:
> On 04/06/2009 12:09 PM, Tejun Heo wrote:
>>> Will the fix naturally make its way into the mainline kernel, or is
>>> there any extra debugging/testing I can help with?
>>>
>> Well, the problem is the debug patch doesn't actually do anything
>> other than printing out messages. It could be that the problem is
>> timing dependent (which is likely anyway). You still can reporduce
>> the problem with the patch, right?
>>
> Heh? You provided two patches, with the last one you said:

Yeah, the second one actually only added printks to see whether that's
the case. No behavior change.

>> Strange. Maybe IO commands are getting through while the sdev is
>> still in quiesce state? Can you please repeat the test with the
>> attached patch?
>
> With the latter, I have not encountered the original problem i.e. any
> severe EXT4 corruption again, not in 2.6.29 and not in 2.6.29.1.

Eh... so, we're definitely seeing something which is dependent on
timing.

> Do I also need to try the last patch without any debugging messages?

Then there will be nothing left. :-) Can you please try the attached
patch? It's still only debug messages but lighter; hopefully, it
won't mask the problem.

Thanks.

--
tejun
Niel Lambrechts
2009-04-06 21:26:42 UTC
Permalink
On 04/06/2009 09:39 PM, Tejun Heo wrote:
> Hello,
>
> Niel Lambrechts wrote:
>
>> On 04/06/2009 12:09 PM, Tejun Heo wrote:
>>
>>>> Will the fix naturally make its way into the mainline kernel, or is
>>>> there any extra debugging/testing I can help with?
>>>>
>>>>
>>> Well, the problem is the debug patch doesn't actually do anything
>>> other than printing out messages. It could be that the problem is
>>> timing dependent (which is likely anyway). You still can reporduce
>>> the problem with the patch, right?
>>>
>>>
>> Heh? You provided two patches, with the last one you said:
>>
>
> Yeah, the second one actually only added printks to see whether that's
> the case. No behavior change.
>
>
>>> Strange. Maybe IO commands are getting through while the sdev is
>>> still in quiesce state? Can you please repeat the test with the
>>> attached patch?
>>>
>> With the latter, I have not encountered the original problem i.e. any
>> severe EXT4 corruption again, not in 2.6.29 and not in 2.6.29.1.
>>
>
> Eh... so, we're definitely seeing something which is dependent on
> timing.
>
>
>> Do I also need to try the last patch without any debugging messages?
>>
>
> Then there will be nothing left. :-) Can you please try the attached
> patch? It's still only debug messages but lighter; hopefully, it
> won't mask the problem.
>
Sorry, my bad - I assumed the 2nd patch actually made a functional
difference... :)

Here is the output on 2.6.29.1 with your new patch - still nothing
serious happening, resume still seems okay!

cheers
Niel
Tejun Heo
2009-04-09 18:18:11 UTC
Permalink
Hello,

Niel Lambrechts wrote:
> On 04/06/2009 09:39 PM, Tejun Heo wrote:
> Sorry, my bad - I assumed the 2nd patch actually made a functional
> difference... :)
>
> Here is the output on 2.6.29.1 with your new patch - still nothing
> serious happening, resume still seems okay!

Hmm... none of the debugging messages got triggered, so there should
be no noticeable timing change. It seems like you're being lucky for
the time being. Can you please keep testing and report back?

Thanks.

--
tejun
Niel Lambrechts
2009-05-23 09:17:56 UTC
Permalink
On 04/09/2009 08:18 PM, Tejun Heo wrote:
> Hmm... none of the debugging messages got triggered, so there should
> be no noticeable timing change. It seems like you're being lucky for
> the time being. Can you please keep testing and report back?

Hi Tejun,

Just to keep you up informed - this issue is still happening, I've been
testing the latest 2.6.30-git kernels the last couple of weeks and just
checked my logs after having an unexpected fsck upon system boot - the
freeze seems to trigger very infrequently though, since I use s2disk at
least twice a day and the error only seems to have happened twice since
May 2.

I had a look at your 2.6.29 debug patch, it does "Hunk #6 succeeded at
3368" so it should still be good (I hope), I'll try and run with it for
a while.

The fsck output also seems a bit odd, on boot it displays a bunch of
newlines and the partition name without a *real* message (the ^H being
newlines):
/dev/disk/by-id/ata-HITACHI_HTS722020K9SA00_FDE_080726DP0470DTGMUNPC-part6:
|| 0.9%
<progress bar skipped>
^H- 56.5%
/dev/disk/by-id/ata-HITACHI_HTS722020K9SA00_FDE_080726DP0470DTGMUNPC-part6:
|
^H\ 56.7%
/dev/disk/by-id/ata-HITACHI_HTS722020K9SA00_FDE_080726DP0470DTGMUNPC-part6:
|
^H| 57.1%
/dev/disk/by-id/ata-HITACHI_HTS722020K9SA00_FDE_080726DP0470DTGMUNPC-part6:
|


Thanks,
Niel


linux-7vph:~ # uname -r
2.6.30-rc6-pae

linux-7vph:~# rpm -qf `which fsck`
e2fsprogs-1.41.4-2.1

linux-7vph:~# lspci -vvv|grep -i sata
00:1f.2 SATA controller: Intel Corporation ICH9M/M-E SATA AHCI
Controller (rev 03) (prog-if 01 [AHCI 1.0])

/var/log/warn:
<snip>
May 22 09:28:55 linux-7vph kernel: Syncing filesystems ... done.
May 22 09:28:55 linux-7vph kernel: Freezing user space processes ...
(elapsed 0.00 seconds) done.
May 22 09:28:55 linux-7vph kernel: Freezing remaining freezable tasks
... (elapsed 0.00 seconds) done.
May 22 09:28:55 linux-7vph kernel: Suspending console(s) (use
no_console_suspend to debug)
May 22 09:28:57 linux-7vph kernel: Disabling non-boot CPUs ...
May 22 09:28:57 linux-7vph kernel: CPU1 is down
May 22 09:28:57 linux-7vph kernel: Enabling non-boot CPUs ...
May 22 09:28:57 linux-7vph kernel: microcode: failed to init CPU1
May 22 09:28:57 linux-7vph kernel: CPU1 is up
May 22 09:28:57 linux-7vph kernel: e1000e 0000:00:19.0:
pci_enable_pcie_error_reporting failed 0xfffffffb
May 22 09:28:58 linux-7vph kernel: ata1: exception Emask 0x10 SAct 0x0
SErr 0x0 action 0x9 t4
May 22 09:28:58 linux-7vph kernel: ata1: irq_stat 0x00400040, connection
status changed
May 22 09:28:58 linux-7vph kernel: Restarting tasks ... done.
May 22 09:28:58 linux-7vph kernel: ata2: exception Emask 0x10 SAct 0x0
SErr 0x0 action 0x9 t4
May 22 09:28:58 linux-7vph kernel: ata2: irq_stat 0x40000001
May 22 09:28:58 linux-7vph kernel: ata1.00: exception Emask 0x10 SAct
0x1ff SErr 0x50000 action 0xe frozen
May 22 09:28:58 linux-7vph kernel: ata1.00: irq_stat 0x00400008, PHY RDY
changed
May 22 09:28:58 linux-7vph kernel: ata1: SError: { PHYRdyChg CommWake }
May 22 09:28:58 linux-7vph kernel: ata1.00: cmd
60/40:00:91:ca:b5/00:00:09:00:00/40 tag 0 ncq 32768 in
May 22 09:28:58 linux-7vph kernel: res
50/00:e8:5f:8e:2b/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 22 09:28:59 linux-7vph kernel: ata1.00: status: { DRDY }
May 22 09:28:59 linux-7vph kernel: ata1.00: cmd
60/40:08:91:d4:b5/00:00:09:00:00/40 tag 1 ncq 32768 in
May 22 09:28:59 linux-7vph kernel: res
50/00:e8:5f:8e:2b/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 22 09:28:59 linux-7vph kernel: ata1.00: status: { DRDY }
May 22 09:28:59 linux-7vph kernel: ata1.00: cmd
60/40:10:51:d2:b5/00:00:09:00:00/40 tag 2 ncq 32768 in
May 22 09:28:59 linux-7vph kernel: res
50/00:e8:5f:8e:2b/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 22 09:28:59 linux-7vph kernel: ata1.00: status: { DRDY }
May 22 09:28:59 linux-7vph kernel: ata1.00: cmd
60/40:18:11:cd:b5/00:00:09:00:00/40 tag 3 ncq 32768 in
May 22 09:28:59 linux-7vph kernel: res
50/00:e8:5f:8e:2b/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 22 09:28:59 linux-7vph kernel: ata1.00: status: { DRDY }
May 22 09:28:59 linux-7vph kernel: ata1.00: cmd
60/40:20:11:0a:b7/00:00:09:00:00/40 tag 4 ncq 32768 in
May 22 09:28:59 linux-7vph kernel: res
50/00:e8:5f:8e:2b/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 22 09:28:59 linux-7vph kernel: ata1.00: status: { DRDY }
May 22 09:28:59 linux-7vph kernel: ata1.00: cmd
60/40:28:d1:2b:b6/00:00:09:00:00/40 tag 5 ncq 32768 in
May 22 09:28:59 linux-7vph kernel: res
50/00:e8:5f:8e:2b/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 22 09:28:59 linux-7vph kernel: ata1.00: status: { DRDY }
May 22 09:28:59 linux-7vph kernel: ata1.00: cmd
60/40:30:91:c3:b5/00:00:09:00:00/40 tag 6 ncq 32768 in
May 22 09:28:59 linux-7vph kernel: res
50/00:e8:5f:8e:2b/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 22 09:28:59 linux-7vph kernel: ata1.00: status: { DRDY }
May 22 09:28:59 linux-7vph kernel: ata1.00: cmd
60/08:38:5f:88:2e/00:00:0b:00:00/40 tag 7 ncq 4096 in
May 22 09:29:01 linux-7vph kernel: res
50/00:e8:5f:8e:2b/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 22 09:29:01 linux-7vph kernel: ata1.00: status: { DRDY }
May 22 09:29:02 linux-7vph kernel: ata1.00: cmd
60/e8:40:5f:8e:2b/00:00:0b:00:00/40 tag 8 ncq 118784 in
May 22 09:29:02 linux-7vph kernel: res
50/00:e8:5f:8e:2b/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 22 09:29:02 linux-7vph kernel: ata1.00: status: { DRDY }
May 22 09:29:03 linux-7vph kernel: Descriptor sense data with sense
descriptors (in hex):
May 22 09:29:03 linux-7vph kernel: end_request: I/O error, dev sda,
sector 187403871
May 22 09:29:03 linux-7vph kernel: EXT4-fs error (device sda6):
__ext4_get_inode_loc: unable to read inode block - inode=641187,
block=2556122
May 22 09:29:03 linux-7vph kernel: EXT4-fs error (device sda6) in
ext4_reserve_inode_write: IO failure
<snip>

and an older one, could have been 2.6.30-rc5, I'm not sure:
May 14 07:47:17 linux-7vph kernel: Syncing filesystems ... done.
May 14 07:47:18 linux-7vph kernel: Freezing user space processes ...
(elapsed 0.00 seconds) done.
May 14 07:47:18 linux-7vph kernel: Freezing remaining freezable tasks
... (elapsed 0.00 seconds) done.
May 14 07:47:18 linux-7vph kernel: Suspending console(s) (use
no_console_suspend to debug)
May 14 07:47:18 linux-7vph kernel: Disabling non-boot CPUs ...
May 14 07:47:18 linux-7vph kernel: CPU1 is down
May 14 07:47:18 linux-7vph kernel: Enabling non-boot CPUs ...
May 14 07:47:18 linux-7vph kernel: microcode: failed to init CPU1
May 14 07:47:18 linux-7vph kernel: CPU1 is up
May 14 07:47:19 linux-7vph kernel: e1000e 0000:00:19.0:
pci_enable_pcie_error_reporting failed 0xfffffffb
May 14 07:47:19 linux-7vph kernel: ata1: exception Emask 0x10 SAct 0x0
SErr 0x0 action 0x9 t4
May 14 07:47:19 linux-7vph kernel: ata1: irq_stat 0x00400040, connection
status changed
May 14 07:47:19 linux-7vph kernel: mac80211-phy0: failed to set key (0,
00:1d:92:1d:1e:8e) to hardware (-22)
May 14 07:47:19 linux-7vph kernel: ata2: exception Emask 0x10 SAct 0x0
SErr 0x0 action 0x9 t4
May 14 07:47:19 linux-7vph kernel: ata2: irq_stat 0x40000001
May 14 07:47:19 linux-7vph kernel: Restarting tasks ... done.
May 14 07:47:19 linux-7vph kernel: ata1.00: exception Emask 0x10 SAct
0x3fffff SErr 0x50000 action 0xe frozen
May 14 07:47:19 linux-7vph kernel: ata1.00: irq_stat 0x00400008, PHY RDY
changed
May 14 07:47:19 linux-7vph kernel: ata1: SError: { PHYRdyChg CommWake }
May 14 07:47:19 linux-7vph kernel: ata1.00: cmd
60/08:00:17:70:14/00:00:0d:00:00/40 tag 0 ncq 4096 in
May 14 07:47:19 linux-7vph kernel: res
50/00:08:ef:4a:39/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 14 07:47:19 linux-7vph kernel: ata1.00: status: { DRDY }
May 14 07:47:19 linux-7vph kernel: ata1.00: cmd
60/40:08:11:8e:b5/00:00:09:00:00/40 tag 1 ncq 32768 in
May 14 07:47:19 linux-7vph kernel: res
50/00:08:ef:4a:39/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 14 07:47:19 linux-7vph kernel: ata1.00: status: { DRDY }
May 14 07:47:19 linux-7vph kernel: ata1.00: cmd
60/40:10:91:a4:b5/00:00:09:00:00/40 tag 2 ncq 32768 in
May 14 07:47:19 linux-7vph kernel: res
50/00:08:ef:4a:39/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 14 07:47:19 linux-7vph kernel: ata1.00: status: { DRDY }
May 14 07:47:19 linux-7vph kernel: ata1.00: cmd
60/08:18:5f:08:c8/00:00:0c:00:00/40 tag 3 ncq 4096 in
May 14 07:47:19 linux-7vph kernel: res
50/00:08:ef:4a:39/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 14 07:47:19 linux-7vph kernel: ata1.00: status: { DRDY }
May 14 07:47:19 linux-7vph kernel: ata1.00: cmd
60/08:20:6f:88:6f/01:00:0e:00:00/40 tag 4 ncq 135168 in
May 14 07:47:19 linux-7vph kernel: res
50/00:08:ef:4a:39/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 14 07:47:19 linux-7vph kernel: ata1.00: status: { DRDY }
May 14 07:47:19 linux-7vph kernel: ata1.00: cmd
60/08:28:9f:98:6d/00:00:0e:00:00/40 tag 5 ncq 4096 in
May 14 07:47:19 linux-7vph kernel: res
50/00:08:ef:4a:39/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 14 07:47:19 linux-7vph kernel: ata1.00: status: { DRDY }
May 14 07:47:19 linux-7vph kernel: ata1.00: cmd
60/40:30:11:91:b5/00:00:09:00:00/40 tag 6 ncq 32768 in
May 14 07:47:19 linux-7vph kernel: res
50/00:08:ef:4a:39/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 14 07:47:19 linux-7vph kernel: ata1.00: status: { DRDY }
May 14 07:47:19 linux-7vph kernel: ata1.00: cmd
60/40:38:91:9d:b5/00:00:09:00:00/40 tag 7 ncq 32768 in
May 14 07:47:19 linux-7vph kernel: res
50/00:08:ef:4a:39/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 14 07:47:19 linux-7vph kernel: ata1.00: status: { DRDY }
May 14 07:47:19 linux-7vph kernel: ata1.00: cmd
60/40:40:51:7f:b5/00:00:09:00:00/40 tag 8 ncq 32768 in
May 14 07:47:19 linux-7vph kernel: res
50/00:08:ef:4a:39/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 14 07:47:19 linux-7vph kernel: ata1.00: status: { DRDY }
May 14 07:47:19 linux-7vph kernel: ata1.00: cmd
60/40:48:91:7d:b5/00:00:09:00:00/40 tag 9 ncq 32768 in
May 14 07:47:19 linux-7vph kernel: res
50/00:08:ef:4a:39/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 14 07:47:19 linux-7vph kernel: ata1.00: status: { DRDY }
May 14 07:47:19 linux-7vph kernel: ata1.00: cmd
60/a8:50:5f:89:67/00:00:0e:00:00/40 tag 10 ncq 86016 in
May 14 07:47:19 linux-7vph kernel: res
50/00:08:ef:4a:39/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 14 07:47:19 linux-7vph kernel: ata1.00: status: { DRDY }
May 14 07:47:19 linux-7vph kernel: ata1.00: cmd
60/08:58:0f:8a:67/00:00:0e:00:00/40 tag 11 ncq 4096 in
May 14 07:47:19 linux-7vph kernel: res
50/00:08:ef:4a:39/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 14 07:47:19 linux-7vph kernel: ata1.00: status: { DRDY }
May 14 07:47:19 linux-7vph kernel: ata1.00: cmd
60/b0:60:5f:8e:2b/00:00:0b:00:00/40 tag 12 ncq 90112 in
May 14 07:47:19 linux-7vph kernel: res
50/00:08:ef:4a:39/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 14 07:47:19 linux-7vph kernel: ata1.00: status: { DRDY }
May 14 07:47:19 linux-7vph kernel: ata1.00: cmd
60/18:68:17:8f:2b/00:00:0b:00:00/40 tag 13 ncq 12288 in
May 14 07:47:19 linux-7vph kernel: res
50/00:08:ef:4a:39/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 14 07:47:19 linux-7vph kernel: ata1.00: status: { DRDY }
May 14 07:47:19 linux-7vph kernel: ata1.00: cmd
60/08:70:0f:8f:2b/00:00:0b:00:00/40 tag 14 ncq 4096 in
May 14 07:47:19 linux-7vph kernel: res
50/00:08:ef:4a:39/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 14 07:47:20 linux-7vph kernel: ata1.00: status: { DRDY }
May 14 07:47:20 linux-7vph kernel: ata1.00: cmd
60/18:78:2f:8f:2b/00:00:0b:00:00/40 tag 15 ncq 12288 in
May 14 07:47:20 linux-7vph kernel: res
50/00:08:ef:4a:39/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 14 07:47:20 linux-7vph kernel: ata1.00: status: { DRDY }
May 14 07:47:20 linux-7vph kernel: ata1.00: cmd
60/08:80:07:8a:67/00:00:0e:00:00/40 tag 16 ncq 4096 in
May 14 07:47:20 linux-7vph kernel: res
50/00:08:ef:4a:39/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 14 07:47:20 linux-7vph kernel: ata1.00: status: { DRDY }
May 14 07:47:20 linux-7vph kernel: ata1.00: cmd
60/50:88:17:8a:67/00:00:0e:00:00/40 tag 17 ncq 40960 in
May 14 07:47:20 linux-7vph kernel: res
50/00:08:ef:4a:39/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 14 07:47:20 linux-7vph kernel: ata1.00: status: { DRDY }
May 14 07:47:20 linux-7vph kernel: ata1.00: cmd
60/58:90:6f:06:44/00:00:0c:00:00/40 tag 18 ncq 45056 in
May 14 07:47:20 linux-7vph kernel: res
50/00:08:ef:4a:39/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 14 07:47:20 linux-7vph kernel: ata1.00: status: { DRDY }
May 14 07:47:20 linux-7vph kernel: ata1.00: cmd
60/08:98:d7:06:44/00:00:0c:00:00/40 tag 19 ncq 4096 in
May 14 07:47:20 linux-7vph kernel: res
50/00:08:ef:4a:39/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 14 07:47:20 linux-7vph kernel: ata1.00: status: { DRDY }
May 14 07:47:20 linux-7vph kernel: ata1.00: cmd
60/18:a0:17:07:44/00:00:0c:00:00/40 tag 20 ncq 12288 in
May 14 07:47:20 linux-7vph kernel: res
50/00:08:ef:4a:39/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 14 07:47:20 linux-7vph kernel: ata1.00: status: { DRDY }
May 14 07:47:20 linux-7vph kernel: ata1.00: cmd
60/08:a8:ef:4a:39/00:00:0b:00:00/40 tag 21 ncq 4096 in
May 14 07:47:20 linux-7vph kernel: res
50/00:08:ef:4a:39/00:00:0b:00:00/40 Emask 0x10 (ATA bus error)
May 14 07:47:20 linux-7vph kernel: ata1.00: status: { DRDY }
May 14 07:47:20 linux-7vph kernel: Descriptor sense data with sense
descriptors (in hex):
May 14 07:47:20 linux-7vph kernel: end_request: I/O error, dev sda,
sector 242190447
May 14 07:47:20 linux-7vph kernel: Descriptor sense data with sense
descriptors (in hex):
May 14 07:47:20 linux-7vph kernel: end_request: I/O error, dev sda,
sector 241666399
May 14 07:47:20 linux-7vph kernel: Descriptor sense data with sense
descriptors (in hex):
May 14 07:47:20 linux-7vph kernel: end_request: I/O error, dev sda,
sector 241666575
May 14 07:47:20 linux-7vph kernel: Descriptor sense data with sense
descriptors (in hex):
May 14 07:47:20 linux-7vph kernel: end_request: I/O error, dev sda,
sector 187403871
May 14 07:47:20 linux-7vph kernel: Descriptor sense data with sense
descriptors (in hex):
May 14 07:47:20 linux-7vph kernel: end_request: I/O error, dev sda,
sector 187404055
May 14 07:47:20 linux-7vph kernel: Descriptor sense data with sense
descriptors (in hex):
May 14 07:47:20 linux-7vph kernel: end_request: I/O error, dev sda,
sector 187404079
May 14 07:47:20 linux-7vph kernel: Descriptor sense data with sense
descriptors (in hex):
May 14 07:47:20 linux-7vph kernel: end_request: I/O error, dev sda,
sector 241666583
May 14 07:47:20 linux-7vph kernel: EXT4-fs error (device sda6):
__ext4_get_inode_loc: unable to read inode block - inode=2346519,
block=9404418
May 14 07:47:20 linux-7vph kernel: EXT4-fs error (device sda6) in
ext4_reserve_inode_write: IO failure
Niel Lambrechts
2009-05-23 10:26:46 UTC
Permalink
On 05/23/2009 11:17 AM, Niel Lambrechts wrote:
> On 04/09/2009 08:18 PM, Tejun Heo wrote:
>> Hmm... none of the debugging messages got triggered, so there should
>> be no noticeable timing change. It seems like you're being lucky for
>> the time being. Can you please keep testing and report back?
>
> Hi Tejun,
>
> Just to keep you up informed - this issue is still happening, I've
> been testing the latest 2.6.30-git kernels the last couple of weeks
> and just checked my logs after having an unexpected fsck upon system
> boot - the freeze seems to trigger very infrequently though, since I
> use s2disk at least twice a day and the error only seems to have
> happened twice since May 2.
>
> I had a look at your 2.6.29 debug patch, it does "Hunk #6 succeeded at
> 3368" so it should still be good (I hope), I'll try and run with it
> for a while.

Hi Tejun

Bug triggered with your patch! I played audio while suspending to try
and increase activity (I also removed a CD on boot), and the filesystem
came up dirty! This was on attempt nr. 3 or 4.

Here is the /var/log/messages output - i hope this is sufficient to
trace down the cause:
May 23 12:15:08 linux-7vph kernel: PM: Shrinking memory... done (98443
pages freed)
May 23 12:15:07 linux-7vph pulseaudio[3781]: alsa-util.c: period_time
: 185759
May 23 12:15:08 linux-7vph pulseaudio[3781]: alsa-util.c: tstamp_mode
: ENABLE
May 23 12:15:08 linux-7vph pulseaudio[3781]: alsa-util.c: period_step : 1
May 23 12:15:08 linux-7vph pulseaudio[3781]: alsa-util.c: avail_min
: 15503
May 23 12:15:08 linux-7vph pulseaudio[3781]: alsa-util.c: period_event : 0
May 23 12:15:08 linux-7vph pulseaudio[3781]: alsa-util.c:
start_threshold : -1
May 23 12:15:08 linux-7vph pulseaudio[3781]: alsa-util.c:
stop_threshold : 1073741824
May 23 12:15:08 linux-7vph pulseaudio[3781]: alsa-util.c:
silence_threshold: 0
May 23 12:15:08 linux-7vph pulseaudio[3781]: alsa-util.c: silence_size : 0
May 23 12:15:08 linux-7vph pulseaudio[3781]: alsa-util.c: boundary
: 1073741824
May 23 12:15:08 linux-7vph pulseaudio[3781]: alsa-util.c: Slave:
Hardware PCM card 0 'HDA Intel' device 0 subdevice 0
May 23 12:15:08 linux-7vph pulseaudio[3781]: alsa-util.c: Its setup is:
May 23 12:15:08 linux-7vph pulseaudio[3781]: alsa-util.c: stream
: PLAYBACK
May 23 12:15:08 linux-7vph pulseaudio[3781]: alsa-util.c: access
: MMAP_INTERLEAVED
May 23 12:15:08 linux-7vph pulseaudio[3781]: alsa-util.c: format
: S16_LE
May 23 12:15:08 linux-7vph pulseaudio[3781]: alsa-util.c: subformat
: STD
May 23 12:15:08 linux-7vph kernel: PM: Freed 393772 kbytes in 3.23
seconds (121.91 MB/s)
May 23 12:15:09 linux-7vph kernel: Suspending console(s) (use
no_console_suspend to debug)
May 23 12:15:09 linux-7vph kernel: sd 0:0:0:0: [sda] Synchronizing SCSI
cache
May 23 12:15:09 linux-7vph kernel: ACPI handle has no context!
May 23 12:15:09 linux-7vph kernel: ata1: exception Emask 0x0 SAct 0x0
SErr 0x0 action 0x0
May 23 12:15:09 linux-7vph kernel: ata2: exception Emask 0x0 SAct 0x0
SErr 0x0 action 0x0
May 23 12:15:09 linux-7vph kernel: ehci_hcd 0000:00:1d.7: PCI INT D disabled
May 23 12:15:09 linux-7vph kernel: ehci_hcd 0000:00:1d.7: PME# disabled
May 23 12:15:09 linux-7vph kernel: uhci_hcd 0000:00:1d.2: PCI INT C disabled
May 23 12:15:09 linux-7vph kernel: uhci_hcd 0000:00:1d.1: PCI INT B disabled
May 23 12:15:09 linux-7vph kernel: uhci_hcd 0000:00:1d.0: PCI INT A disabled
May 23 12:15:09 linux-7vph kernel: HDA Intel 0000:00:1b.0: PCI INT B
disabled
May 23 12:15:10 linux-7vph kernel: ehci_hcd 0000:00:1a.7: PCI INT D disabled
May 23 12:15:10 linux-7vph kernel: ehci_hcd 0000:00:1a.7: PME# disabled
May 23 12:15:10 linux-7vph kernel: uhci_hcd 0000:00:1a.2: PCI INT C disabled
May 23 12:15:10 linux-7vph kernel: uhci_hcd 0000:00:1a.1: PCI INT B disabled
May 23 12:15:08 linux-7vph pulseaudio[3781]: alsa-util.c: channels : 2
May 23 12:15:10 linux-7vph pulseaudio[3781]: alsa-util.c: rate
: 44100
May 23 12:15:10 linux-7vph pulseaudio[3781]: alsa-util.c: exact rate
: 44100 (44100/1)
May 23 12:15:10 linux-7vph kernel: uhci_hcd 0000:00:1a.0: PCI INT A disabled
May 23 12:15:10 linux-7vph pulseaudio[3781]: alsa-util.c: msbits
: 16
May 23 12:15:10 linux-7vph kernel: e1000e 0000:00:19.0: PCI INT A disabled
May 23 12:15:10 linux-7vph pulseaudio[3781]: alsa-util.c: buffer_size
: 16384
May 23 12:15:10 linux-7vph pulseaudio[3781]: alsa-util.c: period_size
: 8192
May 23 12:15:10 linux-7vph pulseaudio[3781]: alsa-util.c: period_time
: 185759
May 23 12:15:10 linux-7vph pulseaudio[3781]: alsa-util.c: tstamp_mode
: ENABLE
May 23 12:15:10 linux-7vph pulseaudio[3781]: alsa-util.c: period_step : 1
May 23 12:15:10 linux-7vph pulseaudio[3781]: alsa-util.c: avail_min
: 15503
May 23 12:15:10 linux-7vph pulseaudio[3781]: alsa-util.c: period_event : 0
May 23 12:15:10 linux-7vph kernel: e1000e 0000:00:19.0: PME# enabled
May 23 12:15:10 linux-7vph pulseaudio[3781]: alsa-util.c:
start_threshold : -1
May 23 12:15:10 linux-7vph kernel: e1000e 0000:00:19.0: wake-up
capability enabled by ACPI
May 23 12:15:10 linux-7vph pulseaudio[3781]: alsa-util.c:
stop_threshold : 1073741824
May 23 12:15:10 linux-7vph pulseaudio[3781]: alsa-util.c:
silence_threshold: 0
May 23 12:15:10 linux-7vph pulseaudio[3781]: alsa-util.c: silence_size : 0
May 23 12:15:10 linux-7vph kernel: ACPI: Preparing to enter system sleep
state S4
May 23 12:15:10 linux-7vph pulseaudio[3781]: alsa-util.c: boundary
: 1073741824
May 23 12:15:10 linux-7vph kernel: PM: Saving platform NVS memory
May 23 12:15:10 linux-7vph kernel: Disabling non-boot CPUs ...
May 23 12:15:10 linux-7vph kernel: CPU 1 is now offline
May 23 12:15:10 linux-7vph kernel: SMP alternatives: switching to UP code
May 23 12:15:10 linux-7vph pulseaudio[3781]: alsa-sink.c: Increasing
wakeup watermark to 40.00 ms
May 23 12:15:10 linux-7vph kernel: CPU0 attaching NULL sched-domain.
May 23 12:15:10 linux-7vph kernel: CPU1 attaching NULL sched-domain.
May 23 12:15:10 linux-7vph kernel: CPU0 attaching NULL sched-domain.
May 23 12:15:10 linux-7vph kernel: CPU1 is down
May 23 12:15:10 linux-7vph kernel: Extended CMOS year: 2000
May 23 12:15:10 linux-7vph kernel: PM: Creating hibernation image:
May 23 12:15:10 linux-7vph kernel: PM: Need to copy 124750 pages
May 23 12:15:10 linux-7vph kernel: x86 PAT enabled: cpu 0, old
0x7040600070406, new 0x7010600070106
May 23 12:15:10 linux-7vph kernel: Intel machine check architecture
supported.
May 23 12:15:10 linux-7vph kernel: Intel machine check reporting enabled
on CPU#0.
May 23 12:15:10 linux-7vph kernel: PM: Restoring platform NVS memory
May 23 12:15:10 linux-7vph kernel: Extended CMOS year: 2000
May 23 12:15:10 linux-7vph kernel: Enabling non-boot CPUs ...
May 23 12:15:10 linux-7vph kernel: SMP alternatives: switching to SMP code
May 23 12:15:10 linux-7vph kernel: Booting processor 1 APIC 0x1 ip 0x6000
May 23 12:15:10 linux-7vph kernel: Initializing CPU#1
May 23 12:15:10 linux-7vph kernel: Calibrating delay using timer
specific routine.. 5054.05 BogoMIPS (lpj=10108117)
May 23 12:15:10 linux-7vph kernel: CPU: L1 I cache: 32K, L1 D cache: 32K
May 23 12:15:10 linux-7vph kernel: CPU: L2 cache: 6144K
May 23 12:15:10 linux-7vph kernel: CPU: Physical Processor ID: 0
May 23 12:15:10 linux-7vph kernel: CPU: Processor Core ID: 1
May 23 12:15:10 linux-7vph kernel: Intel machine check architecture
supported.
May 23 12:15:10 linux-7vph kernel: Intel machine check reporting enabled
on CPU#1.
May 23 12:15:10 linux-7vph kernel: x86 PAT enabled: cpu 1, old
0x7040600070406, new 0x7010600070106
May 23 12:15:10 linux-7vph kernel: CPU1: Intel(R) Core(TM)2 Duo CPU
T9400 @ 2.53GHz stepping 06
May 23 12:15:10 linux-7vph kernel: CPU0 attaching NULL sched-domain.
May 23 12:15:10 linux-7vph kernel: Switched to high resolution mode on CPU 1
May 23 12:15:10 linux-7vph kernel: CPU0 attaching sched-domain:
May 23 12:15:10 linux-7vph kernel: domain 0: span 0-1 level MC
May 23 12:15:10 linux-7vph kernel: groups: 0 1
May 23 12:15:10 linux-7vph kernel: domain 1: span 0-1 level CPU
May 23 12:15:10 linux-7vph kernel: groups: 0-1 (__cpu_power = 2048)
May 23 12:15:10 linux-7vph kernel: CPU1 attaching sched-domain:
May 23 12:15:10 linux-7vph kernel: domain 0: span 0-1 level MC
May 23 12:15:10 linux-7vph kernel: groups: 1 0
May 23 12:15:10 linux-7vph kernel: domain 1: span 0-1 level CPU
May 23 12:15:10 linux-7vph kernel: groups: 0-1 (__cpu_power = 2048)
May 23 12:15:10 linux-7vph kernel: microcode: failed to init CPU1
May 23 12:15:10 linux-7vph kernel: CPU1 is up
May 23 12:15:10 linux-7vph kernel: ACPI: Waking up from system sleep
state S4
May 23 12:15:10 linux-7vph kernel: pci 0000:00:02.0: restoring config
space at offset 0x1 (was 0x900007, writing 0x900403)
May 23 12:15:10 linux-7vph kernel: uhci_hcd 0000:00:1a.0: power state
changed by ACPI to D0
May 23 12:15:10 linux-7vph kernel: uhci_hcd 0000:00:1a.2: power state
changed by ACPI to D0
May 23 12:15:10 linux-7vph kernel: HDA Intel 0000:00:1b.0: restoring
config space at offset 0x1 (was 0x100106, writing 0x100102)
May 23 12:15:10 linux-7vph kernel: uhci_hcd 0000:00:1d.0: power state
changed by ACPI to D0
May 23 12:15:10 linux-7vph kernel: ahci 0000:00:1f.2: restoring config
space at offset 0x1 (was 0x2b00403, writing 0x2b00407)
May 23 12:15:10 linux-7vph kernel: iwlagn 0000:03:00.0: restoring config
space at offset 0x1 (was 0x100106, writing 0x100506)
May 23 12:15:10 linux-7vph kernel: pci 0000:00:02.0: PME# disabled
May 23 12:15:10 linux-7vph kernel: pci 0000:00:02.0: setting latency
timer to 64
May 23 12:15:10 linux-7vph kernel: pci 0000:00:02.1: PME# disabled
May 23 12:15:10 linux-7vph kernel: pci 0000:00:03.0: PME# disabled
May 23 12:15:10 linux-7vph kernel: e1000e 0000:00:19.0: PCI INT A -> GSI
20 (level, low) -> IRQ 20
May 23 12:15:10 linux-7vph kernel: e1000e 0000:00:19.0:
pci_enable_pcie_error_reporting failed 0xfffffffb
May 23 12:15:10 linux-7vph kernel: e1000e 0000:00:19.0: setting latency
timer to 64
May 23 12:15:10 linux-7vph kernel: e1000e 0000:00:19.0: wake-up
capability disabled by ACPI
May 23 12:15:10 linux-7vph kernel: e1000e 0000:00:19.0: PME# disabled
May 23 12:15:10 linux-7vph kernel: e1000e 0000:00:19.0: wake-up
capability disabled by ACPI
May 23 12:15:10 linux-7vph kernel: e1000e 0000:00:19.0: PME# disabled
May 23 12:15:10 linux-7vph kernel: e1000e 0000:00:19.0: irq 29 for MSI/MSI-X
May 23 12:15:10 linux-7vph kernel: uhci_hcd 0000:00:1a.0: PCI INT A ->
GSI 20 (level, low) -> IRQ 20
May 23 12:15:10 linux-7vph kernel: uhci_hcd 0000:00:1a.0: setting
latency timer to 64
May 23 12:15:10 linux-7vph kernel: uhci_hcd 0000:00:1a.1: PCI INT B ->
GSI 21 (level, low) -> IRQ 21
May 23 12:15:10 linux-7vph kernel: uhci_hcd 0000:00:1a.1: setting
latency timer to 64
May 23 12:15:10 linux-7vph kernel: uhci_hcd 0000:00:1a.2: PCI INT C ->
GSI 22 (level, low) -> IRQ 22
May 23 12:15:10 linux-7vph kernel: uhci_hcd 0000:00:1a.2: setting
latency timer to 64
May 23 12:15:10 linux-7vph kernel: ehci_hcd 0000:00:1a.7: PME# disabled
May 23 12:15:10 linux-7vph kernel: ehci_hcd 0000:00:1a.7: PCI INT D ->
GSI 23 (level, low) -> IRQ 23
May 23 12:15:10 linux-7vph kernel: ehci_hcd 0000:00:1a.7: setting
latency timer to 64
May 23 12:15:10 linux-7vph kernel: ehci_hcd 0000:00:1a.7: PME# disabled
May 23 12:15:10 linux-7vph kernel: HDA Intel 0000:00:1b.0: PCI INT B ->
GSI 17 (level, low) -> IRQ 17
May 23 12:15:10 linux-7vph kernel: HDA Intel 0000:00:1b.0: setting
latency timer to 64
May 23 12:15:10 linux-7vph kernel: uhci_hcd 0000:00:1d.0: PCI INT A ->
GSI 16 (level, low) -> IRQ 16
May 23 12:15:10 linux-7vph kernel: uhci_hcd 0000:00:1d.0: setting
latency timer to 64
May 23 12:15:10 linux-7vph kernel: uhci_hcd 0000:00:1d.1: PCI INT B ->
GSI 17 (level, low) -> IRQ 17
May 23 12:15:10 linux-7vph kernel: uhci_hcd 0000:00:1d.1: setting
latency timer to 64
May 23 12:15:10 linux-7vph kernel: uhci_hcd 0000:00:1d.2: PCI INT C ->
GSI 18 (level, low) -> IRQ 18
May 23 12:15:10 linux-7vph kernel: uhci_hcd 0000:00:1d.2: setting
latency timer to 64
May 23 12:15:10 linux-7vph kernel: ehci_hcd 0000:00:1d.7: PME# disabled
May 23 12:15:10 linux-7vph kernel: ehci_hcd 0000:00:1d.7: PCI INT D ->
GSI 19 (level, low) -> IRQ 19
May 23 12:15:10 linux-7vph kernel: ehci_hcd 0000:00:1d.7: setting
latency timer to 64
May 23 12:15:10 linux-7vph kernel: ehci_hcd 0000:00:1d.7: PME# disabled
May 23 12:15:10 linux-7vph kernel: pci 0000:00:1e.0: setting latency
timer to 64
May 23 12:15:10 linux-7vph kernel: ahci 0000:00:1f.2: setting latency
timer to 64
May 23 12:15:10 linux-7vph kernel: ata1: exception Emask 0x0 SAct 0x0
SErr 0x0 action 0x16 frozen
May 23 12:15:10 linux-7vph kernel: ata2: exception Emask 0x0 SAct 0x0
SErr 0x0 action 0x16 frozen
May 23 12:15:10 linux-7vph kernel: pci 0000:15:00.0: PCI INT A -> GSI 16
(level, low) -> IRQ 16
May 23 12:15:10 linux-7vph kernel: ohci1394: fw-host0: OHCI-1394 1.1
(PCI): IRQ=[17] MMIO=[f4801000-f48017ff] Max Packet=[2048] IR/IT
contexts=[4/4]
May 23 12:15:10 linux-7vph kernel: pci 0000:15:00.2: PME# disabled
May 23 12:15:10 linux-7vph kernel: pci 0000:15:00.3: PME# disabled
May 23 12:15:10 linux-7vph kernel: pci 0000:15:00.4: PME# disabled
May 23 12:15:10 linux-7vph kernel: pci 0000:15:00.5: PME# disabled
May 23 12:15:10 linux-7vph kernel: sd 0:0:0:0: [sda] Starting disk
May 23 12:15:10 linux-7vph kernel: ata1: SATA link up 1.5 Gbps (SStatus
113 SControl 300)
May 23 12:15:10 linux-7vph kernel: ata2: SATA link up 1.5 Gbps (SStatus
113 SControl 300)
May 23 12:15:10 linux-7vph kernel: ata1.00: ACPI cmd
ef/02:00:00:00:00:a0 succeeded
May 23 12:15:10 linux-7vph kernel: ata1.00: ACPI cmd
f5/00:00:00:00:00:a0 filtered out
May 23 12:15:10 linux-7vph kernel: ata1.00: ACPI cmd
ef/5f:00:00:00:00:a0 succeeded
May 23 12:15:10 linux-7vph kernel: ata1.00: ACPI cmd
ef/10:03:00:00:00:a0 filtered out
May 23 12:15:10 linux-7vph kernel: ata1.00: ACPI cmd
ef/02:00:00:00:00:a0 succeeded
May 23 12:15:10 linux-7vph kernel: ata1.00: ACPI cmd
f5/00:00:00:00:00:a0 filtered out
May 23 12:15:10 linux-7vph kernel: ata2.00: ACPI cmd
e3/00:1f:00:00:00:a0 succeeded
May 23 12:15:10 linux-7vph kernel: ata1.00: ACPI cmd
ef/5f:00:00:00:00:a0 succeeded
May 23 12:15:10 linux-7vph kernel: ata1.00: ACPI cmd
ef/10:03:00:00:00:a0 filtered out
May 23 12:15:10 linux-7vph kernel: ata2.00: ACPI cmd
e3/00:02:00:00:00:a0 succeeded
May 23 12:15:10 linux-7vph kernel: ata1.00: configured for UDMA/133
May 23 12:15:10 linux-7vph kernel: ata1: exception Emask 0x10 SAct 0x0
SErr 0x0 action 0x9 t4
May 23 12:15:10 linux-7vph kernel: ata1: irq_stat 0x00400040, connection
status changed
May 23 12:15:10 linux-7vph kernel: ata1.00: configured for UDMA/133
May 23 12:15:10 linux-7vph kernel: ata1: EH complete
May 23 12:15:10 linux-7vph kernel: ata2.00: ACPI cmd
e3/00:1f:00:00:00:a0 succeeded
May 23 12:15:10 linux-7vph kernel: ata2.00: ACPI cmd
e3/00:02:00:00:00:a0 succeeded
May 23 12:15:10 linux-7vph kernel: ata2.00: configured for UDMA/133
May 23 12:15:10 linux-7vph kernel: ata2: exception Emask 0x10 SAct 0x0
SErr 0x0 action 0x9 t4
May 23 12:15:10 linux-7vph kernel: ata2: irq_stat 0x40000001
May 23 12:15:10 linux-7vph kernel: ata2.00: configured for UDMA/133
May 23 12:15:10 linux-7vph kernel: ata2: EH complete
May 23 12:15:10 linux-7vph kernel: Registered led device: iwl-phy0::radio
May 23 12:15:10 linux-7vph kernel: Registered led device: iwl-phy0::assoc
May 23 12:15:10 linux-7vph kernel: Registered led device: iwl-phy0::RX
May 23 12:15:10 linux-7vph kernel: Registered led device: iwl-phy0::TX
May 23 12:15:10 linux-7vph kernel: mac80211-phy0: failed to set key (0,
00:1d:92:1d:1e:8e) to hardware (-22)
May 23 12:15:10 linux-7vph kernel: pci 0000:00:02.0: restoring config
space at offset 0x1 (was 0x900407, writing 0x900403)
May 23 12:15:10 linux-7vph kernel: pci 0000:00:02.0: setting latency
timer to 64
May 23 12:15:10 linux-7vph kernel: Restarting tasks ... <3>ata1.00:
exception Emask 0x10 SAct 0x1f SErr 0x50000 action 0xe frozen
May 23 12:15:10 linux-7vph kernel: ata1.00: irq_stat 0x00400008, PHY RDY
changed
May 23 12:15:10 linux-7vph kernel: ata1: SError: { PHYRdyChg CommWake }
May 23 12:15:10 linux-7vph kernel: ata1.00: cmd
60/08:00:5f:89:67/01:00:0e:00:00/40 tag 0 ncq 135168 in
May 23 12:15:10 linux-7vph kernel: res
50/00:08:5f:08:c8/00:00:0c:00:00/40 Emask 0x10 (ATA bus error)
May 23 12:15:10 linux-7vph kernel: ata1.00: status: { DRDY }
May 23 12:15:10 linux-7vph kernel: ata1.00: cmd
60/08:08:6f:88:6f/01:00:0e:00:00/40 tag 1 ncq 135168 in
May 23 12:15:10 linux-7vph kernel: res
50/00:08:5f:08:c8/00:00:0c:00:00/40 Emask 0x10 (ATA bus error)
May 23 12:15:10 linux-7vph kernel: ata1.00: status: { DRDY }
May 23 12:15:10 linux-7vph kernel: ata1.00: cmd
60/08:10:5f:08:c8/00:00:0c:00:00/40 tag 2 ncq 4096 in
May 23 12:15:10 linux-7vph kernel: res
50/00:08:5f:08:c8/00:00:0c:00:00/40 Emask 0x10 (ATA bus error)
May 23 12:15:10 linux-7vph kernel: ata1.00: status: { DRDY }
May 23 12:15:10 linux-7vph kernel: ata1.00: cmd
60/08:18:9f:98:6d/00:00:0e:00:00/40 tag 3 ncq 4096 in
May 23 12:15:10 linux-7vph kernel: res
50/00:08:5f:08:c8/00:00:0c:00:00/40 Emask 0x10 (ATA bus error)
May 23 12:15:11 linux-7vph kernel: ata1.00: status: { DRDY }
May 23 12:15:11 linux-7vph kernel: ata1.00: cmd
60/e8:20:5f:8e:2b/00:00:0b:00:00/40 tag 4 ncq 118784 in
May 23 12:15:11 linux-7vph kernel: res
50/00:08:5f:08:c8/00:00:0c:00:00/40 Emask 0x10 (ATA bus error)
May 23 12:15:11 linux-7vph kernel: ata1.00: status: { DRDY }
May 23 12:15:11 linux-7vph kernel: ata1: hard resetting link
May 23 12:15:11 linux-7vph kernel: done.
May 23 12:15:11 linux-7vph kernel: ata1: SATA link up 1.5 Gbps (SStatus
113 SControl 300)
May 23 12:15:11 linux-7vph kernel: ata1.00: ACPI cmd
ef/02:00:00:00:00:a0 succeeded
May 23 12:15:11 linux-7vph kernel: ata1.00: ACPI cmd
f5/00:00:00:00:00:a0 filtered out
May 23 12:15:11 linux-7vph kernel: ata1.00: ACPI cmd
ef/5f:00:00:00:00:a0 succeeded
May 23 12:15:11 linux-7vph kernel: ata1.00: ACPI cmd
ef/10:03:00:00:00:a0 filtered out
May 23 12:15:11 linux-7vph kernel: ata1.00: ACPI cmd
ef/02:00:00:00:00:a0 succeeded
May 23 12:15:11 linux-7vph kernel: ata1.00: ACPI cmd
f5/00:00:00:00:00:a0 filtered out
May 23 12:15:11 linux-7vph kernel: ata1.00: ACPI cmd
ef/5f:00:00:00:00:a0 succeeded
May 23 12:15:11 linux-7vph kernel: ata1.00: ACPI cmd
ef/10:03:00:00:00:a0 filtered out
May 23 12:15:11 linux-7vph kernel: ata1.00: configured for UDMA/133
May 23 12:15:11 linux-7vph kernel: ata1: exception Emask 0x0 SAct 0x0
SErr 0x0 action 0x0 t4
May 23 12:15:11 linux-7vph kernel: ata1.00: configured for UDMA/133
May 23 12:15:11 linux-7vph kernel: XXX scsi_eh_flush_done_q: online=1(2)
noretry=2 retries=0 allowed=5
May 23 12:15:11 linux-7vph kernel: scsi_eh_0: flush finish cmd: f6838ec0
May 23 12:15:11 linux-7vph kernel: sd 0:0:0:0: [sda] Result:
hostbyte=DID_OK driverbyte=DRIVER_SENSE
May 23 12:15:11 linux-7vph kernel: sd 0:0:0:0: [sda] Sense Key : Aborted
Command [current] [descriptor]
May 23 12:15:11 linux-7vph kernel: Descriptor sense data with sense
descriptors (in hex):
May 23 12:15:11 linux-7vph kernel: 72 0b 00 00 00 00 00 0c 00 0a
80 00 00 00 00 00
May 23 12:15:11 linux-7vph kernel: 0c c8 08 5f
May 23 12:15:11 linux-7vph kernel: sd 0:0:0:0: [sda] Add. Sense: No
additional sense information
May 23 12:15:11 linux-7vph kernel: end_request: I/O error, dev sda,
sector 241666399
May 23 12:15:11 linux-7vph kernel: XXX scsi_eh_flush_done_q: online=1(2)
noretry=2 retries=0 allowed=5
May 23 12:15:11 linux-7vph kernel: scsi_eh_0: flush finish cmd: f6838680
May 23 12:15:11 linux-7vph kernel: sd 0:0:0:0: [sda] Result:
hostbyte=DID_OK driverbyte=DRIVER_SENSE
May 23 12:15:11 linux-7vph kernel: sd 0:0:0:0: [sda] Sense Key : Aborted
Command <2>EXT4-fs error (device sda6): __ext4_get_inode_loc: unable to
read inode block - inode=2330983, block=9338933
May 23 12:15:11 linux-7vph kernel: [current] [descriptor]
May 23 12:15:11 linux-7vph kernel: Descriptor sense data with sense
descriptors (in hex):
May 23 12:15:11 linux-7vph kernel: 72 0b 00 00 00 00 00 0c 00 0a
80 00 00 00 00 00
May 23 12:15:11 linux-7vph kernel: 0c c8 08 5f
May 23 12:15:11 linux-7vph kernel: sd 0:0:0:0: [sda] Add. Sense: No
additional sense information
May 23 12:15:11 linux-7vph kernel: end_request: I/O error, dev sda,
sector 242190447
May 23 12:15:11 linux-7vph kernel: EXT4-fs error (device sda6):
__ext4_get_inode_loc: XXX scsi_eh_flush_done_q: online=1(2) noretry=0
retries=0 allowed=5
May 23 12:15:11 linux-7vph kernel: unable to read inode block -
inode=2346524, block=9404418
May 23 12:15:11 linux-7vph kernel: scsi_eh_0: flush retry cmd: f6838800
May 23 12:15:11 linux-7vph kernel: XXX scsi_eh_flush_done_q: online=1(2)
noretry=0 retries=0 allowed=5
May 23 12:15:11 linux-7vph kernel: scsi_eh_0: flush retry cmd: f403de00
May 23 12:15:11 linux-7vph kernel: XXX scsi_eh_flush_done_q: online=1(2)
noretry=2 retries=0 allowed=5
May 23 12:15:11 linux-7vph kernel: scsi_eh_0: flush finish cmd: f403d200
May 23 12:15:11 linux-7vph kernel: sd 0:0:0:0: [sda] Result:
hostbyte=DID_OK driverbyte=DRIVER_SENSE
May 23 12:15:11 linux-7vph kernel: sd 0:0:0:0: [sda] Sense Key : Aborted
Command [current] [descriptor]
May 23 12:15:11 linux-7vph kernel: Descriptor sense data with sense
descriptors (in hex):
May 23 12:15:11 linux-7vph kernel: 72 0b 00 00 00 00 00 0c 00 0a
80 00 00 00 00 00
May 23 12:15:11 linux-7vph kernel: 0c c8 08 5f
May 23 12:15:11 linux-7vph kernel: sd 0:0:0:0: [sda] Add. Sense: No
additional sense information
May 23 12:15:11 linux-7vph kernel: end_request: I/O error, dev sda,
sector 187403871
May 23 12:15:11 linux-7vph kernel: ata1: EH complete
May 23 12:15:11 linux-7vph kernel: EXT4-fs error (device sda6):
__ext4_get_inode_loc: unable to read inode block - inode=641187,
block=2556122
May 23 12:15:11 linux-7vph kernel: EXT4-fs error (device sda6) in
ext4_reserve_inode_write: IO failure
May 23 12:15:11 linux-7vph kernel: EXT4-fs error (device sda6) in
ext4_reserve_inode_write: IO failure
May 23 12:15:11 linux-7vph kernel: EXT4-fs error (device sda6) in
ext4_reserve_inode_write: IO failure
May 23 12:16:50 linux-7vph kernel: ata1: exception Emask 0x0 SAct 0x0
SErr 0x0 action 0x10
May 23 12:16:50 linux-7vph kernel: CPU0 attaching NULL sched-domain.
May 23 12:16:50 linux-7vph kernel: CPU1 attaching NULL sched-domain.
May 23 12:16:50 linux-7vph kernel: CPU0 attaching sched-domain:
May 23 12:16:50 linux-7vph kernel: domain 0: span 0-1 level MC
May 23 12:16:50 linux-7vph kernel: groups: 0 1
May 23 12:16:50 linux-7vph kernel: domain 1: span 0-1 level CPU
May 23 12:16:50 linux-7vph kernel: groups: 0-1 (__cpu_power = 2048)
May 23 12:16:50 linux-7vph kernel: CPU1 attaching sched-domain:
May 23 12:16:50 linux-7vph kernel: domain 0: span 0-1 level MC
May 23 12:16:50 linux-7vph kernel: groups: 1 0
May 23 12:16:50 linux-7vph kernel: domain 1: span 0-1 level CPU
May 23 12:16:50 linux-7vph kernel: groups: 0-1 (__cpu_power = 2048)


Regards,
Niel
Tejun Heo
2009-05-25 00:32:33 UTC
Permalink
Hello,

Niel Lambrechts wrote:
> Bug triggered with your patch! I played audio while suspending to try
> and increase activity (I also removed a CD on boot), and the filesystem
> came up dirty! This was on attempt nr. 3 or 4.

Great.

Here's the problem.

May 23 12:15:11 linux-7vph kernel: XXX scsi_eh_flush_done_q: online=1(2) noretry=2 retries=0 allowed=5

scsi_noretry_cmd() is returning non-zero indicating that the request
shouldn't be retried and failed immediagely. Looks like the return
value 2 is from blk_failfast_dev() which tests REQ_FAILFAST_DEV. It's
most likely to be set in init_request_from_bio() while translating bio
flags.

cc'ing Theodore Tso. Hello, Neil is reporting ext4 checking out after
resuming.

http://thread.gmane.org/gmane.linux.kernel/814466/focus=817937

The origin of the problem is ATA device triggering a PHY event after
resume sequence is complete. I still don't know why this happens but
it does on certain machines. This in itself shouldn't be a big
problem as the device works fine after one more pass of ATA EH and the
in-flight requests would be retried. However, for some reason, the
aborted commands seem to have REQ_FAILFAST_DEV set thus failing
immediately which, in turn, triggers ext4 errors. Does anything ring
a bell?

Thanks.

--
tejun
Niel Lambrechts
2009-05-23 09:36:30 UTC
Permalink
On 05/23/2009 11:20 AM, Niel Lambrechts wrote:
> On 04/09/2009 08:18 PM, Tejun Heo wrote:
>> Hmm... none of the debugging messages got triggered, so there should
>> be no noticeable timing change. It seems like you're being lucky for
>> the time being. Can you please keep testing and report back?
>
> Hi Tejun,
>
> Just to keep you up informed - this issue is still happening, I've been
> testing the latest 2.6.30-git kernels the last couple of weeks and just
> checked my logs after having an unexpected fsck upon system boot - the
> freeze seems to trigger very infrequently though, since I use s2disk at
> least twice a day and the error only seems to have happened twice since
> May 2.

I notice that if I hibernate with a CD in the drive and then remove it as soon
as I boot, I get a hard reset on the ata1 bus.

I'm not sure if this maybe plays a role in triggering the problem at times,
perhaps it is entirely expected - but vim highlights it in red and to a "normal"
user that is bad. :)

Maybe after applying your patch I should play around with this?


/var/log/warn:
May 23 11:28:44 linux-7vph kernel: sd 0:0:0:0: [sda] Starting disk
May 23 11:28:44 linux-7vph kernel: ata2: SATA link up 1.5 Gbps (SStatus 113
SControl 300)
May 23 11:28:44 linux-7vph kernel: ata1: SATA link up 1.5 Gbps (SStatus 113
SControl 300)
May 23 11:28:44 linux-7vph kernel: ata1.00: ACPI cmd ef/02:00:00:00:00:a0 succeeded
May 23 11:28:44 linux-7vph kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0
filtered out
May 23 11:28:44 linux-7vph kernel: ata1.00: ACPI cmd ef/5f:00:00:00:00:a0 succeeded
May 23 11:28:44 linux-7vph kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0
filtered out
May 23 11:28:44 linux-7vph kernel: ata1.00: ACPI cmd ef/02:00:00:00:00:a0 succeeded
May 23 11:28:44 linux-7vph kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0
filtered out
May 23 11:28:44 linux-7vph kernel: ata1.00: ACPI cmd ef/5f:00:00:00:00:a0 succeeded
May 23 11:28:44 linux-7vph kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0
filtered out
May 23 11:28:44 linux-7vph kernel: ata1.00: configured for UDMA/133
May 23 11:28:44 linux-7vph kernel: ata1: exception Emask 0x10 SAct 0x0 SErr 0x0
action 0x9 t4
May 23 11:28:44 linux-7vph kernel: ata1: irq_stat 0x00400040, connection status
changed
May 23 11:28:44 linux-7vph kernel: ata1.00: configured for UDMA/133
May 23 11:28:44 linux-7vph kernel: ata1: EH complete
May 23 11:28:44 linux-7vph kernel: ata2.00: ACPI cmd e3/00:1f:00:00:00:a0 succeeded
May 23 11:28:44 linux-7vph kernel: ata2.00: ACPI cmd e3/00:02:00:00:00:a0 succeeded
May 23 11:28:44 linux-7vph kernel: ata2.00: ACPI cmd e3/00:1f:00:00:00:a0 succeeded
May 23 11:28:44 linux-7vph kernel: ata2.00: ACPI cmd e3/00:02:00:00:00:a0 succeeded
May 23 11:28:44 linux-7vph kernel: ata2.00: configured for UDMA/133
May 23 11:28:44 linux-7vph kernel: pci 0000:00:02.0: restoring config space at
offset 0x1 (was 0x900407, writing 0x900403)
May 23 11:28:44 linux-7vph kernel: pci 0000:00:02.0: setting latency timer to 64
May 23 11:28:44 linux-7vph kernel: Restarting tasks ... done.
May 23 11:28:44 linux-7vph kernel: ata2: exception Emask 0x10 SAct 0x0 SErr 0x0
action 0x9 t4
May 23 11:28:44 linux-7vph kernel: ata2: irq_stat 0x40000001
May 23 11:28:44 linux-7vph kernel: ata2.00: configured for UDMA/133
May 23 11:28:44 linux-7vph kernel: ata2: EH complete
May 23 11:28:44 linux-7vph kernel: ata1.00: exception Emask 0x10 SAct 0x1 SErr
0x10000 action 0xe frozen
May 23 11:28:44 linux-7vph kernel: ata1.00: irq_stat 0x00400008, PHY RDY changed
May 23 11:28:44 linux-7vph kernel: ata1: SError: { PHYRdyChg }
May 23 11:28:44 linux-7vph kernel: ata1.00: cmd
61/08:00:37:5a:f4/00:00:09:00:00/40 tag 0 ncq 4096 out
May 23 11:28:44 linux-7vph kernel: res
50/00:08:37:5a:f4/00:00:09:00:00/40 Emask 0x10 (ATA bus error)
May 23 11:28:44 linux-7vph kernel: ata1.00: status: { DRDY }
May 23 11:28:44 linux-7vph kernel: ata1: hard resetting link
May 23 11:28:44 linux-7vph kernel: ata1: SATA link up 1.5 Gbps (SStatus 113
SControl 300)
May 23 11:28:44 linux-7vph kernel: ata1.00: ACPI cmd ef/02:00:00:00:00:a0 succeeded
May 23 11:28:44 linux-7vph kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0
filtered out
May 23 11:28:44 linux-7vph kernel: ata1.00: ACPI cmd ef/5f:00:00:00:00:a0 succeeded
May 23 11:28:44 linux-7vph kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0
filtered out
May 23 11:28:44 linux-7vph kernel: ata1.00: ACPI cmd ef/02:00:00:00:00:a0 succeeded
May 23 11:28:44 linux-7vph kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0
filtered out
May 23 11:28:44 linux-7vph kernel: ata1.00: ACPI cmd ef/5f:00:00:00:00:a0 succeeded
May 23 11:28:44 linux-7vph kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0
filtered out
May 23 11:28:44 linux-7vph kernel: ata1.00: configured for UDMA/133
May 23 11:28:44 linux-7vph kernel: ata1.00: configured for UDMA/133
May 23 11:28:44 linux-7vph kernel: ata1: EH complete

Regards,
Niel
Tejun Heo
2009-05-25 01:10:30 UTC
Permalink
Niel Lambrechts wrote:
> I notice that if I hibernate with a CD in the drive and then remove it
> as soon as I boot, I get a hard reset on the ata1 bus.
>
> I'm not sure if this maybe plays a role in triggering the problem at
> times, perhaps it is entirely expected - but vim highlights it in red
> and to a "normal" user that is bad. :)
>
> Maybe after applying your patch I should play around with this?

That isn't normal. You're getting PHY event on ata1 if you do
something to the cdrom which is attached to ata2. Something very
fishy is going on there. Sounds like an electric or some sort of
interference problem but I'm not sure. :-(

Thanks.

--
tejun
Alan Cox
2009-05-25 08:15:34 UTC
Permalink
> something to the cdrom which is attached to ata2. Something very
> fishy is going on there. Sounds like an electric or some sort of
> interference problem but I'm not sure. :-(

I'm just wondering what happens on a resume if a phy change event is seen
by the phy before or during the time registers (such as the port mappings
on an ICH) are being restored (or not restored even). That seems to be one
way you could get an event on the wrong port.
Niel Lambrechts
2009-05-25 22:06:03 UTC
Permalink
On 05/25/2009 10:15 AM, Alan Cox wrote:
>> something to the cdrom which is attached to ata2. Something very
>> fishy is going on there. Sounds like an electric or some sort of
>> interference problem but I'm not sure. :-(
>>
> I'm just wondering what happens on a resume if a phy change event is seen
> by the phy before or during the time registers (such as the port mappings
> on an ICH) are being restored (or not restored even). That seems to be one
> way you could get an event on the wrong port

I've tested all of the kernels I have again since 2.6.29.4 also came out
just recently. I did a hibernate/resume for each in the console, then
repeated the same in X, then continued to the next kernel.

The 2.6.29.4 log is much larger, since some other badness happened there
- there is a large kernel trace in there as my first X hibernation
attempt failed and came back to X after a few seconds. The system seemed
functional, it did not keep generating kernel messages - when I then
retried a hibernate it worked, along with the resume. Another unrelated
bug perhaps?

As for "hard resetting link" messages, they seemed to always happen
under X the times I tried it.

Kernel EXT4-errors? Console:ata1 reset?
Console:ata2-reset? X:ata1 reset? X:ata2 reset?
2.6.28.10 No no yes
yes no
2.6.29.4* No no no
no no
2.6.29.4** No - -
yes no
2.6.30-rc6 Yes - -
yes no
2.6.30-rc6 No no no
yes no

* Xorg hibernation attempt failed.
* Xorg Second hibernation attempt (no extra reboot)

I also did a side by side comparison of the messages I have for
2.6.30-rc6, the one with EXT4 errors I reported on yesterday, and
another one that worked just fine tonight. I stripped all time-stamps
and some pulseaudio messages from the bad one and attached them here,
and also saved the full messages for each kernel to
http://bugzilla.kernel.org/show_bug.cgi?id=13017 .

Since analysing the code-path is still a bit beyond me, I'll leave you
with a little summary of the differences I notice.

A = 2.6.30-rc6 (EXT4 clean)
B = 2.6.30-rc6 (EXT4 errors triggered)

# B first does an ata2 ACPI cmd, A starts with ata1. Only a slight
sequence difference, output is the same.
B: linux-7vph kernel: ata2.00: ACPI cmd e3/00:1f:00:00:00:a0 succeeded

# The main difference appears:
A:linux-7vph kernel: Restarting tasks ... done.
B:linux-7vph kernel: Restarting tasks ... <3>ata1.00: exception Emask
0x10 SAct 0x1f SErr 0x50000 action 0xe frozen

# A first shows "done", then only followed by a frozen message, but with
a different SAct value:
A:linux-7vph kernel: ata1.00: exception Emask 0x10 SAct 0x1ff SErr
0x50000 action 0xe frozen

# Both then have:
linux-7vph kernel: ata1.00: irq_stat 0x00400008, PHY RDY changed

# from there, A seems to have a little extra sequence:
linux-7vph kernel: ata1.00: cmd 60/08:00:ef:fc:48/00:00:0b:00:00/40 tag
0 ncq 4096 in
linux-7vph kernel: res 50/00:38:97:7f:e6/00:00:0b:00:00/40
Emask 0x10 (ATA bus error)
linux-7vph kernel: ata1.00: status: { DRDY }
linux-7vph kernel: ata1.00: cmd 60/08:08:97:d5:69/00:00:0e:00:00/40 tag
1 ncq 4096 in
linux-7vph kernel: res 50/00:38:97:7f:e6/00:00:0b:00:00/40
Emask 0x10 (ATA bus error)
linux-7vph kernel: ata1.00: status: { DRDY }
linux-7vph kernel: ata1.00: cmd 60/30:10:8f:5f:56/00:00:0f:00:00/40 tag
2 ncq 24576 in
linux-7vph kernel: res 50/00:38:97:7f:e6/00:00:0b:00:00/40
Emask 0x10 (ATA bus error)
linux-7vph kernel: ata1.00: status: { DRDY }
linux-7vph kernel: ata1.00: cmd 60/10:18:c7:5f:56/00:00:0f:00:00/40 tag
3 ncq 8192 in
linux-7vph kernel: res 50/00:38:97:7f:e6/00:00:0b:00:00/40
Emask 0x10 (ATA bus error)
linux-7vph kernel: ata1.00: status: { DRDY }
linux-7vph kernel: ata1.00: cmd 60/10:20:ff:5f:56/00:00:0f:00:00/40 tag
4 ncq 8192 in
linux-7vph kernel: res 50/00:38:97:7f:e6/00:00:0b:00:00/40
Emask 0x10 (ATA bus error)

# B only then shows:
B:linux-7vph kernel: done.

# soon after, B spirals into errors:
linux-7vph kernel: XXX scsi_eh_flush_done_q: online=1(2) noretry=2
retries=0 allowed=5
linux-7vph kernel: scsi_eh_0: flush finish cmd: f6838ec0
linux-7vph kernel: sd 0:0:0:0: [sda] Result: hostbyte=DID_OK
driverbyte=DRIVER_SENSE
linux-7vph kernel: sd 0:0:0:0: [sda] Sense Key : Aborted Command
[current] [descriptor]

Hope this helps.

Niel
Tejun Heo
2009-05-26 04:58:19 UTC
Permalink
Hello, Niel.

Niel Lambrechts wrote:
> I've tested all of the kernels I have again since 2.6.29.4 also came out
> just recently. I did a hibernate/resume for each in the console, then
> repeated the same in X, then continued to the next kernel.
>
> The 2.6.29.4 log is much larger, since some other badness happened there
> - there is a large kernel trace in there as my first X hibernation
> attempt failed and came back to X after a few seconds. The system seemed
> functional, it did not keep generating kernel messages - when I then
> retried a hibernate it worked, along with the resume. Another unrelated
> bug perhaps?
>
> As for "hard resetting link" messages, they seemed to always happen
> under X the times I tried it.
>
> Kernel EXT4-errors? Console:ata1 reset? Console:ata2-reset? X:ata1 reset? X:ata2 reset?
> 2.6.28.10 No no yes yes no
> 2.6.29.4* No no no no no
> 2.6.29.4** No - - yes no
> 2.6.30-rc6 Yes - - yes no
> 2.6.30-rc6 No no no yes no
>
> * Xorg hibernation attempt failed.
> * Xorg Second hibernation attempt (no extra reboot)
>
> I also did a side by side comparison of the messages I have for
> 2.6.30-rc6, the one with EXT4 errors I reported on yesterday, and
> another one that worked just fine tonight. I stripped all time-stamps
> and some pulseaudio messages from the bad one and attached them here,
> and also saved the full messages for each kernel to
> http://bugzilla.kernel.org/show_bug.cgi?id=13017 .
>
> Since analysing the code-path is still a bit beyond me, I'll leave you
> with a little summary of the differences I notice.
>
> A = 2.6.30-rc6 (EXT4 clean)
> B = 2.6.30-rc6 (EXT4 errors triggered)

Duplicate PHY events are likely to be dependent on timing and
non-deterministic. The ext4 corrupting or not depends on whether a
request with failfast set was in-flight at the time of the second PHY
event, which again is dependent on timing. At any rate, this looks
like a problem of ext4 (or something between ext4 and the driver). It
either shouldn't issue failfast command or should take appropriate
recovery action if it does. It would be really nice if you can give a
shot at ext3.

Thanks.

--
tejun
Niel Lambrechts
2009-05-26 05:43:47 UTC
Permalink
On 05/26/2009 06:58 AM, Tejun Heo wrote:
> Hello, Niel.
>
> Niel Lambrechts wrote:
>
>> I've tested all of the kernels I have again since 2.6.29.4 also came out
>> just recently. I did a hibernate/resume for each in the console, then
>> repeated the same in X, then continued to the next kernel.
>>
>> The 2.6.29.4 log is much larger, since some other badness happened there
>> - there is a large kernel trace in there as my first X hibernation
>> attempt failed and came back to X after a few seconds. The system seemed
>> functional, it did not keep generating kernel messages - when I then
>> retried a hibernate it worked, along with the resume. Another unrelated
>> bug perhaps?
>>
>> As for "hard resetting link" messages, they seemed to always happen
>> under X the times I tried it.
>>
>> Kernel EXT4-errors? Console:ata1 reset? Console:ata2-reset? X:ata1 reset? X:ata2 reset?
>> 2.6.28.10 No no yes yes no
>> 2.6.29.4* No no no no no
>> 2.6.29.4** No - - yes no
>> 2.6.30-rc6 Yes - - yes no
>> 2.6.30-rc6 No no no yes no
>>
>> * Xorg hibernation attempt failed.
>> * Xorg Second hibernation attempt (no extra reboot)
>>
>> I also did a side by side comparison of the messages I have for
>> 2.6.30-rc6, the one with EXT4 errors I reported on yesterday, and
>> another one that worked just fine tonight. I stripped all time-stamps
>> and some pulseaudio messages from the bad one and attached them here,
>> and also saved the full messages for each kernel to
>> http://bugzilla.kernel.org/show_bug.cgi?id=13017 .
>>
>> Since analysing the code-path is still a bit beyond me, I'll leave you
>> with a little summary of the differences I notice.
>>
>> A = 2.6.30-rc6 (EXT4 clean)
>> B = 2.6.30-rc6 (EXT4 errors triggered)
>>
> Duplicate PHY events are likely to be dependent on timing and
> non-deterministic. The ext4 corrupting or not depends on whether a
> request with failfast set was in-flight at the time of the second PHY
> event, which again is dependent on timing. At any rate, this looks
> like a problem of ext4 (or something between ext4 and the driver). It
> either shouldn't issue failfast command or should take appropriate
> recovery action if it does. It would be really nice if you can give a
> shot at ext3.

Urgh. My root file-system is mounted with extents on, I would have to
re-install entirely.

I'm wondering why no one else is complaining, or whether the problem is
limited to ICH9M/M-E controllers with EXT4 or a certain type of
hard-drive. The laptop is a Lenovo W500 (fairly similar to T500), so
maybe not a lot of people with this type of controller is using EXT4 yet.

Anyhow, I think Theodore may have ruled this out as a EXT4 problem
already (I first copied him) so I'm not sure what to do now, it will
take some strong will (and even more time) for me to re-install EXT3. I
just shouldn't have to, dammit. :-p

Regards,
Niel
Tejun Heo
2009-05-26 05:50:32 UTC
Permalink
Niel Lambrechts wrote:
> Urgh. My root file-system is mounted with extents on, I would have to
> re-install entirely.
>
> I'm wondering why no one else is complaining, or whether the problem is
> limited to ICH9M/M-E controllers with EXT4 or a certain type of
> hard-drive. The laptop is a Lenovo W500 (fairly similar to T500), so
> maybe not a lot of people with this type of controller is using EXT4 yet.

The double PHY events are not too rare but it's not too frequent
either. The number of people using ext4 on machines which have such
issues and are regularly suspending and resuming might not be too
high.

> Anyhow, I think Theodore may have ruled this out as a EXT4 problem
> already (I first copied him) so I'm not sure what to do now, it will
> take some strong will (and even more time) for me to re-install EXT3. I
> just shouldn't have to, dammit. :-p

Well, that is a pretty easy way to rule out many possibilities from my
point of view. :-p If it's difficult, we can try to find out where the
flag is being set with debug patches.

Thanks.

--
tejun
Niel Lambrechts
2009-05-26 06:13:07 UTC
Permalink
On 05/26/2009 07:50 AM, Tejun Heo wrote:
> Niel Lambrechts wrote:
>
>> Urgh. My root file-system is mounted with extents on, I would have to
>> re-install entirely.
>>
>> I'm wondering why no one else is complaining, or whether the problem is
>> limited to ICH9M/M-E controllers with EXT4 or a certain type of
>> hard-drive. The laptop is a Lenovo W500 (fairly similar to T500), so
>> maybe not a lot of people with this type of controller is using EXT4 yet.
>>
> The double PHY events are not too rare but it's not too frequent
> either. The number of people using ext4 on machines which have such
> issues and are regularly suspending and resuming might not be too
> high.
>
>
>> Anyhow, I think Theodore may have ruled this out as a EXT4 problem
>> already (I first copied him) so I'm not sure what to do now, it will
>> take some strong will (and even more time) for me to re-install EXT3. I
>> just shouldn't have to, dammit. :-p
>>
> Well, that is a pretty easy way to rule out many possibilities from my
> point of view. :-p If it's difficult, we can try to find out where the
> flag is being set with debug patches.
>

If you send some patches I'll make every effort to test, it beats having
to re-install, my installation is just too customized. :)

Regards,
Niel
Tejun Heo
2009-05-26 13:33:42 UTC
Permalink
Niel Lambrechts wrote:
> If you send some patches I'll make every effort to test, it beats having
> to re-install, my installation is just too customized. :)

First, let's make sure we aren't balking up the wrong tree. Can you
please apply the attached patch and report the kernel log?

Thanks.

--
tejun
Niel Lambrechts
2009-05-26 18:14:04 UTC
Permalink
On 05/26/2009 03:33 PM, Tejun Heo wrote:
> Niel Lambrechts wrote:
>
>> If you send some patches I'll make every effort to test, it beats having
>> to re-install, my installation is just too customized. :)
>>
> First, let's make sure we aren't balking up the wrong tree. Can you
> please apply the attached patch and report the kernel log?

Hi Tejun,

Okay it took 5 attempts, some of during which I played audio, did 'find
/' etc. but I still do not have a clue whether the extra activity
helped trigger it or not.

linux-7vph:/# dumpe2fs -h /dev/sda6
dumpe2fs 1.41.4 (27-Jan-2009)
Filesystem volume name: <none>
Last mounted on: <not available>
Filesystem UUID: 545370aa-9fb4-4efe-be74-0a64c98c9dad
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index
filetype needs_recovery extent flex_bg sparse_super large_file huge_file
uninit_bg dir_nlink extra_isize
Filesystem flags: signed_directory_hash
Default mount options: (none)
Filesystem state: clean with errors
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 3016944
Block count: 12060082
Reserved block count: 361802
Free blocks: 4666750
Free inodes: 2471577
First block: 0
Block size: 4096
Fragment size: 4096
Reserved GDT blocks: 1021
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 8176
Inode blocks per group: 511
Filesystem created: Tue Sep 30 22:03:02 2008
Last mount time: Tue May 26 20:37:30 2009
Last write time: Tue May 26 19:04:08 2009
Mount count: 5
Maximum mount count: -1
Last checked: Sat May 23 14:33:27 2009
Check interval: 0 (<none>)
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 256
Journal inode: 8
First orphan inode: 2330902
Default directory hash: tea
Directory Hash Seed: a53491df-10b7-4f13-ad3e-279c67192359
Journal backup: inode blocks
Journal size: 128M

Regards,
Niel
Tejun Heo
2009-05-27 00:07:32 UTC
Permalink
Niel Lambrechts wrote:
> On 05/26/2009 03:33 PM, Tejun Heo wrote:
>> Niel Lambrechts wrote:
>>
>>> If you send some patches I'll make every effort to test, it beats having
>>> to re-install, my installation is just too customized. :)
>>>
>> First, let's make sure we aren't balking up the wrong tree. Can you
>> please apply the attached patch and report the kernel log?
>
> Hi Tejun,
>
> Okay it took 5 attempts, some of during which I played audio, did 'find
> /' etc. but I still do not have a clue whether the extra activity
> helped trigger it or not.

Thanks for testing.

XXX scmd->result=0x8000002 ff_t=4 ff_dev=2 ff_drv=8
XXX DID_OK
XXX CHECK_CONDITION, returning ff_dev
sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 0:0:0:0: [sda] Sense Key : Aborted Command [current] [descriptor]
Descriptor sense data with sense descriptors (in hex):
72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00
09 b8 71 d1
sd 0:0:0:0: [sda] Add. Sense: No additional sense information
end_request: I/O error, dev sda, sector 242190447

The above is the offending failure and all three failfast bits are
set. This corresponds to the following ATA exception.

ata1.00: cmd 60/08:18:6f:88:6f/01:00:0e:00:00/40 tag 3 ncq 135168 in
res 50/00:40:d1:71:b8/00:00:09:00:00/40 Emask 0x10 (ATA bus error)

It's 33 page long read command. Looking at the code the only way all
three fastfail bits can be set seems to be if the request is readahead
- the first part of block/blk-core.c::init_request_from_bio(). Now,
the failure of a readahead request isn't supposed to cause any
problem. If it fails, well, it fails and things should go on as if
nothing happened.

Can you please try the attached patch? It takes suspend/resume cycle
out of the equation and simply induces artificial failure to readahead
requests. It's currently set to fail every 40th readahead. Feel free
to adjust the frequency as you see fit. catting files into /dev/null
would trigger readahead to kick in. Can you reproduce filesystem
failure with this alone?

Thanks.

--
tejun
Niel Lambrechts
2009-05-27 14:01:42 UTC
Permalink
On 05/27/2009 02:07 AM, Tejun Heo wrote:
> The above is the offending failure and all three failfast bits are
> set. This corresponds to the following ATA exception.
>
> ata1.00: cmd 60/08:18:6f:88:6f/01:00:0e:00:00/40 tag 3 ncq 135168 in
> res 50/00:40:d1:71:b8/00:00:09:00:00/40 Emask 0x10 (ATA bus error)
>
> It's 33 page long read command. Looking at the code the only way all
> three fastfail bits can be set seems to be if the request is readahead
> - the first part of block/blk-core.c::init_request_from_bio(). Now,
> the failure of a readahead request isn't supposed to cause any
> problem. If it fails, well, it fails and things should go on as if
> nothing happened.
>
> Can you please try the attached patch? It takes suspend/resume cycle
> out of the equation and simply induces artificial failure to readahead
> requests. It's currently set to fail every 40th readahead. Feel free
> to adjust the frequency as you see fit. catting files into /dev/null
> would trigger readahead to kick in. Can you reproduce filesystem
> failure with this alone?
>

No corruption, I tried cat on an entire directory of mp3s, then also
started X, but there the messages debug output got fairly drastic so I
didn't care to continue for an entire day.

It did trigger the "XXX ...failing readahead" message nearly 300 times,
and I also did a s2disk and resume cycle in there - so I hope this is
enough for us to conclude that it is not the cause.

Regards,
Niel
Tejun Heo
2009-06-03 03:14:00 UTC
Permalink
Hello, Niel.

Niel Lambrechts wrote:
> Did you perhaps have any time to look into my feedback around the
> readahead patch?

Yeah, I've been thinking about it and am a bit out of ideas, so no
immediate follow-up.

>>From my side, I tried on Saturday to bisect this problem again, doing
> 5-8 hibernates per each bisect from 2.6.28. I stopped at 2.6.30-rc2 due
> time (or fatigue), and did not manage to replicate the problem at all
> which is strange since I was playing audio, doing finds and even doing
> an entire dd of the root partition. I saved the bisect logs so perhaps I
> can continue to see if the problem becomes more prevalent in later
> versions - the first time it ever happened to me was somewhere in 2.6.29
> originally.
>
> The other interesting thing was to see that "hard resetting link"
> messages seem to first start appearing at v2.6.29-rc7 or perhaps rc8. Is
> it worth trying to track down the commit that lead to this?
>
> Do you have any other debug patches to try, or should I try to delve
> deeper into finding commits that can be reverted? I'm running out of
> ideas, I even tried to find later firmware for my drive, but I seem to
> be on the latest level.

Given the non-deterministic nature of the failure, I think bisection
would be quite difficult. I think the best way to diagnose the
problem is to track down the owner of the failed request. It has
FAILFAST set and yet the issuer fails to deal with an error condition
which is expected when FAILFAST is set. I'll prep a patch to track
request / bio failure.

Thanks.

--
tejun
Tejun Heo
2009-06-03 04:28:01 UTC
Permalink
Hello,

Can you please do the followings?

1. Apply the attached patch, build & boot

2. Trigger the problem and record dmesg

3. On failed IO, the kernel will print the address of bi_endio. Run
"nm -n" on the vmlinux in the kernel build root and look up which
function it is and post the dmesg and function name.

Thanks.

--
tejun
Niel Lambrechts
2009-06-06 07:05:05 UTC
Permalink
Morning Tejun,

Tejun Heo wrote:
> Hello,
>
> Can you please do the followings?
>
> 1. Apply the attached patch, build & boot
>
I chose 2.6.30-rc7...
> 2. Trigger the problem and record dmesg
>
It took 3 days and quite a few hibernate attempts ... :-)

> 3. On failed IO, the kernel will print the address of bi_endio. Run
> "nm -n" on the vmlinux in the kernel build root and look up which
> function it is and post the dmesg and function name.
I did not have that specific vmlinux.o file any more, but
/boot/System.map-2.6.30-rc7-pae shows:
c01a49fd t end_bio_bh_io_sync

Hope this is sufficient to help you. Sorry if this is silly - being so
inexperienced with the kernel - but I wondered if or why a dump_stack()
in that debug patch would not be helpful?

Regards,
Niel
Pavel Machek
2009-06-19 15:04:42 UTC
Permalink
On Sat 2009-06-06 09:05:05, Niel Lambrechts wrote:
> Morning Tejun,
>
> Tejun Heo wrote:
> > Hello,
> >
> > Can you please do the followings?
> >
> > 1. Apply the attached patch, build & boot
> >
> I chose 2.6.30-rc7...
> > 2. Trigger the problem and record dmesg
> >
> It took 3 days and quite a few hibernate attempts ... :-)
>
> > 3. On failed IO, the kernel will print the address of bi_endio. Run
> > "nm -n" on the vmlinux in the kernel build root and look up which
> > function it is and post the dmesg and function name.
> I did not have that specific vmlinux.o file any more, but
> /boot/System.map-2.6.30-rc7-pae shows:
> c01a49fd t end_bio_bh_io_sync
>
> Hope this is sufficient to help you. Sorry if this is silly - being so
> inexperienced with the kernel - but I wondered if or why a dump_stack()
> in that debug patch would not be helpful?

I guess dump_stack would be very helpful :-).

> Regards,
> Niel
>
>

> Jun 5 09:26:12 linux-7vph -- MARK --
> Jun 5 09:26:12 linux-7vph syslog-ng[24615]: Log statistics; dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0', processed='center(queued)=3777', processed='center(received)=3142', processed='destination(newsnotice)=0', processed='destination(acpid)=13', processed='destination(firewall)=18', processed='destination(null)=208', processed='destination(mail)=0', processed='destination(mailinfo)=0', processed='destination(console)=200', processed='destination(newserr)=0', processed='destination(newscrit)=0', processed='destination(messages)=1909', processed='destination(mailwarn)=0', processed='destination(localmessages)=0', processed='destination(netmgm)=994', processed='destination(mailerr)=0', processed='destination(xconsole)=200', processed='destination(warn)=235', processed='s
ource(src)=3142'
> Jun 5 09:26:11 linux-7vph kernel: Syncing filesystems ... done.
> Jun 5 09:26:12 linux-7vph kernel: Freezing user space processes ... (elapsed 0.00 seconds) done.
> Jun 5 09:26:12 linux-7vph kernel: Freezing remaining freezable tasks ... (elapsed 0.00 seconds) done.
> Jun 5 09:26:12 linux-7vph kernel: PM: Shrinking memory... done (149507 pages freed)
> Jun 5 09:26:12 linux-7vph kernel: PM: Freed 598028 kbytes in 6.49 seconds (92.14 MB/s)
> Jun 5 09:26:12 linux-7vph kernel: Suspending console(s) (use no_console_suspend to debug)
> Jun 5 09:26:12 linux-7vph kernel: sd 0:0:0:0: [sda] Synchronizing SCSI cache
> Jun 5 09:26:12 linux-7vph kernel: ACPI handle has no context!
> Jun 5 09:26:12 linux-7vph kernel: ehci_hcd 0000:00:1d.7: PCI INT D disabled
> Jun 5 09:26:12 linux-7vph kernel: ehci_hcd 0000:00:1d.7: PME# disabled
> Jun 5 09:26:12 linux-7vph kernel: uhci_hcd 0000:00:1d.2: PCI INT C disabled
> Jun 5 09:26:12 linux-7vph kernel: uhci_hcd 0000:00:1d.1: PCI INT B disabled
> Jun 5 09:26:12 linux-7vph kernel: uhci_hcd 0000:00:1d.0: PCI INT A disabled
> Jun 5 09:26:12 linux-7vph kernel: HDA Intel 0000:00:1b.0: PCI INT B disabled
> Jun 5 09:26:12 linux-7vph kernel: ehci_hcd 0000:00:1a.7: PCI INT D disabled
> Jun 5 09:26:12 linux-7vph kernel: ehci_hcd 0000:00:1a.7: PME# disabled
> Jun 5 09:26:12 linux-7vph kernel: uhci_hcd 0000:00:1a.2: PCI INT C disabled
> Jun 5 09:26:12 linux-7vph kernel: uhci_hcd 0000:00:1a.1: PCI INT B disabled
> Jun 5 09:26:12 linux-7vph kernel: uhci_hcd 0000:00:1a.0: PCI INT A disabled
> Jun 5 09:26:13 linux-7vph kernel: e1000e 0000:00:19.0: PCI INT A disabled
> Jun 5 09:26:13 linux-7vph kernel: e1000e 0000:00:19.0: PME# enabled
> Jun 5 09:26:13 linux-7vph kernel: e1000e 0000:00:19.0: wake-up capability enabled by ACPI
> Jun 5 09:26:13 linux-7vph kernel: ACPI: Preparing to enter system sleep state S4
> Jun 5 09:26:13 linux-7vph kernel: PM: Saving platform NVS memory
> Jun 5 09:26:13 linux-7vph kernel: Disabling non-boot CPUs ...
> Jun 5 09:26:13 linux-7vph kernel: CPU 1 is now offline
> Jun 5 09:26:13 linux-7vph kernel: SMP alternatives: switching to UP code
> Jun 5 09:26:13 linux-7vph kernel: CPU0 attaching NULL sched-domain.
> Jun 5 09:26:13 linux-7vph kernel: CPU1 attaching NULL sched-domain.
> Jun 5 09:26:13 linux-7vph kernel: CPU0 attaching NULL sched-domain.
> Jun 5 09:26:13 linux-7vph kernel: CPU1 is down
> Jun 5 09:26:13 linux-7vph kernel: Extended CMOS year: 2000
> Jun 5 09:26:13 linux-7vph kernel: PM: Creating hibernation image:
> Jun 5 09:26:13 linux-7vph kernel: PM: Need to copy 223773 pages
> Jun 5 09:26:13 linux-7vph kernel: x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106
> Jun 5 09:26:13 linux-7vph kernel: Intel machine check architecture supported.
> Jun 5 09:26:13 linux-7vph kernel: Intel machine check reporting enabled on CPU#0.
> Jun 5 09:26:13 linux-7vph kernel: PM: Restoring platform NVS memory
> Jun 5 09:26:13 linux-7vph kernel: Extended CMOS year: 2000
> Jun 5 09:26:13 linux-7vph kernel: Enabling non-boot CPUs ...
> Jun 5 09:26:13 linux-7vph kernel: SMP alternatives: switching to SMP code
> Jun 5 09:26:13 linux-7vph kernel: Booting processor 1 APIC 0x1 ip 0x6000
> Jun 5 09:26:13 linux-7vph kernel: Initializing CPU#1
> Jun 5 09:26:13 linux-7vph kernel: Calibrating delay using timer specific routine.. 5134.67 BogoMIPS (lpj=10269343)
> Jun 5 09:26:13 linux-7vph kernel: CPU: L1 I cache: 32K, L1 D cache: 32K
> Jun 5 09:26:13 linux-7vph kernel: CPU: L2 cache: 6144K
> Jun 5 09:26:13 linux-7vph kernel: CPU: Physical Processor ID: 0
> Jun 5 09:26:13 linux-7vph kernel: CPU: Processor Core ID: 1
> Jun 5 09:26:13 linux-7vph kernel: Intel machine check architecture supported.
> Jun 5 09:26:13 linux-7vph kernel: Intel machine check reporting enabled on CPU#1.
> Jun 5 09:26:13 linux-7vph kernel: x86 PAT enabled: cpu 1, old 0x7040600070406, new 0x7010600070106
> Jun 5 09:26:13 linux-7vph kernel: CPU1: Intel(R) Core(TM)2 Duo CPU T9400 @ 2.53GHz stepping 06
> Jun 5 09:26:13 linux-7vph kernel: CPU0 attaching NULL sched-domain.
> Jun 5 09:26:13 linux-7vph kernel: Switched to high resolution mode on CPU 1
> Jun 5 09:26:13 linux-7vph kernel: CPU0 attaching sched-domain:
> Jun 5 09:26:13 linux-7vph kernel: domain 0: span 0-1 level MC
> Jun 5 09:26:13 linux-7vph kernel: groups: 0 1
> Jun 5 09:26:13 linux-7vph kernel: domain 1: span 0-1 level CPU
> Jun 5 09:26:13 linux-7vph kernel: groups: 0-1 (__cpu_power = 2048)
> Jun 5 09:26:13 linux-7vph kernel: CPU1 attaching sched-domain:
> Jun 5 09:26:13 linux-7vph kernel: domain 0: span 0-1 level MC
> Jun 5 09:26:13 linux-7vph kernel: groups: 1 0
> Jun 5 09:26:13 linux-7vph kernel: domain 1: span 0-1 level CPU
> Jun 5 09:26:13 linux-7vph kernel: groups: 0-1 (__cpu_power = 2048)
> Jun 5 09:26:13 linux-7vph kernel: microcode: failed to init CPU1
> Jun 5 09:26:13 linux-7vph kernel: CPU1 is up
> Jun 5 09:26:13 linux-7vph kernel: ACPI: Waking up from system sleep state S4
> Jun 5 09:26:13 linux-7vph kernel: pci 0000:00:02.0: restoring config space at offset 0x1 (was 0x900007, writing 0x900403)
> Jun 5 09:26:13 linux-7vph kernel: uhci_hcd 0000:00:1a.0: power state changed by ACPI to D0
> Jun 5 09:26:13 linux-7vph kernel: uhci_hcd 0000:00:1a.2: power state changed by ACPI to D0
> Jun 5 09:26:13 linux-7vph kernel: HDA Intel 0000:00:1b.0: restoring config space at offset 0x1 (was 0x100106, writing 0x100102)
> Jun 5 09:26:13 linux-7vph kernel: uhci_hcd 0000:00:1d.0: power state changed by ACPI to D0
> Jun 5 09:26:13 linux-7vph kernel: ahci 0000:00:1f.2: restoring config space at offset 0x1 (was 0x2b00403, writing 0x2b00407)
> Jun 5 09:26:13 linux-7vph kernel: iwlagn 0000:03:00.0: restoring config space at offset 0x1 (was 0x100106, writing 0x100506)
> Jun 5 09:26:13 linux-7vph kernel: pci 0000:00:02.0: PME# disabled
> Jun 5 09:26:13 linux-7vph kernel: pci 0000:00:02.0: setting latency timer to 64
> Jun 5 09:26:13 linux-7vph kernel: pci 0000:00:02.1: PME# disabled
> Jun 5 09:26:13 linux-7vph kernel: pci 0000:00:03.0: PME# disabled
> Jun 5 09:26:13 linux-7vph kernel: e1000e 0000:00:19.0: PCI INT A -> GSI 20 (level, low) -> IRQ 20
> Jun 5 09:26:13 linux-7vph kernel: e1000e 0000:00:19.0: pci_enable_pcie_error_reporting failed 0xfffffffb
> Jun 5 09:26:13 linux-7vph kernel: e1000e 0000:00:19.0: setting latency timer to 64
> Jun 5 09:26:14 linux-7vph kernel: e1000e 0000:00:19.0: wake-up capability disabled by ACPI
> Jun 5 09:26:14 linux-7vph kernel: e1000e 0000:00:19.0: PME# disabled
> Jun 5 09:26:14 linux-7vph kernel: e1000e 0000:00:19.0: wake-up capability disabled by ACPI
> Jun 5 09:26:14 linux-7vph kernel: e1000e 0000:00:19.0: PME# disabled
> Jun 5 09:26:14 linux-7vph kernel: e1000e 0000:00:19.0: irq 29 for MSI/MSI-X
> Jun 5 09:26:14 linux-7vph kernel: uhci_hcd 0000:00:1a.0: PCI INT A -> GSI 20 (level, low) -> IRQ 20
> Jun 5 09:26:14 linux-7vph kernel: uhci_hcd 0000:00:1a.0: setting latency timer to 64
> Jun 5 09:26:14 linux-7vph kernel: uhci_hcd 0000:00:1a.1: PCI INT B -> GSI 21 (level, low) -> IRQ 21
> Jun 5 09:26:14 linux-7vph kernel: uhci_hcd 0000:00:1a.1: setting latency timer to 64
> Jun 5 09:26:14 linux-7vph kernel: uhci_hcd 0000:00:1a.2: PCI INT C -> GSI 22 (level, low) -> IRQ 22
> Jun 5 09:26:14 linux-7vph kernel: uhci_hcd 0000:00:1a.2: setting latency timer to 64
> Jun 5 09:26:14 linux-7vph kernel: ehci_hcd 0000:00:1a.7: PME# disabled
> Jun 5 09:26:14 linux-7vph kernel: ehci_hcd 0000:00:1a.7: PCI INT D -> GSI 23 (level, low) -> IRQ 23
> Jun 5 09:26:14 linux-7vph kernel: ehci_hcd 0000:00:1a.7: setting latency timer to 64
> Jun 5 09:26:14 linux-7vph kernel: ehci_hcd 0000:00:1a.7: PME# disabled
> Jun 5 09:26:14 linux-7vph kernel: HDA Intel 0000:00:1b.0: PCI INT B -> GSI 17 (level, low) -> IRQ 17
> Jun 5 09:26:14 linux-7vph kernel: HDA Intel 0000:00:1b.0: setting latency timer to 64
> Jun 5 09:26:14 linux-7vph kernel: uhci_hcd 0000:00:1d.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
> Jun 5 09:26:14 linux-7vph kernel: uhci_hcd 0000:00:1d.0: setting latency timer to 64
> Jun 5 09:26:14 linux-7vph kernel: uhci_hcd 0000:00:1d.1: PCI INT B -> GSI 17 (level, low) -> IRQ 17
> Jun 5 09:26:14 linux-7vph kernel: uhci_hcd 0000:00:1d.1: setting latency timer to 64
> Jun 5 09:26:14 linux-7vph kernel: uhci_hcd 0000:00:1d.2: PCI INT C -> GSI 18 (level, low) -> IRQ 18
> Jun 5 09:26:14 linux-7vph kernel: uhci_hcd 0000:00:1d.2: setting latency timer to 64
> Jun 5 09:26:14 linux-7vph kernel: ehci_hcd 0000:00:1d.7: PME# disabled
> Jun 5 09:26:14 linux-7vph kernel: ehci_hcd 0000:00:1d.7: PCI INT D -> GSI 19 (level, low) -> IRQ 19
> Jun 5 09:26:14 linux-7vph kernel: ehci_hcd 0000:00:1d.7: setting latency timer to 64
> Jun 5 09:26:14 linux-7vph kernel: ehci_hcd 0000:00:1d.7: PME# disabled
> Jun 5 09:26:14 linux-7vph kernel: pci 0000:00:1e.0: setting latency timer to 64
> Jun 5 09:26:14 linux-7vph kernel: ahci 0000:00:1f.2: setting latency timer to 64
> Jun 5 09:26:14 linux-7vph kernel: pci 0000:15:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
> Jun 5 09:26:14 linux-7vph kernel: ohci1394: fw-host0: OHCI-1394 1.1 (PCI): IRQ=[17] MMIO=[f4801000-f48017ff] Max Packet=[2048] IR/IT contexts=[4/4]
> Jun 5 09:26:14 linux-7vph kernel: pci 0000:15:00.2: PME# disabled
> Jun 5 09:26:14 linux-7vph kernel: pci 0000:15:00.3: PME# disabled
> Jun 5 09:26:14 linux-7vph kernel: pci 0000:15:00.4: PME# disabled
> Jun 5 09:26:14 linux-7vph kernel: pci 0000:15:00.5: PME# disabled
> Jun 5 09:26:14 linux-7vph kernel: sd 0:0:0:0: [sda] Starting disk
> Jun 5 09:26:14 linux-7vph kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
> Jun 5 09:26:14 linux-7vph kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: ACPI cmd ef/02:00:00:00:00:a0 succeeded
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0 filtered out
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: ACPI cmd ef/5f:00:00:00:00:a0 succeeded
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0 filtered out
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: ACPI cmd ef/02:00:00:00:00:a0 succeeded
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0 filtered out
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: ACPI cmd ef/5f:00:00:00:00:a0 succeeded
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0 filtered out
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: configured for UDMA/133
> Jun 5 09:26:14 linux-7vph kernel: ata1: exception Emask 0x10 SAct 0x0 SErr 0x0 action 0x9 t4
> Jun 5 09:26:14 linux-7vph kernel: ata1: irq_stat 0x00400040, connection status changed
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: configured for UDMA/133
> Jun 5 09:26:14 linux-7vph kernel: ata1: EH complete
> Jun 5 09:26:14 linux-7vph kernel: pci 0000:00:02.0: restoring config space at offset 0x1 (was 0x900407, writing 0x900403)
> Jun 5 09:26:14 linux-7vph kernel: pci 0000:00:02.0: setting latency timer to 64
> Jun 5 09:26:14 linux-7vph kernel: ata2.00: ACPI cmd e3/00:1f:00:00:00:a0 succeeded
> Jun 5 09:26:14 linux-7vph kernel: ata2.00: ACPI cmd e3/00:02:00:00:00:a0 succeeded
> Jun 5 09:26:14 linux-7vph kernel: ata2.00: ACPI cmd e3/00:1f:00:00:00:a0 succeeded
> Jun 5 09:26:14 linux-7vph kernel: ata2.00: ACPI cmd e3/00:02:00:00:00:a0 succeeded
> Jun 5 09:26:14 linux-7vph kernel: ata2.00: configured for UDMA/133
> Jun 5 09:26:14 linux-7vph kernel: Restarting tasks ... <3>ata2: exception Emask 0x10 SAct 0x0 SErr 0x0 action 0x9 t4
> Jun 5 09:26:14 linux-7vph kernel: ata2: irq_stat 0x40000001
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: exception Emask 0x10 SAct 0x1ff SErr 0x50000 action 0xe frozen
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: irq_stat 0x00400008, PHY RDY changed
> Jun 5 09:26:14 linux-7vph kernel: ata1: SError: { PHYRdyChg CommWake }
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: cmd 60/08:00:5f:06:44/00:00:0c:00:00/40 tag 0 ncq 4096 in
> Jun 5 09:26:14 linux-7vph kernel: res 50/00:28:3f:8a:6f/00:00:0e:00:00/40 Emask 0x10 (ATA bus error)
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: status: { DRDY }
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: cmd 60/58:08:6f:06:44/00:00:0c:00:00/40 tag 1 ncq 45056 in
> Jun 5 09:26:14 linux-7vph kernel: res 50/00:28:3f:8a:6f/00:00:0e:00:00/40 Emask 0x10 (ATA bus error)
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: status: { DRDY }
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: cmd 60/08:10:d7:06:44/00:00:0c:00:00/40 tag 2 ncq 4096 in
> Jun 5 09:26:14 linux-7vph kernel: res 50/00:28:3f:8a:6f/00:00:0e:00:00/40 Emask 0x10 (ATA bus error)
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: status: { DRDY }
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: cmd 60/08:18:07:07:44/00:00:0c:00:00/40 tag 3 ncq 4096 in
> Jun 5 09:26:14 linux-7vph kernel: res 50/00:28:3f:8a:6f/00:00:0e:00:00/40 Emask 0x10 (ATA bus error)
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: status: { DRDY }
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: cmd 60/18:20:17:07:44/00:00:0c:00:00/40 tag 4 ncq 12288 in
> Jun 5 09:26:14 linux-7vph kernel: res 50/00:28:3f:8a:6f/00:00:0e:00:00/40 Emask 0x10 (ATA bus error)
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: status: { DRDY }
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: cmd 60/08:28:67:88:70/00:00:0e:00:00/40 tag 5 ncq 4096 in
> Jun 5 09:26:14 linux-7vph kernel: res 50/00:28:3f:8a:6f/00:00:0e:00:00/40 Emask 0x10 (ATA bus error)
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: status: { DRDY }
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: cmd 60/08:30:17:70:14/00:00:0d:00:00/40 tag 6 ncq 4096 in
> Jun 5 09:26:14 linux-7vph kernel: res 50/00:28:3f:8a:6f/00:00:0e:00:00/40 Emask 0x10 (ATA bus error)
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: status: { DRDY }
> Jun 5 09:26:14 linux-7vph kernel: ata1.00: cmd 60/d8:38:5f:89:6f/00:00:0e:00:00/40 tag 7 ncq 110592 in
> Jun 5 09:26:15 linux-7vph kernel: res 50/00:28:3f:8a:6f/00:00:0e:00:00/40 Emask 0x10 (ATA bus error)
> Jun 5 09:26:15 linux-7vph kernel: ata1.00: status: { DRDY }
> Jun 5 09:26:15 linux-7vph kernel: ata1.00: cmd 60/28:40:3f:8a:6f/00:00:0e:00:00/40 tag 8 ncq 20480 in
> Jun 5 09:26:15 linux-7vph kernel: res 50/00:28:3f:8a:6f/00:00:0e:00:00/40 Emask 0x10 (ATA bus error)
> Jun 5 09:26:15 linux-7vph kernel: ata1.00: status: { DRDY }
> Jun 5 09:26:15 linux-7vph kernel: ata1: hard resetting link
> Jun 5 09:26:15 linux-7vph kernel: ata2.00: configured for UDMA/133
> Jun 5 09:26:15 linux-7vph kernel: ata2: EH complete
> Jun 5 09:26:15 linux-7vph kernel: done.
> Jun 5 09:26:15 linux-7vph kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
> Jun 5 09:26:16 linux-7vph kernel: ata1.00: ACPI cmd ef/02:00:00:00:00:a0 succeeded
> Jun 5 09:26:16 linux-7vph kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0 filtered out
> Jun 5 09:26:16 linux-7vph kernel: ata1.00: ACPI cmd ef/5f:00:00:00:00:a0 succeeded
> Jun 5 09:26:16 linux-7vph kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0 filtered out
> Jun 5 09:26:16 linux-7vph kernel: ata1.00: ACPI cmd ef/02:00:00:00:00:a0 succeeded
> Jun 5 09:26:16 linux-7vph kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0 filtered out
> Jun 5 09:26:16 linux-7vph kernel: ata1.00: ACPI cmd ef/5f:00:00:00:00:a0 succeeded
> Jun 5 09:26:16 linux-7vph kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0 filtered out
> Jun 5 09:26:16 linux-7vph kernel: ata1.00: configured for UDMA/133
> Jun 5 09:26:16 linux-7vph kernel: ata1.00: configured for UDMA/133
> Jun 5 09:26:16 linux-7vph kernel: sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> Jun 5 09:26:16 linux-7vph kernel: sd 0:0:0:0: [sda] Sense Key : Aborted Command [current] [descriptor]
> Jun 5 09:26:16 linux-7vph kernel: Descriptor sense data with sense descriptors (in hex):
> Jun 5 09:26:16 linux-7vph kernel: 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00
> Jun 5 09:26:16 linux-7vph kernel: 0e 6f 8a 3f
> Jun 5 09:26:16 linux-7vph kernel: sd 0:0:0:0: [sda] Add. Sense: No additional sense information
> Jun 5 09:26:16 linux-7vph kernel: end_request: I/O error, dev sda, sector 242190687
> Jun 5 09:26:16 linux-7vph kernel: XXX sda: failing bio e1223140 with -5
> Jun 5 09:26:16 linux-7vph kernel: XXX sda: bio=e1223140 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:16 linux-7vph kernel: XXX sda: failing bio f5caa0c0 with -5
> Jun 5 09:26:16 linux-7vph kernel: XXX sda: bio=f5caa0c0 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:16 linux-7vph kernel: XXX sda: failing bio f5caa140 with -5
> Jun 5 09:26:16 linux-7vph kernel: XXX sda: bio=f5caa140 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:16 linux-7vph kernel: XXX sda: failing bio f5caa1c0 with -5
> Jun 5 09:26:16 linux-7vph kernel: XXX sda: bio=f5caa1c0 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:16 linux-7vph kernel: XXX sda: failing bio f5caa240 with -5
> Jun 5 09:26:16 linux-7vph kernel: XXX sda: bio=f5caa240 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:16 linux-7vph kernel: XXX sda: failing bio f5caa2c0 with -5
> Jun 5 09:26:16 linux-7vph kernel: XXX sda: bio=f5caa2c0 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:16 linux-7vph kernel: XXX sda: failing bio f5caa340 with -5
> Jun 5 09:26:16 linux-7vph kernel: XXX sda: bio=f5caa340 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:16 linux-7vph kernel: XXX sda: failing bio f5caa3c0 with -5
> Jun 5 09:26:16 linux-7vph kernel: XXX sda: bio=f5caa3c0 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:16 linux-7vph kernel: XXX sda: failing bio f5caa440 with -5
> Jun 5 09:26:16 linux-7vph kernel: XXX sda: bio=f5caa440 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:16 linux-7vph kernel: XXX sda: failing bio f5caa4c0 with -5
> Jun 5 09:26:16 linux-7vph kernel: XXX sda: bio=f5caa4c0 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:16 linux-7vph kernel: XXX sda: failing bio f5caa540 with -5
> Jun 5 09:26:16 linux-7vph kernel: XXX sda: bio=f5caa540 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:16 linux-7vph kernel: XXX sda: failing bio f5caa5c0 with -5
> Jun 5 09:26:19 linux-7vph kernel: XXX sda: bio=f5caa5c0 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:19 linux-7vph kernel: XXX sda: failing bio f5caa640 with -5
> Jun 5 09:26:19 linux-7vph kernel: XXX sda: bio=f5caa640 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:19 linux-7vph kernel: XXX sda: failing bio f5caa6c0 with -5
> Jun 5 09:26:19 linux-7vph kernel: XXX sda: bio=f5caa6c0 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:19 linux-7vph kernel: XXX sda: failing bio f5caa740 with -5
> Jun 5 09:26:19 linux-7vph kernel: XXX sda: bio=f5caa740 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:19 linux-7vph kernel: XXX sda: failing bio f5caa7c0 with -5
> Jun 5 09:26:19 linux-7vph kernel: XXX sda: bio=f5caa7c0 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:19 linux-7vph kernel: XXX sda: failing bio f5caa840 with -5
> Jun 5 09:26:21 linux-7vph kernel: XXX sda: bio=f5caa840 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:21 linux-7vph kernel: XXX sda: failing bio f5caa8c0 with -5
> Jun 5 09:26:21 linux-7vph kernel: XXX sda: bio=f5caa8c0 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:21 linux-7vph kernel: XXX sda: failing bio f5caa940 with -5
> Jun 5 09:26:21 linux-7vph kernel: XXX sda: bio=f5caa940 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:21 linux-7vph kernel: XXX sda: failing bio f5caa9c0 with -5
> Jun 5 09:26:21 linux-7vph kernel: XXX sda: bio=f5caa9c0 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:21 linux-7vph kernel: XXX sda: failing bio f5caaa40 with -5
> Jun 5 09:26:21 linux-7vph kernel: XXX sda: bio=f5caaa40 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:21 linux-7vph kernel: XXX sda: failing bio f5caaac0 with -5
> Jun 5 09:26:22 linux-7vph kernel: XXX sda: bio=f5caaac0 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:22 linux-7vph kernel: XXX sda: failing bio f5caab40 with -5
> Jun 5 09:26:22 linux-7vph kernel: XXX sda: bio=f5caab40 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:22 linux-7vph kernel: XXX sda: failing bio f5caabc0 with -5
> Jun 5 09:26:22 linux-7vph kernel: XXX sda: bio=f5caabc0 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:22 linux-7vph kernel: XXX sda: failing bio f5caac40 with -5
> Jun 5 09:26:22 linux-7vph kernel: XXX sda: bio=f5caac40 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:22 linux-7vph kernel: XXX sda: failing bio f5caacc0 with -5
> Jun 5 09:26:22 linux-7vph kernel: XXX sda: bio=f5caacc0 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:22 linux-7vph kernel: XXX sda: failing bio f5caad40 with -5
> Jun 5 09:26:22 linux-7vph kernel: XXX sda: bio=f5caad40 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:22 linux-7vph kernel: sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> Jun 5 09:26:22 linux-7vph kernel: sd 0:0:0:0: [sda] Sense Key : Aborted Command [current] [descriptor]
> Jun 5 09:26:22 linux-7vph kernel: Descriptor sense data with sense descriptors (in hex):
> Jun 5 09:26:22 linux-7vph kernel: 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00
> Jun 5 09:26:22 linux-7vph kernel: 0e 6f 8a 3f
> Jun 5 09:26:22 linux-7vph kernel: sd 0:0:0:0: [sda] Add. Sense: No additional sense information
> Jun 5 09:26:22 linux-7vph kernel: end_request: I/O error, dev sda, sector 242190911
> Jun 5 09:26:22 linux-7vph kernel: XXX sda: failing bio f5caadc0 with -5
> Jun 5 09:26:22 linux-7vph kernel: XXX sda: bio=f5caadc0 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:22 linux-7vph kernel: XXX sda: failing bio f5caae40 with -5
> Jun 5 09:26:22 linux-7vph kernel: XXX sda: bio=f5caae40 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:22 linux-7vph kernel: XXX sda: failing bio f5caaec0 with -5
> Jun 5 09:26:22 linux-7vph kernel: XXX sda: bio=f5caaec0 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:22 linux-7vph kernel: XXX sda: failing bio f5caaf40 with -5
> Jun 5 09:26:22 linux-7vph kernel: XXX sda: bio=f5caaf40 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:22 linux-7vph kernel: XXX sda: failing bio e12230c0 with -5
> Jun 5 09:26:22 linux-7vph kernel: XXX sda: bio=e12230c0 error=-5 bi_end_io=c01a49fd
> Jun 5 09:26:22 linux-7vph kernel: ata1: EH complete
> Jun 5 09:26:22 linux-7vph kernel: EXT4-fs error (device sda6): __ext4_get_inode_loc: unable to read inode block - inode=2346994, block=9404448
> Jun 5 09:26:22 linux-7vph kernel: EXT4-fs error (device sda6) in ext4_reserve_inode_write: IO failure
>


--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Tejun Heo
2009-06-25 12:57:38 UTC
Permalink
Sorry about the long delay.

Niel Lambrechts wrote:
> Morning Tejun,
>
> Tejun Heo wrote:
>> Hello,
>>
>> Can you please do the followings?
>>
>> 1. Apply the attached patch, build & boot
>>
> I chose 2.6.30-rc7...
>> 2. Trigger the problem and record dmesg
>>
> It took 3 days and quite a few hibernate attempts ... :-)
>
>> 3. On failed IO, the kernel will print the address of bi_endio. Run
>> "nm -n" on the vmlinux in the kernel build root and look up which
>> function it is and post the dmesg and function name.
> I did not have that specific vmlinux.o file any more, but
> /boot/System.map-2.6.30-rc7-pae shows:
> c01a49fd t end_bio_bh_io_sync

So, it's coming from submit_bh()

> Hope this is sufficient to help you. Sorry if this is silly - being so
> inexperienced with the kernel - but I wondered if or why a dump_stack()
> in that debug patch would not be helpful?

The result is perfectly good and yeah dump_stack() on the issue path
would help but the problem is that block IO requests are processed
asynchronously so by the time we find out which request fail, the
requester stack is long gone. We can either record the stack trace
with each request or trace it back one step at a time by chasing down
the completion callbacks. The first requires more coding, so... :-)

Looks like the request gotta be coming from __breadahead(). The only
place this is used in ext4 is in __ext4_get_inode_loc(). Ah.. it also
contains the matching error message. I still don't see how the READA
buffer reads can affect the synchronous path. They're doing proper
exclusion via buffer lock. Maybe they're getting merged? Yeap, looks
like block code is merging READAs and regular READs.

Can you please try the attached patch and reproduce the problem and
report the kernel log? Hopefully, this will be the last debug run.

Thanks.

--
tejun
Niel Lambrechts
2009-06-25 15:25:29 UTC
Permalink
On 06/25/2009 02:57 PM, Tejun Heo wrote:
> Sorry about the long delay.
>
> The result is perfectly good and yeah dump_stack() on the issue path
> would help but the problem is that block IO requests are processed
> asynchronously so by the time we find out which request fail, the
> requester stack is long gone. We can either record the stack trace
> with each request or trace it back one step at a time by chasing down
> the completion callbacks. The first requires more coding, so... :-)
>
> Looks like the request gotta be coming from __breadahead(). The only
> place this is used in ext4 is in __ext4_get_inode_loc(). Ah.. it also
> contains the matching error message. I still don't see how the READA
> buffer reads can affect the synchronous path. They're doing proper
> exclusion via buffer lock. Maybe they're getting merged? Yeap, looks
> like block code is merging READAs and regular READs.
>
> Can you please try the attached patch and reproduce the problem and
> report the kernel log? Hopefully, this will be the last debug run.
>

Hi Tejun,

I've recently switched my root partition from OpenSUSE 11.1 to Fedora 11
and since then I've not again seen the issue. I'm still using vanilla
2.6.30 generated with the same .config and EXT4 as before, so I have no
idea why I cannot reproduce the issue. I still use hibernate + sleep
frequently, and I just checked - I have 5 days uptime with a mount count
of 20 and the file-system is still clean.

The one big difference is that my original partition was a EXT2 -> EXT3
-> EXT4 upgrade job over a long period of time, and some of the EXT4
parameters now used by Fedora 11 on the reformatted root partition are
different from what I had then. Here is a summary of the differences in
case it matters at all:

Current settings:
Default mount options: user_xattr acl
Inodes per group: 8192
Inode blocks per group: 512
Flex block group size: 16
Required extra isize: 28
Desired extra isize: 28
Default directory hash: half_md4

Previous settings:
Default mount options: (none)
Inodes per group: 8176
Inode blocks per group: 511
Default directory hash: tea

If I do notice any such errors again I'll apply the debug patch and let
you know, but it does seem as if the upgrade made this issue disappear...

Regards,
Niel
Tejun Heo
2009-06-26 00:46:33 UTC
Permalink
Hello, Niel.

Niel Lambrechts wrote:
> I've recently switched my root partition from OpenSUSE 11.1 to Fedora 11
> and since then I've not again seen the issue. I'm still using vanilla
> 2.6.30 generated with the same .config and EXT4 as before, so I have no
> idea why I cannot reproduce the issue. I still use hibernate + sleep
> frequently, and I just checked - I have 5 days uptime with a mount count
> of 20 and the file-system is still clean.
>
> The one big difference is that my original partition was a EXT2 -> EXT3
> -> EXT4 upgrade job over a long period of time, and some of the EXT4
> parameters now used by Fedora 11 on the reformatted root partition are
> different from what I had then. Here is a summary of the differences in
> case it matters at all:
>
> Current settings:
> Default mount options: user_xattr acl
> Inodes per group: 8192
> Inode blocks per group: 512
> Flex block group size: 16
> Required extra isize: 28
> Desired extra isize: 28
> Default directory hash: half_md4
>
> Previous settings:
> Default mount options: (none)
> Inodes per group: 8176
> Inode blocks per group: 511
> Default directory hash: tea
>
> If I do notice any such errors again I'll apply the debug patch and let
> you know, but it does seem as if the upgrade made this issue disappear...

Eh... if my analysis is correct the bug will be dependent on some
non-deterministic things including timing and layout of blocks of the
file which the filesystem touchs during suspend/resume. I'm fairly
sure the problem is still there. Oh well, it seems I'll have to
construct a synthetic case to test it. Anyways, please let me know if
you encounter the problem again. I'll keep you in loop with further
patches.

Thanks.

--
tejun
Niel Lambrechts
2009-06-26 06:24:24 UTC
Permalink
On 06/26/2009 02:46 AM, Tejun Heo wrote:
> Hello, Niel.
>
> Niel Lambrechts wrote:
>
>> I've recently switched my root partition from OpenSUSE 11.1 to Fedora 11
>> and since then I've not again seen the issue. I'm still using vanilla
>> 2.6.30 generated with the same .config and EXT4 as before, so I have no
>> idea why I cannot reproduce the issue. I still use hibernate + sleep
>> frequently, and I just checked - I have 5 days uptime with a mount count
>> of 20 and the file-system is still clean.
>>
>> The one big difference is that my original partition was a EXT2 -> EXT3
>> -> EXT4 upgrade job over a long period of time, and some of the EXT4
>> parameters now used by Fedora 11 on the reformatted root partition are
>> different from what I had then. Here is a summary of the differences in
>> case it matters at all:
>>
<snip>
> Eh... if my analysis is correct the bug will be dependent on some
> non-deterministic things including timing and layout of blocks of the
> file which the filesystem touchs during suspend/resume. I'm fairly
> sure the problem is still there. Oh well, it seems I'll have to
> construct a synthetic case to test it. Anyways, please let me know if
> you encounter the problem again. I'll keep you in loop with further
> patches.
>

Understood, what I mean to imply is that I might (at least temporarily)
be shielded from the bug because reformatting the root partition could
possibly have removed some fragmentation and improved the initial disk
layout. With the previous partition layout there would also have been a
mixture of extents and non-extents based files since that system was
upgraded from EXT2 to EXT3 to EXT4.

I'll try to boot from 2.6.29 again as well, and put it through some
suspend/resume cycles just in case that made any difference.

Regards,
Niel
Berthold Gunreben
2009-09-18 20:26:39 UTC
Permalink
Am Freitag 26 Juni 2009 schrieb Niel Lambrechts:
> On 06/26/2009 02:46 AM, Tejun Heo wrote:
> > Eh... if my analysis is correct the bug will be dependent on some
> > non-deterministic things including timing and layout of blocks of the
> > file which the filesystem touchs during suspend/resume. I'm fairly
> > sure the problem is still there. Oh well, it seems I'll have to
> > construct a synthetic case to test it. Anyways, please let me know if
> > you encounter the problem again. I'll keep you in loop with further
> > patches.

I am not quite sure if I am at the right place here, however, I get very
similar problems with a totally different setup. What I do is the following:

I added a fourth disk to a software raid5 array and did setup the raid
completely from scratch (the same disks have been running for about
1.5 years without any problems before, the difference is that previously
one of the disks was setup as hot spare).

After the software raid was in sync, I started to copy my data back to
the raid, and after a less than 5 minutes time, one of the disks failed
(from /var/log/warn):

Sep 18 22:04:02 Bacchus kernel: ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x10000 action 0xe frozen
Sep 18 22:04:02 Bacchus kernel: ata3.00: irq_stat 0x00400000, PHY RDY changed
Sep 18 22:04:02 Bacchus kernel: ata3: SError: { PHYRdyChg }
Sep 18 22:04:02 Bacchus kernel: ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Sep 18 22:04:02 Bacchus kernel: res 40/00:0c:3f:6c:d5/00:00:15:00:00/40 Emask 0x10 (ATA bus error)
Sep 18 22:04:02 Bacchus kernel: ata3.00: status: { DRDY }
Sep 18 22:04:02 Bacchus kernel: ata3: hard resetting link
Sep 18 22:04:07 Bacchus kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Sep 18 22:04:12 Bacchus kernel: ata3.00: qc timeout (cmd 0xec)
Sep 18 22:04:12 Bacchus kernel: ata3.00: failed to IDENTIFY (I/O error, err_mask=0x5)
Sep 18 22:04:12 Bacchus kernel: ata3.00: revalidation failed (errno=-5)
Sep 18 22:04:12 Bacchus kernel: ata3: hard resetting link
Sep 18 22:04:12 Bacchus kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Sep 18 22:04:12 Bacchus kernel: ata3.00: configured for UDMA/133
Sep 18 22:04:12 Bacchus kernel: ata3: EH complete
Sep 18 22:04:12 Bacchus kernel: end_request: I/O error, dev sdc, sector 1953519853
Sep 18 22:04:12 Bacchus kernel: md: super_written gets error=-5, uptodate=0
Sep 18 22:04:12 Bacchus kernel: raid5: Disk failure on sdc1, disabling device.
Sep 18 22:04:12 Bacchus kernel: raid5: Operation continuing on 3 devices.
Sep 18 22:04:13 Bacchus kernel: RAID5 conf printout:
Sep 18 22:04:13 Bacchus kernel: --- rd:4 wd:3
Sep 18 22:04:13 Bacchus kernel: disk 0, o:1, dev:sda1
Sep 18 22:04:13 Bacchus kernel: disk 1, o:1, dev:sdb1
Sep 18 22:04:13 Bacchus kernel: disk 2, o:0, dev:sdc1
Sep 18 22:04:13 Bacchus kernel: disk 3, o:1, dev:sdd1
Sep 18 22:04:13 Bacchus kernel: RAID5 conf printout:
Sep 18 22:04:13 Bacchus kernel: --- rd:4 wd:3
Sep 18 22:04:13 Bacchus kernel: disk 0, o:1, dev:sda1
Sep 18 22:04:13 Bacchus kernel: disk 1, o:1, dev:sdb1
Sep 18 22:04:13 Bacchus kernel: disk 3, o:1, dev:sdd1

The filesystem on /dev/md0 was XFS and there was no suspend/resume applied.
The raid is built with four 1TB disks, attached to a Intel ICH7 Controller:
00:1f.2 SATA controller: Intel Corporation 82801GR/GH (ICH7 Family) SATA AHCI Controller (rev 01)

The system is a openSUSE11.1 with all patches applied. When I first
encountered this problem I had the original openSUSE kernel (2.6.27.29)
running, but I thought it was a good idea to go with a current upstream kernel
instead. The difference was, that with the openSUSE kernel, I could copy
several 100 GB before I lost the disk, while with the upstream kernel
(2.6.31 from ftp.kernel.org) I lost the disk after less than 4 GB. The result
however is the same.

smart does not tell me any problems with the disks, and I did not encounter
any problems during the build of the raid5 which took about 6 hours on
this machine. Thus I don't believe that there is a hardware problem.

Maybe you have an idea what I can do to provide more information.

Thanks in advance

Berthold
Berthold Gunreben
2009-09-30 09:58:43 UTC
Permalink
Hi Tejun,

thanks a lot for your reply.

Am Freitag 25 September 2009 schrieb Tejun Heo:
> Hello, Berthold.
>
> The disk is most likely losing power briefly. After boot, run
> "smartctl -a" on the device and record the output. After triggering
> the problem, do it again. See if Start_Stop_Count, Power_Cycle_Count
> or Power-Off_Retract_Count has increased. If so, take out your PSU,
> bury it half-deep in your backyard, apply some gasoline, light it up
> and enjoy the sight of perishing evil with a can of beer.

You might be right. However, I cannot reproduce the problem anymore, since I
switched to the totally unsupported JFS as filesystem.

In the meantime, I was able to copy 1.5TB of data back to the array, and the
system also survived artificially generated high load. If the problem is a
race (which I do not know), it might still be there. Obviously, it does not
show up as often again.

It could still be the power supply of course, but I don't understand why a new
kernel would trigger power outages so often (current kernels triggered the
problem latest after 5 minutes). Maybe it has something to do with the
chipset (ICH7R) which is capable of hot remove/add disks. Or it is related to
the hotswap harddisk slots in the case
(http://www.chenbro.eu/corporatesite/products_detail.php?sku=79 ). I have no
idea....

Thanks

Berthold
Tejun Heo
2009-09-30 10:26:37 UTC
Permalink
Hello,

Berthold Gunreben wrote:
> It could still be the power supply of course, but I don't understand
> why a new kernel would trigger power outages so often (current
> kernels triggered the problem latest after 5 minutes).

Probably because older kernels didn't handle barrier properly for
raid. Changing fs can definitely affect things but if I were you I
wouldn't be trusting any valuable data on that configuration.

--
tejun
Tejun Heo
2009-05-26 04:58:54 UTC
Permalink
Alan Cox wrote:
>> something to the cdrom which is attached to ata2. Something very
>> fishy is going on there. Sounds like an electric or some sort of
>> interference problem but I'm not sure. :-(
>
> I'm just wondering what happens on a resume if a phy change event is seen
> by the phy before or during the time registers (such as the port mappings
> on an ICH) are being restored (or not restored even). That seems to be one
> way you could get an event on the wrong port.

It doesn't really matter. Before the ports are resumed, all such
events are ignored.

Thanks.

--
tejun
Continue reading on narkive:
Loading...