Discussion:
Resume stops working between 2.6.16 and 2.6.17-rc1 on Dell Inspiron 6000
(too old to reply)
Paul Dickson
2006-05-28 21:02:38 UTC
Permalink
I follow the Fedora development kernels and noticed that resuming from
suspending (and hibernate) stopped working at 2.6.16-git15 (Fedora Core
kernel 2102). Trouble was, my only previous kernel was 2.6.16-rc6-git12
(FC 2064) because I had been out of town for nearly two weeks (I did have
limited net access and that's how I got that last working version).

So yesterday I embarked on a git bisect of the problem. My first was to
test my two end points and then the release in between (2.6.16).

good 2.6.16-rc6
good 2.6.16
bad 2.6.17-rc1

Building and testing a good kernel takes me about 70 minutes. If I make
mistakes it can easily take two times (or more!) longer.

I'm cuurently tracking my work at:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=185108

I'm currently building my fifth bisect.

00:00.0 Host bridge: Intel Corporation Mobile 915GM/PM/GMS/910GML Express Processor to DRAM Controller (rev 03)
00:02.0 VGA compatible controller: Intel Corporation Mobile 915GM/GMS/910GML Express Graphics Controller (rev 03)
00:02.1 Display controller: Intel Corporation Mobile 915GM/GMS/910GML Express Graphics Controller (rev 03)
00:1d.0 USB Controller: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) USB UHCI #1 (rev 03)
00:1d.1 USB Controller: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) USB UHCI #2 (rev 03)
00:1d.2 USB Controller: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) USB UHCI #3 (rev 03)
00:1d.3 USB Controller: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) USB UHCI #4 (rev 03)
00:1d.7 USB Controller: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) USB2 EHCI Controller (rev 03)
00:1e.0 PCI bridge: Intel Corporation 82801 Mobile PCI Bridge (rev d3)
00:1e.2 Multimedia audio controller: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) AC'97 Audio Controller (rev 03)
00:1f.0 ISA bridge: Intel Corporation 82801FBM (ICH6M) LPC Interface Bridge (rev 03)
00:1f.2 IDE interface: Intel Corporation 82801FBM (ICH6M) SATA Controller (rev 03)
00:1f.3 SMBus: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) SMBus Controller (rev 03)
03:00.0 Ethernet controller: Broadcom Corporation BCM4401-B0 100Base-TX (rev 02)
03:01.0 CardBus bridge: Ricoh Co Ltd RL5c476 II (rev b3)
03:01.1 FireWire (IEEE 1394): Ricoh Co Ltd R5C552 IEEE 1394 Controller (rev 08)
03:01.2 Class 0805: Ricoh Co Ltd R5C822 SD/SDIO/MMC/MS/MSPro Host Adapter (rev 17)
03:03.0 Network controller: Intel Corporation PRO/Wireless 2200BG Network Connection (rev 05)

-Paul
Paul Dickson
2006-05-28 21:08:54 UTC
Permalink
Post by Paul Dickson
Building and testing a good kernel takes me about 70 minutes. If I make
mistakes it can easily take two times (or more!) longer.
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=185108
I'm currently building my fifth bisect.
Is there a method of bisecting that means neither "good" nor "bad"? I
have run into kernel problems that are not related to the problem I'm
attempting to track. Some are not avoidable by changing the .config (see
the third bisect in comments 10 and 11 in the bugzilla report).

-Paul
Rafael J. Wysocki
2006-05-28 21:24:12 UTC
Permalink
Post by Paul Dickson
Post by Paul Dickson
Building and testing a good kernel takes me about 70 minutes. If I make
mistakes it can easily take two times (or more!) longer.
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=185108
I'm currently building my fifth bisect.
Could you please also try if the problems persist if you boot with
init=/bin/bash?

Besides, it would be helpful if you were able to get a serial console log
from the failing system.
Post by Paul Dickson
Is there a method of bisecting that means neither "good" nor "bad"? I
have run into kernel problems that are not related to the problem I'm
attempting to track. Some are not avoidable by changing the .config (see
the third bisect in comments 10 and 11 in the bugzilla report).
There are lots of patches between 2.6.16-rc* and 2.6.17-rc1, most of them
having stayed in -mm for some time. If you found the first failing -mm kernel,
it would be easier to catch the offending patch.

BTW, have you tried any kernel _after_ 2.6.17-rc1? If not, I'd start from
these.

Greetings,
Rafael
Dave Jones
2006-05-28 21:34:14 UTC
Permalink
Post by Rafael J. Wysocki
Besides, it would be helpful if you were able to get a serial console log
from the failing system.
I think I've seen the same problem on one of my (similar spec) laptops.
Serial console was useless. On resume, there's a short spew of garbage
(just like if the baud rate were misconfigured) over serial before it
locks up completely. Adjusting the speed on the other end of the cable
made no difference, nothing but garbage comes out.
Maybe serial needs some suspend/resume hooks to reinitialise state ?
Post by Rafael J. Wysocki
BTW, have you tried any kernel _after_ 2.6.17-rc1? If not, I'd start from
these.
If it's the same problem I'm seeing, it's still there in rc5.
I'll continue to poke at it when I get time.

Dave
--
http://www.codemonkey.org.uk
Sanjoy Mahajan
2006-05-29 11:37:23 UTC
Permalink
Post by Dave Jones
I think I've seen the same problem on one of my (similar spec) laptops.
Serial console was useless. On resume, there's a short spew of garbage
(just like if the baud rate were misconfigured) over serial before it
locks up completely.
<http://bugzilla.kernel.org/show_bug.cgi?id=4270> discusses a similar
problem on a couple of machines. In my resume script (for a TP 600X),
I have to restore the serial console with

setserial -a /dev/ttyS0

Until that magic executes, garbage characters (like modem noise)
appear across the serial console.

-Sanjoy
Dave Jones
2006-05-29 14:52:55 UTC
Permalink
Post by Sanjoy Mahajan
Post by Dave Jones
I think I've seen the same problem on one of my (similar spec) laptops.
Serial console was useless. On resume, there's a short spew of garbage
(just like if the baud rate were misconfigured) over serial before it
locks up completely.
<http://bugzilla.kernel.org/show_bug.cgi?id=4270> discusses a similar
problem on a couple of machines. In my resume script (for a TP 600X),
I have to restore the serial console with
setserial -a /dev/ttyS0
Until that magic executes, garbage characters (like modem noise)
appear across the serial console.
With the resume failure I'm seeing, we don't get back to userspace
to run anything like this. It goes bang long before that.

The SATA fix Mark proposed also didn't improve the situation for me :-/

Dave
--
http://www.codemonkey.org.uk
Paul Dickson
2006-05-31 02:45:09 UTC
Permalink
Post by Dave Jones
The SATA fix Mark proposed also didn't improve the situation for me :-/
Fedora kernel 2230 is supposed to include the patch, yet resuming doesn't
work with that kernel.

-Paul
Pavel Machek
2006-05-30 15:29:26 UTC
Permalink
Hi!
Post by Dave Jones
Post by Sanjoy Mahajan
Post by Dave Jones
I think I've seen the same problem on one of my (similar spec) laptops.
Serial console was useless. On resume, there's a short spew of garbage
(just like if the baud rate were misconfigured) over serial before it
locks up completely.
<http://bugzilla.kernel.org/show_bug.cgi?id=4270> discusses a similar
problem on a couple of machines. In my resume script (for a TP 600X),
I have to restore the serial console with
setserial -a /dev/ttyS0
Until that magic executes, garbage characters (like modem noise)
appear across the serial console.
With the resume failure I'm seeing, we don't get back to userspace
to run anything like this. It goes bang long before that.
The SATA fix Mark proposed also didn't improve the situation for me :-/
If setserial -a is needed.. it means that someone really needs to fix
suspend/resume support for serial... do it on working machine to
enable debugging of broken ones...

(But x32 has no serials, so I'm unlikely to code it...)
--
Thanks for all the (sleeping) penguins.
Rafael J. Wysocki
2006-06-03 08:58:33 UTC
Permalink
Hi,
Post by Pavel Machek
Post by Dave Jones
Post by Sanjoy Mahajan
Post by Dave Jones
I think I've seen the same problem on one of my (similar spec) laptops.
Serial console was useless. On resume, there's a short spew of garbage
(just like if the baud rate were misconfigured) over serial before it
locks up completely.
<http://bugzilla.kernel.org/show_bug.cgi?id=4270> discusses a similar
problem on a couple of machines. In my resume script (for a TP 600X),
I have to restore the serial console with
setserial -a /dev/ttyS0
Until that magic executes, garbage characters (like modem noise)
appear across the serial console.
With the resume failure I'm seeing, we don't get back to userspace
to run anything like this. It goes bang long before that.
The SATA fix Mark proposed also didn't improve the situation for me :-/
If setserial -a is needed.. it means that someone really needs to fix
suspend/resume support for serial... do it on working machine to
enable debugging of broken ones...
(But x32 has no serials, so I'm unlikely to code it...)
There's been something wrong with the serial console on the resume front
on my box for quite some time now. However, I don't use it very often and I
have a patch that disables suspending of the console's serial port, if anyone
is interested (no, I'm not going to post it to the list ;-) ).

I only observed that the serial console works just fine wrt suspend/resume
if I boot with init=/bin/bash.

Greetings,
Rafael
Russell King
2006-06-03 09:11:33 UTC
Permalink
Post by Pavel Machek
Post by Dave Jones
Post by Sanjoy Mahajan
Post by Dave Jones
I think I've seen the same problem on one of my (similar spec) laptops.
Serial console was useless. On resume, there's a short spew of garbage
(just like if the baud rate were misconfigured) over serial before it
locks up completely.
<http://bugzilla.kernel.org/show_bug.cgi?id=4270> discusses a similar
problem on a couple of machines. In my resume script (for a TP 600X),
I have to restore the serial console with
setserial -a /dev/ttyS0
Until that magic executes, garbage characters (like modem noise)
appear across the serial console.
With the resume failure I'm seeing, we don't get back to userspace
to run anything like this. It goes bang long before that.
The SATA fix Mark proposed also didn't improve the situation for me :-/
If setserial -a is needed.. it means that someone really needs to fix
suspend/resume support for serial... do it on working machine to
enable debugging of broken ones...
I've explained why this occurs in bugzilla - but for the sake of
repeating repeating repeating myself at great length, let's repeat
it again here.

The serial layer does _not_ have access to the "current" termios
settings due to the layering by the tty subsystem. If the serial
port being used by serial console has been opened once by the user,
but is closed at the moment when a suspend/resume cycle occurs,
the serial layer and lower level drivers do not have access to the
baud rate.

Hence, it is impossible for the serial layer to do a proper resume
in this scenario. Either always suspend with the console port open
or never open the console port before suspend. Alternatively, we
need the tty layer to mature, so that there is some way for drivers
to get the termios structures for the console from the upper layer.
Or maybe we need the tty layer to be responsible for implementing
suspend/resume support for tty devices.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core
Pavel Machek
2006-06-09 08:38:33 UTC
Permalink
Hi!
Post by Russell King
Post by Pavel Machek
Post by Dave Jones
With the resume failure I'm seeing, we don't get back to userspace
to run anything like this. It goes bang long before that.
The SATA fix Mark proposed also didn't improve the situation for me :-/
If setserial -a is needed.. it means that someone really needs to fix
suspend/resume support for serial... do it on working machine to
enable debugging of broken ones...
I've explained why this occurs in bugzilla - but for the sake of
repeating repeating repeating myself at great length, let's repeat
it again here.
The serial layer does _not_ have access to the "current" termios
settings due to the layering by the tty subsystem. If the serial
port being used by serial console has been opened once by the user,
but is closed at the moment when a suspend/resume cycle occurs,
the serial layer and lower level drivers do not have access to the
baud rate.
Could serial layer just cache "last baud rate" in some kind of
software shadow register? Yes, it is slightly ugly, but should do the trick.
Post by Russell King
Hence, it is impossible for the serial layer to do a proper resume
in this scenario. Either always suspend with the console port open
or never open the console port before suspend. Alternatively, we
need the tty layer to mature, so that there is some way for drivers
to get the termios structures for the console from the upper layer.
Or maybe we need the tty layer to be responsible for implementing
suspend/resume support for tty devices.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Russell King
2006-06-09 08:42:34 UTC
Permalink
Post by Pavel Machek
Post by Russell King
The serial layer does _not_ have access to the "current" termios
settings due to the layering by the tty subsystem. If the serial
port being used by serial console has been opened once by the user,
but is closed at the moment when a suspend/resume cycle occurs,
the serial layer and lower level drivers do not have access to the
baud rate.
Could serial layer just cache "last baud rate" in some kind of
software shadow register? Yes, it is slightly ugly, but should do the trick.
That's not a new suggestion. How do you deal with the case where
you have console on two or more different serial ports? That's
the problem with this approach.

The only sane solution is for the tty layer to be adjusted to allow
suspend/resume support for consoles.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core
Pavel Machek
2006-06-09 08:46:00 UTC
Permalink
Post by Russell King
Post by Pavel Machek
Post by Russell King
The serial layer does _not_ have access to the "current" termios
settings due to the layering by the tty subsystem. If the serial
port being used by serial console has been opened once by the use=
r,
Post by Russell King
Post by Pavel Machek
Post by Russell King
but is closed at the moment when a suspend/resume cycle occurs,
the serial layer and lower level drivers do not have access to th=
e
Post by Russell King
Post by Pavel Machek
Post by Russell King
baud rate.
=20
Could serial layer just cache "last baud rate" in some kind of
software shadow register? Yes, it is slightly ugly, but should do t=
he trick.
Post by Russell King
=20
That's not a new suggestion. How do you deal with the case where
you have console on two or more different serial ports? That's
the problem with this approach.
Well, each of serial ports has hardware baud_rate register. I'll need
software baud_rate_shadow for every serial port, setting
baud_rate_shadow each time baud_rate is set. During resume, I restore
baud_rate from baud_rate_shadow for each serial port.

What am I missing?
Post by Russell King
The only sane solution is for the tty layer to be adjusted to allow
suspend/resume support for consoles.
Well, solution above is likely to be ugly, but even ugly patch would
help people debug s-to-RAM.
Pavel
--=20
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses=
/blog.html
Russell King
2006-06-09 08:51:34 UTC
Permalink
Post by Pavel Machek
Post by Russell King
Post by Pavel Machek
Post by Russell King
The serial layer does _not_ have access to the "current" termios
settings due to the layering by the tty subsystem. If the serial
port being used by serial console has been opened once by the user,
but is closed at the moment when a suspend/resume cycle occurs,
the serial layer and lower level drivers do not have access to the
baud rate.
Could serial layer just cache "last baud rate" in some kind of
software shadow register? Yes, it is slightly ugly, but should do the trick.
That's not a new suggestion. How do you deal with the case where
you have console on two or more different serial ports? That's
the problem with this approach.
Well, each of serial ports has hardware baud_rate register. I'll need
software baud_rate_shadow for every serial port, setting
baud_rate_shadow each time baud_rate is set. During resume, I restore
baud_rate from baud_rate_shadow for each serial port.
What am I missing?
What about the other parameters like the bit size, number of stop bits,
etc?
Post by Pavel Machek
Post by Russell King
The only sane solution is for the tty layer to be adjusted to allow
suspend/resume support for consoles.
Well, solution above is likely to be ugly, but even ugly patch would
help people debug s-to-RAM.
Why not investigate doing the proper solution? Since you're obviously
one of the ones who is able to reproduce the situation (I'm not), you're
perfectly placed to develop and test such a solution, and I think it's
well within your capability.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core
Russell King
2006-06-11 14:08:37 UTC
Permalink
Post by Russell King
The only sane solution is for the tty layer to be adjusted to allow
suspend/resume support for consoles.
And for those who can't work out how to do that, here's something which
_probably_ does it. Would folk mind testing it out please?

diff --git a/drivers/char/tty_io.c b/drivers/char/tty_io.c
--- a/drivers/char/tty_io.c
+++ b/drivers/char/tty_io.c
@@ -1674,6 +1674,19 @@ release_mem_out:
}

/*
+ * Get a copy of the termios structure for the driver/index
+ */
+void tty_get_termios(struct tty_driver *driver, int idx, struct termios *tio)
+{
+ lock_kernel();
+ if (driver->termios[idx])
+ *tio = *driver->termios[idx];
+ else
+ *tio = driver->init_termios;
+ unlock_kernel();
+}
+
+/*
* Releases memory associated with a tty structure, and clears out the
* driver table slots.
*/
diff --git a/drivers/serial/serial_core.c b/drivers/serial/serial_core.c
--- a/drivers/serial/serial_core.c
+++ b/drivers/serial/serial_core.c
@@ -1968,16 +1968,16 @@ int uart_resume_port(struct uart_driver
struct termios termios;

/*
- * First try to use the console cflag setting.
+ * Get the termios for this line
*/
- memset(&termios, 0, sizeof(struct termios));
- termios.c_cflag = port->cons->cflag;
+ tty_get_termios(drv->tty_driver, port->line, &termios);

/*
- * If that's unset, use the tty termios setting.
+ * If the console cflag is still set, subsitute that
+ * for the termios cflag.
*/
- if (state->info && state->info->tty && termios.c_cflag == 0)
- termios = *state->info->tty->termios;
+ if (port->cons->cflag)
+ termios.c_cflag = port->cons->cflag;

port->ops->set_termios(port, &termios, NULL);
console_start(port->cons);
diff --git a/include/linux/tty.h b/include/linux/tty.h
--- a/include/linux/tty.h
+++ b/include/linux/tty.h
@@ -297,6 +297,8 @@ extern int tty_read_raw_data(struct tty_
int buflen);
extern void tty_write_message(struct tty_struct *tty, char *msg);

+extern void tty_get_termios(struct tty_driver *drv, int idx, struct termios *tio);
+
extern int is_orphaned_pgrp(int pgrp);
extern int is_ignored(int sig);
extern int tty_signal(int sig, struct tty_struct *tty);
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core
Paul Dickson
2006-05-28 22:02:08 UTC
Permalink
Post by Rafael J. Wysocki
Post by Paul Dickson
Post by Paul Dickson
Building and testing a good kernel takes me about 70 minutes. If I make
mistakes it can easily take two times (or more!) longer.
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=185108
I'm currently building my fifth bisect.
Could you please also try if the problems persist if you boot with
init=/bin/bash?
I'll try it after I send this message. I'm guessing you're refering to
my third bisect. I still have that and will use it.

I also have my 5th bisect ready for this reboot too...
Post by Rafael J. Wysocki
Besides, it would be helpful if you were able to get a serial console log
from the failing system.
No serial port on this notebook. I've tried
"netconsole=***@192.168.1.9/eth0,***@192.168.1.3/00:01:02:77:7D:E1" but
nothing happens (there's not even a log message that this is unsupported).
Post by Rafael J. Wysocki
Post by Paul Dickson
Is there a method of bisecting that means neither "good" nor "bad"? I
have run into kernel problems that are not related to the problem I'm
attempting to track. Some are not avoidable by changing the .config (see
the third bisect in comments 10 and 11 in the bugzilla report).
There are lots of patches between 2.6.16-rc* and 2.6.17-rc1, most of them
having stayed in -mm for some time. If you found the first failing -mm kernel,
it would be easier to catch the offending patch.
BTW, have you tried any kernel _after_ 2.6.17-rc1? If not, I'd start from
these.
I have been using the Fedora development kernels. The last I'm SURE I
tested was 2211 (2.6.17-rc4-git11). It has the same problems as the
2.6.17-rc1 I compiled from the git database. It's been the same
throughout the series.

I may try sshing into my notebook when I finish these current bisect
tests to see if it's still the HD being made RO. This is assuming ssh
will keep the connection through a suspend.

-Paul
Paul Dickson
2006-05-29 00:12:00 UTC
Permalink
Post by Paul Dickson
Post by Rafael J. Wysocki
Post by Paul Dickson
Building and testing a good kernel takes me about 70 minutes. If I make
mistakes it can easily take two times (or more!) longer.
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=185108
I'm currently building my fifth bisect.
Could you please also try if the problems persist if you boot with
init=/bin/bash?
I'll try it after I send this message. I'm guessing you're refering to
my third bisect. I still have that and will use it.
The switchroot in the initrd can't find the ext3 root fs. So no root
found and bash couldn't be found. This is the third bisect. All of this
is because of a compiler warning in ext3 and reiserfs. If I recall
correctly, something like "generic_slice_[...](?) uninitialized".
Post by Paul Dickson
I may try sshing into my notebook when I finish these current bisect
tests to see if it's still the HD being made RO. This is assuming ssh
will keep the connection through a suspend.
While ssh retains a connection on a good kernel. It gets no response
from a bad kernel.

-Paul
Arjan van de Ven
2006-05-28 21:11:23 UTC
Permalink
Post by Paul Dickson
I follow the Fedora development kernels and noticed that resuming from
suspending (and hibernate) stopped working at 2.6.16-git15 (Fedora Core
kernel 2102). Trouble was, my only previous kernel was 2.6.16-rc6-git12
(FC 2064) because I had been out of town for nearly two weeks (I did have
limited net access and that's how I got that last working version).
have you verified they have both the same general .config file? Like
both are smp or both UP, same APIC settings etc etc
That's all easy to check and those two are the most likely candidates in
config land that could break resume...
(not saying those are the cause or have changed, no idea, but they're
really cheap to check that none have changed, much cheaper than a
bisect ;)
Paul Dickson
2006-05-28 21:29:51 UTC
Permalink
Post by Arjan van de Ven
Post by Paul Dickson
I follow the Fedora development kernels and noticed that resuming from
suspending (and hibernate) stopped working at 2.6.16-git15 (Fedora Core
kernel 2102). Trouble was, my only previous kernel was 2.6.16-rc6-git12
(FC 2064) because I had been out of town for nearly two weeks (I did have
limited net access and that's how I got that last working version).
have you verified they have both the same general .config file? Like
both are smp or both UP, same APIC settings etc etc
That's all easy to check and those two are the most likely candidates in
config land that could break resume...
(not saying those are the cause or have changed, no idea, but they're
really cheap to check that none have changed, much cheaper than a
bisect ;)
Not the Fedora kernels, but the ones I'm bisecting have the same .config
(modulus "make oldconfig"). I did lose some time when somehow SMP got
enabled between the test of 2.6.16 and 2.6.17-rc1. I ended up testing
2.6.17-rc1 without suspend being in the kernel (that kernel wouldn't
suspend). After that, I have been verifying that each kernel will have
suspend compiled in before the hour long make session.

-Paul
Mark Lord
2006-05-28 21:49:35 UTC
Permalink
We've just now put out a one-liner patch to libata that fixes
resume on my own Inspiron, and for other machines as well.

Does it fix the problem here too? (copy of patch is attached)
Paul Dickson
2006-05-29 00:21:01 UTC
Permalink
Post by Mark Lord
We've just now put out a one-liner patch to libata that fixes
resume on my own Inspiron, and for other machines as well.
Does it fix the problem here too? (copy of patch is attached)
Yes. I compiled 2.6.17-rc5 without it and verified the problem occurs,
then applied the patch and tried it again. This time it worked.

I can suspend AND hibernate with the patch.


I still get the BUG message on resuming that I reported in bugzilla
comment #9:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=185108#c9
It is likely a separate bug.

Thanks for the patch!

-Paul
Andrew Morton
2006-05-29 00:40:11 UTC
Permalink
On Sun, 28 May 2006 17:21:01 -0700
Post by Paul Dickson
I still get the BUG message on resuming that I reported in bugzilla
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=185108#c9
It is likely a separate bug.
That's ACPI doing a GFP_KERNEL allocation while resume has disabled
interrupts. It won't cause much trouble and I'm pretty sure we
subsequently fixed that.
Paul Dickson
2006-05-29 01:47:20 UTC
Permalink
Post by Andrew Morton
On Sun, 28 May 2006 17:21:01 -0700
Post by Paul Dickson
I still get the BUG message on resuming that I reported in bugzilla
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=185108#c9
It is likely a separate bug.
That's ACPI doing a GFP_KERNEL allocation while resume has disabled
interrupts. It won't cause much trouble and I'm pretty sure we
subsequently fixed that.
I don't immediately see a fix in the linux-2.6.git/log since 2.6.17-rc5
(within the past 3 days). I do see Mark Lord's patch.

-Paul
Mark Lord
2006-05-29 03:02:07 UTC
Permalink
Post by Paul Dickson
Post by Mark Lord
We've just now put out a one-liner patch to libata that fixes
resume on my own Inspiron, and for other machines as well.
Does it fix the problem here too? (copy of patch is attached)
Yes. I compiled 2.6.17-rc5 without it and verified the problem occurs,
then applied the patch and tried it again. This time it worked.
I can suspend AND hibernate with the patch.
Good! That patch is in the latest 2.6.17-rc*-git* now.
Post by Paul Dickson
I still get the BUG message on resuming that I reported in bugzilla
...
Post by Paul Dickson
BUG: sleeping function called from invalid context at mm/slab.c:2794
in_atomic():0, irqs_disabled():1
<c01c971b> acpi_os_acquire_object+0xf/0x3c <c0149c48> kmem_cache_alloc+0x27/0x7f
<c01c971b> acpi_os_acquire_object+0xf/0x3c <c01df220> acpi_ut_allocate_object_desc_dbg+0xc/0x40
<c01df26d> acpi_ut_create_internal_object_dbg+0x19/0x70 <c01db3ef> acpi_rs_set_srs_method_data+0x40/0xc5
<c01e545d> acpi_pci_link_set+0x3e/0x16d <c0149c96> kmem_cache_alloc+0x75/0x7f
<c01e5515> acpi_pci_link_set+0xf6/0x16d <c01e55c0> irqrouter_resume+0x34/0x52
<c020eb77> __sysdev_resume+0x12/0x55 <c020ecd4> sysdev_resume+0x16/0x47
<c0213117> device_power_up+0x5/0xa <c01293db> suspend_enter+0x32/0x3a
<c0129504> enter_state+0x121/0x13e <c01295a2> state_store+0x81/0x94
<c0182fa9> sysfs_write_file+0xa3/0xc9 <c014d4c8> vfs_write+0xa2/0x136
<c014d9d2> sys_write+0x3b/0x64 <c0102ab3> syscall_call+0x7/0xb
Yup, pretty obvious bug in the acpi code.
Something probably needs to use GFP_ATOMIC there.
Pavel Machek
2006-05-29 15:12:16 UTC
Permalink
Hi!
Post by Mark Lord
Post by Paul Dickson
I still get the BUG message on resuming that I reported
in bugzilla
...
Post by Paul Dickson
BUG: sleeping function called from invalid context at
mm/slab.c:2794
in_atomic():0, irqs_disabled():1
<c01c971b> acpi_os_acquire_object+0xf/0x3c <c0149c48>
kmem_cache_alloc+0x27/0x7f
<c01c971b> acpi_os_acquire_object+0xf/0x3c <c01df220>
acpi_ut_allocate_object_desc_dbg+0xc/0x40
<c01df26d>
acpi_ut_create_internal_object_dbg+0x19/0x70
<c01db3ef> acpi_rs_set_srs_method_data+0x40/0xc5
<c01e545d> acpi_pci_link_set+0x3e/0x16d <c0149c96>
kmem_cache_alloc+0x75/0x7f
<c01e5515> acpi_pci_link_set+0xf6/0x16d <c01e55c0>
irqrouter_resume+0x34/0x52
<c020eb77> __sysdev_resume+0x12/0x55 <c020ecd4>
sysdev_resume+0x16/0x47
<c0213117> device_power_up+0x5/0xa <c01293db>
suspend_enter+0x32/0x3a
<c0129504> enter_state+0x121/0x13e <c01295a2>
state_store+0x81/0x94
<c0182fa9> sysfs_write_file+0xa3/0xc9 <c014d4c8>
vfs_write+0xa2/0x136
<c014d9d2> sys_write+0x3b/0x64 <c0102ab3>
syscall_call+0x7/0xb
Yup, pretty obvious bug in the acpi code.
Something probably needs to use GFP_ATOMIC there.
Does it still happen in -rc5?
--
Thanks for all the (sleeping) penguins.
Paul Dickson
2006-05-31 02:38:24 UTC
Permalink
Post by Pavel Machek
Hi!
Post by Mark Lord
Post by Paul Dickson
I still get the BUG message on resuming that I reported
in bugzilla
...
Post by Paul Dickson
BUG: sleeping function called from invalid context at
mm/slab.c:2794
in_atomic():0, irqs_disabled():1
Yup, pretty obvious bug in the acpi code.
Something probably needs to use GFP_ATOMIC there.
Does it still happen in -rc5?
Yes. That was the kernel that the dmesg came from.

-Paul
Mark Lord
2006-05-29 03:03:25 UTC
Permalink
Paul, your email address (***@permanentmail.com) bounces.
Please fix it.

Thanks
Robert Hancock
2006-05-29 02:55:03 UTC
Permalink
Post by Paul Dickson
Post by Andrew Morton
Post by Paul Dickson
I still get the BUG message on resuming that I reported in bugzilla
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=185108#c9
It is likely a separate bug.
That's ACPI doing a GFP_KERNEL allocation while resume has disabled
interrupts. It won't cause much trouble and I'm pretty sure we
subsequently fixed that.
I don't immediately see a fix in the linux-2.6.git/log since 2.6.17-rc5
(within the past 3 days). I do see Mark Lord's patch.
I think Fedora has been carrying a patch for that for some time, last I
checked they still were..
--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from ***@nospamshaw.ca
Home Page: http://www.roberthancock.com/
l***@horizon.com
2006-05-29 22:56:32 UTC
Permalink
(Cc: to the git list, since the people there undoubtedly know much better.)
Post by Paul Dickson
Is there a method of bisecting that means neither "good" nor "bad"? I
have run into kernel problems that are not related to the problem I'm
attempting to track. Some are not avoidable by changing the .config (see
the third bisect in comments 10 and 11 in the bugzilla report).
Yes. While you're bisecting, HEAD is a special "bisect" head used just
for that purpose. If you encounter a compile error or are otherwise
unable to test a version, you can "git reset --hard <commit>" to jump
to some other commit and test that instead. Because that command
unconditionally changes both the current head and the checked-out code,
it's normally somewhat dangerous, but while bisecting, there's no problem.
You can choose anything you like to test instead of git-bisect's suggested
version, but staying near the middle of the uncertain range is usually
a good idea. "HEAD^" (the parent of the current commit) is often a
simple choice. "git bisect visualize" might give you some ideas.

Note that if the problem actually is in the area of the untestable commit,
git bisect might drag you back there, but this lets you try to avoid it.
Post by Paul Dickson
You can further cut down the number of trials if you know what part of
the tree is involved in the problem you are tracking down, by giving
$ git bisect start arch/i386 include/asm-i386
Linus Torvalds
2006-05-30 00:46:32 UTC
Permalink
You can further cut down the number of trials if you know what part of
the tree is involved in the problem you are tracking down, by giving
$ git bisect start arch/i386 include/asm-i386
I'm not 100% sure this works - I think it has problems with the ending
condition because there always ends up being more commits in between when
the commit space isn't dense, so the "no commits left" thing doesn't
trigger. But "git bisect visualize" should hopefully help make it obvious

Linus
Continue reading on narkive:
Loading...