Ticket #2072 (assigned defect)

Opened 2 years ago

Last modified 2 years ago

Unexpected server shutdown

Reported by: paul.pech@gmx.de Assigned to: ruik (accepted)
Priority: major Milestone:
Component: hardware Version: 2.9.1
Keywords: tyan mainboard server shutdown Cc:

Description

Hi,

I'm using a Tyan Tomcat i875P (S5102) Mainboard, running SuSE Linux 10.0 with kernel 2.6.13-15.10-smp (geeko@buildhost) (gcc version 4.0.2 20050901). Attached is my sensors.conf file.

On two different occasions, the server simply shut down. Here are the server log files:

--- 2nd time ---

Jun  1 22:20:01 secserv /usr/sbin/cron[27086]: (root) CMD (/etc/health/healthd.sh)
Jun  1 22:20:01 secserv kernel:     ACPI-0463: *** Warning: Critical trip point
Jun  1 22:20:01 secserv kernel: Critical temperature reached (95 C), shutting down.
Jun  1 22:20:01 secserv kernel: klogd 1.4.1, ---------- state change ----------
Jun  1 22:20:01 secserv kernel:     ACPI-0212: *** Warning: Device is not power manageable
Jun  1 22:20:01 secserv kernel:     ACPI-0629: *** Warning: Unable to turn cooling device [dffd8a00] 'on'
Jun  1 22:20:02 secserv init: Switching to runlevel: 0
Jun  1 22:20:03 secserv snort: Final Flow Statistics
Jun  1 22:20:03 secserv snort: Snort exiting
Jun  1 22:20:03 secserv ntpd[5243]: ntpd exiting on signal 15
Jun  1 22:20:03 secserv sshd[4810]: Received signal 15; terminating.
Jun  1 22:20:05 secserv kernel: Kernel logging (proc) stopped.
Jun  1 22:20:05 secserv kernel: Kernel log daemon terminating.
Jun  1 22:20:06 secserv exiting on signal 15

---

--- 1st time ---

Apr 16 01:56:01 secserv /usr/sbin/cron[4541]: (root) CMD (/etc/health/healthd.sh)
Apr 16 01:56:01 secserv kernel:     ACPI-0463: *** Warning: Critical trip point
Apr 16 01:56:01 secserv kernel: Critical temperature reached (80 C), shutting down.
Apr 16 01:56:01 secserv kernel: klogd 1.4.1, ---------- state change ----------
Apr 16 01:56:01 secserv kernel:     ACPI-0212: *** Warning: Device is not power manageable
Apr 16 01:56:01 secserv kernel:     ACPI-0629: *** Warning: Unable to turn cooling device [dfdbea00] 'on'
Apr 16 01:56:02 secserv init: Switching to runlevel: 0
Apr 16 01:56:02 secserv snort: Final Flow Statistics
Apr 16 01:56:02 secserv snort: Snort exiting
Apr 16 01:56:03 secserv sshd[10074]: Received signal 15; terminating.
Apr 16 01:56:03 secserv ntpd[30004]: ntpd exiting on signal 15
Apr 16 01:56:04 secserv kernel: Kernel logging (proc) stopped.
Apr 16 01:56:04 secserv kernel: Kernel log daemon terminating.
Apr 16 01:56:05 secserv exiting on signal 15
---

The script "healthd.sh" issues the call "sensors" to the command line and if an alarm is raised, I get notified by eMail. The CPU usually has a temperature at around 38°C, so it's rather unlikely that it really reached 80°C resp. 90°C (especially if the system is not under heavy load).

Could this possibly be a bug with lm_sensors and the kernel ACPI functions, or is something wrong with my sensors.conf file?

Any help is greatly appreciated.

Yours,

Paul


   

Attachments

sensors.conf (2.3 kB) - added by paul.pech@gmx.de on 06/02/06 13:48:42.
Config file /etc/sensors.conf
dsdt.bin (15.4 kB) - added by paul.pech@gmx.de on 06/04/06 16:57:05.
/proc/acpi/dsdt
dsdt.dsl (131.3 kB) - added by paul.pech@gmx.de on 06/04/06 16:59:26.
iasl -d dsdt.bin --> dsdt.dsl

Change History

06/02/06 13:48:42 changed by paul.pech@gmx.de

  • attachment sensors.conf added.

Config file /etc/sensors.conf

06/02/06 19:53:38 changed by paul.pech@gmx.de

Hi,

I forgot to mention that the script "healthd.sh" is being run as a cron job and it gets executed every 2 minutes. It has been running for about four months and the above mentioned shutdowns have "only" occured two times since.

Yours,

Paul

06/02/06 21:37:40 changed by ruik

  • owner changed from somebody to ruik.

Hello,

I think it is because of ACPI and lm-sensors driver are trying to access the chip same time. Please try to disable the thermal module in kernel, or compile kernel without "thermal support" in ACPI section

I hope it helps, Regards Rudolf

06/03/06 18:14:46 changed by paul.pech@gmx.de

  • owner changed from ruik to anonymous.
  • status changed from new to assigned.

Hi Rudolf,

thanks for the quick answer, I have removed kernel thermal support and have lm-sensors running again. Should the server shut down unexpectedly again, I'll drop another post here. Of course, it could take a while until luck strikes again :-)

Yours,

Paul

06/03/06 18:15:40 changed by anonymous

  • status changed from assigned to new.

06/04/06 14:31:17 changed by ruik

  • owner changed from anonymous to ruik.
  • status changed from new to assigned.

Hello please include the output of cat > /proc/acpi/dsdt > /tmp/dsdt.bin (no need to change kernel configuration)

06/04/06 14:43:22 changed by ruik

Well of course cat /proc/acpi/dsdt > /tmp/dsdt.bin Thanks Rudolf

06/04/06 16:55:40 changed by paul.pech@gmx.de

Hi,

I have attached files dsdt.bin and the disassembled version of it (iasl -d dsdt.bin --> dsdt.dsl).

If I try to recompile it (iasl -tc -oa dsdt.dsl), I get the following output:

Intel ACPI Component Architecture
ASL Optimizing Compiler version 20050902 [Sep  9 2005]
Copyright (C) 2000 - 2005 Intel Corporation
Supports ACPI Specification Revision 3.0

dsdt.dsl   262:     Method (\_WAK, 1, NotSerialized)
Warning  2078 -                 ^ Reserved method must return a value (_WAK)

dsdt.dsl   288:             Store (Local0, Local0)
Error    1048 -                         ^ Method local variable is not initialized (Local0)

dsdt.dsl   293:             Store (Local0, Local0)
Error    1048 -                         ^ Method local variable is not initialized (Local0)

ASL Input:  dsdt.dsl - 4192 lines, 134486 bytes, 1731 keywords
Compilation complete. 2 Errors, 1 Warnings, 0 Remarks, 0 Optimizations

Fankly, I don't know what to do with that...

Yours,

Paul

06/04/06 16:57:05 changed by paul.pech@gmx.de

  • attachment dsdt.bin added.

/proc/acpi/dsdt

06/04/06 16:59:26 changed by paul.pech@gmx.de

  • attachment dsdt.dsl added.

iasl -d dsdt.bin --> dsdt.dsl

06/04/06 17:34:01 changed by ruik

Hello,

I know what to do with that. I think you get the shutdown because acpi thinks it is too hot. Watch:

            Method (_TMP, 0, NotSerialized)                                                         
            {                                                                                       
                And (SENF, 0x01, Local6)                                                            
                If (LEqual (Local6, 0x01))                                                          
                {                                                                                   
                    Return (RTMP ())                                                                
                }                                                                                   
                Else                                                                                
                {                                                                                   
                    Return (0x0B86)                                                                 
                }                                                                                   
            }        

This is a method that is used by acpi to read the CPU temp

As you can see, this readings are enabled by magic SENF value. If they are enabled, RTMP method is called.

    Method (RTMP, 0, NotSerialized)                                                                 
    {                                                                                               
        Store (GBYT (0x4E), Local0)                                                                 
        Or (Local0, 0x01, Local1)                                                                   
        SBYT (0x4E, Local1)                                                                         
        Store (GBYT (0x50), Local1)                                                                 
        Store (GBYT (0x51), Local2)                                                                 
        ShiftLeft (Local1, 0x08, Local1)                                                            
        Or (Local1, Local2, Local3)                                                                 
        ShiftRight (Local3, 0x07, Local3)                                                           
        ShiftLeft (Local3, 0x02, Local1)                                                            
        Add (Local3, Local1, Local3)                                                                
        Add (Local3, 0x0AAC, Local3)                                                                
        If (LGreater (Local3, 0x0E8A))                                                              
        {                                                                                           
            Store (0x0AAC, Local3)                                                                  
        }                                                                                           
                                                                                                    
        SBYT (0x4E, Local0)                                                                         
        Return (Local3)                                                                             
    } 

The temp is in bank 1 addr 50. Now what has happened:

1) acpi bytecode changes bank to 1 2) lm-sensors driver changes bank 0 3) acpi reads some other registers with other than temp value ...

Due to lame design of ACPI it will be difficult to fix it. TYan and windows drivers solve it via that magic OFF value...

I will try to come with some solution and ask Tyan about it but it will take time.

Without the thermal module loaded, I think you should not have any more problems.

Regards,

Rudolf

07/02/06 20:48:09 changed by anonymous

  • priority changed from major to blocker.
  • type changed from defect to task.
  • version changed from 2.9.1 to 2.9.2.
  • component changed from hardware to interface.

07/03/06 12:28:06 changed by anonymous

  • priority changed from blocker to minor.
  • type changed from task to enhancement.
  • version changed from 2.9.2 to older than 2.9.0.
  • component changed from interface to unsorted.