Ticket #2354 (closed defect: invalid)

Opened 6 years ago

Last modified 6 years ago

sensors brought down entire system

Reported by: ticket Owned by: khali
Priority: major Milestone:
Component: sensors Version: 3.0.2
Keywords: Cc: coldfusionpc@…

Description

I foolishly tried to add temp(n)_max and _min to k8temp.. apparently this is not supported. Running sensors -s to set these resulted in error messages about unsupported feature and then I immediately lost all connection to the machine. It's now not responding to SSH, HTTP, ping, anything at all.. So it either powered off or is hung. I won't know which until I can get to the site. It's a Sun Ultra 20 M2.

Change History

Changed 6 years ago by khali

  • owner changed from somebody to khali
  • status changed from new to assigned

Which version of lm-sensors are you running? Which kernel version?

I confirm that the k8temp driver doesn't support temperature limits, simply because the hardware doesn't have them. I would like to make sure I understand what you did: you simply added "set" statements to /etc/sensors.conf and then ran "sensors -s"? You didn't modify the k8temp driver code, did you?

Changed 6 years ago by ticket

OK, everything I did.. I added lm87, that was working nice for a while, then I tweaked some settings for that, and then I added "set" statements to k8temp, and ran sensors -s and got a bunch of errors and back to a prompt.. Just then the machine abruptly powered off. While I was trying to undo what I had done last night, the machine powered itself off every two minutes after startup.. I did an interactive startup and left everything stopped except network. I removed and re-installed lm_sensors and apcupsd (dependencies) and this fixed the issue. I'm still not clear about why the machine was shutting off.

Kernel 2.6.26.5-45.fc9.x86_64

lm_sensors-3.0.2-1.fc9.x86_64

Changed 6 years ago by khali

  • version set to 3.0.2

It is very unlikely that your experiments with the k8temp configuration settings have anything to do with the power-off. When trying to set values which do not exist for a given chip, libsensors will report an error without even accessing the hardware monitoring chip in question.

If your system also has an LM87 or compatible chip and you changed the parameters of this chip, then this is more likely to be the problem (even though it is of course not supposed to happen.) Did you keep a copy of the modified configuration file which supposedly triggered the power-off?

Can you please attach the output of the "sensors" command when things work?

Changed 6 years ago by ticket

Unfortunately I do not have a copy. The only things I changed for LM87 was to ignore some temps and fans and set the fan0 and fan1 mins. HOWEVER, you should know that although LM87 worked on my system, according to sensors-detect I have an ADT7476. I had posted from a different email to the mailing list to try to get added to the list of requests for ADT7476 (on the devices page) but didn't get any response so I tried LM87 on the suggestion of someone else with a nearly identical Tyan Tomcat board.

Changed 6 years ago by khali

  • status changed from assigned to closed
  • resolution set to invalid

I see. I don't know who recommended that you use the lm87 driver for the ADT7476 chip but this was a silly suggestion. Different chips have different register maps and using a driver with a chip it wasn't written for can lead to severe problems, as you just found out. Forcing the lm87 driver to attach to the ADT7476, as you presumably did using a "force" module parameter, was asking for trouble. Just don't do that.

Changed 6 years ago by ticket

I didn't force it, I just addded MODULE_2=lm87 and it gave me good data.

if you could, please add my name (Matt Castelein) to the list of requests for ADT7476.

Changed 6 years ago by khali

  • status changed from closed to reopened
  • resolution invalid deleted

Please attach the output of "sensors" as requested.

Please also attach a dump of your chip. You'll have to install the i2c-tools package first. Then:

modprobe i2c-dev
rmmod lm87
i2cdump 0 0x2e b

Replace "0" and "0x2e" in the last command by the actual I2C bus number and device address as reported by "sensors".

Changed 6 years ago by ticket

[root@arturo ~]# sensors k8temp-pci-00c3 Adapter: PCI adapter Core0 Temp: +31.0°C Core1 Temp: +24.0°C

[root@arturo ~]# modprobe i2c-dev [root@arturo ~]# rmmod lm87 ERROR: Module lm87 does not exist in /proc/modules [root@arturo ~]# i2cdump 1 0x2e b WARNING! This program can confuse your I2C bus, cause data loss and worse! I will probe file /dev/i2c-1, address 0x2e, mode byte Continue? [Y/n]

0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef

00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 20: be 5e c4 c5 c4 09 28 20 77 0f 99 15 ff ff 00 00 ?????( w???.... 30: 3b 13 13 00 00 00 00 00 ff ff ff 00 00 76 41 69 ;??..........vAi 40: 05 00 00 52 00 ff 00 ff 00 ff 00 ff 00 ff 81 4c ?..R..........?L 50: 81 7f 81 7f 30 2a ff ff ff ff ff ff 02 42 42 8f ????0*......?BB? 60: 5f 5f e0 00 3b 0d 0d 39 20 20 46 27 27 44 40 00 ?.;??9 FD@. 70: ed 00 00 00 6f fb 5f 34 00 00 00 55 03 06 00 02 ?...o?_4...U??.? 80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................

Changed 6 years ago by ticket

[root@arturo ~]# i2cdump 1 0x2e b
WARNING! This program can confuse your I2C bus, cause data loss and worse!
I will probe file /dev/i2c-1, address 0x2e, mode byte
Continue? [Y/n]
     0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f    0123456789abcdef
00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
20: be 5e c4 c5 c4 09 28 20 77 0f 99 15 ff ff 00 00    ?^????( w???....
30: 3b 13 13 00 00 00 00 00 ff ff ff 00 00 76 41 69    ;??..........vAi
40: 05 00 00 52 00 ff 00 ff 00 ff 00 ff 00 ff 81 4c    ?..R..........?L
50: 81 7f 81 7f 30 2a ff ff ff ff ff ff 02 42 42 8f    ????0*......?BB?
60: 5f 5f e0 00 3b 0d 0d 39 20 20 46 27 27 44 40 00    __?.;??9  F''D@.
70: ed 00 00 00 6f fb 5f 34 00 00 00 55 03 06 00 02    ?...o?_4...U??.?
80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................

Changed 6 years ago by khali

I just tested the dump you provided with the lm87 driver from kernel 2.6.26.6 and as expected, the lm87 driver didn't attach to the emulated chip. So, I can't believe that it did for you without a force parameter. Check the output of

modprobe -c | grep lm87

The output of "sensors" you provided doesn't include any output from the lm87 driver either.

Changed 6 years ago by ticket

It's not loaded anymore and I'm not going to load it if that's what caused the system to shut down. It's 10 miles away from me right now so I would not be able to bring it back up. Is there anything else I can do for you, or any other data you need?

You're right though, I think, I went back and looked at the bash history from when I copied the commands from a forum post.. It wasn't lm87, it was lm85.. This was the command:

modprobe lm85 force_adm1027=0,0x2e

Sorry about the confusion.. I'm still getting my bearings with this stuff.

Changed 6 years ago by khali

  • status changed from reopened to closed
  • resolution set to invalid

Oh well... Next time you need help on something, I suggest that you provide correct information right from the beginning, this will save everyone's time. Here it took 10 comments before getting the exact description of what you did that caused the problem. Which, clearly, explains what happened, because you shouldn't have done it.

I can add you to the list of people interested in a driver for the ADT7476 if you tell me your name.

Changed 6 years ago by ticket

Matt Castelein, I gave it to you..

Note: See TracTickets for help on using tickets.