Author |
Message |
finotti
|
|
Post subject: [Solved] Hardware Error (kernel message)
Posted: 12.04.2017, 19:13
|
|
Joined: 2010-09-12
Posts: 493
Status: Offline
|
|
While converting some videos from x265 to x264, I've got a lot of these error messages:
Code:
[snip]
Message from syslogd@debian at Apr 12 14:49:09 ...
kernel:[580833.359854] mce: [Hardware Error]: CPU 6: Machine Check: 0 Bank 128: 00000000880003c3
Message from syslogd@debian at Apr 12 14:49:09 ...
kernel:[580833.359855] mce: [Hardware Error]: TSC 7700ee7e7ae1a
Message from syslogd@debian at Apr 12 14:49:09 ...
kernel:[580833.359857] mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1492022949 SOCKET 0 APIC 5 microcode 7
Message from syslogd@debian at Apr 12 14:49:09 ...
kernel:[580833.359859] mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 128: 00000000880003c3
Message from syslogd@debian at Apr 12 14:49:09 ...
kernel:[580833.359859] mce: [Hardware Error]: TSC 7700ee7e8550d
Message from syslogd@debian at Apr 12 14:49:09 ...
kernel:[580833.359861] mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1492022949 SOCKET 0 APIC 4 microcode 7
Message from syslogd@debian at Apr 12 14:49:09 ...
kernel:[580833.362514] mce: [Hardware Error]: CPU 6: Machine Check: 0 Bank 128: 0000000088020282
Message from syslogd@debian at Apr 12 14:49:09 ...
kernel:[580833.362516] mce: [Hardware Error]: TSC 7700ee81cf106
Message from syslogd@debian at Apr 12 14:49:09 ...
kernel:[580833.362519] mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1492022949 SOCKET 0 APIC 5 microcode 7
Message from syslogd@debian at Apr 12 14:49:09 ...
kernel:[580833.362520] mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 128: 0000000088020282
Message from syslogd@debian at Apr 12 14:49:09 ...
kernel:[580833.362521] mce: [Hardware Error]: TSC 7700ee81d0f2f
Message from syslogd@debian at Apr 12 14:49:09 ...
kernel:[580833.362523] mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1492022949 SOCKET 0 APIC 4 microcode 7
Message from syslogd@debian at Apr 12 14:54:09 ...
kernel:[581133.473534] mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 128: 00000000880003c3
Message from syslogd@debian at Apr 12 14:54:09 ...
kernel:[581133.473536] mce: [Hardware Error]: TSC 7710ac08d48a4
Message from syslogd@debian at Apr 12 14:54:09 ...
kernel:[581133.473538] mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1492023249 SOCKET 0 APIC 4 microcode 7
Message from syslogd@debian at Apr 12 14:54:09 ...
kernel:[581133.473539] mce: [Hardware Error]: CPU 6: Machine Check: 0 Bank 128: 00000000880003c3
Message from syslogd@debian at Apr 12 14:54:09 ...
kernel:[581133.473540] mce: [Hardware Error]: TSC 7710ac08df8b9
Message from syslogd@debian at Apr 12 14:54:09 ...
kernel:[581133.473541] mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1492023249 SOCKET 0 APIC 5 microcode 7
Message from syslogd@debian at Apr 12 14:54:09 ...
kernel:[581133.474507] mce: [Hardware Error]: CPU 6: Machine Check: 0 Bank 128: 0000000088020282
Message from syslogd@debian at Apr 12 14:54:09 ...
kernel:[581133.474509] mce: [Hardware Error]: TSC 7710ac0c3179f
Message from syslogd@debian at Apr 12 14:54:09 ...
kernel:[581133.474511] mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1492023249 SOCKET 0 APIC 5 microcode 7
Message from syslogd@debian at Apr 12 14:54:09 ...
kernel:[581133.474512] mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 128: 0000000088020282
Message from syslogd@debian at Apr 12 14:54:09 ...
kernel:[581133.474513] mce: [Hardware Error]: TSC 7710ac0c32b1c
Message from syslogd@debian at Apr 12 14:54:09 ...
kernel:[581133.474514] mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1492023249 SOCKET 0 APIC 4 microcode 7
The command was this:
Code:
#! /bin/bash
for i in *265*.mkv ; do
nn=$(echo $i | sed 's/265/264/')
ffmpeg -i "$i" -bsf:v h264_mp4toannexb -vcodec libx264 "$nn"
sleep 3
done
and here is the info on the system:
Code:
$ inxi -v3
System: Host: debian Kernel: 4.10.0-7.slh.1-aptosid-amd64 x86_64 (64 bit gcc: 6.3.0) Console: tty 20
Distro: aptosid 2013-01 Ἑσπερίδες - kde-full - (201305050307)
Machine: Device: desktop System: ASUS product: All Series
Mobo: ASUSTeK model: Z87-PRO v: Rev 1.xx UEFI: American Megatrends v: 1707 date: 12/13/2013
CPU: Quad core Intel Core i7-4771 (-HT-MCP-) cache: 8192 KB
flags: (lm nx sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx) bmips: 28833
clock speeds: max: 3900 MHz 1: 3652 MHz 2: 3770 MHz 3: 3730 MHz 4: 3669 MHz 5: 3884 MHz 6: 3878 MHz
7: 3778 MHz 8: 3679 MHz
Graphics: Card: Intel Xeon E3-1200 v3/4th Gen Core Processor Integrated Graphics Controller bus-ID: 00:02.0
Display Server: X.org 1.19.3 driver: intel tty size: 133x74 Advanced Data: N/A out of X
Network: Card: Intel Ethernet Connection I217-V driver: e1000e v: 3.2.6-k port: f080 bus-ID: 00:19.0
IF: eth0 state: up speed: 100 Mbps duplex: full mac: e0:3f:49:a3:4c:a6
Drives: HDD Total Size: 9257.9GB (52.2% used)
ID-1: model: WDC_WD10EALS
ID-2: model: WDC_WD40EZRX
ID-3: model: WDC_WD20EARX
ID-4: model: Samsung_SSD_840
ID-5: model: WDC_WD20EARX
Info: Processes: 381 Uptime: 6 days Memory: 12376.0/15736.9MB Init: systemd runlevel: 5 Gcc sys: 6.3.0
Client: Shell (bash 4.4.111) inxi: 2.3.5
(I haven't been able to check if the conversion worked well.)
Does this represent just an error/problem with ffmpeg or is it a real hardware problem?
Thanks and best to all,
Luis |
Last edited by finotti on 19.04.2017, 16:16; edited 1 time in total
|
|
|
|
|
slh
|
|
Post subject: RE: Hardware Error (kernel message)
Posted: 12.04.2017, 20:40
|
|
Joined: 2010-08-25
Posts: 962
Status: Offline
|
|
In most cases MCE errors are real, but there may be kernel errors involved as well, it's not very likely, but possible (so if you still have an older version, like 4.9 or earlier available, give it a try). Some hardware bugs can be fixed (rather plastered over) by the CPU/ mainboard manufacturer via microcode updates. These are usually integrated by the mainboard vendor via BIOS/ UEFI updates, but as BIOS/ UEFI updates are constantly behind, installing intel-microcode from Debian's non-free section usually offers the most current microcode fixes.
You have to consider that transcoding videos is quite hardware intensive and ffmpeg is very aggressively multi-threaded, so it's easily one of the most demanding tasks you can push a CPU to do in normal life. On top of that it also tends to execute hardware accelerated commands (MMX, SSE2/3/...) which are also rarely used by anything but 3d games or media players/ encoders, which makes it more likely to expose hardware faults than 'normal' software, which keeps the CPU idle 98% of the time.
Before going shopping, there are however two things you could check first:
- is the cooling working? check the CPU fan and core temperatures (lm-sensors, run sensors). normally Intel CPUs throttle before overheating and prevent themselves from hardware damage due to insufficient cooling, but it's easy to check.
- re-seat the RAM modules, there may be connectivity issues (dust) involved, be careful of static electricity though (try to touch a radiator or water pipe first, don't walk around on carpet, etc.)
Re-seating the CPU wouldn't be totally insane either, but today's modern CPU sockets are rather fragile and you'd need to replace the thermal grease afterwards, so I really don't recommend this unless you're very confident about your abilities or out of options. In general cleaning (compressed air, be careful not to let the fans spin/ generate electricity and backpower/ overpower the chips) and checking cables/ connectors can help with quite some strange issues.
Try the software and less intrusive debugging approaches first, of course (intel-microcode, BIOS update). |
|
|
|
|
|
finotti
|
|
Post subject: RE: Hardware Error (kernel message)
Posted: 14.04.2017, 15:22
|
|
Joined: 2010-09-12
Posts: 493
Status: Offline
|
|
Thanks, slh, for the outstanding support, as usual.
I will test it all next week, as I am out of town right now. |
|
|
|
|
|
alexk
|
|
Post subject: RE: Hardware Error (kernel message)
Posted: 18.04.2017, 18:00
|
|
Joined: 2010-10-01
Posts: 288
Status: Offline
|
|
I had similar errors last month on an intel i5-2500k, while running graphics applications, found at least one cpu core was running up to 90+C. I cleaned dust, checked RAM seatings, reseated cpu with new thermal grease, noticed my heatsink might not have been attached entirely properly, but temperatures were still quite high and I ended up getting some errors again. I then installed thermald, temps went down significantly and no more hardware errors at this time. |
|
|
|
|
|
slh
|
|
Post subject: RE: Hardware Error (kernel message)
Posted: 18.04.2017, 18:19
|
|
Joined: 2010-08-25
Posts: 962
Status: Offline
|
|
While thermald is recommended on recent intel CPUs (sandy-bridge and newer), MCEs mustn't occur without it either. |
|
|
|
|
|
alexk
|
|
Post subject: RE: Hardware Error (kernel message)
Posted: 18.04.2017, 20:47
|
|
Joined: 2010-10-01
Posts: 288
Status: Offline
|
|
Yes, that worried me. I'll check again without it. I've found the 'stress' tool useful for testing. I seemed to have excessively high temperatures for months and hadn't gotten around to tackling them, but the MCEs were new and what finally prompted me to take action. I'll check for microcode and BIOS updates. |
|
|
|
|
|
slh
|
|
Post subject: RE: Hardware Error (kernel message)
Posted: 18.04.2017, 22:41
|
|
Joined: 2010-08-25
Posts: 962
Status: Offline
|
|
Btw., installing/ running mcelog usually provides relatively human readable output about the kind problem the CPU thinks to have. |
|
|
|
|
|
finotti
|
|
Post subject: RE: Hardware Error (kernel message)
Posted: 19.04.2017, 16:16
|
|
Joined: 2010-09-12
Posts: 493
Status: Offline
|
|
OK, I finally could do some time to look over the issue.
I did not have the time to clean it all up (it's on my to-do list), so I just fiddled with software.
First, I installed intel-microcode, lm-sensors, xsensors and mcelog. I ran the ffmpeg conversion again (on same 4.10 kernel) and the temperature was hitting 100oC and then I'd get a message (as the ones above) and the core would be throttled (or so the message said). The temperature would not go above 100oC (as far as I waited), but I would get a message every once in a while. So, I killed the process.
I booted then in 4.9 kernel. It seemed to me I was getting similar temperatures, but I did not get any message.
I then updated the BIOS, and on kernel 4.10, the temperatures did not seem to go above 87oC. So, I guess there was someting in the update. I will mark this as solved.
P.S.: Thanks, again, slh for the support and very informative answers! |
|
|
|
|
|
slh
|
|
Post subject: RE: Hardware Error (kernel message)
Posted: 19.04.2017, 19:01
|
|
Joined: 2010-08-25
Posts: 962
Status: Offline
|
|
The temperatures really shouldn't go that high on a desktop system (notebooks typically run hotter, but even there it should remain at least 20 °C cooler), anything significantly beyond 45-55 °C -even under full load- is an indication of cooling problems. Either the fan doesn't spin anymore, dust has clogged the cooling fins or (more likely) the cooler doesn't sit exactly plain on the CPU (one of the push rods loose?). |
|
|
|
|
|
finotti
|
|
Post subject: Re: RE: Hardware Error (kernel message)
Posted: 19.04.2017, 19:57
|
|
Joined: 2010-09-12
Posts: 493
Status: Offline
|
|
slh wrote:
The temperatures really shouldn't go that high on a desktop system (notebooks typically run hotter, but even there it should remain at least 20 °C cooler), anything significantly beyond 45-55 °C -even under full load- is an indication of cooling problems. Either the fan doesn't spin anymore, dust has clogged the cooling fins or (more likely) the cooler doesn't sit exactly plain on the CPU (one of the push rods loose?).
OK, I wasn't aware of that. I will definitely check it then. Thanks again for the help! |
|
|
|
|
|
slh
|
|
Post subject: RE: Re: RE: Hardware Error (kernel message)
Posted: 19.04.2017, 22:21
|
|
Joined: 2010-08-25
Posts: 962
Status: Offline
|
|
In comparison an ivy-bridge system under full load (yes, the temperatures are slightly higher than expected, the fan apparently needs some cleaning).
Code:
$ sensors | grep ^Core
Core 0: +61.0°C (high = +85.0°C, crit = +105.0°C)
Core 1: +62.0°C (high = +85.0°C, crit = +105.0°C)
Core 2: +59.0°C (high = +85.0°C, crit = +105.0°C)
Core 3: +59.0°C (high = +85.0°C, crit = +105.0°C)
(this is with an aftermarket cooler, which isn't as noisy as the stock intel fan but has the same temperature behaviour) |
|
|
|
|
|
finotti
|
|
Post subject: Re: RE: Re: RE: Hardware Error (kernel message)
Posted: 24.04.2017, 15:38
|
|
Joined: 2010-09-12
Posts: 493
Status: Offline
|
|
slh wrote:
In comparison an ivy-bridge system under full load (yes, the temperatures are slightly higher than expected, the fan apparently needs some cleaning).
Code:
$ sensors | grep ^Core
Core 0: +61.0°C (high = +85.0°C, crit = +105.0°C)
Core 1: +62.0°C (high = +85.0°C, crit = +105.0°C)
Core 2: +59.0°C (high = +85.0°C, crit = +105.0°C)
Core 3: +59.0°C (high = +85.0°C, crit = +105.0°C)
(this is with an aftermarket cooler, which isn't as noisy as the stock intel fan but has the same temperature behaviour)
OK, over the weekend I've cleaned up the case and fans from dust. The push rods of CPU fan were tight. It did improve things, as the same job would not go past 80oC, but it is still to high. I guess a new CPU cooling fan might be necessary...
Does anyone have a recommendation for a good replacement? (A good balance between quiet, price, performance and easy installation...) The system is running an Intel Core i7-4771 CPU (socket LGA1150). No overclocking or gaming (except Minecraft for my son). The options are overwhelming me right now... |
|
|
|
|
|
|