Rootserver Sponsor

webtropia"

Donations

Please support your operating system's further development:

donate"

Post new topic   Reply to topic
View previous topic Printable version Log in to check your private messages View next topic
Author Message
finottiOffline
Post subject: [Solved] Hardware Error (kernel message)  PostPosted: 12.04.2017, 19:13



Joined: 2010-09-12
Posts: 493

Status: Offline
While converting some videos from x265 to x264, I've got a lot of these error messages:

      Code:

[snip]
Message from syslogd@debian at Apr 12 14:49:09 ...
 kernel:[580833.359854] mce: [Hardware Error]: CPU 6: Machine Check: 0 Bank 128: 00000000880003c3

Message from syslogd@debian at Apr 12 14:49:09 ...
 kernel:[580833.359855] mce: [Hardware Error]: TSC 7700ee7e7ae1a

Message from syslogd@debian at Apr 12 14:49:09 ...
 kernel:[580833.359857] mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1492022949 SOCKET 0 APIC 5 microcode 7

Message from syslogd@debian at Apr 12 14:49:09 ...
 kernel:[580833.359859] mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 128: 00000000880003c3

Message from syslogd@debian at Apr 12 14:49:09 ...
 kernel:[580833.359859] mce: [Hardware Error]: TSC 7700ee7e8550d

Message from syslogd@debian at Apr 12 14:49:09 ...
 kernel:[580833.359861] mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1492022949 SOCKET 0 APIC 4 microcode 7

Message from syslogd@debian at Apr 12 14:49:09 ...
 kernel:[580833.362514] mce: [Hardware Error]: CPU 6: Machine Check: 0 Bank 128: 0000000088020282

Message from syslogd@debian at Apr 12 14:49:09 ...
 kernel:[580833.362516] mce: [Hardware Error]: TSC 7700ee81cf106

Message from syslogd@debian at Apr 12 14:49:09 ...
 kernel:[580833.362519] mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1492022949 SOCKET 0 APIC 5 microcode 7

Message from syslogd@debian at Apr 12 14:49:09 ...
 kernel:[580833.362520] mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 128: 0000000088020282

Message from syslogd@debian at Apr 12 14:49:09 ...
 kernel:[580833.362521] mce: [Hardware Error]: TSC 7700ee81d0f2f

Message from syslogd@debian at Apr 12 14:49:09 ...
 kernel:[580833.362523] mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1492022949 SOCKET 0 APIC 4 microcode 7

Message from syslogd@debian at Apr 12 14:54:09 ...
 kernel:[581133.473534] mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 128: 00000000880003c3

Message from syslogd@debian at Apr 12 14:54:09 ...
 kernel:[581133.473536] mce: [Hardware Error]: TSC 7710ac08d48a4

Message from syslogd@debian at Apr 12 14:54:09 ...
 kernel:[581133.473538] mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1492023249 SOCKET 0 APIC 4 microcode 7

Message from syslogd@debian at Apr 12 14:54:09 ...
 kernel:[581133.473539] mce: [Hardware Error]: CPU 6: Machine Check: 0 Bank 128: 00000000880003c3

Message from syslogd@debian at Apr 12 14:54:09 ...
 kernel:[581133.473540] mce: [Hardware Error]: TSC 7710ac08df8b9

Message from syslogd@debian at Apr 12 14:54:09 ...
 kernel:[581133.473541] mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1492023249 SOCKET 0 APIC 5 microcode 7

Message from syslogd@debian at Apr 12 14:54:09 ...
 kernel:[581133.474507] mce: [Hardware Error]: CPU 6: Machine Check: 0 Bank 128: 0000000088020282

Message from syslogd@debian at Apr 12 14:54:09 ...
 kernel:[581133.474509] mce: [Hardware Error]: TSC 7710ac0c3179f

Message from syslogd@debian at Apr 12 14:54:09 ...
 kernel:[581133.474511] mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1492023249 SOCKET 0 APIC 5 microcode 7

Message from syslogd@debian at Apr 12 14:54:09 ...
 kernel:[581133.474512] mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 128: 0000000088020282

Message from syslogd@debian at Apr 12 14:54:09 ...
 kernel:[581133.474513] mce: [Hardware Error]: TSC 7710ac0c32b1c

Message from syslogd@debian at Apr 12 14:54:09 ...
 kernel:[581133.474514] mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1492023249 SOCKET 0 APIC 4 microcode 7


The command was this:

      Code:

#! /bin/bash

for i in *265*.mkv ; do
    nn=$(echo $i | sed 's/265/264/')
    ffmpeg -i "$i" -bsf:v h264_mp4toannexb -vcodec libx264 "$nn"
    sleep 3
done

and here is the info on the system:
      Code:

$ inxi -v3
System:    Host: debian Kernel: 4.10.0-7.slh.1-aptosid-amd64 x86_64 (64 bit gcc: 6.3.0) Console: tty 20
           Distro: aptosid 2013-01 Ἑσπερίδες - kde-full - (201305050307)
Machine:   Device: desktop System: ASUS product: All Series
           Mobo: ASUSTeK model: Z87-PRO v: Rev 1.xx UEFI: American Megatrends v: 1707 date: 12/13/2013
CPU:       Quad core Intel Core i7-4771 (-HT-MCP-) cache: 8192 KB
           flags: (lm nx sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx) bmips: 28833
           clock speeds: max: 3900 MHz 1: 3652 MHz 2: 3770 MHz 3: 3730 MHz 4: 3669 MHz 5: 3884 MHz 6: 3878 MHz
           7: 3778 MHz 8: 3679 MHz
Graphics:  Card: Intel Xeon E3-1200 v3/4th Gen Core Processor Integrated Graphics Controller bus-ID: 00:02.0
           Display Server: X.org 1.19.3 driver: intel tty size: 133x74 Advanced Data: N/A out of X
Network:   Card: Intel Ethernet Connection I217-V driver: e1000e v: 3.2.6-k port: f080 bus-ID: 00:19.0
           IF: eth0 state: up speed: 100 Mbps duplex: full mac: e0:3f:49:a3:4c:a6
Drives:    HDD Total Size: 9257.9GB (52.2% used)
           ID-1: model: WDC_WD10EALS
           ID-2: model: WDC_WD40EZRX
           ID-3: model: WDC_WD20EARX
           ID-4: model: Samsung_SSD_840
           ID-5: model: WDC_WD20EARX
Info:      Processes: 381 Uptime: 6 days Memory: 12376.0/15736.9MB Init: systemd runlevel: 5 Gcc sys: 6.3.0
           Client: Shell (bash 4.4.111) inxi: 2.3.5


(I haven't been able to check if the conversion worked well.)

Does this represent just an error/problem with ffmpeg or is it a real hardware problem?

Thanks and best to all,

Luis


Last edited by finotti on 19.04.2017, 16:16; edited 1 time in total
 
 View user's profile Send private message  
Reply with quote Back to top
slhOffline
Post subject: RE: Hardware Error (kernel message)  PostPosted: 12.04.2017, 20:40



Joined: 2010-08-25
Posts: 962

Status: Offline
In most cases MCE errors are real, but there may be kernel errors involved as well, it's not very likely, but possible (so if you still have an older version, like 4.9 or earlier available, give it a try). Some hardware bugs can be fixed (rather plastered over) by the CPU/ mainboard manufacturer via microcode updates. These are usually integrated by the mainboard vendor via BIOS/ UEFI updates, but as BIOS/ UEFI updates are constantly behind, installing intel-microcode from Debian's non-free section usually offers the most current microcode fixes.

You have to consider that transcoding videos is quite hardware intensive and ffmpeg is very aggressively multi-threaded, so it's easily one of the most demanding tasks you can push a CPU to do in normal life. On top of that it also tends to execute hardware accelerated commands (MMX, SSE2/3/...) which are also rarely used by anything but 3d games or media players/ encoders, which makes it more likely to expose hardware faults than 'normal' software, which keeps the CPU idle 98% of the time.

Before going shopping, there are however two things you could check first:
- is the cooling working? check the CPU fan and core temperatures (lm-sensors, run sensors). normally Intel CPUs throttle before overheating and prevent themselves from hardware damage due to insufficient cooling, but it's easy to check.
- re-seat the RAM modules, there may be connectivity issues (dust) involved, be careful of static electricity though (try to touch a radiator or water pipe first, don't walk around on carpet, etc.)
Re-seating the CPU wouldn't be totally insane either, but today's modern CPU sockets are rather fragile and you'd need to replace the thermal grease afterwards, so I really don't recommend this unless you're very confident about your abilities or out of options. In general cleaning (compressed air, be careful not to let the fans spin/ generate electricity and backpower/ overpower the chips) and checking cables/ connectors can help with quite some strange issues.

Try the software and less intrusive debugging approaches first, of course (intel-microcode, BIOS update).
 
 View user's profile Send private message  
Reply with quote Back to top
finottiOffline
Post subject: RE: Hardware Error (kernel message)  PostPosted: 14.04.2017, 15:22



Joined: 2010-09-12
Posts: 493

Status: Offline
Thanks, slh, for the outstanding support, as usual.

I will test it all next week, as I am out of town right now.
 
 View user's profile Send private message  
Reply with quote Back to top
alexkOffline
Post subject: RE: Hardware Error (kernel message)  PostPosted: 18.04.2017, 18:00



Joined: 2010-10-01
Posts: 288

Status: Offline
I had similar errors last month on an intel i5-2500k, while running graphics applications, found at least one cpu core was running up to 90+C. I cleaned dust, checked RAM seatings, reseated cpu with new thermal grease, noticed my heatsink might not have been attached entirely properly, but temperatures were still quite high and I ended up getting some errors again. I then installed thermald, temps went down significantly and no more hardware errors at this time.
 
 View user's profile Send private message Yahoo Messenger  
Reply with quote Back to top
slhOffline
Post subject: RE: Hardware Error (kernel message)  PostPosted: 18.04.2017, 18:19



Joined: 2010-08-25
Posts: 962

Status: Offline
While thermald is recommended on recent intel CPUs (sandy-bridge and newer), MCEs mustn't occur without it either.
 
 View user's profile Send private message  
Reply with quote Back to top
alexkOffline
Post subject: RE: Hardware Error (kernel message)  PostPosted: 18.04.2017, 20:47



Joined: 2010-10-01
Posts: 288

Status: Offline
Yes, that worried me. I'll check again without it. I've found the 'stress' tool useful for testing. I seemed to have excessively high temperatures for months and hadn't gotten around to tackling them, but the MCEs were new and what finally prompted me to take action. I'll check for microcode and BIOS updates.
 
 View user's profile Send private message Yahoo Messenger  
Reply with quote Back to top
slhOffline
Post subject: RE: Hardware Error (kernel message)  PostPosted: 18.04.2017, 22:41



Joined: 2010-08-25
Posts: 962

Status: Offline
Btw., installing/ running mcelog usually provides relatively human readable output about the kind problem the CPU thinks to have.
 
 View user's profile Send private message  
Reply with quote Back to top
finottiOffline
Post subject: RE: Hardware Error (kernel message)  PostPosted: 19.04.2017, 16:16



Joined: 2010-09-12
Posts: 493

Status: Offline
OK, I finally could do some time to look over the issue.

I did not have the time to clean it all up (it's on my to-do list), so I just fiddled with software.

First, I installed intel-microcode, lm-sensors, xsensors and mcelog. I ran the ffmpeg conversion again (on same 4.10 kernel) and the temperature was hitting 100oC and then I'd get a message (as the ones above) and the core would be throttled (or so the message said). The temperature would not go above 100oC (as far as I waited), but I would get a message every once in a while. So, I killed the process.

I booted then in 4.9 kernel. It seemed to me I was getting similar temperatures, but I did not get any message.

I then updated the BIOS, and on kernel 4.10, the temperatures did not seem to go above 87oC. So, I guess there was someting in the update. I will mark this as solved.

P.S.: Thanks, again, slh for the support and very informative answers!
 
 View user's profile Send private message  
Reply with quote Back to top
slhOffline
Post subject: RE: Hardware Error (kernel message)  PostPosted: 19.04.2017, 19:01



Joined: 2010-08-25
Posts: 962

Status: Offline
The temperatures really shouldn't go that high on a desktop system (notebooks typically run hotter, but even there it should remain at least 20 °C cooler), anything significantly beyond 45-55 °C -even under full load- is an indication of cooling problems. Either the fan doesn't spin anymore, dust has clogged the cooling fins or (more likely) the cooler doesn't sit exactly plain on the CPU (one of the push rods loose?).
 
 View user's profile Send private message  
Reply with quote Back to top
finottiOffline
Post subject: Re: RE: Hardware Error (kernel message)  PostPosted: 19.04.2017, 19:57



Joined: 2010-09-12
Posts: 493

Status: Offline
      slh wrote:
The temperatures really shouldn't go that high on a desktop system (notebooks typically run hotter, but even there it should remain at least 20 °C cooler), anything significantly beyond 45-55 °C -even under full load- is an indication of cooling problems. Either the fan doesn't spin anymore, dust has clogged the cooling fins or (more likely) the cooler doesn't sit exactly plain on the CPU (one of the push rods loose?).


OK, I wasn't aware of that. I will definitely check it then. Thanks again for the help!
 
 View user's profile Send private message  
Reply with quote Back to top
slhOffline
Post subject: RE: Re: RE: Hardware Error (kernel message)  PostPosted: 19.04.2017, 22:21



Joined: 2010-08-25
Posts: 962

Status: Offline
In comparison an ivy-bridge system under full load (yes, the temperatures are slightly higher than expected, the fan apparently needs some cleaning).
      Code:
$ sensors | grep ^Core
Core 0:        +61.0°C  (high = +85.0°C, crit = +105.0°C)
Core 1:        +62.0°C  (high = +85.0°C, crit = +105.0°C)
Core 2:        +59.0°C  (high = +85.0°C, crit = +105.0°C)
Core 3:        +59.0°C  (high = +85.0°C, crit = +105.0°C)
(this is with an aftermarket cooler, which isn't as noisy as the stock intel fan but has the same temperature behaviour)
 
 View user's profile Send private message  
Reply with quote Back to top
finottiOffline
Post subject: Re: RE: Re: RE: Hardware Error (kernel message)  PostPosted: 24.04.2017, 15:38



Joined: 2010-09-12
Posts: 493

Status: Offline
      slh wrote:
In comparison an ivy-bridge system under full load (yes, the temperatures are slightly higher than expected, the fan apparently needs some cleaning).
      Code:
$ sensors | grep ^Core
Core 0:        +61.0°C  (high = +85.0°C, crit = +105.0°C)
Core 1:        +62.0°C  (high = +85.0°C, crit = +105.0°C)
Core 2:        +59.0°C  (high = +85.0°C, crit = +105.0°C)
Core 3:        +59.0°C  (high = +85.0°C, crit = +105.0°C)
(this is with an aftermarket cooler, which isn't as noisy as the stock intel fan but has the same temperature behaviour)


OK, over the weekend I've cleaned up the case and fans from dust. The push rods of CPU fan were tight. It did improve things, as the same job would not go past 80oC, but it is still to high. I guess a new CPU cooling fan might be necessary...

Does anyone have a recommendation for a good replacement? (A good balance between quiet, price, performance and easy installation...) The system is running an Intel Core i7-4771 CPU (socket LGA1150). No overclocking or gaming (except Minecraft for my son). The options are overwhelming me right now...
 
 View user's profile Send private message  
Reply with quote Back to top
Display posts from previous:     
Jump to:  
All times are GMT - 12 Hours
Post new topic   Reply to topic
View previous topic Printable version Log in to check your private messages View next topic
Powered by Zafenio