jueves, 3 de septiembre de 2015

Linux x86_64: Detecting Hardware Errors http://www.cyberciti.biz/tips/linux-server-predicting-hardware-failure.html

ORIGINAL ARTICLE: http://www.cyberciti.biz/tips/linux-server-predicting-hardware-failure.html

I copy the article on the link below to spread the word in case of dissapear.

The Blue Screen of Death (BSoD) is used by Microsoft Windows, after encountering a critical system error. Linux / UNIX like operating system may get a kernel panic. It is just like BSoD. The BSoD and a kernel panic generated using a Machine Check Exception (MCE). MCE is nothing but feature of AMD / Intel 64 bit systems which is used to detect an unrecoverable hardware problem. MCE can detect:
  • Communication error between CPU and motherboard.
  • Memory error - ECC problems.
  • CPU cache errors and so on.

Program such mcelog decodes machine check events (hardware errors) on x86-64 machines running a 64-bit Linux kernel. It should be run regularly as a cron job on any x86-64 Linux system. This is useful for predicting server hardware failure before actual server crash.

Install mcelog

Type the following command under RHEL / CentOS / Fedora Linux, 64 bit kernel:
yum install mcelog
Type the following command under Debian / Ubuntu Linux, 64 bit kernel:
apt-get update && apt-get install mcelog

Default Cronjob

mcelog should be run regularly as a cron job on any x86-64 Linux system. By default followingcron settings are used on Debian / Ubuntu Linux - /etc/cron.d/mcelog:
# /etc/cron.d/mcelog: crontab entry for the mcelog package
*/5 * * * * root test -x /usr/sbin/mcelog -a ! -e /etc/mcelog-disabled && /usr/sbin/mcelog --ignorenodev --filter >> /var/log/mcelog
CentOS / RHEL / Fedora Linux runs hourly cron job via /etc/cron.hourly/mcelog.cron:
/usr/sbin/mcelog --ignorenodev --filter >> /var/log/mcelog

How do I view error logs?

Use tail or grep command:
# tail -f /var/log/mcelog
# grep -i "hardware error" /var/log/mcelog
# grep -c "hardware error" /var/log/mcelog
Alternatively, you can send an email alert when hardware error found on the system (write a shell script and call it via cron job):
# [ $(grep -c "hardware error" /var/log/mcelog) -gt 0 ] && echo "Hardware Error Found $(hostname) @ $(date)" | mail -s 'H/w Error' pager@example.com
With this tool I was able to pick up couple of hardware problem before a kernel panic i.e. server crash.

A Note About mcelog

  • You need to use 64 bit Linux kernel and operating system to run mcelog. Machine checks can indicate failing hardware, system overheats, bad DIMMs or other problems. Some MCEs are fatal and can not generally be survived without reboot and h/w replacement, but I was able to catch lots of bad h/w before crash with this tool.
  • mcat - A Windows command-line program from AMD to decode MCEs from AMD K8, Family 0x10 and 0x11 processors.
  • mcelog project home page.
  • mcedaemon - a daemonthat can get MCE notifications as soon as the kernel finds them. It does not try to interpret the MCE data, just alert other apps.
  • Linux Kernel panic source code.
  • man mcelog
  • Machine check exception support information for MS-Windows server 2003 and XP operating systems.