The Sysadmin Notebook  

Sitemap

Smart Disk Monitoring

Checking Disk Health with smartctl

HDD Technology

Top Bottom

Internally, hard disk drives (HDDs) consist of a number of rigid platters on a spindle and read/write heads that float over the platters. The surface of each platter is coated with magnetic material which can be magnetised to represent 0 or 1.

Colder drives (average temperature < 27C) have twice the failure rate of hotter drives (~50C).

SMART Attributes

Top Bottom

SMART is used to predict failure in hard drives by monitoring various attributes of a drive. SMART attribute values are represented as 'Normalised' and 'Raw' values. Normalised values are calculated, so that higher values are always better than lower values.

Data for SMART attributes are updated:

Always
During normal device operation
Offline
Only during offline tests

Not all failures are predictable, so good backup strategies are important to maintain data security

Common SMART Attributes
Attribute Name Description Additional Notes
Raw Read Error Rate Rate of hardware read errors that occurred when reading data from the disk surface A raw value of 0 is the best result for this attribute. Read Errors indicate problems with disk surface or read/write heads.
Throughput Performance Value representing the throughput performance of the disk A decreasing value is indicative of a mechanical problem with the disk
Spin Up Time Value representing the time the disk takes to spin up to full operational speed An increasing raw value indicates mechanical problems with the disk
Start Stop Count Count of spindle start/stop cycles Raw value will normally match raw value for Power Cycle Count
Reallocated Sector Count Count of reallocated sectors or "remaps". When errors are encountered during read/write verification, the sectors are marked as "reallocated and data is remapped to special reserved areas. As remaps increase, the read/write performance of the disk will degrade. Increasing remaps indicate problems with disk surface.
Seek Error Rate Measures errors in the mechanical positioning of the read/write heads Can be caused by a number of factors including thermal widening or problems with servo mechanism. Worth assessing this value in conjunction with thermal readings for the disk.
Seek Time Performance Average performance of seek operations of the read/write heads Decreasing values indicate problems with the mechanical system of the hard disk
Power On Hours Cumulative count of power on hours for the disk
Spin Retry Count Count of the number of times the disk has had to retry a spin start at power on Under normal conditions a disk will spin up at power on. The retry count is incremented if the initial spin up fails, and is indicative of a failing mechanical subsystem or power supply problem.
Power Cycle Count Count of Full Power On cycles
Runtime Bad Block
End-to-End Error Mismatch between host and hard drive parity bits Could indicate problems with hard drive cache or IO subsystem.
Reported Uncorrect Number of errors that could not be corrected using hardware ECC
Command Timeout Number of aborted operations due to HDD timeout Indicates power supply or interface cable problem
High Fly Writes Number of unsafe write operations outside the normal head flying-range
Airflow Temperature Celsius Internal HDD air temperature Raw value should increase while the disk is on and actively in use, but should reach an optimal value around 35 to 45C.
G-Sense Error Rate Number of errors resulting from external shock and vibration Indicates someone may be kicking your server
Power-Off Retract Count Number of times the heads are loaded off the media Simply indicates the number of power-off events
Current Pending Sector Number of sectors waiting to be remapped because of errors If the errors are read errors the sector will not be remapped, because it may become readable later. Such sectors are considered pending and will be remapped when a write operation to the area is needed.
Offline Uncorrectable Total number of sectors with uncorrectable errors Indicates defects in the disk surface or mechanical subsystem
UDMA CRC Error Count Number of errors in high-speed data transfer via the interface cable Errors are detected using Interface Cyclical Redundancy Check
Load Retry Count Number of times head changes position Number of times drive head enters and leaves data zones. Indicates level of activity on drive.

SMART Data Collection

Top Bottom

Online data collection occurs during normal operation, and collects data for attributes that are describe as 'Updated Always' in the output from 'smartctl -A'. Offline data is collected every four hours, and involves scanning the disk for defects. SMART tests can be run manually with 'smartctl -t testtype. The testtypes are:

offline
runs an immeadiate offline data collection
short
runs short self test, checking the electrical and mechanical performance of the disk
long
runs extended self test, a more thorough test than the short self test
conveyance
runs a conveyance self test, designed to identify damaged caused during transportation of a device

Running these tests with the '-c' option causes the tests to be run in captive mode, which will impact the disk responsiveness and should not be used on drives with mounted partitions

smartctl

Top Bottom

The smartctl command gives access to the SMART functions of your hard disk, allowing you to monitor the health status of the device.

smartctl -i /dev/sda
list model and firmware information for the disk
 smartctl 5.39.1 2010-01-28 r3054 [x86_64-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:     Hitachi HTS725050A9A364
Serial Number:    091123PC6400VLG5X7RA
Firmware Version: PC4OC70E
User Capacity:    500,107,862,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 6
Local Time is:    Sat Jun 19 10:15:29 2010 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

 
smartctl -Hc /dev/sda
show the health status of the disk
 smartctl 5.39.1 2010-01-28 r3054 [x86_64-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		 ( 645) seconds.
Offline data collection
capabilities: 			 (0x51) SMART execute Offline immediate.
					No Auto Offline data collection support.
					Suspend Offline collection upon new
					command.
					No Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 131) minutes.
SCT capabilities: 	       (0x003d)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

 
smartctl -A /dev/sda
lists the disks table of attributes
smartctl -l error /dev/sda
lists the log of disk errors
 smartctl 5.39.1 2010-01-28 r3054 [x86_64-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Error Log Version: 1
No Errors Logged

 
smartctl -l selftest /dev/sda
lists the results of self-tests run on the disk
smartctl -t short /dev/sda
run a short (couple of minutes) self-test on the disk. Does not require disk to be taken off-line
smartctl -t long /dev/sda
run a long (hour or more) self-test on the disk. Does not require disk to be taken off-line
smartctl -t offline /dev/sda
run an off-line self-test

Configure a long self-test as a cron job, if you suspect a disk is failing

The smartd daemon regularly monitors SMART status for your hard disks according to the configuration contained in /etc/smartd.conf.