Smart Disk Monitoring
Checking Disk Health with smartctl
Contents
HDD Technology
Top BottomInternally, hard disk drives (HDDs) consist of a number of rigid platters on a spindle and read/write heads that float over the platters. The surface of each platter is coated with magnetic material which can be magnetised to represent 0 or 1.
Colder drives (average temperature < 27C) have twice the failure rate of hotter drives (~50C).
SMART Attributes
Top BottomSMART is used to predict failure in hard drives by monitoring various attributes of a drive. SMART attribute values are represented as 'Normalised' and 'Raw' values. Normalised values are calculated, so that higher values are always better than lower values.
Data for SMART attributes are updated:
- Always
- During normal device operation
- Offline
- Only during offline tests
Not all failures are predictable, so good backup strategies are important to maintain data security
| Attribute Name | Description | Additional Notes |
|---|---|---|
| Raw Read Error Rate | Rate of hardware read errors that occurred when reading data from the disk surface | A raw value of 0 is the best result for this attribute. Read Errors indicate problems with disk surface or read/write heads. |
| Throughput Performance | Value representing the throughput performance of the disk | A decreasing value is indicative of a mechanical problem with the disk |
| Spin Up Time | Value representing the time the disk takes to spin up to full operational speed | An increasing raw value indicates mechanical problems with the disk |
| Start Stop Count | Count of spindle start/stop cycles | Raw value will normally match raw value for Power Cycle Count |
| Reallocated Sector Count | Count of reallocated sectors or "remaps". | When errors are encountered during read/write verification, the sectors are marked as "reallocated and data is remapped to special reserved areas. As remaps increase, the read/write performance of the disk will degrade. Increasing remaps indicate problems with disk surface. |
| Seek Error Rate | Measures errors in the mechanical positioning of the read/write heads | Can be caused by a number of factors including thermal widening or problems with servo mechanism. Worth assessing this value in conjunction with thermal readings for the disk. |
| Seek Time Performance | Average performance of seek operations of the read/write heads | Decreasing values indicate problems with the mechanical system of the hard disk |
| Power On Hours | Cumulative count of power on hours for the disk | |
| Spin Retry Count | Count of the number of times the disk has had to retry a spin start at power on | Under normal conditions a disk will spin up at power on. The retry count is incremented if the initial spin up fails, and is indicative of a failing mechanical subsystem or power supply problem. |
| Power Cycle Count | Count of Full Power On cycles | |
| Runtime Bad Block | ||
| End-to-End Error | Mismatch between host and hard drive parity bits | Could indicate problems with hard drive cache or IO subsystem. |
| Reported Uncorrect | Number of errors that could not be corrected using hardware ECC | |
| Command Timeout | Number of aborted operations due to HDD timeout | Indicates power supply or interface cable problem |
| High Fly Writes | Number of unsafe write operations outside the normal head flying-range | |
| Airflow Temperature Celsius | Internal HDD air temperature | Raw value should increase while the disk is on and actively in use, but should reach an optimal value around 35 to 45C. |
| G-Sense Error Rate | Number of errors resulting from external shock and vibration | Indicates someone may be kicking your server |
| Power-Off Retract Count | Number of times the heads are loaded off the media | Simply indicates the number of power-off events |
| Current Pending Sector | Number of sectors waiting to be remapped because of errors | If the errors are read errors the sector will not be remapped, because it may become readable later. Such sectors are considered pending and will be remapped when a write operation to the area is needed. |
| Offline Uncorrectable | Total number of sectors with uncorrectable errors | Indicates defects in the disk surface or mechanical subsystem |
| UDMA CRC Error Count | Number of errors in high-speed data transfer via the interface cable | Errors are detected using Interface Cyclical Redundancy Check |
| Load Retry Count | Number of times head changes position | Number of times drive head enters and leaves data zones. Indicates level of activity on drive. |
SMART Data Collection
Top BottomOnline data collection occurs during normal operation, and collects data for attributes that are describe as 'Updated Always' in the output from 'smartctl -A'. Offline data is collected every four hours, and involves scanning the disk for defects. SMART tests can be run manually with 'smartctl -t testtype. The testtypes are:
- offline
- runs an immeadiate offline data collection
- short
- runs short self test, checking the electrical and mechanical performance of the disk
- long
- runs extended self test, a more thorough test than the short self test
- conveyance
- runs a conveyance self test, designed to identify damaged caused during transportation of a device
Running these tests with the '-c' option causes the tests to be run in captive mode, which will impact the disk responsiveness and should not be used on drives with mounted partitions
smartctl
Top BottomThe smartctl command gives access to the SMART functions of your hard disk, allowing you to monitor the health status of the device.
- smartctl -i /dev/sda
- list model and firmware information for the disk
smartctl 5.39.1 2010-01-28 r3054 [x86_64-redhat-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Device Model: Hitachi HTS725050A9A364 Serial Number: 091123PC6400VLG5X7RA Firmware Version: PC4OC70E User Capacity: 500,107,862,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 6 Local Time is: Sat Jun 19 10:15:29 2010 BST SMART support is: Available - device has SMART capability. SMART support is: Enabled
- smartctl -Hc /dev/sda
- show the health status of the disk
smartctl 5.39.1 2010-01-28 r3054 [x86_64-redhat-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 645) seconds. Offline data collection capabilities: (0x51) SMART execute Offline immediate. No Auto Offline data collection support. Suspend Offline collection upon new command. No Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 131) minutes. SCT capabilities: (0x003d) SCT Status supported. SCT Feature Control supported. SCT Data Table supported.
- smartctl -A /dev/sda
- lists the disks table of attributes
- smartctl -l error /dev/sda
- lists the log of disk errors
smartctl 5.39.1 2010-01-28 r3054 [x86_64-redhat-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF READ SMART DATA SECTION === SMART Error Log Version: 1 No Errors Logged
- smartctl -l selftest /dev/sda
- lists the results of self-tests run on the disk
- smartctl -t short /dev/sda
- run a short (couple of minutes) self-test on the disk. Does not require disk to be taken off-line
- smartctl -t long /dev/sda
- run a long (hour or more) self-test on the disk. Does not require disk to be taken off-line
- smartctl -t offline /dev/sda
- run an off-line self-test
Configure a long self-test as a cron job, if you suspect a disk is failing
The smartd daemon regularly monitors SMART status for your hard disks according to the configuration contained in /etc/smartd.conf.
