To subscribe to updates watch my GitHub repository. New posts get added as pull requests.

Stress testing and monitoring harddrives

26 Nov 2019

Context of this post

In this blog post, I am responding to a reddit poster within this reddit discussion. This post is intended for users of Linux on personal computers who want to play with RAID, drive monitoring, and stress testing drives purchased online to ensure personal harddrive’s maximum lifetime.

Why do I stress test

I do the following to stress test all new drives I purchase. The reason why I stress test is to detect Early "Infant Mortality" Failure of a drive. In layman’s terms, this basically means if your drive doesn’t fail somewhat immediately then it generally does a good job at lasting its lifetime. This has been my experience and this experience has also been documented by others (backblaze has several good storage write-ups I recommend them as a regular read).

I can only highlight my personal experience on my personal computers and in general drives that survive my stress test have lasted me for the lifetime that I needed them (5 years estimated). Your experience may differ and it’s worth noting that I have experienced drive failures within the first 3 days of my stress testing. When it occurs I just get a replacement of the same model from the place I bought the drive and say that the drive failed; it’s usually no hassle when I’ve done it.

How to stress test

When I mount a drive in Linux I usually start 3 terminals with the following commands. Let’s say your drive is /dev/XDD (note I’m intentionally making this a weird non-existent Linux drive so that someone doesn’t accidentally copy and paste it destroying their own data).

# terminal 1
while true; do dd if=/dev/zero of=/dev/XDD bs=1000M; done

#terminal 2
while true; do dd if=/dev/urandom of=/dev/XDD bs=1000M; done

#terminal 3
while true; do dd if=/dev/XDD of=/dev/null bs=1000M; done

After 3 days of this stress testing you can go to each terminal and type CTRL+C to cancel the loops so that it stops writing/reading data to the disk.

Notes:

  • Note that I reference /dev/urandom and not /dev/random. urandom generates more data with pseudo randomness consistently whereas /dev/random is much more secure with randomness but will not generate data if it does not have enough entropy. So using /dev/urandom to get the fastest possible random write data rate.
  • while true is necessary so that it indefinitely loops. When dd fills up the disk or finishes reading the disk it will exit. By being in a loop it will restart from the beginning.

I let these three terminals run on my computer for 3 days . Don’t forget to disable sleep and hibernate so that your computer continuously runs during this time period.

Analyzing stress test results and running more drive tests

Install smartmontools package. It provides smartctl which can be used to read the SMART data from a drive.

apt-get install smartmontools

The wikipedia page on SMART attributes gives you some hints on failures you can look for. This mostly applies to spinning disk HDDs but SSDs have SMART data as well.

Note: even though your SSD has SMART data it might not be documented with the Linux kernel so you could see a lot of “unknown” attributes. Your option for this is to research the manufacturer in the hopes that they publish their SMART attributes and what they mean. However, SSDs will still typically report standard SMART attributes like Reallocated Sectors Count.

To read the SMART data from your drive run

smartctl -a /dev/XDD

If it looks good, then run a short test. This should take roughly 1-15 minutes.

smartctl -t short /dev/XDD

After about 15 minutes run the following and look up the results.

smartctl -a /dev/XDD

If all looks good, then run the long test. This can take in upwards of 4 hours depending on the hard drive vendor and how through their testing. It varies.

smartctl -t long /dev/XDD

To analyze the results, once again run

smartclt -a /dev/XDD

New drives should have zero reallocated sectors so if you want to be really conservative replace it if there’s any. Other errors mentioned on the SMART wikipedia page would also prompt me to replace the drive. I’ve had drives show failure indicators from SMART data and have replaced them. I have also had one 7200RPM disk drive mechanically fail entirely within the 3 day stress test. Drives which have passed my 3 day stress test have lasted me several years. I have had drives fail years later so it’s not a guarantee but it is definitely a good practice.

How did I configure RAID0

I use RAID0 on my system partition. For software RAID (of any level) you must configure a non-raid boot partition. The non-raid boot partition is responsible for starting your system up and initializing the RAID disks (1st and 2nd level boot loader; see this link to learn more).

I documented how I configured all of this in Linux via this post of my notes. Since these are personal notes I tend to gloss over technical stuff that I am highly familiar. They are also not very well organized since it’s a scratchpad I keep adding to every time I re-install my own computer with RAID (usually from getting new drives; I can typically fish myself out of a hole when I dig it with system misconfiguration). Probably the one I need to reference if you’re going to review my notes is how to chroot into Linux from a live disk which I do not document in my notes (but just remind myself to do).

I recommend reading all of my notes before you try it. It’s like building legos where reading the lego instructions first might make it easier to build since you understand what needs to be done.

Detecting disk failures

walteweiss asked

Since you use it, I am interested what do you do to prevent the fail, beside the 5 years thing. Do you keep just the system on your drives, or your data too? And how many drives do you have?

I’ll address these parts a little bit at a time. How to prevent failures is not really about preventing them, but more about being proactive in monitoring.

On Linux, to get disk monitoring utilities you install the package smartmontools (in Debian). This gives you disk troubleshooting utilities such as smartctl. Another neat thing this provides is automatic monitoring of SMART data for all drives mounted in your computer. smartmontools will email you when it detects a potential imminent failure. Most people do not configure their computers to send email but if you do then you’ll get an email saying “your drive is about to fail so replace it” kind of message from your own system.

For this to work, you must configure your system to send email. Rather than trying to maintain your own mail server (I don’t personally), I would configure my computer to send email through gmail by configuring postfix and sendmail.

To continuously monitor your hard drives see this great write up on smart monitoring with smartmontools.

There are more thorough monitoring solutions that are open source but probably not worth mentioning since my intention for this reply is personal computers and not monitoring many computers (thousands).

Do you keep just the system on your drives, or your data too?

I usually do not keep sensitive data on my RAID0 drives. Just software, operating system, steam games, and cloud-synced data such as dropbox and steam game saves. I have separate hard drives for data I really care about.


This article was last updated Mar 29, 2020.

Posted in Engineering with tags: Reddit Response, Tips


Comments: