Why I Chose Non-ECC RAM for my FreeNAS

| Comments

Why I Chose Non-ECC RAM for My FreeNAS

Since posting the first of my few FreeNAS build blogs, I’ve had a steady stream of informed, well-wishing people drop comments related to the fact that I had not selected ECC RAM for my personal build or any of my other different FreeNAS build suggestions. In the comments alone of these blogs, I’ve probably written a couple different blogs’ worth of material about my thoughts on FreeNAS and ECC RAM. It made sense for me to consolidate that down into its own article and refer to it in the future.

The Risk

Let’s start off by clarifying a bit… In no way am I suggesting that ECC is an inferior choice to Non-ECC for FreeNAS. All things being equal, ECC is hands down the better choice. In fact, for almost all things computing, ECC is a better choice. Even more so for any ZFS filesystem like FreeNAS uses, bad RAM in ZFS could potentially do more than just corrupt a file or two, it could — and has — render the pool unmountable. Add the lack of recovery tools for the ZFS file system out there and this becomes a catastrophic failure. There is a discussion thread on the FreeNAS users’ forum covering ECC vs Non-ECC, in my opinion there’s not really an discussion at all in the thread and opposing opinions don’t seem to be either encouraged or welcome. However, if you can get past the totalitarian tone of the thread, there is actually some good information in there about what could happen to your ZFS filesystem when the filesystem data gets corrupted by bad RAM. And this is just the scariest single risk possible, there’s plenty of other risk out there. Due to data decay there’s a possibility that a bit in RAM may get flipped while the file is in RAM and has not yet been written to disk. If that happens, then ZFS will write that bit to disk causing undetected corruption to the file.

Many other comments have also pointed to the FreeNAS Hardware Recommendations, which states:

If your system supports it and your budget allows for it, install ECC RAM.

From these two sources we can more or less establish two facts:

  1. There’s a greater-than-zero chance that corruption could happen to your ZFS filesystem when using Non-ECC RAM.
  2. FreeNAS encourages you to use ECC in their hardware recommendations.

My Thoughts

As a quick catchup, my FreeNAS box was built for a single primary purpose: to house backups from the computers in my household. After a few months of using my Freenas box, I decided that I wanted to upgrade primary hard drive in my PC to an SSD. Because of SSDs smaller sizes, especially those in my price range, I was forced to do a little bit of housecleaning and decided that a large amount of important of data kept on my desktop would actually be safer on my FreeNAS box.

It is tempting at this point to arrive at the conclusion that ECC must be used with ZFS without exception. Which raises the question, why didn’t I choose ECC? And why do I continue to suggest that others consider the same?

  1. The real-world evidence is anecdotal for (and against) with regards to the rates of errors caused to ZFS and its contents as a result.
  2. Every bad stick of RAM I’ve experienced came to me that way from the factory and could be found via some burn-in testing.
  3. ECC RAM is more expensive.
  4. There’s more useful ways to protect yourself against this kind of failure.

Firstly, there’s no debate to these risks. The risks are real and they do happen. Despite what you might see claimed in the comments below, I acknowledge these risks and I’m not trying to explain them away. What I do question and wonder about however, is the likelihood of these events happen and whether the premium in price (for ECC) is worth that benefit. As best as I can find, no data like that exists yet. There are studies that show the data decay error rates overall (see below and comments) but I’ve yet to see any kind of study connecting those kinds of errors with actual found problems on a ZFS file system. What percentage of correctable data decay errors actually wind up corrupting a file or the filesystem in ZFS? Does the amount vary with utilization? If someone has some data on that, I’d be very interested in going over it. Until that exists, I think people are well within their rights to debate the value of ECC vs. Non-ECC RAM.

A friend made an apt analogy to me the other day when discussing this blog; firstly, consider the street in front of your house. The street is your amount of memory throughput, traffic in the street are your file writes to ZFS and finally pedestrians are your bits of data decay. When it comes to the sanctity of your file system and files, corruption happens each time a pedestrian is struck by a car. A study by Google is quoted on Wikipedia and in the comments of this article found that roughly 92% (11/12) of Google’s DIMMs have never experienced this kind of error. In my analogy, that’s basically saying that 11 out of 12 roads never have had any kind of pedestrians on them. Futhermore, the approximation of 5 flipped bits per 8GB of throughput an hour is the equivalent of 5 pedestrians playing in the street. Given that the usage of the FreeNAS in my house (and most DIYers) is on the low side, I’m not anticipating a whole slew of file writes having to happen throughout the hour. In my mind, my utilization of FreeNAS is low, the chance of a problematic DIMM is low (~8.34%), and even if I do have one of those problematic DIMMs that actual error rate is low too based on Google’s data. However, I completely understand that in someone else’s (like Google’s) shoes I might have a completely different definition of low, low to them justifiably should be “virtually non existent.”

Secondly, I’ve been tinkering with and building PCs for nearly two decades, I’ve worked as a technician in a computer store where we built custom computers and repaired customers’ PCs and I’ve worked in Information Technology dealing with all sorts of server hardware. Each of the bad sticks of RAM that I can remember were bad from the get-go. In my experience and the experience of people that I trust, I can’t come up with any scenario where a stick of RAM was good one day and then bad some day later. A reasonable amount of burn-in testing, especially a few passes of Memtest86+, should catch this bad RAM. In the event you were suspicious that your RAM could spontaneously go bad, it wouldn’t be that much work to periodically run Memtest86+ and make sure everything is in working order. In that ECC discussion thread from the FreeNAS forums, the ECC advocate makes an analogy:

For those that don’t want to read, just understand that ECC is one of the legs on your kitchen table, and you’ve removed that leg because you wanted to reuse old hardware that uses non-ECC RAM.

Straw Man and Slippery Slope fallacies notwithstanding, running Memtest86+ would be the equivalent of testing that table out and determining how reliably it can hold any weight.

Thirdly, ECC RAM is expensive. It’s not nearly the price premium that it used to be, but for my last build blog, ECC RAM would cost between $150 and $200 more. Before buying drives, that has been roughly a 40-50% premium. For a DIYer, that’s quite a price to pay to avoid a scenario of indeterminate probability. I would assert that in this regard, the FreeNAS hardware recommendations are pretty well written. They’re advocating ECC use but acknowledging there is both a budgetary reason not to use ECC and allowing for the possibility there may be a better way to spend that money to improve your NAS.

And lastly, even if ECC is within your budget, I believe that you can get more utility if you spent that money on something else. My point being, the extra money you spend on ECC could be spent elsewhere and provide you with more utility beyond protecting you from dataset or file corruption caused by bad RAM or bit decay. In my case, I’m leveraging a co-located server that hosts my blog and backing up the contents to that machine. Thankfully, because I have some nice connections this isn’t costing me more than it would cost to put ECC hardware in my NAS. I’m essentially placing a wager that I believe my NAS to be at more risk from other threats like burglary, fire, flood, etc.. than I do from memory errors that could be caught and fixed by ECC RAM. Furthermore, a backup of your data is not something that can be replaced by a RAID, in this case having either a backup or a raid are good, and having both is even better.

Conclusion

I think that the benefits of using Non-ECC (cost) and the benefits of ZFS (far too many to succinctly list) add up together for a very cost-effective and powerful NAS to be using at the house. Compared to systems priced similarly from other sources, it’s features and redundancy will put the other devices to shame. In my opinion, those added features and redundancy make up for the added risk of using Non-ECC RAM. I believe the investment in ECC RAM is a reasonable upgrade, but I also believe that that extra money could be spent elsewhere to help protect against other kinds of problems and potentially to help with some of the failures that could happen as a result of using Non-ECC RAM.

Update (4/19): This blog was shared on the FreeNAS forums which resulted in some good discussion, check it out!

Update (4/01): I re-organized this article a bit and included a bit of background to further illustrate what can go wrong when using non-ECC memory, explain how I’m using my FreeNAS system, and why I still feel comfortable living with these risks.

Update (3/29): Andrew Galloway a.k.a. nex7 wrote an excellent response to this blog that proves as an excellent counter-point to what I’ve suggested. Furthermore, I think that anybody who’s tempted to build a FreeNAS box and use non-ECC memory should give it the same consideration given to anything I’ve suggested in this blog. If you choose non-ECC RAM (like I did, and still would) just make sure you understand what you could be getting into with non-ECC RAM. I still think that risk is minor and I think it’s manageable, but that’s ultimately your decision to make and I wouldn’t fault you for going in either direction.

Related Posts

Comments