Re: First NAS: Newbie Q&A on Hashing, Data & RAID Scrubbing, Check Summing
Reply #3 – 2023-09-02 08:14:43
Is a hash code automatically created for every user file (e.g., document, photo, audio, video) the first time it gets written to the NAS? Or do you have to use some kind of app or NAS utility and enable it to generate and assign a hash code to every one of your files? You seem to have a misunderstanding: RAID is for protecting against disk failure, not data corruption. Data corruption is extremely rare, because of the defences built into the HDD itself and the way the file system exploits that. For example (and this is what happens on any HDD made in the last 30 years or so): when the HDD reads a sector of data off the platter, it is compared with the CRC (stored with the data) and if it disagrees the read is retried multiple times to obtain (if possible) a valid CRC. Chances are this will succeed, because a HDD is essentially analogue and if a sector is going a bit marginal it won't be actually dead. Marginal is good – the HDD's built-in controller can then swap that sector out (invisible to the user) for a sector in the pool of spares. The file system (not RAID, although the file system may be stored on a RAID) is responsible for ensuring further data integrity, and if a particular file system uses enhanced methods for data verification then the check data (eg hashes) are stored by the file system in its directory tables (which are also subject to rare data corruption). Far less rare than actual data corruption is file system inconsistency, where the file system indexing loses track (typically because of being turned off mid-update). Journalling file systems such as BTRFS are designed to prevent this kind of corruption by keeping a log of pending update operations so that they can be re-run if necessary.what else can I and the zfs and/or btrfs do to both bit rot? I'm convinced you're making too much of this. All you need to do is maintain at least two copies (which you should anyway) and refresh those copies periodically (eg annually). While refreshing, you scan all the hashes to check data integrity and if a hash fails scrap that copy and use the other one.And to minimize the risk crashes: Replace the RAIDed HDD array every 3 (consumer) to 5 (enterprise grade) years. Nooooooo! If and when a HDD in a RAID fails, you just replace that one and let the RAID rebuild. THAT IS THE POINT OF RAID. Failure might be total hardware failure, or if a HDD runs out of spare sectors, or a hard failure of a sector where the data on it is not recoverable by re-reading. If a bit really has flipped (which, as I said, is incredibly rare – not least because that's not really a "thing" in analogue such as a HDD... but *is* a thing for digital storage such as SSD), the sector CRC will have failed and the sector be unrecoverable, which should trigger RAID fallback (HDD marked as failed). More likely (for HDD) a bit will be stuck rather than flipped (which results in a write error rather than a read error, and the HDD swaps out the sector). Modern HDDs are not prone to "crashes" if you do not exceed their physical specifications. A disk crash is when the head makes contact with the platter, thereby damaging the magnetic coating (normally they are riding on a thin film of whatever gas is sealed in the HDD casing). Crashes occur if you subject the running HDD to jolts, or kill the power without it parking its heads first (modern drives automatically retract their heads on power loss), or if the mechanism becomes so worn that fragments get caught under the head. 3-5 years is way too short for that, unless the HDD was faulty in the first place. Never moving the HDD with the power on, and performing a proper system shutdown, will pretty much ensure a crash never happens. You cannot rely on RAID to preserve your data. A meteorite might fall on it. A power glitch might destroy the whole RAID (that's happened to me). Data is never truly safe unless it is replicated in at least three locations, and the storage medium is refreshed periodically. If you have a backup (kept in more than one physical location, just in case a meteorite falls on it), then you don't really need RAID at all – just the means to check your working data isn't corrupt so you can substitute from backup if it is. Hashing does that (see below). If it's digital media (audio, video etc), then usually you'll know if it's corrupt anyway – and if you don't notice, does it matter? IMO it's not worth going to extremes to hide corruption at all costs, because you generally won't know whether it's happening in the background and therefore how much of a problem corruption really is. All you really need is to be able to restore the data from KNOWN GOOD VERIFIED backups if and when you notice a corruption.And where are those codes stored? Inside of the file’s own container? Or are all user file hash codes stored someplace else? In a “hash table” and/or on a drive partition on the RAID drive array? ... But as hash functions are never perfect, and while rare, data “collisions” are inevitable. Where the hashes are stored depends what generated them. The sector CRC (a form of hash) is physically stored on the HDD platter by the HDD built-in controller, invisibly to the user/OS/file system. A journalling file system stores its hashes within the indexing structure (a file system is essentially a database). if you use a utility to create and check hashes yourself, the hashes will be stored in the data files for that utility. Personally, I like a separate utility because I can control what happens when and can see what's going on. "Are any files in my working set corrupt?" Run the hash checker to find out. "Is this backup valid?" Run the hash checker to find out. Hashes are not perfect, no... but near enough for anybody. The question is not: "is it impossible that another combination of bits will produce an identical hash", but rather "is it possible for any non-contrived corruption of the data to produce an identical hash?" – and the answer to that is "NO!". And, just as a reminder (I wouldn't mention this again if you hadn't started a new thread): RAID management etc etc etc is a HUGH OVERHEAD. It would be far easier just to have good backups, and if one of those backups were on a subscription cloud service you've got the best of both worlds (and probably cheaper).