First NAS: Newbie Q&A on Hashing, Data & RAID Scrubbing, Check Summing

Topic: First NAS: Newbie Q&A on Hashing, Data & RAID Scrubbing, Check Summing (Read 2094 times) previous topic - next topic

0 Members and 1 Guest are viewing this topic.

First NAS: Newbie Q&A on Hashing, Data & RAID Scrubbing, Check Summing

2023-09-01 20:45:48

In pursuit of building-or rather having my local IT guy build me-my first NAS, I’ve sunk my newbie brain as deep as it can go into learning how best to use it after my builder does all the hardware and OS installs and then walks me through use of the GUI.

Of course, beyond basic storage capacity and drive storage redundancy to prevent user file losses, a NAS or any server and its file system (zfs or btrfs) are only as useful as they enable you to prevent data corruption. Save for the crazy maths (and terms like “pool” which seems to have multiple meanings in the data storage biz), these reports were helpful https://en.wikipedia.org/wiki/Hash_function https://en.wikipedia.org/wiki/Checksum for learning about hash functions and the tables of hash codes (“hashes”) they (apparently?) create for each document, photo, audio or video file.

But please to these questions:

Is a hash code automatically created for every user file (e.g., document, photo, audio, video) the first time it gets written to the NAS? Or do you have to use some kind of app or NAS utility and enable it to generate and assign a hash code to every one of your files?

And where are those codes stored? Inside of the file’s own container? Or are all user file hash codes stored someplace else? In a “hash table” and/or on a drive partition on the RAID drive array?

Are these hash codes used by the zfs and btrfs file system for routine data scrubbing?

https://blog.synology.com/how-data-scrubbing-protects-against-data-corruption
https://www.qnap.com/en/how-to/tutorial/article/how-to-prevent-silent-data-corruption-by-using-data-scrubbing-schedule
https://par.nsf.gov/servlets/purl/10100012

Then, as mentioned in the above links, following data scrubbing, are these hash codes also usually used for routine RAID scrubbing?

But for both data and RAID scrubbing, is data integrity ensured by comparing the hash code of each file with its initially (first ever created) hash code (stored wherever) to the hash code currently in the file. If the system’s comparing calculations show that the codes are different, then one or more of the file’s bits have flipped, so then it knows that the file is therefore corrupt?

If yes, then at that point will it flag me and ask if it wants the system to attempt to repair it?

If I say yes, then it will try to overwrite the corrupt file with the mirrored copy stored on a redundant (e.g., RAID 5) drive.

CAUTION: As RAID scrubbing puts mechanical stress and heat on HDDs, the rule of thumb seems to be to schedule it for once monthly-and only when drives are idle, so no user triggered read/write errors can occur.
https://arstechnica.com/civis/threads/probably-a-dumb-question-raid-data-scrubbing-bad-for-disks-if-done-too-frequently.1413781/

Beyond scrubbing, what else can I and the zfs and/or btrfs do to both bit rot?

And to minimize the risk crashes:

Replace the RAIDed HDD array every 3 (consumer) to 5 (enterprise grade) years.

Do not install any software upgrade for the NAS until it’s been around long for the NAS brand and the user community forum to declare it to be bug free.

What else can I do to minimize the risk of crashes?

Finally, when backing up from my (main) NAS to an (ideally identical??) NAS, Kunzite says here “…and I'm check summing my backups...”
https://forum.qnap.com/viewtopic.php?t=168535

But as hash functions are never perfect, and while rare, data “collisions” are inevitable. https://en.wikipedia.org/wiki/Hash_collision So as those hash algorithms are used for data and RAID scrubbing, they are evidently also used for check summing to ensure that data transfers from the NAS to a backup device happen without file corruption.

Apparently, CRC-32 is among the least collusion proof hash algorithms. https://en.wikipedia.org/wiki/Hash_collision#CRC-32

Thus, for backups from main NAS to backup NAS, how much more is the SHA256 hash function (algorithm) worth using to prevent collisions and to verify data integrity of user files via check summing than MD5, because it uses twice the number of bits?

But if not much more advantageous for even potentially large audio files https://www.hdtracks.com/ , then would SHA256 be a lot more so than MD5 for check summing during for backups of DVD movie rips saved to uncompressed MKV and/or ISO files, because video bandwidths are so much bigger than audio?

And what would be a recommended checksum calculator app? https://www.lifewire.com/what-does-checksum-mean-2625825#toc-checksum-calculators

But if the app returns a check sum error between the file on my main NAS and the copy to be updated on my backup NAS, how then to repair the corrupt file?

Again, by using the file’s original hash code (stored some place) created the first time that it was ever stored in the NAS?

If yes, would that app then prompt me to choose to have the system repair the file?

Re: First NAS: Newbie Q&A on Hashing, Data & RAID Scrubbing, Check Summing

Reply #1 – 2023-09-01 21:24:12

If you're asking about the hash codes for the file system, they're transparent to the user. For ZFS, hash codes are stored in a Merkle tree.

Re: First NAS: Newbie Q&A on Hashing, Data & RAID Scrubbing, Check Summing

Reply #2 – 2023-09-02 00:44:51

Quote from: nmxny24 on 2023-09-01 20:45:48

Is a hash code automatically created for every user file (e.g., document, photo, audio, video) the first time it gets written to the NAS? Or do you have to use some kind of app or NAS utility and enable it to generate and assign a hash code to every one of your files?

When enabled, hashes are calculated and stored when the files are first created and updated any time the files are edited. Hashes are enabled by default in both ZFS and BTRFS.

Quote from: nmxny24 on 2023-09-01 20:45:48

And where are those codes stored? Inside of the file’s own container? Or are all user file hash codes stored someplace else? In a “hash table” and/or on a drive partition on the RAID drive array?

What do you mean by "the file’s own container"? Whatever storage space you're using to hold your files, some of it will be set aside to hold metadata. Metadata includes the hashes as well as things like file names and modification times.

Quote from: nmxny24 on 2023-09-01 20:45:48

Are these hash codes used by the zfs and btrfs file system for routine data scrubbing?

Yes.

Quote from: nmxny24 on 2023-09-01 20:45:48

Then, as mentioned in the above links, following data scrubbing, are these hash codes also usually used for routine RAID scrubbing?

Plain RAID does not have hashes. That's a feature specific to ZFS and BTRFS.

Quote from: nmxny24 on 2023-09-01 20:45:48

But for both data and RAID scrubbing, is data integrity ensured by comparing the hash code of each file with its initially (first ever created) hash code (stored wherever) to the hash code currently in the file. If the system’s comparing calculations show that the codes are different, then one or more of the file’s bits have flipped, so then it knows that the file is therefore corrupt?

ZFS and BTRFS ensure integrity by reading the data, calculating the hash, and comparing the calculated hash against the previously-stored hash. If the hashes don't match, bits have flipped and therefore the file is corrupt.

Ordinary RAID does not have hashes; scrubbing involves reading every disk's copy of the data and comparing them. If something doesn't match, bits have flipped and the file is corrupt.

Quote from: nmxny24 on 2023-09-01 20:45:48

If yes, then at that point will it flag me and ask if it wants the system to attempt to repair it?

ZFS and BTRFS will automatically repair the file without asking you, as long as there's enough redundancy available. Since it's completely automatic, you won't know it happened unless you set up something like email alerts to inform you.

Most flavors of ordinary RAID can't repair corrupt data.

Quote from: nmxny24 on 2023-09-01 20:45:48

Thus, for backups from main NAS to backup NAS, how much more is the SHA256 hash function (algorithm) worth using to prevent collisions and to verify data integrity of user files via check summing than MD5, because it uses twice the number of bits?

But if not much more advantageous for even potentially large audio files https://www.hdtracks.com/ , then would SHA256 be a lot more so than MD5 for check summing during for backups of DVD movie rips saved to uncompressed MKV and/or ISO files, because video bandwidths are so much bigger than audio?

The odds of random corruption causing a collision in either of those hash functions are so ridiculously low that you're better off using MD5 because it's faster.

Quote from: nmxny24 on 2023-09-01 20:45:48

But if the app returns a check sum error between the file on my main NAS and the copy to be updated on my backup NAS, how then to repair the corrupt file?

Again, by using the file’s original hash code (stored some place) created the first time that it was ever stored in the NAS?

Yep.

Re: First NAS: Newbie Q&A on Hashing, Data & RAID Scrubbing, Check Summing

Reply #3 – 2023-09-02 08:14:43

Quote from: nmxny24 on 2023-09-01 20:45:48

Is a hash code automatically created for every user file (e.g., document, photo, audio, video) the first time it gets written to the NAS? Or do you have to use some kind of app or NAS utility and enable it to generate and assign a hash code to every one of your files?

You seem to have a misunderstanding: RAID is for protecting against disk failure, not data corruption. Data corruption is extremely rare, because of the defences built into the HDD itself and the way the file system exploits that. For example (and this is what happens on any HDD made in the last 30 years or so): when the HDD reads a sector of data off the platter, it is compared with the CRC (stored with the data) and if it disagrees the read is retried multiple times to obtain (if possible) a valid CRC. Chances are this will succeed, because a HDD is essentially analogue and if a sector is going a bit marginal it won't be actually dead. Marginal is good – the HDD's built-in controller can then swap that sector out (invisible to the user) for a sector in the pool of spares.

The file system (not RAID, although the file system may be stored on a RAID) is responsible for ensuring further data integrity, and if a particular file system uses enhanced methods for data verification then the check data (eg hashes) are stored by the file system in its directory tables (which are also subject to rare data corruption).

Far less rare than actual data corruption is file system inconsistency, where the file system indexing loses track (typically because of being turned off mid-update). Journalling file systems such as BTRFS are designed to prevent this kind of corruption by keeping a log of pending update operations so that they can be re-run if necessary.

Quote

what else can I and the zfs and/or btrfs do to both bit rot?

I'm convinced you're making too much of this. All you need to do is maintain at least two copies (which you should anyway) and refresh those copies periodically (eg annually). While refreshing, you scan all the hashes to check data integrity and if a hash fails scrap that copy and use the other one.

Quote

And to minimize the risk crashes:

Replace the RAIDed HDD array every 3 (consumer) to 5 (enterprise grade) years.

Nooooooo! If and when a HDD in a RAID fails, you just replace that one and let the RAID rebuild. THAT IS THE POINT OF RAID. Failure might be total hardware failure, or if a HDD runs out of spare sectors, or a hard failure of a sector where the data on it is not recoverable by re-reading. If a bit really has flipped (which, as I said, is incredibly rare – not least because that's not really a "thing" in analogue such as a HDD... but *is* a thing for digital storage such as SSD), the sector CRC will have failed and the sector be unrecoverable, which should trigger RAID fallback (HDD marked as failed). More likely (for HDD) a bit will be stuck rather than flipped (which results in a write error rather than a read error, and the HDD swaps out the sector).

Modern HDDs are not prone to "crashes" if you do not exceed their physical specifications. A disk crash is when the head makes contact with the platter, thereby damaging the magnetic coating (normally they are riding on a thin film of whatever gas is sealed in the HDD casing). Crashes occur if you subject the running HDD to jolts, or kill the power without it parking its heads first (modern drives automatically retract their heads on power loss), or if the mechanism becomes so worn that fragments get caught under the head. 3-5 years is way too short for that, unless the HDD was faulty in the first place. Never moving the HDD with the power on, and performing a proper system shutdown, will pretty much ensure a crash never happens.

You cannot rely on RAID to preserve your data. A meteorite might fall on it. A power glitch might destroy the whole RAID (that's happened to me). Data is never truly safe unless it is replicated in at least three locations, and the storage medium is refreshed periodically. If you have a backup (kept in more than one physical location, just in case a meteorite falls on it), then you don't really need RAID at all – just the means to check your working data isn't corrupt so you can substitute from backup if it is. Hashing does that (see below).

If it's digital media (audio, video etc), then usually you'll know if it's corrupt anyway – and if you don't notice, does it matter? IMO it's not worth going to extremes to hide corruption at all costs, because you generally won't know whether it's happening in the background and therefore how much of a problem corruption really is. All you really need is to be able to restore the data from KNOWN GOOD VERIFIED backups if and when you notice a corruption.

Quote from: nmxny24 on 2023-09-01 20:45:48

And where are those codes stored? Inside of the file’s own container? Or are all user file hash codes stored someplace else? In a “hash table” and/or on a drive partition on the RAID drive array?
...
But as hash functions are never perfect, and while rare, data “collisions” are inevitable.

Where the hashes are stored depends what generated them. The sector CRC (a form of hash) is physically stored on the HDD platter by the HDD built-in controller, invisibly to the user/OS/file system. A journalling file system stores its hashes within the indexing structure (a file system is essentially a database). if you use a utility to create and check hashes yourself, the hashes will be stored in the data files for that utility. Personally, I like a separate utility because I can control what happens when and can see what's going on.

"Are any files in my working set corrupt?" Run the hash checker to find out.

"Is this backup valid?" Run the hash checker to find out.

Hashes are not perfect, no... but near enough for anybody. The question is not: "is it impossible that another combination of bits will produce an identical hash", but rather "is it possible for any non-contrived corruption of the data to produce an identical hash?" – and the answer to that is "NO!".

And, just as a reminder (I wouldn't mention this again if you hadn't started a new thread): RAID management etc etc etc is a HUGH OVERHEAD. It would be far easier just to have good backups, and if one of those backups were on a subscription cloud service you've got the best of both worlds (and probably cheaper).

Re: First NAS: Newbie Q&A on Hashing, Data & RAID Scrubbing, Check Summing

Reply #4 – 2023-09-05 18:17:27

Quote from: Octocontrabass on 2023-09-02 00:44:51

What do you mean by "the file’s own container"? Whatever storage space you're using to hold your files, some of it will be set aside to hold metadata. Metadata includes the hashes as well as things like file names and modification times.

"Container" being the file's FLAC, WAV, MKV, ISO, et al.

So glad my presumptions were correct about the answers to the rest of my questions. Thanks.

Re: First NAS: Newbie Q&A on Hashing, Data & RAID Scrubbing, Check Summing

Reply #5 – 2023-09-05 21:16:52

Quote from: fooball on 2023-09-02 08:14:43

You seem to have a misunderstanding: RAID is for protecting against disk failure, not data corruption. Data corruption is extremely rare, because of the defences built into the HDD itself and the way the file system exploits that. For example (and this is what happens on any HDD made in the last 30 years or so): when the HDD reads a sector of data off the platter, it is compared with the CRC (stored with the data) and if it disagrees the read is retried multiple times to obtain (if possible) a valid CRC. Chances are this will succeed, because a HDD is essentially analogue and if a sector is going a bit marginal it won't be actually dead. Marginal is good – the HDD's built-in controller can then swap that sector out (invisible to the user) for a sector in the pool of spares.

The file system (not RAID, although the file system may be stored on a RAID) is responsible for ensuring further data integrity, and if a particular file system uses enhanced methods for data verification then the check data (eg hashes) are stored by the file system in its directory tables (which are also subject to rare data corruption).

Far less rare than actual data corruption is file system inconsistency, where the file system indexing loses track (typically because of being turned off mid-update). Journalling file systems such as BTRFS are designed to prevent this kind of corruption by keeping a log of pending update operations so that they can be re-run if necessary.

And, just as a reminder (I wouldn't mention this again if you hadn't started a new thread): RAID management etc etc etc is a HUGH OVERHEAD. It would be far easier just to have good backups, and if one of those backups were on a subscription cloud service you've got the best of both worlds (and probably cheaper).

I understood ~ 90% of everything you said about RAID, though what I read here about “RAID scrubbing”, as opposed to “data scrubbing” had apparently confused me.
https://blog.synology.com/how-data-scrubbing-protects-against-data-corruption

It seems what Synology calls “RAID scrubbing” is what a mirrored drive array does to ensure data consistency (but not bit accuracy) among the drives, so that if a drive does die its data (however bit accurate it may be) will be recovered and ready to be distributed among the re-built array when the dead drive is replaced.

As for as installing and maintaining in-home hardware for data storage and prevention against loss and corruption, rather than adding a second in-home RAID array NAS box to use for backing up the main NAS, you seem to you suggest only using a cloud service for backup.

If yes, and assuming their hardware and software configurations and routines are much better than what I can do on my own to store and protect my data, which would you say are among the top five most reliable cloud service brands?

And how safe would going with them be if my only internet connection is a wi-fi hotspot via my iPhone?

Re: First NAS: Newbie Q&A on Hashing, Data & RAID Scrubbing, Check Summing

Reply #6 – 2023-09-05 22:15:05

Quote from: nmxny24 on 2023-09-05 18:17:27

"Container" being the file's FLAC, WAV, MKV, ISO, et al.

Then no, that is not where ZFS/BTRFS hashes are stored. Files may contain their own hashes; for example, FLAC has an optional (but usually present) MD5 hash of the stored audio, so you can easily verify whether the audio portion of a FLAC file has become corrupt for some reason.

Quote from: nmxny24 on 2023-09-05 21:16:52

And how safe would going with them be if my only internet connection is a wi-fi hotspot via my iPhone?

Internet protocols have sufficient error correction built in to handle any connection you use.

Re: First NAS: Newbie Q&A on Hashing, Data & RAID Scrubbing, Check Summing

Reply #7 – 2023-09-06 09:37:43

Quote from: nmxny24 on 2023-09-05 21:16:52

As for as installing and maintaining in-home hardware for data storage and prevention against loss and corruption, rather than adding a second in-home RAID array NAS box to use for backing up the main NAS, you seem to you suggest only using a cloud service for backup.

I didn't say "only"! I think I said that data is not truly safe unless replicated in at least three different physical locations. Whether you choose cloud storage as one of those physical locations is up to you, but it comes with the advantage of very good data integrity.

However, what I was querying was the need for a NAS at all! Let cloud storage provide the NAS, that's what you would be paying the subscription for. It saves all the bother.

Your main working copy would simply be a ZFS/BTRFS partition, which you diligently spawn backups from (say weekly) and rotate your backups to your other physical locations. Auto-syncing with cloud storage eliminates the routine of backing up to cloud, but not to your third backup (which needs to be physically disconnected anyway).

Any unrecoverable error reported by ZFS/BTRFS which damages a file, and you simply swap in the file from cloud/backup. If the ZFS/BTRFS partition (or the whole drive) dies, you rebuild from backup (for security you would probably only have the one partition of the HDD). Any corruption caused by (say) a virus, or operator error, and you ditch the file or the entire set and recover from backup (a NAS would not protect you from that!). All these events are rare (except operator error), so it doesn't seem worth going to extremes for error/corruption to auto-repair.

Quote

If yes, and assuming their hardware and software configurations and routines are much better than what I can do on my own to store and protect my data, which would you say are among the top five most reliable cloud service brands?

I have no idea. All cloud storage services offer far greater levels of data security than can be conveniently managed at home, it's just a question of who you think most deserves your dollars.

Quote

And how safe would going with them be if my only internet connection is a wi-fi hotspot via my iPhone?

"Safety" is not an issue. The data (in cloud storage) is just as safe regardless of your means to access it, so I take it you are concerned about your ability to deliver and recover it. The Internet is designed to operate robustly over unreliable connections. If a received packet does not pass its CRC, retransmission is requested until the packet is received accurately (I'm talking about TCP, not UDP which is for speed rather than accuracy). That's more reliable than USB!

I don't mean to pry, but as you are so concerned about your data that you are contemplating/willing to pay for subcontracted NAS building and the overheads of running a NAS including pre-emptive HDD swapping, is it a false economy not to have broadband?

Re: First NAS: Newbie Q&A on Hashing, Data & RAID Scrubbing, Check Summing

Reply #8 – 2023-09-16 19:30:24

Quote from: fooball on 2023-09-06 09:37:43

I don't mean to pry, but as you are so concerned about your data that you are contemplating/willing to pay for subcontracted NAS building and the overheads of running a NAS including pre-emptive HDD swapping, is it a false economy not to have broadband?.........

All cloud storage services offer far greater levels of data security than can be conveniently managed at home, it's just a question of who you think most deserves your dollars.

In that case, which might be among the top five most reputable cloud storage centers which individuals, such as members of forums like this, would typically hire to back up their personal data?

Notice