Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: deduplicate files -> and create symlinks / softlinks instead (Read 3973 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

Re: deduplicate files -> and create symlinks / softlinks instead

Reply #25
You can use the HardLinkShellExt tool on Windows to find duplicates
Huh. I have it installed, but ... does it find duplicates?
Anyway, no chance to work on tagged files. Solutions to deduplicate identical audio are already posted here.

Re: deduplicate files -> and create symlinks / softlinks instead

Reply #26
The reason I have "wishlisted" that feature is not deduplication though - it is to expand padding to take up all the space that the file is anyway going to use on disk. Bigger padding and fewer full rewrites, costlessly.
But when you do have deduplication, aligning the start of the audio makes the most sense because it ensures the audio will never share a block with metadata. ZFS won't waste an entire recordsize block for the end of the file as long as you enable some form of compression.
If you have an "ideal" file system that can sneak in more blocks "midway" in a file - and a tagger that can do it and thus "maintain deduplication" - then that is the way, especially if you don't know the audio length upon encoding. (If you knew, you might as well write an encoder that starts at the end.)

But if you don't have that luxury, then starting audio with z bytes zeroes and a bytes audio shouldn't be much different to the file system than ending it with a bytes audio and z bytes zeroes. You can always keep those z bytes if you want, but you have the additional advantage of writing tag data into it if and when you are in a situation where doing so saves a full tag rewrite. You simply have z bytes more padding, and that is not a downside.

Instead of course, you can use audio formats with (APEv2) tags at the end. Disadvantages and advantages to either solution.

Re: deduplicate files -> and create symlinks / softlinks instead

Reply #27
If you have an "ideal" file system that can sneak in more blocks "midway" in a file - and a tagger that can do it and thus "maintain deduplication" - then that is the way, especially if you don't know the audio length upon encoding. (If you knew, you might as well write an encoder that starts at the end.)
There is no such ideal filesystem. Which is fine, because audio files are relatively small, and tags are seldom modified. With ZFS, you maintain deduplication as long as the metadata and padding at the start of the file grows (or shrinks) in exact multiples of the recordsize, even if you have to rewrite the entire file to do it. The default recordsize is 128kiB.

But if you don't have that luxury, then starting audio with z bytes zeroes and a bytes audio shouldn't be much different to the file system than ending it with a bytes audio and z bytes zeroes. You can always keep those z bytes if you want, but you have the additional advantage of writing tag data into it if and when you are in a situation where doing so saves a full tag rewrite. You simply have z bytes more padding, and that is not a downside.
When you're using ZFS deduplication, the difference is that if your tags grow into the same 128kiB block as the audio, that entire 128kiB block can no longer be deduplicated. Padding compresses away to nothing, so you'll save more space by keeping the audio blocks entirely separate from the metadata blocks.

I'm not familiar with APEv2, but a quick look at the wiki suggests using it at the end of the file would prevent deduplication of the last block of the audio.

Re: deduplicate files -> and create symlinks / softlinks instead

Reply #28
When you're using ZFS deduplication, the difference is that if your tags grow into the same 128kiB block as the audio,
Which you never have to if you don't want to. You can always choose to rewrite the entire file (like your suggestion would force you to).
But when padding to EOF, there will be situations where you can choose to do so - and instead of a full rewrite, take a size penalty of a block size (per file-pair).
Big deal? If you are running compression, like you suggest below, it is going to be less, since part of that un-deduplicated block will be padding.
And if you are still adamant about that size, you can always run nightly metaflac jobs. Which would do precisely the same full file rewrites and recoup then maybe - except, not while you are sitting there waiting.
Also, since you will average to the same number of non-deduplicated blocks, you have to use compression to get the size gain anyway.

Let's get real here: A typical FLAC track is like 25 to 30 MB. Two hundred ZFS blocks. So with two audio-duplicates, then there is a chance that retagging one of them will cost half a percent minus the padding - if you use compression; if you don't, then there is a chance that you started out with one block less and lost it.
And when it does cost you a block, you save time while working - and you can reclaim that by a nightly metaflac job.

that entire 128kiB block can no longer be deduplicated. Padding compresses away to nothing, so you'll save more space by keeping the audio blocks entirely separate from the metadata blocks.
Enabling compression on a file system where most of the content is incompressible? How good is ZFS at giving up on compressing a block? This could become more expensive in CPU, than half a percent disk size?
(I never fiddled with ZFS compression, it was more than heavy enough anyway back in the day.)

Re: deduplicate files -> and create symlinks / softlinks instead

Reply #29
situations where you can choose to do so
"not" to do so.
Update the metadata section, and if goes into the block that ends with the first audio - do that.

But, but: If FS deduplication spends effectively no additional space on a block of zeroes, cannot you just pad megabytes between tags and audio and they will be deduplicated away?
But then "copying" to a different file system (w/o compression) should likely be done with metaflac instead of copy.

Re: deduplicate files -> and create symlinks / softlinks instead

Reply #30
ZFS supports block cloning (reflinks) since v2.2.  My understanding is that there's far less overhead than deduplication.  I know XFS and BTRFS also support reflinks.  Any time a file is copied (cp --reflink), identical blocks should only be referenced, not taking up additional space.

I believe ZFS uses LZ4 to tests a block for compressibility.  If the block can be compressed more than 12.5%, ZFS will compress that block with the chosen compression algorithm.  There will probably be no noticable overhead with lz4 or zstd, but the compression ratio won't be much (it's only 1.02 on my media files) 

Re: deduplicate files -> and create symlinks / softlinks instead

Reply #31
Big deal?
If you've reached the point where this deduplication scheme makes more sense than buying more storage, it probably is a big deal to you.

Enabling compression on a file system where most of the content is incompressible? How good is ZFS at giving up on compressing a block? This could become more expensive in CPU, than half a percent disk size?
ZFS gives you the choice of several different compression strategies, so you can tune it according to how much CPU time you're willing to spend. For this particular setup of only compressed lossless audio files, ZLE is probably the best choice; it'll compress the padding without attempting to compress anything else.

But, but: If FS deduplication spends effectively no additional space on a block of zeroes, cannot you just pad megabytes between tags and audio and they will be deduplicated away?
But then "copying" to a different file system (w/o compression) should likely be done with metaflac instead of copy.
You could do that, but I'm not sure why you'd put so much effort into speeding up tagging when you'll probably spend more time listening to your music than tagging it.