HydrogenAudio

Hydrogenaudio Forum => General Audio => Topic started by: TheJJJ-42 on 2024-02-04 18:52:53

Title: deduplicate files -> and create symlinks / softlinks instead
Post by: TheJJJ-42 on 2024-02-04 18:52:53
Hi

I found several tools and ways how to find and delete duplicate files.
But I don´t want to just delete them. Then they´ll be missing from e.g. compilations.

Using a filesystem like btrfs, zfs or ReFS, deduplication of similar data should be handled on filesystem level automatically.
But I´m using NTFS on Windows (non-pro).

Does anyone know a way, how to find (nearly) duplicates and replace them by softlinks / symlinks / mklink?
Any tool, script, plug-in, ...?
Title: Re: deduplicate files -> and create symlinks / softlinks instead
Post by: SimBun on 2024-02-04 20:53:13
Does anyone know a way, how to find (nearly) duplicates and replace them by softlinks / symlinks / mklink?
What do you mean by nearly?
If the tags don't match then you'll have incomplete albums, so are you looking for tags and a fuzzy audio match?
Title: Re: deduplicate files -> and create symlinks / softlinks instead
Post by: itisljar on 2024-02-05 07:55:56
Not only tags, compilations often have different mastering than albums. Songs are different.
Title: Re: deduplicate files -> and create symlinks / softlinks instead
Post by: magicgoose on 2024-02-05 10:48:45
"nearly" is not going to work, too complicated.
and for exact matches (identical files), you'd probably want (an equivalent of) hardlinks.
otherwise one of the files in the group (or all but one, depending on how you look) will become "special" and that will complicate things, not worth it.
anyway, the size savings will be probably so small in the grand scheme, I bet it's not worth spending time on, as time is finite! -- unless you make a lot of compilations yourself, then the easiest way is probably to use hardlinking instead of copying as the primary operation used for making a compilation.
for windows, you might be interested in "link shell extension" but I don't recall where one can install it safely from (to not get a malware look-alike), it doesn't look obvious.
I guess it would really be a lot simpler to pull off on Linux t.b.h.

soft links for self-made compilations can possibly work too, but beware, some software will likely act strange when it sees a soft link instead of a "real file".
especially backup methods.
the main benefit from soft links is that they can point to anywhere (e.g. a different filesystem/disk)
Title: Re: deduplicate files -> and create symlinks / softlinks instead
Post by: cid42 on 2024-02-05 11:32:38
For lossless like flac it wouldn't be too hard to create a FUSE filesystem that de-dupes only the audio, allowing tags to be different. FUSE can work everywhere AFAIK but easiest on Linux. But maintaining a custom filesystem just to save a few GBs per TB, not worth it.

edit: As for automatic dedupe via btrfs/zfs etc, they're all block based so that won't work except in the rare case where the tags for an otherwise matching audio portion have the same length. BTRFS/ZFS dedupe could work for formats that store the audio before the tags (so, not flac).
Title: Re: deduplicate files -> and create symlinks / softlinks instead
Post by: magicgoose on 2024-02-05 12:11:07
flac files typically have some padding.
if tags are updated without overflowing the padding and without getting rid of the remaining padding, then the audio part is not moved (relatively to the start of the file).
Title: Re: deduplicate files -> and create symlinks / softlinks instead
Post by: cid42 on 2024-02-05 12:23:45
That's true. So then the main hurdle to btrfs/zfs deduping is that audio that could match needs to be encoded with the same settings and encoder version to maximise match chance.
Title: Re: deduplicate files -> and create symlinks / softlinks instead
Post by: TheJJJ-42 on 2024-02-05 17:20:10
Thank you all, for your thoughts!
- I can recommend the Shell Link Extension, that´s where my inspiration for soft-links originated. But that´s not a tool to create thousands of links.
- 'Nearly' - There was a plugin, maybe for foobar, that found quite similar files, other tools can handle this too.

I just used DupeGuru, that found 4500 files worth 162 GB, so there would be some gain.
Sadly DupeGuru does not show any filesystem structure or the paths in the results, so it´s hard to continue. Not even to mention softlinks.

It seems, that that´s a use case for a special software, that works as a layer between the data (files or database) and handles stuff like this.
roon, picard, helium, lexicon ... there are some, that handle music management on an application layer.
But that is not quite my use case - I would prefer something application agnostic, on a deeper system layer.

Anyway, thank you all, for your knowledge and thoughts!
Title: Re: deduplicate files -> and create symlinks / softlinks instead
Post by: fooball on 2024-02-05 19:52:54
I've written a system of .BAT and .AWK which scans a tree of .MP3s and extract the tags and paths into a file using FFMPEG, then process the results file with AWK to build another BAT, which when run sets up another tree with the MP3s sorted into folders by genre (and other criteria).  Those folders are generated and named on the fly, and contain hard links to the original MP3s rather than duplicated files.  This is all on NTFS on a USB stick, and a variety of players work with the hard links just fine (even in MacOS – which surprised me!).

I'm not saying this is necessarily what you want, just illustrating what's possible.  I don't understand what you're saying about modifying tags, any modification to a hard-link referenced file modifies the file itself.  As for fuzzy matching, that's very difficult.
Title: Re: deduplicate files -> and create symlinks / softlinks instead
Post by: SimBun on 2024-02-05 21:47:33
I've written a system of .BAT and .AWK which scans a tree of .MP3s and extract the tags and paths into a file using FFMPEG, then process the results file with AWK to build another BAT, which when run sets up another tree with the MP3s sorted into folders by genre (and other criteria).  Those folders are generated and named on the fly, and contain hard links to the original MP3s rather than duplicated files.  This is all on NTFS on a USB stick, and a variety of players work with the hard links just fine (even in MacOS – which surprised me!).
I have to ask why? It looks like you might be creating playlists.
Title: Re: deduplicate files -> and create symlinks / softlinks instead
Post by: magicgoose on 2024-02-06 08:30:10
maybe because every player has its own playlist format, sometimes unclear/undocumented
Title: Re: deduplicate files -> and create symlinks / softlinks instead
Post by: fooball on 2024-02-06 09:54:31
I have to ask why? It looks like you might be creating playlists.
It doesn't matter why, but ease of locating any particular track when the user's (not me) primary means of access is via the directory tree.

I also came into all this with a misunderstanding of what a playlist is: I thought it was literally the pre-programmed sequence of tracks to play, I didn't realise it could be a presentation of potential tracks to play... because I personally need it to be the former, and I still have difficulty understanding the relationship between playlists and the actual play buffer (I forget what it's called now).

Nonetheless, I thought my experience might be of some help to the OP.
Title: Re: deduplicate files -> and create symlinks / softlinks instead
Post by: Porcus on 2024-02-06 10:41:40
For lossless like flac it wouldn't be too hard to create a FUSE filesystem that de-dupes only the audio, allowing tags to be different. FUSE can work everywhere AFAIK but easiest on Linux. But maintaining a custom filesystem just to save a few GBs per TB, not worth it.

edit: As for automatic dedupe via btrfs/zfs etc, they're all block based so that won't work except in the rare case where the tags for an otherwise matching audio portion have the same length. BTRFS/ZFS dedupe could work for formats that store the audio before the tags (so, not flac).

Not worth it no - but if for the sake of the argument we consider a scenario where most of the data in the world were PCM audio so file system developers would do that sort of stuff:
I guess most likely one would de-dupe the uncompressed audio and then have an audio compression algorithm as part of the file system. A block size of 4096 stereo samples each of 4 bytes before compression isn't much wrong for file system deduplication. The deduplicator would have to scan for different offsets in CD rips, so it would have to use either variable block size or change padding on the fly - and with that in place, front tags wouldn't be an issue.
Of course for cloud audio storage the users would have to settle for not getting their files back, just the audio. But "matching" cloud services have offered that for a while anyway.


(As an aside, a wishlist for tagging software: if a retag triggers full file rewrite due to front-tags expanding too much, then - knowing the size of the audio chunk - round up padding so that the file becomes an integer multiple of 4096 bytes.)
Title: Re: deduplicate files -> and create symlinks / softlinks instead
Post by: SimBun on 2024-02-06 12:10:51
I have to ask why? It looks like you might be creating playlists.
It doesn't matter why, but ease of locating any particular track when the user's (not me) primary means of access is via the directory tree.
I just thought in this instance the why might be more interesting than the how. I thought it was for use in a car that only allows browsing by folder.


Given the OP states DupeGuru found 4500 duplicates, I'm beginning to think they're creating "playlists" (folders of tracks) by copying tracks from other albums. If that's the case then it's a much simpler problem to solve.
Title: Re: deduplicate files -> and create symlinks / softlinks instead
Post by: cid42 on 2024-02-06 12:37:27
...
Not worth it no - but if for the sake of the argument we consider a scenario where most of the data in the world were PCM audio so file system developers would do that sort of stuff:
...
For the sake of argument then.
Anything I've missed? Other than glossing over hideously complex filesystem things.
Title: Re: deduplicate files -> and create symlinks / softlinks instead
Post by: Porcus on 2024-02-06 13:25:57
Anything I've missed? Other than glossing over hideously complex filesystem things.
* It is not at all clear that on-the-fly enoding is bad for speed, when you use spinning drives. Here, someone tested ZFS uncompressed vs ZFS compressed vs gzip vs lz4 vs zstd, on retired hardware two years ago:
https://www.reddit.com/r/zfs/comments/svnycx/a_simple_real_world_zfs_compression_speed_an/
(Hm, how does flac -0 relate to those speed-wise?)
* Such a beast might want an option to hash not just a block, but subblock per channel. There are cases where you got single channels matching (say 2.0 extracted from 5.0 - or a "multitrack files in the works where only a few channels are altered each save").
Title: Re: deduplicate files -> and create symlinks / softlinks instead
Post by: cid42 on 2024-02-06 14:54:16
Quote
It is not at all clear that on-the-fly enoding is bad for speed, when you use spinning drives.
But a filesystem should know the size of the files it contains. With wav this is trivial, with verbatim wasted_bits=0 flac this is possible, with flac -# it's not (short of storing the encode filesize on initial write and being super careful with versioning, nightmare). An ls within a directory triggering every audio file in that directory to be fully re-encoded in order to populate the stat request, not smart.

I've reconsidered exposing a custom compression algorithm to the user, it's unnecessary and dangerous as you don't really want the user copying the file elsewhere without built-in checksumming. On closer inspection I'm struggling to see a benefit at all.

Quote
Such a beast might want an option to hash not just a block, but subblock per channel. There are cases where you got single channels matching (say 2.0 extracted from 5.0 - or a "multitrack files in the works where only a few channels are altered each save").
Sounds like quite a lot of extra work across the board to accommodate a niche, but sure why not.
Title: Re: deduplicate files -> and create symlinks / softlinks instead
Post by: Porcus on 2024-02-06 19:13:05
But a filesystem should know the size of the files it contains.
So ... variable file block size then? Or are you thinking of something else?

Quote
Such a beast might want an option to hash not just a block, but subblock per channel. There are cases where you got single channels matching (say 2.0 extracted from 5.0 - or a "multitrack files in the works where only a few channels are altered each save").
Sounds like quite a lot of extra work across the board to accommodate a niche, but sure why not.
There is of course an argument why not: If the end-result is popular enough to be spread around, then there will world-wide be many GBs saved - but not so many on "the studio copy".
Then on the other hand, the number of file versions in the works could be quite big.
Then on the other other hand, this could be up to a DAW to simply keep track of pointers to tracks (modified or not) in a project.
Then on the other other other hand, ... it means someone's got to do it, so that's not an argument against doing it right first time.
Title: Re: deduplicate files -> and create symlinks / softlinks instead
Post by: Porcus on 2024-02-06 20:39:11
I think this case can - in a way I would call "slightly fragile" - be handled by hardlinks and foobar2000, assuming you are on Windows and on NTFS:

I just used DupeGuru, that found 4500 files worth 162 GB, so there would be some gain.
I wonder what kind of collection you have. DupeGuru only identifies bit-identical audio, not sound-alikes - heck, I think version 4.3.1 still has trouble identifying MP3 with two different tag schemes.

So I guess, those are true identicals? If they are FLAC/WavPack they should also be easy to identify with foobar2000: Put up a ReFacets panel with statistics, a column for %__md5% and sort by the number of items:
(https://snipboard.io/IMsfaD.jpg)

Sadly DupeGuru does not show any filesystem structure or the paths in the results, so it´s hard to continue.
It does?  Under Columns, select "Folder". 

Not even to mention softlinks.
But it can hardlink.

So here is a possible workflow - assuming everything is on one NTFS formatted volume

(1) foobar2000
(2) ... with https://www.foobar2000.org/components/view/foo_external_tags
Create the tagsets you like - you find it under Advanced preferences. Don't use ADS.
For starters, make an external tags SQL database and create a backup copy of it. Then either folder.tags or filename.tag .
But those don't handle embedded art, so be ready to export all art to the album's folder.
(3) Now fb2k can read album art from the file folder, and all other tags from the external tags component. Then run DupeGuru.
Rather than deleting, create hardlinks. (For testing, do not use direct deletion - that recycle bin can come in handy if something goes wrong.)
(4) Now the dupes are replaced by hardlinks to something else. And half of them have wrong tags in the files, and "correct tags only fb2k-readable".
(5) I think that if you do tag commits that trigger full file rewrite, you will "merely lose the deduplication" in that it will write a new file, delete the old (hence just breaking the hardlink) and rename the new. But don't take my word for it, test it while you still have the old ones in your recycle bin.
Which makes a case for migrating from FLAC to WavPack then ... or on the other hand, if you are only losing the deduplication and don't mind running DupeGuru again when the drive gets full, you are maybe happy still.

Personally I would not do this unless I had have a full backup with actual files. But if the 162 GB savings can make it all fit on your SSD, while your terabytes size spinning external drive can fit it all, then by all means. Also if your collection is biggg, then I think the largest USB-powered 2.5" external drives are still 5TB, so if your collection is around that size and you prefer to have your working set on such a drive (and spinning drives in the attic where you don't hear them), you might consider it.
Title: Re: deduplicate files -> and create symlinks / softlinks instead
Post by: cid42 on 2024-02-07 10:44:08
But a filesystem should know the size of the files it contains.
So ... variable file block size then? Or are you thinking of something else?
...
In the scenario I presented the filesystem is essentially a chunked PCM database stored compressed, where a file is defined by a list of chunks. An input file matches an output file in contents but not necessarily in file structure, lossless PCM but not bit-identical to-the-file. So we have a choice of how the output is presented to the user. The user needs to do things like statting files/directories, seeking etc, which without re-encoding everything up to the seek point every time requires being able to figure out where a sample is off-the-bat. With a wav/verbatim-flac we can immediately seek anywhere because we can compute the size and location of every frame independent of all the others (header + index*verbatim_base_size+utf8_fiddle(index)).

When I talked about exposing a custom algorithm with variable frame size that's a halfway measure that allows a filesize and sample location to be calculated much cheaper than a full re-encode (but not free, compressed blocks summed with partial blocks re-encoded), all to present the user with a smaller file in some custom format no-one wants to deal with. But where's the benefit really? Better to emit flac/wav so that the user can benefit from existing tooling, everything is protected by checksums transparently and decode to PCM is transparent. It doesn't matter that the files presented to the user appear large and unprotected, they're virtual files.

To be robust and implementable without going insane it'd likely have to be flac input only (single tag format, seektables can be handled by filesystem etc, any other format like accompanying jpeg would simply be stored as-is), tagged-verbatim-flac or tagless-wav output only. Complexity would be a killer.
Title: Re: deduplicate files -> and create symlinks / softlinks instead
Post by: Octocontrabass on 2024-02-08 04:07:39
You can get something almost as good with much less effort using ZFS with deduplication and careful tuning of the recordsize to match the padding used by your tagging program (and careful tuning of your tagging program to keep the padding aligned to the ZFS recordsize).

Want compression on top of deduplication? Pick your FLAC version and settings, compress everything, and use your tagging program to make sure the padding is still aligned afterwards. It's not perfect, but it'll work.

(If you really want to build your automatic compressed-and-deduplicated PCM filesystem, maybe patch ZFS to support FLAC. It won't be completely automatic since you'll still have to configure FLAC according to the uncompressed PCM format, but it's a lot less work than reinventing ZFS from scratch.)
Title: Re: deduplicate files -> and create symlinks / softlinks instead
Post by: Porcus on 2024-02-09 17:59:13
You can get something almost as good with much less effort using ZFS with deduplication and careful tuning of the recordsize to match the padding used by your tagging program (and careful tuning of your tagging program to keep the padding aligned to the ZFS recordsize).
The "aside" at the end of reply #12.

Make sure the audio ends at the last byte of a file system block. Then two files with same audio and same compression parameters will have last block identical, then second to last block identical, etc, and maybe the block containing the end of the padding and the beginning of the audio.
Rewrite new and bigger metatadata to one of the files? Even if the application cannot access the file system to the extent that it can rewrite the first few blocks and hook up the remaining ones, and has to write the full file anew: It knows the source file, it knows the length of the (encoded) audio chunk, it can pad up accordingly, and again the file system blocks can be deduplicated starting from the last one.

The reason I have "wishlisted" that feature is not deduplication though - it is to expand padding to take up all the space that the file is anyway going to use on disk. Bigger padding and fewer full rewrites, costlessly.
Title: Re: deduplicate files -> and create symlinks / softlinks instead
Post by: Octocontrabass on 2024-02-10 04:04:28
The reason I have "wishlisted" that feature is not deduplication though - it is to expand padding to take up all the space that the file is anyway going to use on disk. Bigger padding and fewer full rewrites, costlessly.
But when you do have deduplication, aligning the start of the audio makes the most sense because it ensures the audio will never share a block with metadata. ZFS won't waste an entire recordsize block for the end of the file as long as you enable some form of compression.
Title: Re: deduplicate files -> and create symlinks / softlinks instead
Post by: Jason Hennry on 2024-02-22 13:09:49
You can use the HardLinkShellExt tool on Windows to find duplicates and create hard links, conserving space without losing them from compilations. It integrates into the right-click context menu for easy access.
Title: Re: deduplicate files -> and create symlinks / softlinks instead
Post by: Porcus on 2024-02-22 13:26:06
You can use the HardLinkShellExt tool on Windows to find duplicates
Huh. I have it installed, but ... does it find duplicates?
Anyway, no chance to work on tagged files. Solutions to deduplicate identical audio are already posted here.
Title: Re: deduplicate files -> and create symlinks / softlinks instead
Post by: Porcus on 2024-02-22 15:14:04
The reason I have "wishlisted" that feature is not deduplication though - it is to expand padding to take up all the space that the file is anyway going to use on disk. Bigger padding and fewer full rewrites, costlessly.
But when you do have deduplication, aligning the start of the audio makes the most sense because it ensures the audio will never share a block with metadata. ZFS won't waste an entire recordsize block for the end of the file as long as you enable some form of compression.
If you have an "ideal" file system that can sneak in more blocks "midway" in a file - and a tagger that can do it and thus "maintain deduplication" - then that is the way, especially if you don't know the audio length upon encoding. (If you knew, you might as well write an encoder that starts at the end.)

But if you don't have that luxury, then starting audio with z bytes zeroes and a bytes audio shouldn't be much different to the file system than ending it with a bytes audio and z bytes zeroes. You can always keep those z bytes if you want, but you have the additional advantage of writing tag data into it if and when you are in a situation where doing so saves a full tag rewrite. You simply have z bytes more padding, and that is not a downside.

Instead of course, you can use audio formats with (APEv2) tags at the end. Disadvantages and advantages to either solution.
Title: Re: deduplicate files -> and create symlinks / softlinks instead
Post by: Octocontrabass on 2024-02-23 06:17:47
If you have an "ideal" file system that can sneak in more blocks "midway" in a file - and a tagger that can do it and thus "maintain deduplication" - then that is the way, especially if you don't know the audio length upon encoding. (If you knew, you might as well write an encoder that starts at the end.)
There is no such ideal filesystem. Which is fine, because audio files are relatively small, and tags are seldom modified. With ZFS, you maintain deduplication as long as the metadata and padding at the start of the file grows (or shrinks) in exact multiples of the recordsize, even if you have to rewrite the entire file to do it. The default recordsize is 128kiB.

But if you don't have that luxury, then starting audio with z bytes zeroes and a bytes audio shouldn't be much different to the file system than ending it with a bytes audio and z bytes zeroes. You can always keep those z bytes if you want, but you have the additional advantage of writing tag data into it if and when you are in a situation where doing so saves a full tag rewrite. You simply have z bytes more padding, and that is not a downside.
When you're using ZFS deduplication, the difference is that if your tags grow into the same 128kiB block as the audio, that entire 128kiB block can no longer be deduplicated. Padding compresses away to nothing, so you'll save more space by keeping the audio blocks entirely separate from the metadata blocks.

I'm not familiar with APEv2, but a quick look at the wiki suggests using it at the end of the file would prevent deduplication of the last block of the audio.
Title: Re: deduplicate files -> and create symlinks / softlinks instead
Post by: Porcus on 2024-02-23 07:51:33
When you're using ZFS deduplication, the difference is that if your tags grow into the same 128kiB block as the audio,
Which you never have to if you don't want to. You can always choose to rewrite the entire file (like your suggestion would force you to).
But when padding to EOF, there will be situations where you can choose to do so - and instead of a full rewrite, take a size penalty of a block size (per file-pair).
Big deal? If you are running compression, like you suggest below, it is going to be less, since part of that un-deduplicated block will be padding.
And if you are still adamant about that size, you can always run nightly metaflac jobs. Which would do precisely the same full file rewrites and recoup then maybe - except, not while you are sitting there waiting.
Also, since you will average to the same number of non-deduplicated blocks, you have to use compression to get the size gain anyway.

Let's get real here: A typical FLAC track is like 25 to 30 MB. Two hundred ZFS blocks. So with two audio-duplicates, then there is a chance that retagging one of them will cost half a percent minus the padding - if you use compression; if you don't, then there is a chance that you started out with one block less and lost it.
And when it does cost you a block, you save time while working - and you can reclaim that by a nightly metaflac job.

that entire 128kiB block can no longer be deduplicated. Padding compresses away to nothing, so you'll save more space by keeping the audio blocks entirely separate from the metadata blocks.
Enabling compression on a file system where most of the content is incompressible? How good is ZFS at giving up on compressing a block? This could become more expensive in CPU, than half a percent disk size?
(I never fiddled with ZFS compression, it was more than heavy enough anyway back in the day.)
Title: Re: deduplicate files -> and create symlinks / softlinks instead
Post by: Porcus on 2024-02-23 11:44:42
situations where you can choose to do so
"not" to do so.
Update the metadata section, and if goes into the block that ends with the first audio - do that.

But, but: If FS deduplication spends effectively no additional space on a block of zeroes, cannot you just pad megabytes between tags and audio and they will be deduplicated away?
But then "copying" to a different file system (w/o compression) should likely be done with metaflac instead of copy.
Title: Re: deduplicate files -> and create symlinks / softlinks instead
Post by: Replica9000 on 2024-02-23 14:51:39
ZFS supports block cloning (reflinks) since v2.2.  My understanding is that there's far less overhead than deduplication.  I know XFS and BTRFS also support reflinks.  Any time a file is copied (cp --reflink), identical blocks should only be referenced, not taking up additional space.

I believe ZFS uses LZ4 to tests a block for compressibility.  If the block can be compressed more than 12.5%, ZFS will compress that block with the chosen compression algorithm.  There will probably be no noticable overhead with lz4 or zstd, but the compression ratio won't be much (it's only 1.02 on my media files) 
Title: Re: deduplicate files -> and create symlinks / softlinks instead
Post by: Octocontrabass on 2024-02-23 17:25:42
Big deal?
If you've reached the point where this deduplication scheme makes more sense than buying more storage, it probably is a big deal to you.

Enabling compression on a file system where most of the content is incompressible? How good is ZFS at giving up on compressing a block? This could become more expensive in CPU, than half a percent disk size?
ZFS gives you the choice of several different compression strategies, so you can tune it according to how much CPU time you're willing to spend. For this particular setup of only compressed lossless audio files, ZLE is probably the best choice; it'll compress the padding without attempting to compress anything else.

But, but: If FS deduplication spends effectively no additional space on a block of zeroes, cannot you just pad megabytes between tags and audio and they will be deduplicated away?
But then "copying" to a different file system (w/o compression) should likely be done with metaflac instead of copy.
You could do that, but I'm not sure why you'd put so much effort into speeding up tagging when you'll probably spend more time listening to your music than tagging it.