deduplicate files -> and create symlinks / softlinks instead

Topic: deduplicate files -> and create symlinks / softlinks instead (Read 3962 times) previous topic - next topic

0 Members and 1 Guest are viewing this topic.

deduplicate files -> and create symlinks / softlinks instead

2024-02-04 18:52:53

Hi

I found several tools and ways how to find and delete duplicate files.
But I don´t want to just delete them. Then they´ll be missing from e.g. compilations.

Using a filesystem like btrfs, zfs or ReFS, deduplication of similar data should be handled on filesystem level automatically.
But I´m using NTFS on Windows (non-pro).

Does anyone know a way, how to find (nearly) duplicates and replace them by softlinks / symlinks / mklink?
Any tool, script, plug-in, ...?

Re: deduplicate files -> and create symlinks / softlinks instead

Reply #1 – 2024-02-04 20:53:13

Quote from: TheJJJ-42 on 2024-02-04 18:52:53

Does anyone know a way, how to find (nearly) duplicates and replace them by softlinks / symlinks / mklink?

What do you mean by nearly?
If the tags don't match then you'll have incomplete albums, so are you looking for tags and a fuzzy audio match?

Re: deduplicate files -> and create symlinks / softlinks instead

Reply #2 – 2024-02-05 07:55:56

Not only tags, compilations often have different mastering than albums. Songs are different.

Re: deduplicate files -> and create symlinks / softlinks instead

Reply #3 – 2024-02-05 10:48:45

"nearly" is not going to work, too complicated.
and for exact matches (identical files), you'd probably want (an equivalent of) hardlinks.
otherwise one of the files in the group (or all but one, depending on how you look) will become "special" and that will complicate things, not worth it.
anyway, the size savings will be probably so small in the grand scheme, I bet it's not worth spending time on, as time is finite! -- unless you make a lot of compilations yourself, then the easiest way is probably to use hardlinking instead of copying as the primary operation used for making a compilation.
for windows, you might be interested in "link shell extension" but I don't recall where one can install it safely from (to not get a malware look-alike), it doesn't look obvious.
I guess it would really be a lot simpler to pull off on Linux t.b.h.

soft links for self-made compilations can possibly work too, but beware, some software will likely act strange when it sees a soft link instead of a "real file".
especially backup methods.
the main benefit from soft links is that they can point to anywhere (e.g. a different filesystem/disk)

Re: deduplicate files -> and create symlinks / softlinks instead

Reply #4 – 2024-02-05 11:32:38

For lossless like flac it wouldn't be too hard to create a FUSE filesystem that de-dupes only the audio, allowing tags to be different. FUSE can work everywhere AFAIK but easiest on Linux. But maintaining a custom filesystem just to save a few GBs per TB, not worth it.

edit: As for automatic dedupe via btrfs/zfs etc, they're all block based so that won't work except in the rare case where the tags for an otherwise matching audio portion have the same length. BTRFS/ZFS dedupe could work for formats that store the audio before the tags (so, not flac).

Re: deduplicate files -> and create symlinks / softlinks instead

Reply #5 – 2024-02-05 12:11:07

flac files typically have some padding.
if tags are updated without overflowing the padding and without getting rid of the remaining padding, then the audio part is not moved (relatively to the start of the file).

Re: deduplicate files -> and create symlinks / softlinks instead

Reply #6 – 2024-02-05 12:23:45

That's true. So then the main hurdle to btrfs/zfs deduping is that audio that could match needs to be encoded with the same settings and encoder version to maximise match chance.

Re: deduplicate files -> and create symlinks / softlinks instead

Reply #7 – 2024-02-05 17:20:10

Thank you all, for your thoughts!
- I can recommend the Shell Link Extension, that´s where my inspiration for soft-links originated. But that´s not a tool to create thousands of links.
- 'Nearly' - There was a plugin, maybe for foobar, that found quite similar files, other tools can handle this too.

I just used DupeGuru, that found 4500 files worth 162 GB, so there would be some gain.
Sadly DupeGuru does not show any filesystem structure or the paths in the results, so it´s hard to continue. Not even to mention softlinks.

It seems, that that´s a use case for a special software, that works as a layer between the data (files or database) and handles stuff like this.
roon, picard, helium, lexicon ... there are some, that handle music management on an application layer.
But that is not quite my use case - I would prefer something application agnostic, on a deeper system layer.

Anyway, thank you all, for your knowledge and thoughts!

Re: deduplicate files -> and create symlinks / softlinks instead

Reply #8 – 2024-02-05 19:52:54

I've written a system of .BAT and .AWK which scans a tree of .MP3s and extract the tags and paths into a file using FFMPEG, then process the results file with AWK to build another BAT, which when run sets up another tree with the MP3s sorted into folders by genre (and other criteria). Those folders are generated and named on the fly, and contain hard links to the original MP3s rather than duplicated files. This is all on NTFS on a USB stick, and a variety of players work with the hard links just fine (even in MacOS – which surprised me!).

I'm not saying this is necessarily what you want, just illustrating what's possible. I don't understand what you're saying about modifying tags, any modification to a hard-link referenced file modifies the file itself. As for fuzzy matching, that's very difficult.

Re: deduplicate files -> and create symlinks / softlinks instead

Reply #9 – 2024-02-05 21:47:33

Quote from: fooball on 2024-02-05 19:52:54

I've written a system of .BAT and .AWK which scans a tree of .MP3s and extract the tags and paths into a file using FFMPEG, then process the results file with AWK to build another BAT, which when run sets up another tree with the MP3s sorted into folders by genre (and other criteria). Those folders are generated and named on the fly, and contain hard links to the original MP3s rather than duplicated files. This is all on NTFS on a USB stick, and a variety of players work with the hard links just fine (even in MacOS – which surprised me!).

I have to ask why? It looks like you might be creating playlists.

Re: deduplicate files -> and create symlinks / softlinks instead

Reply #10 – 2024-02-06 08:30:10

maybe because every player has its own playlist format, sometimes unclear/undocumented

Re: deduplicate files -> and create symlinks / softlinks instead

Reply #11 – 2024-02-06 09:54:31

Quote from: SimBun on 2024-02-05 21:47:33

I have to ask why? It looks like you might be creating playlists.

It doesn't matter why, but ease of locating any particular track when the user's (not me) primary means of access is via the directory tree.

I also came into all this with a misunderstanding of what a playlist is: I thought it was literally the pre-programmed sequence of tracks to play, I didn't realise it could be a presentation of potential tracks to play... because I personally need it to be the former, and I still have difficulty understanding the relationship between playlists and the actual play buffer (I forget what it's called now).

Nonetheless, I thought my experience might be of some help to the OP.

Re: deduplicate files -> and create symlinks / softlinks instead

Reply #12 – 2024-02-06 10:41:40

Quote from: cid42 on 2024-02-05 11:32:38

For lossless like flac it wouldn't be too hard to create a FUSE filesystem that de-dupes only the audio, allowing tags to be different. FUSE can work everywhere AFAIK but easiest on Linux. But maintaining a custom filesystem just to save a few GBs per TB, not worth it.

edit: As for automatic dedupe via btrfs/zfs etc, they're all block based so that won't work except in the rare case where the tags for an otherwise matching audio portion have the same length. BTRFS/ZFS dedupe could work for formats that store the audio before the tags (so, not flac).

Not worth it no - but if for the sake of the argument we consider a scenario where most of the data in the world were PCM audio so file system developers would do that sort of stuff:
I guess most likely one would de-dupe the uncompressed audio and then have an audio compression algorithm as part of the file system. A block size of 4096 stereo samples each of 4 bytes before compression isn't much wrong for file system deduplication. The deduplicator would have to scan for different offsets in CD rips, so it would have to use either variable block size or change padding on the fly - and with that in place, front tags wouldn't be an issue.
Of course for cloud audio storage the users would have to settle for not getting their files back, just the audio. But "matching" cloud services have offered that for a while anyway.

(As an aside, a wishlist for tagging software: if a retag triggers full file rewrite due to front-tags expanding too much, then - knowing the size of the audio chunk - round up padding so that the file becomes an integer multiple of 4096 bytes.)

Re: deduplicate files -> and create symlinks / softlinks instead

Reply #13 – 2024-02-06 12:10:51

Quote from: fooball on 2024-02-06 09:54:31

Quote from: SimBun on 2024-02-05 21:47:33
I have to ask why? It looks like you might be creating playlists.
It doesn't matter why, but ease of locating any particular track when the user's (not me) primary means of access is via the directory tree.

I just thought in this instance the why might be more interesting than the how. I thought it was for use in a car that only allows browsing by folder.

Given the OP states DupeGuru found 4500 duplicates, I'm beginning to think they're creating "playlists" (folders of tracks) by copying tracks from other albums. If that's the case then it's a much simpler problem to solve.

Re: deduplicate files -> and create symlinks / softlinks instead

Reply #14 – 2024-02-06 12:37:27

Quote from: Porcus on 2024-02-06 10:41:40

...
Not worth it no - but if for the sake of the argument we consider a scenario where most of the data in the world were PCM audio so file system developers would do that sort of stuff:
...

For the sake of argument then.

A block would be a chunk of PCM or a chunk of "other" data
Contents of a PCM block is the raw encoded bitstream, decodeable by anything that knows the minimal metadata required to decode it (algorithm/samples/depth/channels, stored elsewhere)
Every PCM block has data checksummed and the minimal metadata required to decode stored in the filesystem: algorithm/samples/depth/channels
A file is a set of blocks concatenated, with a PCM block it would also be possible to define a range within the block so a file might be b0, b17345, b2[34:456], b6, ...
Deduping offset matching aka unaligned matching is intensive and fuzzy so it wouldn't be done by default, instead there'd be a user program that could be run whenever convenient to look for unaligned matching "out-of-band"
Files could be presented to the user as wav/flac (at minimal compression, on-the-fly re-encode is a painful prospect, probably a good use case for verbatim-only wasted_bits=0 encoding as filesize can be pre-determined) for compatibility, or as a suitable custom compressed format that re-uses as much of the internal structure of the filesystem verbatim. An ideal compressed format would use non-PCM blocks as-is, fully-used PCM blocks as-is (the majority of both a deduped and unique track, minimising on-the-fly re-encode), partially used PCM blocks would have to be re-encoded into their own custom frame on-the-fly (potentially just verbatim PCM for speed). Implying the format is variable frame and doesn't store location in-frame. The compressed format presented to the user wouldn't need checksumming itself, the filesystem has already checksummed the data

Anything I've missed? Other than glossing over hideously complex filesystem things.

Re: deduplicate files -> and create symlinks / softlinks instead

Reply #15 – 2024-02-06 13:25:57

Quote from: cid42 on 2024-02-06 12:37:27

Anything I've missed? Other than glossing over hideously complex filesystem things.

* It is not at all clear that on-the-fly enoding is bad for speed, when you use spinning drives. Here, someone tested ZFS uncompressed vs ZFS compressed vs gzip vs lz4 vs zstd, on retired hardware two years ago:
https://www.reddit.com/r/zfs/comments/svnycx/a_simple_real_world_zfs_compression_speed_an/
(Hm, how does flac -0 relate to those speed-wise?)
* Such a beast might want an option to hash not just a block, but subblock per channel. There are cases where you got single channels matching (say 2.0 extracted from 5.0 - or a "multitrack files in the works where only a few channels are altered each save").

Re: deduplicate files -> and create symlinks / softlinks instead

Reply #16 – 2024-02-06 14:54:16

Quote

It is not at all clear that on-the-fly enoding is bad for speed, when you use spinning drives.

But a filesystem should know the size of the files it contains. With wav this is trivial, with verbatim wasted_bits=0 flac this is possible, with flac -# it's not (short of storing the encode filesize on initial write and being super careful with versioning, nightmare). An ls within a directory triggering every audio file in that directory to be fully re-encoded in order to populate the stat request, not smart.

I've reconsidered exposing a custom compression algorithm to the user, it's unnecessary and dangerous as you don't really want the user copying the file elsewhere without built-in checksumming. On closer inspection I'm struggling to see a benefit at all.

Quote

Such a beast might want an option to hash not just a block, but subblock per channel. There are cases where you got single channels matching (say 2.0 extracted from 5.0 - or a "multitrack files in the works where only a few channels are altered each save").

Sounds like quite a lot of extra work across the board to accommodate a niche, but sure why not.

Re: deduplicate files -> and create symlinks / softlinks instead

Reply #17 – 2024-02-06 19:13:05

Quote from: cid42 on 2024-02-06 14:54:16

But a filesystem should know the size of the files it contains.

So ... variable file block size then? Or are you thinking of something else?

Quote from: cid42 on 2024-02-06 14:54:16

Quote
Such a beast might want an option to hash not just a block, but subblock per channel. There are cases where you got single channels matching (say 2.0 extracted from 5.0 - or a "multitrack files in the works where only a few channels are altered each save").
Sounds like quite a lot of extra work across the board to accommodate a niche, but sure why not.

There is of course an argument why not: If the end-result is popular enough to be spread around, then there will world-wide be many GBs saved - but not so many on "the studio copy".
Then on the other hand, the number of file versions in the works could be quite big.
Then on the other other hand, this could be up to a DAW to simply keep track of pointers to tracks (modified or not) in a project.
Then on the other other other hand, ... it means someone's got to do it, so that's not an argument against doing it right first time.

Re: deduplicate files -> and create symlinks / softlinks instead

Reply #18 – 2024-02-06 20:39:11

I think this case can - in a way I would call "slightly fragile" - be handled by hardlinks and foobar2000, assuming you are on Windows and on NTFS:

Quote from: TheJJJ-42 on 2024-02-05 17:20:10

I just used DupeGuru, that found 4500 files worth 162 GB, so there would be some gain.

I wonder what kind of collection you have. DupeGuru only identifies bit-identical audio, not sound-alikes - heck, I think version 4.3.1 still has trouble identifying MP3 with two different tag schemes.

So I guess, those are true identicals? If they are FLAC/WavPack they should also be easy to identify with foobar2000: Put up a ReFacets panel with statistics, a column for %__md5% and sort by the number of items:

Quote from: TheJJJ-42 on 2024-02-05 17:20:10

Sadly DupeGuru does not show any filesystem structure or the paths in the results, so it´s hard to continue.

It does? Under Columns, select "Folder".

Quote from: TheJJJ-42 on 2024-02-05 17:20:10

Not even to mention softlinks.

But it can hardlink.

So here is a possible workflow - assuming everything is on one NTFS formatted volume

(1) foobar2000
(2) ... with https://www.foobar2000.org/components/view/foo_external_tags
Create the tagsets you like - you find it under Advanced preferences. Don't use ADS.
For starters, make an external tags SQL database and create a backup copy of it. Then either folder.tags or filename.tag .
But those don't handle embedded art, so be ready to export all art to the album's folder.
(3) Now fb2k can read album art from the file folder, and all other tags from the external tags component. Then run DupeGuru.
Rather than deleting, create hardlinks. (For testing, do not use direct deletion - that recycle bin can come in handy if something goes wrong.)
(4) Now the dupes are replaced by hardlinks to something else. And half of them have wrong tags in the files, and "correct tags only fb2k-readable".
(5) I think that if you do tag commits that trigger full file rewrite, you will "merely lose the deduplication" in that it will write a new file, delete the old (hence just breaking the hardlink) and rename the new. But don't take my word for it, test it while you still have the old ones in your recycle bin.
Which makes a case for migrating from FLAC to WavPack then ... or on the other hand, if you are only losing the deduplication and don't mind running DupeGuru again when the drive gets full, you are maybe happy still.

Personally I would not do this unless I had have a full backup with actual files. But if the 162 GB savings can make it all fit on your SSD, while your terabytes size spinning external drive can fit it all, then by all means. Also if your collection is biggg, then I think the largest USB-powered 2.5" external drives are still 5TB, so if your collection is around that size and you prefer to have your working set on such a drive (and spinning drives in the attic where you don't hear them), you might consider it.

Re: deduplicate files -> and create symlinks / softlinks instead

Reply #19 – 2024-02-07 10:44:08

Quote from: Porcus on 2024-02-06 19:13:05

Quote from: cid42 on 2024-02-06 14:54:16
But a filesystem should know the size of the files it contains.
So ... variable file block size then? Or are you thinking of something else?
...

In the scenario I presented the filesystem is essentially a chunked PCM database stored compressed, where a file is defined by a list of chunks. An input file matches an output file in contents but not necessarily in file structure, lossless PCM but not bit-identical to-the-file. So we have a choice of how the output is presented to the user. The user needs to do things like statting files/directories, seeking etc, which without re-encoding everything up to the seek point every time requires being able to figure out where a sample is off-the-bat. With a wav/verbatim-flac we can immediately seek anywhere because we can compute the size and location of every frame independent of all the others (header + index*verbatim_base_size+utf8_fiddle(index)).

When I talked about exposing a custom algorithm with variable frame size that's a halfway measure that allows a filesize and sample location to be calculated much cheaper than a full re-encode (but not free, compressed blocks summed with partial blocks re-encoded), all to present the user with a smaller file in some custom format no-one wants to deal with. But where's the benefit really? Better to emit flac/wav so that the user can benefit from existing tooling, everything is protected by checksums transparently and decode to PCM is transparent. It doesn't matter that the files presented to the user appear large and unprotected, they're virtual files.

To be robust and implementable without going insane it'd likely have to be flac input only (single tag format, seektables can be handled by filesystem etc, any other format like accompanying jpeg would simply be stored as-is), tagged-verbatim-flac or tagless-wav output only. Complexity would be a killer.

Re: deduplicate files -> and create symlinks / softlinks instead

Reply #20 – 2024-02-08 04:07:39

You can get something almost as good with much less effort using ZFS with deduplication and careful tuning of the recordsize to match the padding used by your tagging program (and careful tuning of your tagging program to keep the padding aligned to the ZFS recordsize).

Want compression on top of deduplication? Pick your FLAC version and settings, compress everything, and use your tagging program to make sure the padding is still aligned afterwards. It's not perfect, but it'll work.

(If you really want to build your automatic compressed-and-deduplicated PCM filesystem, maybe patch ZFS to support FLAC. It won't be completely automatic since you'll still have to configure FLAC according to the uncompressed PCM format, but it's a lot less work than reinventing ZFS from scratch.)

Re: deduplicate files -> and create symlinks / softlinks instead

Reply #21 – 2024-02-09 17:59:13

Quote from: Octocontrabass on 2024-02-08 04:07:39

You can get something almost as good with much less effort using ZFS with deduplication and careful tuning of the recordsize to match the padding used by your tagging program (and careful tuning of your tagging program to keep the padding aligned to the ZFS recordsize).

The "aside" at the end of reply #12.

Make sure the audio ends at the last byte of a file system block. Then two files with same audio and same compression parameters will have last block identical, then second to last block identical, etc, and maybe the block containing the end of the padding and the beginning of the audio.
Rewrite new and bigger metatadata to one of the files? Even if the application cannot access the file system to the extent that it can rewrite the first few blocks and hook up the remaining ones, and has to write the full file anew: It knows the source file, it knows the length of the (encoded) audio chunk, it can pad up accordingly, and again the file system blocks can be deduplicated starting from the last one.

The reason I have "wishlisted" that feature is not deduplication though - it is to expand padding to take up all the space that the file is anyway going to use on disk. Bigger padding and fewer full rewrites, costlessly.

Re: deduplicate files -> and create symlinks / softlinks instead

Reply #22 – 2024-02-10 04:04:28

Quote from: Porcus on 2024-02-09 17:59:13

The reason I have "wishlisted" that feature is not deduplication though - it is to expand padding to take up all the space that the file is anyway going to use on disk. Bigger padding and fewer full rewrites, costlessly.

But when you do have deduplication, aligning the start of the audio makes the most sense because it ensures the audio will never share a block with metadata. ZFS won't waste an entire recordsize block for the end of the file as long as you enable some form of compression.

Re: deduplicate files -> and create symlinks / softlinks instead

Reply #23 – 2024-02-22 13:09:49

You can use the HardLinkShellExt tool on Windows to find duplicates and create hard links, conserving space without losing them from compilations. It integrates into the right-click context menu for easy access.

Re: deduplicate files -> and create symlinks / softlinks instead

Reply #24 – 2024-02-22 13:26:06

Quote from: Jason Hennry on 2024-02-22 13:09:49

You can use the HardLinkShellExt tool on Windows to find duplicates

Huh. I have it installed, but ... does it find duplicates?
Anyway, no chance to work on tagged files. Solutions to deduplicate identical audio are already posted here.

Notice