Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: Filtering Duplicate FLACs Using MD5 Hash (Read 4193 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

Filtering Duplicate FLACs Using MD5 Hash

I am trying to identify duplicate FLAC files/albums from my collection using the MD5 hash in the header.

I would like to do this not through the library (I haven’t set one up) but using a playlist (.fpl) I have created of all tracks with the tracknumber tag value “1”.

What I want to achieve is a filter that contains only files and every file where two or more contain identical MD5 hash values [$info(md5)].

I’m not good technically so would really appreciate if someone could talk me through exactly what I would need to do to filter my playlist accordingly. I’m happy to explain why I’d like to do it in the way I’ve described if it is not obvious.

Thanks.

Filtering Duplicate FLACs Using MD5 Hash

Reply #1
Is it even possible  to do what I'm asking?

 

Filtering Duplicate FLACs Using MD5 Hash

Reply #2
It's not possible with a playlist. You need to enable your media library and then install foo_facets. Configure with it with a custom column containing [$info(md5)]. Then right click the column header to enable statistics>Items.. You'll want to reverse sort this so anything with more than one match appears first. Then in the next pane to the right, you can configure it with a custom %path% column so you can see the location of your duplicate files.

I don't have any duplicate files but I made a copy just to show it as an example:


Filtering Duplicate FLACs Using MD5 Hash

Reply #3
It's not possible with a playlist. You need to enable your media library and then install foo_facets. Configure with it with a custom column containing [$info(md5)]. Then right click the column header to enable statistics>Items.. You'll want to reverse sort this so anything with more than one match appears first. Then in the next pane to the right, you can configure it with a custom %path% column so you can see the location of your duplicate files.

Thanks very much - that is very helpful. I'll give this a go but I don't think that my computer can handle my entire media library as it is extremely large, hence why I was trying the playlist route with only the first tracks included. If that fails, I'll have to go through the playlist manually.

Re: Filtering Duplicate FLACs Using MD5 Hash

Reply #4
It's not possible with a playlist. You need to enable your media library and then install foo_facets.
I just wanted to thank you again for your help. That solution worked perfectly and highlighted loads of bit for bit identical files that needed flushing out.

Also, I thought I would correct you on one small point which is that you can use foo_facets on a playlist although in the event, I was surprised to find my server coped with the entire collection, so happily I now have a workable and dynamic music library.

Re: Filtering Duplicate FLACs Using MD5 Hash

Reply #5
Yes you are right. I had been aware of that in the past but I completely forgot.

Re: Filtering Duplicate FLACs Using MD5 Hash

Reply #6
I just create column in playlist, in pattern I set $info(md5)
Spoiler (click to show/hide)
I have another question. Is it possible to automatically remove lines from the playlist that do not have duplicates with the same values ​ ​ in a specific column? In this case, I would like only groups of duplicates to remain in the playlist. Like this:
Spoiler (click to show/hide)

Re: Filtering Duplicate FLACs Using MD5 Hash

Reply #7
That's what facets as described above will give you.

If it were me, I'd be using foo_sqlite. Using this you can run sql against your media library/playlist to identify duplicates (based on md5) and then push the resulting tracks out to a playlist.

Code: [Select]
drop table if exists Playlist_Updatable;
create virtual table Playlist_Updatable USING MetaDB_Module(no_multivalue_split,playlist);

insert into Playlist_Updatable (path, playlist_name)
  select path,
         'Duplicate Tracks'
  from Playlist
  where playlist_index = active_playlist()
            and md5 in (select md5
                  from Playlist
                  where playlist_index = active_playlist()
                  group by md5
                  having count(*)>1
               );

drop table Playlist_Updatable;

If you want to run it against your entire media library, just use
Code: [Select]
drop table if exists Playlist_Updatable;
create virtual table Playlist_Updatable USING MetaDB_Module(no_multivalue_split,playlist);

insert into Playlist_Updatable (path, playlist_name)
  select path,
         'Duplicate Tracks'
  from mediaLibrary
  where md5 in (select md5
                  from mediaLibrary
                  group by md5
                  having count(*)>1
               );

drop table Playlist_Updatable;

Re: Filtering Duplicate FLACs Using MD5 Hash

Reply #8
With the latest version of foo_sqlite you should omit
Code: [Select]
drop table if exists Playlist_Updatable;
create virtual table Playlist_Updatable USING MetaDB_Module(no_multivalue_split,playlist);
at the beginning and
Code: [Select]
drop table Playlist_Updatable;
at the end of your scripts and use PlaylistUpdatable instead of Playlist_Updatable. PlaylistUpdatable is now automatically created.

Re: Filtering Duplicate FLACs Using MD5 Hash

Reply #9
Thank you. I tried execute both commands, but SQLite console says "SQLite Error:(1) no such column: md5"

Re: Filtering Duplicate FLACs Using MD5 Hash

Reply #10
Thank you. I tried execute both commands, but SQLite console says "SQLite Error:(1) no such column: md5"
foo_sqlite comes with a bunch of default tags/columns set up by default; to add additional tags that exist in your media library go to 'File > Preferences > Media Library > SQLite viewer' and select 'MediaLibrary table', then click 'Add defaults' and 'Apply'. Does this add md5 to the list of columns available? Repeat for 'Playlist table'.

If that doesn't add it, can you let me know which version of foobar you're using, and if you have foobar configured with a media library of if you're just adding files to foobar via playlists?

I'm out for a few hours now, but will take a look when I'm back later if that doesn't work.

Re: Filtering Duplicate FLACs Using MD5 Hash

Reply #11
foo_sqlite comes with a bunch of default tags/columns set up by default; to add additional tags that exist in your media library go to 'File > Preferences > Media Library > SQLite viewer' and select 'MediaLibrary table', then click 'Add defaults' and 'Apply'. Does this add md5 to the list of columns available?
Yes, now there are tags in the MediaLibrary table, including md5. I executed this script:
Quote
insert into Playlist_Updatable (path, playlist_name)
  select path,
         'Duplicate Tracks'
  from mediaLibrary
  where md5 in (select md5
                  from mediaLibrary
                  group by md5
                  having count(*)>1
               );
Now I understand, that SQLite takes the list of tracks from medialibrary, I configured it too, it has status monitoring folders wich have duplicates.
Spoiler (click to show/hide)
Result can see:
Spoiler (click to show/hide)
Thank you very much for help, plugin and scripts  :)

P.S. I also tried execute this script:
Quote
insert into Playlist_Updatable (path, playlist_name)
  select path,
         'Duplicate Tracks'
  from Playlist
  where playlist_index = active_playlist()
            and md5 in (select md5
                  from Playlist
                  where playlist_index = active_playlist()
                  group by md5
                  having count(*)>1
               );
But it still says "no such column: md5":
Spoiler (click to show/hide)

Re: Filtering Duplicate FLACs Using MD5 Hash

Reply #12
foo_sqlite comes with a bunch of default tags/columns set up by default; to add additional tags that exist in your media library go to 'File > Preferences > Media Library > SQLite viewer' and select 'MediaLibrary table', then click 'Add defaults' and 'Apply'. Does this add md5 to the list of columns available? Repeat for 'Playlist table'.
Did you also 'Add defaults' to the 'Playlist table' too?


Re: Filtering Duplicate FLACs Using MD5 Hash

Reply #14
Glad you got it sorted.

It's a great plugin, and is the only tool I know of that enables you to perform queries across groups of tracks e.g. at the album level.

I use it a lot for QA purposes, things like checking that there's only one artistsort per artist, but have also used it for updating tags. There was a guy on the MinimServer forum that wanted to create a tag that contained the album duration so he could show it in his control point, and this plugin made that trivial. I also added DISCTOTAL to my library as I hadn't included it originally but found a use for it when I was performing conversions.


Re: Filtering Duplicate FLACs Using MD5 Hash

Reply #15
With the latest version of foo_sqlite you should omit
Code: [Select]
drop table if exists Playlist_Updatable;
create virtual table Playlist_Updatable USING MetaDB_Module(no_multivalue_split,playlist);
at the beginning and
Code: [Select]
drop table Playlist_Updatable;
at the end of your scripts and use PlaylistUpdatable instead of Playlist_Updatable. PlaylistUpdatable is now automatically created.
In my recent scripts I had moved over to using PlaylistUpdatable, but I hadn't realised it was automatically available so that simplifies the scripts a bit.

Thanks again for the plugin.


Re: Filtering Duplicate FLACs Using MD5 Hash

Reply #16
Hello again. Is it possible to set a term so that duplicates are allocated (marked) only in a certain directory and its subdirectories? For example, there are already files in different folders and on different disks and you need to compare them with files that are in a certain directory and its subdirectories, in this case in the F :\! Labels. After finding duplicates need only to mark them or to delete to the recycle. Need a script to select duplicate in a ready-made list with duplicates found

Re: Filtering Duplicate FLACs Using MD5 Hash

Reply #17
I'm not entirely sure I understand.
If you're just trying to run the code against a subset of your collection then the easiest way to do that is just to filter the active playlist rather than change the sql.
Or are you asking instead to compare directories, so find files from Path A that also exist in path B?

Re: Filtering Duplicate FLACs Using MD5 Hash

Reply #18
There is already a ready list of duplicates. I need just select all duplicates that are located in a specific folder and its subfolders in order to remove them to the recycle bin. Like this:

Re: Filtering Duplicate FLACs Using MD5 Hash

Reply #19
Given it's a playlist can't you just filter it by the path?
I use facets, and if I type 'H:\' into the filter bar it'll automatically select everything that resides under that path, at which point I can delete them.

You could change the query so it only kept one, but how would it know which one to keep, what if there were duplicates under T:\ AND under H:\? Alternatively, if you just wanted to find files in H:\ that were already in T:\ then that's possible, but that's what I suggested before.

Re: Filtering Duplicate FLACs Using MD5 Hash

Reply #20
Hello. Need help again. How do I select all tracks with the same tag value except one?
Example: https://i.postimg.cc/sxbbKkV1/2022-11-05-121442.png
In this case, it does not matter from which folder to remove duplicates, and in which folder to leave one track.

Re: Filtering Duplicate FLACs Using MD5 Hash

Reply #21
Hello. Need help again. How do I select all tracks with the same tag value except one?
Example: https://i.postimg.cc/sxbbKkV1/2022-11-05-121442.png
In this case, it does not matter from which folder to remove duplicates, and in which folder to leave one track.
You can get them with the following SQL statement for the md5 tag:
Code: [Select]
INSERT INTO Playlist_Updatable (path, playlist_name)
  SELECT path,
         'Duplicate Tracks'
  FROM (
    SELECT first_value(path) OVER w_md5 first_path,
           count(*) OVER w_md5 track_count,
           path
    FROM MediaLibrary
    WHERE md5 is not null
    WINDOW w_md5 AS (PARTITION BY md5)
  )
  WHERE track_count>1
    AND path<>first_path

Re: Filtering Duplicate FLACs Using MD5 Hash

Reply #22
Lots of CDs have silent tracks and lots of them are one of a short list of lengths (sample counts)
How should the WHERE's be amended to exclude those in a [given list of sample counts]?

I don't know SQL but I understand this one: https://www.zazzle.co.uk/select_from_users_where_clue_0_0_rows_returned_t_shirt-235876639241498000

Re: Filtering Duplicate FLACs Using MD5 Hash

Reply #23
Lots of CDs have silent tracks and lots of them are one of a short list of lengths (sample counts)
How should the WHERE's be amended to exclude those in a [given list of sample counts]?
I don't think you have access to sample counts but you do have access to attributes like length_seconds, bitrate and filesize so you could use a combination of those, but the best way I've found to identify silent tracks is to look for missing replaygain_track_gain (assuming you calculate that of course).
Code: [Select]
select path,
       album,
       "album artist",
       artist,
       title,
       md5,
       length_seconds,
       bitrate,
       filesize,
       replaygain_track_peak,
       replaygain_track_gain
  from Playlist
  where     playlist_index = active_playlist()
        and replaygain_track_gain is null
If you wanted to exclude these from the dedupe process just change the where clause to:
Code: [Select]
WHERE md5 is not null and replaygain_track_gain is not null