Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: Audio similarity (Read 10062 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

Audio similarity

Another post related to audio similarity, but I feel that posting in my original thread years ago wont result in anything productive anymore. That approach was naive and in fact it can give interesting info for set of track regardless.

I was using phash some days ago (yep, no big deal), but although image processing and audio in general DSP, share same procedures (and in case of bare statistics same but with totally different variable names!?), it is generally unusable for audio and I thought to look over it again.

I dont have deep DSP inside, otherwise I would have solved this by myself, and from what I read I can assume that MFCC is the right way to go with perceptually describing audio files.
Here on HA we had fooID developed by Garf, which IFACT used MFCC extracts but unusable for general comparison (i.e. foo_biometric by musicmusic). Maybe differentiation equation wasnt good enough or maybe underlying algorithm didn't allow easy enough comparison. Also it seems to me that there was some kind of a database back in time when that module was created (I wasnt member back then).
Web services (i.e. echonest, last.fm, ...) use MFCC extracts to do comparison, and IMHO provide fair results.

I post this as I think that despite general audio topics, some forum members can flag this popular topic with solution promoting it (HA) through example and collaborate work.

High hopes, thanks

Audio similarity

Reply #1
>  it is generally unusable for audio

Why is that? Is a row of pixels fundamentally different from a stream of samples?

Audio similarity

Reply #2
>  it is generally unusable for audio

Why is that? Is a row of pixels fundamentally different from a stream of samples?



FWIW, yes.  The pixels represent a raster which is  data with 4 dimensions, X, Y, intensity and time while audio samples represent data represent a simple data stream in just two dimensions, intensity and time.


Audio similarity

Reply #3
Usually intensity/amplitude are not considered dimensions, since they are the dependent values (e.g. intensity at sample 10), whereas dimensions are usually indexable (e.g. pixel 10,10,2).  You could say video is 4 dimensional though if it has color channels (x,y,t,lam).

Most often optical engineers do treat image data as essentially a stream of samples, just one that spans a higher dimensional space.  Things like Nyquist and Shannon's principle still hold, with spatial aliasing set by the point spread function (2D impulse response) and temporal aliasing set the usual way as in audio.  Colors is a little different though because you're sampling a color space rather than actual spectrum (due to how weird human perception of color is), so thinking about aliasing and such is in color is uncommon, at least until you go up to larger numbers of color channels in hyperspectral systems.

Audio similarity

Reply #4
I dont have deep DSP inside, otherwise I would have solved this by myself, and from what I read I can assume that MFCC is the right way to go with perceptually describing audio files.

I recently developed a new technology that, among other things, finds audio similarity.

The technology is a new lossless audio compressor combined with search capabilities.

The compressor, as part of the format, stores high accuracy spectral information. The search capability uses this information to speed-up searching. In addition, the compressor divides a song into blocks, based on content.

One of the search options searches an audio file for blocks similar to a given block. This option utilizes a variant of MFCC. It is very fast since it relies on the stored spectral information. It searches a song in a fraction of a second.

Those who want to evaluate the technology can contact me at gringya (at) gmail.

Audio similarity

Reply #5
Usually intensity/amplitude are not considered dimensions, since they are the dependent values (e.g. intensity at sample 10), whereas dimensions are usually indexable (e.g. pixel 10,10,2).  You could say video is 4 dimensional though if it has color channels (x,y,t,lam).


My approach to this area is based on undergraduate and postgraduate work in multivariate Calculus and its applications to systems analysis.

I'm not sure what the above is based on.


Audio similarity

Reply #6
Usually intensity/amplitude are not considered dimensions, since they are the dependent values (e.g. intensity at sample 10), whereas dimensions are usually indexable (e.g. pixel 10,10,2).  You could say video is 4 dimensional though if it has color channels (x,y,t,lam).


My approach to this area is based on undergraduate and postgraduate work in multivariate Calculus and its applications to systems analysis.


Not sure what you mean?  The power measured by a 2D array detector measures two dimensions per exposure.  You can get it up to 3 if you color coat the pixels, or 4 if you take multiple exposures.  Usually this is expressed in functional notation as f(x,y,t)  or f(x,y,t,lam) if multispectral. 

Audio similarity

Reply #7
Usually intensity/amplitude are not considered dimensions, since they are the dependent values (e.g. intensity at sample 10), whereas dimensions are usually indexable (e.g. pixel 10,10,2).  You could say video is 4 dimensional though if it has color channels (x,y,t,lam).


My approach to this area is based on undergraduate and postgraduate work in multivariate Calculus and its applications to systems analysis.


Not sure what you mean? 


What I mean is that in multivariate calculus, each variable in the vector description of a sampled data point represents a dimension in an n-space, whether they are independent variables or dependent variables.

In audio, each channel or signal path is in 2-dimensional space with  the independent variable usually being  time.  Therefore, we almost always characterize a signal as being f(t). 

In the case of video the signal has additional attributes, but we were talking about audio, right? Bringing up video in a discussion of audio is usually a deflection or an entry into a rabbit hole.

An audio signal passing down a channel or other signal path only has two attributes, Time plus some representation of signal intensity which is usually either power or voltage.

Audio similarity

Reply #8
So you would consider f(t) to define a 2 dimensional function?