Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: Utilizing machine learning (ML) to distinguish talk from music in audio streams (Read 1904 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

Utilizing machine learning (ML) to distinguish talk from music in audio streams

I started down this path because I enjoy listening to and archiving FM radio and Internet music program streams from local stations. These are great for discovering new music, learning about the local music scene, and listening to interviews with artists and others in the music community. I find that these programs provide a superior curation of music than any “personalized” algorithms or "genre" streams I have listened to. One local station's catch phrase is "Don't Let The Robots Win!" and that sums it up pretty well.

The problem is that if I listen to these archived programs more than once or twice the dialog starts to get repetitive and sometimes even irrelevant and outdated (e.g., past concerts or long finished pledge drives). Also, there are times when the dialog is simply unwanted, such as when I'm using the stream as background music for exercise, or reading, or when I have guests over.

For these situations I have even gone so far as editing particularly good archived streams and removing the dialog. Unfortunately this is somewhat time-consuming and not at all practical on a regular basis. However, when I did do this I noticed that I could fairly easily distinguish the music and dialog by just glancing at the waveforms, and have thought that this would be something that could be automated without too much difficulty (famous last words).

What I noticed when looking at waveforms is that the short-term dynamic range of talk is much greater than music, first because there are unvoiced sounds that have far less energy than voiced sounds, and also because people tend to pause during speech. And since there’s usually little or no reverb with broadcast speech, the level briefly goes really low and stays low longer, compared to most music.

So I wrote some code to quantify some specific parameters along those lines and created training samples by hand-editing hours of audio recorded from FM and Internet radio. Once the audio had been analyzed and parameterized, I searched for the most reliable combination of predictors and ended up with a 4-dimensional tensor that achieves about a 95% success rate on my training audio, and seems to translate well to new streams.

It's pretty trivial to distinguish most music and talk in audio streams, but it's essentially impossible to be 100% accurate. Why? Well consider the situation where the DJ is talking during and over the music. Depending on the relative levels, and what portion of the time the talking is occurring, this can essentially make the talk detection impossible. Or consider a cappella singing (which can range from operatic singing to essentially talking), or music that includes people actually talking (my brother used to like putting preachers in his songs). In addition, some music genres simply have a temporal acoustic profile very similar to talking, with little reverb.

Still, once I had the detection reasonably reliable, I implemented code to selectively edit either the talk or the music out of the stream (with cross-fading) and configured it as a simple filter with raw 16-bit audio in and out. FFmpeg, Lame, and of course WavPack work well like this to pre-process the audio, and I’ve been very happy with the results so far. I added a configurable threshold to bias the detection for situations where the default is not ideal (for example if too much talk is being passed, or too much music eliminated). The application is called Skipper.

I’ve never been great at guessing what other music lovers / audio enthusiasts might or might not be interested in, and maybe this application has value only to me (with my specific quirks), but in the case that someone else is interested I’ve put the whole thing on GitHub, including detailed usage examples and a Windows executable. For obvious reasons I can not include the training audio, but the executable does have the resulting tensor embedded (in compressed form).

“Skipper” application on GitHub

Any and all feedback welcome!