Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: Why does Opus encoder have a speech/music detector? (Read 15301 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

Why does Opus encoder have a speech/music detector?

...rather than trying both modes for each frame and using the lower distortion one?

Re: Why does Opus encoder have a speech/music detector?

Reply #1
That's an interesting question, and I've been curious about it for a while.

According the the documentation. opus_encoder_create() has 3 application options:
* voip
* audio
* restricted_lowdelay

And then the opus_encoder_ctl() interface can do OPUS_SET_SIGNAL(signal_type), which can be:
* auto
* voice
* music

But this is just supposed to be a bias, more like a hint for the encoder not a restriction. Furthermore, from the Opus 1.1 demo page I can quote: "As of 1.1, libopus analyzes the audio content in realtime and dynamically selects the correct encoding mode on the fly." And I wonder how these options behave as of 1.2

But I've always wondered (per version). How mixing the options worked out.
VOIP application with signal bias set to music?
Music application with signal bias set to voice?

 

Re: Why does Opus encoder have a speech/music detector?

Reply #2
The VOIP mode option in opusenc "gives best quality at a given bitrate for voice signals. It enhances the input signal by high-pass filtering and emphasizing formants and harmonics. Optionally it includes in-band forward error correction to protect against packet loss. Use this mode for typical VoIP applications. Because of the enhancement, even at high bitrates the output may sound different from the input." It is deliberately not transparent, to improve the clarity of speech at low bitrates. SILK is preferred because it enables some of these features. One of the features is a 50 Hz high pass filter.

restricted-lowdelay forces CELT and forces the lowest possible latency. This disables some of the features that are used to enhance speech and it may reduce clarity regardless of bitrate.

The speech mode is essentially SILK and the music mode is essentially CELT, although there are hybrid modes and extreme high or low bitrates can override that decision.

To answer the original question, Opus (as of v1.1) does seamlessly switch between between speech and music modes unless you force one or the other. It doesn't generally do it with every frame since the algorithm tends to take some time to switch its decision, but the encoder makes continuous instantaneous decisions about the type of audio. It (libopus, but not yet opusenc) can also be configured to use lookahead to improve the decision of each frame.

So, one last thing: VoIP mode and music? Essentially a bad idea since VOIP mode deliberately does things that will reduce music transparency. VoIP application and music setting, less bad but clarity of speech may be reduced. Music application and voice setting, less than ideal since you may end up using SILK instead of CELT (although with sufficient bitrate it may not matter). Note that you can't actually force music or speech mode from opusenc, only give it hints, and even that is well-hidden since v1.1 because the automatic detection is usually good enough.

Re: Why does Opus encoder have a speech/music detector?

Reply #3
...rather than trying both modes for each frame and using the lower distortion one?

My guess is that it is fairly inefficient to switch modes, so you want to pick one and stick with it for more than a frame.  Usually MDCT codecs work like this at least.  Probably it gets pretty complex to try the brute force solution over a long enough time interval to be worthwhile. 

Re: Why does Opus encoder have a speech/music detector?

Reply #4
...rather than trying both modes for each frame and using the lower distortion one?

My guess is that it is fairly inefficient to switch modes, so you want to pick one and stick with it for more than a frame.  Usually MDCT codecs work like this at least.  Probably it gets pretty complex to try the brute force solution over a long enough time interval to be worthwhile. 

I don't think it is expensive. There is even the hybrid mode which encodes simultaneously in CELT and SILK, although I think the bandpass frequency between the two is fixed. As I mentioned, the decision on speech or music is analysed continuously during encoding and can change at any time, potentially switching from MDCT to LP (or hybrid) on any frame (but usually staying there for multiple frames).

The "brute force" approach of analysing every encoded frame to see which mode is best might be tricky. Is it even possible? Who defines what is "lower distortion" in a lossy codec? Speech in Opus is deliberately encoded with "distortion", or at least modifications that are intended to enhance clarity rather than aiming for strict transparency, so deciding on either speech or music is something that needs to be done independently of how effective the encoding will be.

Although I'm heavily interpreting some of this, there is fairly extensive description out there, including the in Opus spec itself. This link might be particularly relevant:
https://people.xiph.org/~xiphmont/demo/opus/demo3.shtml


Re: Why does Opus encoder have a speech/music detector?

Reply #6
...rather than trying both modes for each frame and using the lower distortion one?

There are many reasons, but a fundamental one that prevents us from trying the two and picking the best is that I'm not aware of any distortion metric that's good enough to make that decision. The ones I looked at would always prefer one mode, regardless of the content.

Re: Why does Opus encoder have a speech/music detector?

Reply #7
One question is if OPUS_SET_SIGNAL is meant to be deprecated in future builds, or it still has some particular use cases. (on top of the setting the application option)

For one, I know opusenc.exe has it disabled (or very well hidden).

Re: Why does Opus encoder have a speech/music detector?

Reply #8
One question is if OPUS_SET_SIGNAL is meant to be deprecated in future builds, or it still has some particular use cases. (on top of the setting the application option)
For one, I know opusenc.exe has it disabled (or very well hidden).=
It will not be deprecated and in fact we're planning on exposing it in opusenc in a future (next?) version. That being said, with the speech/music detector it should usually not be needed.

Re: Why does Opus encoder have a speech/music detector?

Reply #9
Am I correct in assuming that while voip + voice is ideal for, well erm. voice chat.
Is audio + music  best suited for music and live streaming?
In particular I'm thinking of a internet radio DJ that plays music and then talk between songs (sometimes with background music).

Re: Why does Opus encoder have a speech/music detector?

Reply #10
Am I correct in assuming that voip + voice is ideal for, well erm. voice chat.
And that audio + music  is best suited for music and live streaming?
In particular I'm thinking of a internet radio DJ that plays music and then talk between songs (sometimes with background music).


For something like radio, I would use "audio" and let the detector figure out when there's voice vs music. Think of it as "voip" trying to optimize for communication/intelligibility and "audio" trying to optimize for fidelity. For example, voip will filter out the very low frequencies (e.g. < 50 Hz), while audio will leave them there (but still remove the DC).

Re: Why does Opus encoder have a speech/music detector?

Reply #11
I can see the value of audio+voice and audio+music switching if going from music to talking only then music again.

But what about talking with background music?

And how much of a quality advantage/disadvantage could there be?
Would possibly a audio+mixed(voice&music) be a future possibility?

At 96kbit I'm assuming that the encoder plenty bits to encode voice even if it's set to audio+music?
Music quality is 1st priority and voice 2nd in this case.

Also, is there a risk of quality drop if the codec decides to switch to voice during a song?

I can't help but feel that the default for OPUS_SET_SIGNAL
should be voice for voip and music for audio. rather than auto.
Or simply such a behaviour the exact behaviour of auto, and allow music and voice to be used as overrides for voip or audio (for the rare cases somebody does music over voip for example like a jam session).


Re: Why does Opus encoder have a speech/music detector?

Reply #12
Keep in mind that there's a lot more than just speech/music detection that goes into the mode decision. Bitrate plays an even bigger role. You give the example of 96 kb/s... at that rate Opus will use CELT no matter what the detection says because that's the best more anyway. The detection is mostly used in a relatively narrow range of bitrate and it's generally reliable (and in version 1.3 it's going to be near-perfect). It's already more reliable than letting users choose manually without understanding the details of the encoder.

Re: Why does Opus encoder have a speech/music detector?

Reply #13
The detection is mostly used in a relatively narrow range of bitrate and it's generally reliable (and in version 1.3 it's going to be near-perfect).
Sorry for off-topic. What are features in development plan for 1.3 in general terms?

Thank You.

Re: Why does Opus encoder have a speech/music detector?

Reply #14
Sorry for off-topic. What are features in development plan for 1.3 in general terms?
I'm aiming to release 1.3 before the end of 2017, so there's going to be much less than in 1.2. The main thing of interest will likely be a much better speech/music detector based on recurrent neural networks, and which makes very very few errors. There's also going to be improvements to stereo music around 24-48 kb/s.

Re: Why does Opus encoder have a speech/music detector?

Reply #15
I'm aiming to release 1.3 before the end of 2017, so there's going to be much less than in 1.2. The main thing of interest will likely be a much better speech/music detector based on recurrent neural networks, and which makes very very few errors. There's also going to be improvements to stereo music around 24-48 kb/s.

Looking back, are there any big mistakes in the codec itself or any easy gains you wish you could implement now into the finished specification? Or is Opus pretty much still perfect both in design/architecture as well as the possible features that are built in, leaving only the encoder with possible refinements? Just curious here.

Re: Why does Opus encoder have a speech/music detector?

Reply #16
Looking back, are there any big mistakes in the codec itself or any easy gains you wish you could implement now into the finished specification? Or is Opus pretty much still perfect both in design/architecture as well as the possible features that are built in, leaving only the encoder with possible refinements? Just curious here.

Opus is by no means perfect, but there is also nothing I would consider as a "big mistake" either. There's a few details I might have done differently, but the encoder can reasonably work around most of those.

Re: Why does Opus encoder have a speech/music detector?

Reply #17

Opus is by no means perfect, but there is also nothing I would consider as a "big mistake" either. There's a few details I might have done differently, but the encoder can reasonably work around most of those.

That's good to hear :)

Re: Why does Opus encoder have a speech/music detector?

Reply #18
I'm aiming to release 1.3 before the end of 2017, so there's going to be much less than in 1.2. The main thing of interest will likely be a much better speech/music detector based on recurrent neural networks, and which makes very very few errors. There's also going to be improvements to stereo music around 24-48 kb/s.
Any news about it? :)

Re: Why does Opus encoder have a speech/music detector?

Reply #19
Any plans to support different bitrate settings for speech and music. (e.g. 32 kbps for speech and 64 kbps for music)?
This would save data traffic streaming mix content over the net.
Speech sound fine at much lower data rate than music do.