SoundStream: An End-to-End Neural Audio Codec. How exciting is it?

Topic: SoundStream: An End-to-End Neural Audio Codec. How exciting is it? (Read 5514 times) previous topic - next topic

0 Members and 1 Guest are viewing this topic.

SoundStream: An End-to-End Neural Audio Codec. How exciting is it?

2021-09-24 07:51:33

I'm a bit of a nerd when it comes to audio codecs, so I suppose my interest in this is a bit more than average. Anyway I recently found this blog post, and I was curious on what people here would think.

My own admittedly uninformed thoughts are that this is promising, but it's too early to really know what to expect. For very low bit rates at least, this could be setting a new bar in compression efficiency. For speech, I'm definitely interested; I've never heard a codec produce such clear speech at such low rates.

For music, the examples on the site show that opus and perhaps everything else just can't compete. Still, Silk was never designed for music so it has a huge disadvantage. But despite the promise I see from SoundStream (and I might as well lump Satin into this as well even though its claims for great music compression have yet to be proven afaik), there is still precious little to show how gracefully or otherwise these codecs could handle music. There is nothing at all to show how a complex track might fair.

At the lowest bit rates < 4 kbps, SoundStream artifacts seem quite invasive on polyphonic signals. Of course that's to be expected at such low rates. Nobody in their right mind would expect otherwise. But at higher rates I'm left wondering if there is a quality ceiling or other glaring weaknesses. My educated guess is that things do clean up, but it's far too early to specify the extent.

Of course, all we can do is speculate, which really isn't useful I suppose, but it can lead to good discussion. The main points of speculation I can come up with right now are:
Could a codec targeted at such a low bit rate still benefit everyone, whether you're simply using compression for music storage, or you want to stream decent but not necessarily perfect quality audio? Opus has shown just how wide a scope an audio codec can cover and excel in, so I'm hoping newer codecs can follow its example. If advancing technology really holds promise, I'd hate to see it remain niche forever.
Could the bit rate for transparency get considerably lower in the near future due to stuff like this?
Could audible artifacts get to a point where they sound eerily natural, tempting people who don't care about transparency to put the bit rate to ridiculous lows without complaining? I mean, low bit rates are becoming less and less important so it doesn't really matter to most people if 16 kbps sounds good anymore, but if you're a nerd like me or you actually have a need, then why not entertain the thought? Lol

Anyway that's enough from me. Eager to see what others have to say!

Re: SoundStream: An End-to-End Neural Audio Codec. How exciting is it?

Reply #1 – 2021-09-24 13:21:45

This format will never be popular for normal people.
It will only be used if they offer transparent delivery, just like Opus in YouTube.
Not a single person knows what Opus is, but they all know what YouTube is.
For normal people, MP3/AAC/FLAC will remain standard FOREVER...
Its not just the size, you also need to take encoding speed and battery life in consideration.
Why are people still using MP3 compared to Opus?
Why are people still using H.264 compared to H.265?
Why are people still using JPEG compared to WebP?
With the advance of 5G, I don't really see a need for these very low bitrate fornats...

Re: SoundStream: An End-to-End Neural Audio Codec. How exciting is it?

Reply #2 – 2021-09-24 14:57:11

Quote from: musicalman on 2021-09-24 07:51:33

But at higher rates I'm left wondering if there is a quality ceiling or other glaring weaknesses.

Since it is 24kHz, it will never be transparent on hard samples.

To be a fully versatile codec, it needs to have full 48kHz support. A strong protection against packet loss and data corruption is definitely needed. Streaming codecs should recover its state quickly even when some data are lost forever. Existing codecs such as Opus and AAC are built with fast recovery from partially corrupted streams in mind.

Re: SoundStream: An End-to-End Neural Audio Codec. How exciting is it?

Reply #3 – 2021-09-24 15:11:41

Quote

Lyra is a high-quality, low-bitrate speech codec that makes voice communication available even on the slowest networks. To do this it applies traditional codec techniques while leveraging advances in machine learning (ML) with models trained on thousands of hours of data to create a novel method for compressing and transmitting voice signals.

~ https://github.com/google/lyra

Re: SoundStream: An End-to-End Neural Audio Codec. How exciting is it?

Reply #4 – 2021-09-24 15:31:47

Quote

While these codecs [Opus and EVS] leverage expert knowledge of human perception as well as carefully engineered signal processing pipelines to maximize the efficiency of the compression algorithms, there has been recent interest in replacing these handcrafted pipelines by machine learning approaches that learn to encode audio in a data-driven manner.

They just forgot to mention by whom.

Re: SoundStream: An End-to-End Neural Audio Codec. How exciting is it?

Reply #5 – 2021-11-07 07:32:09

Quote from: 2tec on 2021-09-24 15:11:41

Quote
Lyra is a high-quality, low-bitrate speech codec that makes voice communication available even on the slowest networks. To do this it applies traditional codec techniques while leveraging advances in machine learning (ML) with models trained on thousands of hours of data to create a novel method for compressing and transmitting voice signals.
~ https://github.com/google/lyra

How do we use that to encode?

Re: SoundStream: An End-to-End Neural Audio Codec. How exciting is it?

Reply #6 – 2021-11-07 07:33:29

Quote from: musicalman on 2021-09-24 07:51:33

I'm a bit of a nerd when it comes to audio codecs, so I suppose my interest in this is a bit more than average. Anyway I recently found this blog post, and I was curious on what people here would think.

My own admittedly uninformed thoughts are that this is promising, but it's too early to really know what to expect. For very low bit rates at least, this could be setting a new bar in compression efficiency. For speech, I'm definitely interested; I've never heard a codec produce such clear speech at such low rates.

For music, the examples on the site show that opus and perhaps everything else just can't compete. Still, Silk was never designed for music so it has a huge disadvantage. But despite the promise I see from SoundStream (and I might as well lump Satin into this as well even though its claims for great music compression have yet to be proven afaik), there is still precious little to show how gracefully or otherwise these codecs could handle music. There is nothing at all to show how a complex track might fair.

At the lowest bit rates < 4 kbps, SoundStream artifacts seem quite invasive on polyphonic signals. Of course that's to be expected at such low rates. Nobody in their right mind would expect otherwise. But at higher rates I'm left wondering if there is a quality ceiling or other glaring weaknesses. My educated guess is that things do clean up, but it's far too early to specify the extent.

Of course, all we can do is speculate, which really isn't useful I suppose, but it can lead to good discussion. The main points of speculation I can come up with right now are:
Could a codec targeted at such a low bit rate still benefit everyone, whether you're simply using compression for music storage, or you want to stream decent but not necessarily perfect quality audio? Opus has shown just how wide a scope an audio codec can cover and excel in, so I'm hoping newer codecs can follow its example. If advancing technology really holds promise, I'd hate to see it remain niche forever.
Could the bit rate for transparency get considerably lower in the near future due to stuff like this?
Could audible artifacts get to a point where they sound eerily natural, tempting people who don't care about transparency to put the bit rate to ridiculous lows without complaining? I mean, low bit rates are becoming less and less important so it doesn't really matter to most people if 16 kbps sounds good anymore, but if you're a nerd like me or you actually have a need, then why not entertain the thought? Lol

Anyway that's enough from me. Eager to see what others have to say!

Its really useless until we have the encoder, i don't understand what's the point of releasing the format without giving people the means to encode

Re: SoundStream: An End-to-End Neural Audio Codec. How exciting is it?

Reply #7 – 2021-11-10 07:18:29

Quote from: MinPower on 2021-11-07 07:32:09

Quote from: 2tec on 2021-09-24 15:11:41
Quote
Lyra is a high-quality, low-bitrate speech codec that makes voice communication available even on the slowest networks. To do this it applies traditional codec techniques while leveraging advances in machine learning (ML) with models trained on thousands of hours of data to create a novel method for compressing and transmitting voice signals.
~ https://github.com/google/lyra

How do we use that to encode?

https://github.com/google/lyra#api

Re: SoundStream: An End-to-End Neural Audio Codec. How exciting is it?

Reply #8 – 2021-11-13 08:54:30

Quote from: doccolinni on 2021-11-10 07:18:29

Quote from: MinPower on 2021-11-07 07:32:09
Quote from: 2tec on 2021-09-24 15:11:41
Quote
Lyra is a high-quality, low-bitrate speech codec that makes voice communication available even on the slowest networks. To do this it applies traditional codec techniques while leveraging advances in machine learning (ML) with models trained on thousands of hours of data to create a novel method for compressing and transmitting voice signals.
~ https://github.com/google/lyra

How do we use that to encode?

https://github.com/google/lyra#api

Thanks, do you have an online tutorial on how to use it?

Re: SoundStream: An End-to-End Neural Audio Codec. How exciting is it?

Reply #9 – 2021-11-14 20:02:03

Quote from: MinPower on 2021-11-13 08:54:30

Quote from: doccolinni on 2021-11-10 07:18:29
https://github.com/google/lyra#api

Thanks, do you have an online tutorial on how to use it?

That, being the API documentation, is the online tutorial on how to use it.

At least if you're a programmer. Otherwise, given that the documentation uses C++, I suppose the closest thing to "an online tutorial on how to use it" is here.

Re: SoundStream: An End-to-End Neural Audio Codec. How exciting is it?

Reply #10 – 2021-11-16 00:59:42

I'm incredibly skeptical until an actual independent test shows it to be better than existing alternatives (codec manufacturers have been known to use poor choices of settings/encoders for the competition (and presumably potentially cherrypicked samples) to prove superiority (think "64 kbps wma is better than 128 kbps mp3" which is concievably true if you use an early mp3 encoder)

Re: SoundStream: An End-to-End Neural Audio Codec. How exciting is it?

Reply #11 – 2022-01-06 01:38:10

The code and PDF for the paper can be found through here: https://paperswithcode.com/paper/soundstream-an-end-to-end-neural-audio-codec

The Python code is a little meh as it's very much down to the basic algorithmic implementation without any practical tooling. Which is fine for a paper application, but I'm not sure I want to make a wrapper tool for it.

The codec will most likely see application in internet video telephony or internet telephony in general - of which we've seen a lot in the last two years or so.

I'm reading the paper right now, and it is quite interesting, however I'm not an ML expert. I'll see if I can get the code to run and produce some examples of my own.

Notice