LMCodec: A Low Bitrate Speech Codec With Causal Transformer Models

Topic: LMCodec: A Low Bitrate Speech Codec With Causal Transformer Models (Read 1901 times) previous topic - next topic

0 Members and 2 Guests are viewing this topic.

LMCodec: A Low Bitrate Speech Codec With Causal Transformer Models

2023-05-03 21:38:46

This codec apparently appeared in late March. As most ML audio codecs it targets primarily extremely low bitrates (all the examples are < 3kbps). Most people involved are Googlers and they compare it with Soundstream (to which it is superior), so maybe they are part of the same team and it will eventually be included in Lyra? No source code release yet though.

Quote

We introduce LMCodec, a causal neural speech codec that provides high quality audio at very low bitrates. The backbone of the system is a causal convolutional codec that encodes audio into a hierarchy of coarse-to-fine tokens using residual vector quantization. LMCodec trains a Transformer language model to predict the fine tokens from the coarse ones in a generative fashion, allowing for the transmission of fewer codes. A second Transformer predicts the uncertainty of the next codes given the past transmitted codes, and is used to perform conditional entropy coding. A MUSHRA subjective test was conducted and shows that the quality is comparable to reference codecs at higher bitrates.

Project page with speech examples

Paper @ Arxiv

Re: LMCodec: A Low Bitrate Speech Codec With Causal Transformer Models

Reply #1 – 2023-05-04 08:01:12

So LMCodec in its 3/4 setting (0.85 kbps) provides roughly the same audio quality as Opus at 14 times higher bitrate (12 kbps)? I'll have to see and hear that with my own eyes and ears to believe it... on my own set of speech samples.

Chris

Re: LMCodec: A Low Bitrate Speech Codec With Causal Transformer Models

Reply #2 – 2023-05-04 09:52:33

Quote from: C.R.Helmrich on 2023-05-04 08:01:12

So LMCodec in its 3/4 setting (0.85 kbps) provides roughly the same audio quality as Opus at 14 times higher bitrate (12 kbps)? I'll have to see and hear that with my own eyes and ears to believe it... on my own set of speech samples.

Chris

There are Audio Examples at the bottom of the page which you could check

Re: LMCodec: A Low Bitrate Speech Codec With Causal Transformer Models

Reply #3 – 2023-05-04 09:59:12

Yes, very impressive indeed. But I don't think it is too good to be true.

After all, Opus' SILK still focusses on waveforms, the LPC basis is already ~60 years old. Taking the waveform, extracting a predictor, quantizing the residual etc.

It seems to me like the approach for these machine learning codecs is more like programming a voice synthesizer. In it simplest form you would need to encode a message, timbre, phrasing and attack, more or less. 500 bit/s = 62.5 byte/s. The actual message that is being transferred is closer 10 byte per second. At least, I cannot think of a word of more than 10 characters that I can utter in that time, and even then there is a lot of redundancy in those 10 bytes. Having another 50 byte to encode vocal characteristics does not seem impossible.

The machine learning part is in finding out what parameters the voice synthesis needs, and how to extract them, I think.

Re: LMCodec: A Low Bitrate Speech Codec With Causal Transformer Models

Reply #4 – 2023-05-04 11:22:46

A question I asked many years ago:
https://hydrogenaud.io/index.php/topic,114710.msg945513.html#msg945513

Re: LMCodec: A Low Bitrate Speech Codec With Causal Transformer Models

Reply #5 – 2023-05-04 12:19:24

Ah, yes, didn't think about the latency. According to the paper algorithmic delay is 20ms, so that pretty much rules out using voice recognition indeed.

Still, the paper says

Quote

Transmission of continuous speech features over low-bandwidth channels is achieved via vector quantizers (VQs) [10], where the features are turned into discrete representations while introducing minimal distortion.

So not really turning the speech into words and synthesizing it back, but a more intermediate level, recognizing speech 'features'. I guess that means vowels and consonants, pretty much? The machine tries to encode phonetics instead of samples?

Notice