I am developing a free lossy zero-delay audio codec (mainly for fun).
Maybe it is just of academic interest, but I could see possible applications where raw PCM data is streamed because of a priority on latency and quality.
Such a codec might be an alternative to raw PCM when the priorities are: latency, bandwidth, and quality; in exactly that order.
The outline of the idea is: "Compress and encode audio data of arbitrary sample rates with no reference to future samples, using (linear) predictive coding, adaptive quantization of the residue steered by a masking model, and entropy coding."
I already invested a few night-shifts and wrote a (slow but working) reference implementation in Octave .
The idea and implementation is described in more detail on the GitHub page.
While I am still tinkering with the parameters, audio data can already be encoded and decoded to evaluate audio quality and bit rates.
Also, there is a demo script (play_demo.m) which generates some statistics and graphical representations of the encoded data, e.g. the attached colored bitmap which shows the encoded data of a short sweep.
I plan to write a basic implementation in C to evaluate the performance in terms of needed CPU-cycles.
It is a fun project to learn working with audio data from a strictly causal perspective and explore the possibilities/limits of audio compression under these (artificial?) constraints.
At 32kHz sample rate, the currently achievable average variable bit rate with the approach is 150-160 kbit/s per channel (no joint coding) without, as far as I can tell, any perceivable loss (on a corpus of speech and music samples, more details on the GitHub page).
There is no restriction on sample rates, but 32kHz seemed like a good target for parameter optimization.
I am happy about any constructive feedback as well as ideas on possible applications of such a zero-delay audio codec.
Also, if you are interested in sound samples but don't want to use the scripts, I can process any legally shareable database on demand.
What about something like NICAM
Yes, as far as I understand it, the proposed approach can be seen as a much more sophisticated variant of NICAM.
NICAM uses 10 significant bits, we use a variable number (between 2 and 12) depending on the output of the masking model (i.e., if the quantization noise will be audible or not).
Also, NICAM does not seem to use entropy coding (i.e. lossless compression).
I am not sure if NICAM can be used with arbitrary small block sizes (down to 1 sample).
I think the limit is 32, i.e., 1ms at 32kHz, which is good, but might be still a lot in some environments.
It's not really possible to do zero-delay audio unless each sample is exactly one byte or bit of incoming data. Your delay will always be the audio duration of a single audio packet, at bare minimum. If this is smaller than the size of the network buffer you are using, then your delay will be at minimum the number of packets you assemble before you start sending across the wire. For most audio codecs currently in use, this is, in practice, considerably less than the usual network latency between two given points, but the better you do with your compression scheme, the more you can reduce this.
Network delay is not much more of an issue, when it is within your own networking boundaries.
New switches are able to do cut-through-switching, which means that a packet is sent before it has been received entirely, thus reducing latency by a lot in comparison to store-and-forward-switching.
The fastest o(latency wise) active networking element remains the venerable reoeater/hub. But who uses them anyways?
Yes, I know that for practical very-low-latency-audio-over-IP applications WAVPACK-stream  or even raw PCM data is already a very good solution, mainly due to the significant overhead of the internet protocol.
There is a paper on the use case of WAVPACK-stream .
There are several (big) components in the audio-delay-chain which can be optimized independently.
The approach here only targets the delay due to buffering future samples for compression purposes.
In many setups, 2 ms might not make a big difference, but in some it does.
A colleague of mine is working on a solution for highly latency-critical acoustic interactions (https://github.com/gisogrimm/ovbox).
He told me that less than 10ms (analog->analog) delay can be achieved with fiber in our city.
And he is still using USB sound cards, which account for 5 of the 10ms.
In this context 2 ms can make a difference, which is why he streams raw PCM data.
I know that very high-quality near-natural remote acoustic interaction is still a niche, but I think it will gain more attraction in the future (e.g. the digital stage project ).
I imagine connecting distant rooms with several microphones, where the IP-overhead then would not be as large.
You probably wouldn't even need feedback suppression if the latency is low enough.
Also, if a server is involved, the audio streams of all participants can be transmitted and mixed locally to individual spatial positions, which is what the ovbox project is about.
There might be digital systems where other protocols are used, or where air-time is just very expensive (battery power of connected hearing devices?). In the context of hearing devices every fraction of a millisecond counts.
Maybe there are other use cases for it? I am still thinking about it.
But maybe its only interesting to explore which compression rates can be achieved without any look-ahead.
The currently possible benefit on average for 32 kHz 16 bit audio seems to be a reduction to less than 1/3 of the original size, i.e. a reduction from 512 to 160 kbit/s; without perceivable loss, as far as I can tell.
I looked a bit further, and found no attempts on zero-delay audio coding apart from the various ADPCM variants.
After some time thinking about this problem now I see a big difference to the other approaches on audio compression, which I will illustrate shortly.
The main difference to block compression schemes is that every decision in the encoder needs to be taken for an almost completely uncertain future.
The only thing that can be predicted with some certainty is post-masking.
If you know the next 100 samples you can use their statistics to pick the best encoding/description for them.
There is never a wrong decision because you can pre-calculate a lower bound on its amortization.
You could be more aggressive and re-use the information in future blocks, too, but you can also be conservative and just optimize everything just for each block, so that it is decodeable on its own (which is what most codecs do).
This is not possible with a zero-delay constraint.
The block size is one sample.
If you want it to be independently decodeable, you cannot make references to anything and just replace the value by another value.
The only way to remove redundancy is to allow reference to past values.
This does not introduce delay, but a dependency on the availability of these past samples.
But, reference to past samples is limited by the need to re-start decoding in a timely manner if any data was lost (which is likely in an ultra-low-delay real-time application).
Also, if you don't know the next sample value but you need to guess how to best describe the future data (e.g. select a codebook), your decision might need to be reverted after getting to know the next sample value.
This increases the frequency and cost of signaling mode changes to the decoder.
Hence, the pressure on an efficient coding of this meta data increases to a maximum.
I think that this is completely different problem compared to traditional audio codecs, and only few (if any) takes on it have been made.
Even ADPCM requires a sample or two delay to achieve something of high quality. IMA ADPCM is one of the most computationally efficient codecs with zero delay apart from a good practice to filter with the next sample or two for better index prediction/less noise, but in terms of quality it's not as good as aptX Low Latency.
NICAM sounds interesting. You could get away with 8 significant bits for most audio, considering psychoacoustic codecs generally aren't accurate to 8 significant bits lossless.
My mind still goes back to Opus as better than all of the above at quality, bit rate efficiency and low latency — as low as 2.5ms. PCM is the only way to achieve lossless with zero latency.
For internet use, an extra of 2.5ms per connection seems usually acceptable and then opus works fine.
Nonetheless, I find the problem of high-fidelity zero-delay lossy audio compression an interesting.
I was exactly looking for something between PCM and opus: (almost) imperceivable compression artefacts, no additional delay, but considerably less bit rate than raw PCM.
For local wireless transmissions it might have some relevance, for XoIP probably not that much.
Still, I am satisfied with the achieved average reduction to 190 kbit/s (4.3 bit per sample), compared to the 705 kbit/s needed for 44100Hz 16bit PCM.
Now the question is if its feasible to run it in real time on a Raspberry Pi (it should, but who knows).
Thanks to everyone for your opinions!
From the discussion I take: For the internet community there is no need for such a (free) zero-delay audio codec because there are hardly any internet-related use-cases.
I will do it just for fun then :)
From "Low Latency 5G for Professional Audio Transmission" :
Using a proprietary method, individual samples were compressed to a 6.3 bit / sample while maintaining high-end audio quality.
After compression, multiple samples were combined into individual packets for wireless transfer. The size of such packets is determined by the periodicity of transfers. The rate at which an audio packet is sent out is a parameter not necessarily derived by the audio application and could be adjusted for optimized data transmission, e.g. for matching the wireless system transfer interval. As the 5G testbed was working with mini-slots of 0.5 ms the periodicity was set to this value, leading to 2,000 audio IP packets per second at the Ethernet interface at a data rate of 360 kbit/s including some overhead. The system allowed for each packet to be injected with a start timestamp just before sending it over Ethernet.