Skip to main content

Topic: Near-lossless / lossy FLAC (Read 124199 times) previous topic - next topic

0 Members and 1 Guest are viewing this topic.
  • Nick.C
  • [*][*][*][*][*]
  • Developer
Near-lossless / lossy FLAC
Reply #200
If you want a similar preprocessing for FLAC or WavPack you'd do something like this:
- estimate LPC filter coeffs (H(z)) and temporarily filter the block to get the residual
- check the residual's power and select "wasted_bits" accordingly
- quantize original (unfiltered) samples so that the "wasted_bits" least sigcificant bits are zero
- use 1/H(z) as noise shaping filter.
Sebastian,

This is the second time that David has pointed me in the direction of your suggestion - unfortunately, I am unable to take these concepts and convert to code as I have no idea where to start as to the algorithms that are required. If you have any second-hand code which you would be willing to share, I would gratefully receive it and attempt to implement it in the lossyWAV Delphi project.

Best regards,

Nick.
lossyWAV -q X -a 4 -s h -A --feedback 2 --limit 15848| FLAC -5 -e -p -b 512 -P=4096 -S-

  • 2Bdecided
  • [*][*][*][*][*]
  • Developer
Near-lossless / lossy FLAC
Reply #201
Nick,

Sorry, not that bit. I've already done that bit, but haven't released it due to concern over a Sony patent.

The part I meant was this...

If you further check what psychoacoustic models usually do you'll notice that they allocate more bits to lower frequencies than to higher frequencies (higher SNR for lower freqs) most of the time. You then can tweak the noise shaping filter to W(z)/H(z) where W(z) is some fixed weighting so that you have a higher SNR for lower freqs.
(I derived W(z) by feeding OggEnc with mono pink noise).

...where you can use that weighting for exactly what you're doing now.

So it's "just": feed noise into Ogg, subtract input from output, check noise (implies SNR) at given frequencies using, say, spectral view in Cool Edit, and simulate that rough spectral shape in your code.

Just an idea. I keep meaning to try it but have other things to do!

Cheers,
David.
  • Last Edit: 03 October, 2007, 06:16:49 AM by 2Bdecided

  • Nick.C
  • [*][*][*][*][*]
  • Developer
Near-lossless / lossy FLAC
Reply #202
Doh! <slaps forehead> That sounds like a plan to me.... I'll get onto it tonight.
lossyWAV -q X -a 4 -s h -A --feedback 2 --limit 15848| FLAC -5 -e -p -b 512 -P=4096 -S-

  • Nick.C
  • [*][*][*][*][*]
  • Developer
Near-lossless / lossy FLAC
Reply #203
Didn't get round to OGG noise analysis last night, however reading the arstechnica MP3 explanation, it struck me that there may be some merit in the following:

Instead of a spreading function where values are averaged (in the default case over 4 bins), why not take the max of (last_bin,this_bin,next_bin) values, progressively along the fft bin results.

I have made a test implementation and the difference in bits_to_remove (average) between 4 bin average and this 3 bin max seems to be small.

[edit] Well, that was my impression, but when I ran my 52 sample set at default quality, 4 bin averaging = 39.48MB, 3 bin max = 38.83MB; [/edit]

[edit2] For Guru's 150 sample set at default quality, 4 bin averaging = 89.56MB, 3 bin max = 87.99MB;

Maybe, averaging the two highest values, disregarding the minimum value would be better - I'll try it. [/edit2]

[edit3] For Guru's 150 sample set, at default quality, 2-highest-of-3-average = 90.86MB; [/edit3]

[edit4] For my 52 sample set, at default quality, 2-highest-of-3-average = 40.23MB; [/edit4]
  • Last Edit: 04 October, 2007, 09:20:04 AM by Nick.C
lossyWAV -q X -a 4 -s h -A --feedback 2 --limit 15848| FLAC -5 -e -p -b 512 -P=4096 -S-

  • Nick.C
  • [*][*][*][*][*]
  • Developer
Near-lossless / lossy FLAC
Reply #204
Looking at the way that bit reduction / dither noise is calculated for each of the dither options, it appears that I neglected to ensure that the rounded value remained within the permissible sample limits when calculating the noise from rounding and dithering. I have re-written my noise calculation subroutine and will revise the constants used in the code to recreate the dither noise surfaces (1..32 bits x 6..15 bit fft length x 3 dither options).

On the experimental spreading function front, I am looking at a spreading function which changes from averaging at small fft lengths to simple maximum at long fft lengths as follows:

Code: [Select]
  begin
    pcll:=low_frequency_bin[analysis_number]-1;
    pchl:=high_frequency_bin[analysis_number]-1;

    for pci:=0 to pchl-pcll+1 do
    Begin
      v1:=fft_result[pci];
      v2:=fft_result[pci+1];
      v3:=fft_result[pci+2];

      vMax:=max(v1,max(v2,v3));
      vMin:=min(v1,min(v2,v3));
      vTot:=v1+v2+v3;
      vMid:=vTot-vMax-vMin;
      vAvg:=vTot/3;

      Case fft_bit_length[analysis_number] of
         0.. 6 : fft_result2[pci+1]:=(vAvg);
         7     : fft_result2[pci+1]:=(vMax*1.50+vMid+vMin*0.5)/3;
         8     : fft_result2[pci+1]:=(vMax*2.00+vMid)/3;
         9     : fft_result2[pci+1]:=(vMax*2.50+vMid*0.5)/3;
        10..15 : fft_result2[pci+1]:=(vMax);
      End;
    End;
  end;
  • Last Edit: 08 October, 2007, 07:42:06 AM by Nick.C
lossyWAV -q X -a 4 -s h -A --feedback 2 --limit 15848| FLAC -5 -e -p -b 512 -P=4096 -S-

  • SebastianG
  • [*][*][*][*][*]
  • Developer
Near-lossless / lossy FLAC
Reply #205
Hi, Nick, David!

From what I understand you are looking for some kind of weighting to determine the wasted_bits count, right? I'm not sure whether the weighting trick I described is appropriate here since I used this filter for noise shaping. I calculated the amount of bits to use for steganography (in your case wasted_bits) solely based on the power of the linear prediction residual. Combined with the fixed non-recursive part of the noise shaper the effect was quantization noise with a more or less constant (constant over time) SNR for a specific frequency region.

To be honest I really don't understand why you guys insist on introducing white-only noise. It's like travelling from A (lossless) to B (perceptual lossy) and stopping right in the middle where both disadvantages are combined: lossy encoding (B) + high bitrate (A) necessary due to lack of noise shaping.

IMHO the best thing to do here is following Edler, Faller and Schuller: Perceptual Audio Coding Using a Time-Varying Linear Pre- and Post-Filter. Their psychoacoustic analysis results in a "pre-filter" and a "post-filter". The post filter acts like a noise shaper. To make it work for lossy FLAC just
  • skip the prefilter, we don't need it.
  • derive wasted_bits according to the first sample of the post-filter's impulse response. This first sample tells you the optimal quantizer step size.
  • use the ("normalized") post-filter as noise shaping filter. (Normalized: A noise shaping filter's impulse response must start with the coefficient '1' and has an average log response of 0 dB on a linear frequency scale.)
About sharing code: I'd have to locate the source code, first. It's been a while since I touched it. Exactly what are you interested in? The "complicated" part of it was the levinson durbin algorithm. I could share a Java version if you like. It's not hard to find other source code for it with the help of Google, I suppose. If you want to follow the "Edler et al type approach" you could borrow a lot of Speex code for handling the filters.

Cheers!
SG
  • Last Edit: 09 October, 2007, 05:06:24 AM by SebastianG

  • 2Bdecided
  • [*][*][*][*][*]
  • Developer
Near-lossless / lossy FLAC
Reply #206
Hi Seb,

To be honest I really don't understand why you guys insist on introducing white-only noise.


1. It works.
2. I didn't stop there. I've done a noise shaping version. See the previous page!
(It's not truly psychoacoustic though)


Quote
It's like travelling from A (lossless) to B (perceptual lossy) and stopping right in the middle where both disadvantages are combined: lossy encoding (B) + high bitrate (A) necessary due to lack of noise shaping.


I see both advantages being combined: no problem samples, little or no transcoding issues, lower bitrate than lossless.

You could probably use Vorbis at high bitrates instead, with possibly slightly more transcoding worries. Also I'm not sure you could be so confident with multi-generation coding; set the threshold correctly, and lossyFLAC seems to go many generations (e.g. 50) without issue.


You could, of course, make this a proper psychoacoustic codec, but I'd only do this for fun - what would be the practical point? You'd be forcing the underlying issues of FLAC onto a psychoacoustic codec - why would you do that? Surely it would be much better to use Vorbis or something without these issues? I don't think myself or Nick are up for designing a new psychoacoustic model(!), though I guess we could "borrow" one.


Quote
IMHO the best thing to do here is following Edler, Faller and Schuller: Perceptual Audio Coding Using a Time-Varying Linear Pre- and Post-Filter. Their psychoacoustic analysis results in a "pre-filter" and a "post-filter". The post filter acts like a noise shaper. To make it work for lossy FLAC just
  • skip the prefilter, we don't need it.
  • derive wasted_bits according to the first sample of the post-filter's impulse response. This first sample tells you the optimal quantizer step size.
  • use the ("normalized") post-filter as noise shaping filter. (Normalized: A noise shaping filter's impulse response must start with the coefficient '1' and has an average log response of 0 dB on a linear frequency scale.)
About sharing code: I'd have to locate the source code, first. It's been a while since I touched it. Exactly what are you interested in? The "complicated" part of it was the levinson durbin algorithm. I could share a Java version if you like. It's not hard to find other source code for it with the help of Google, I suppose. If you want to follow the "Edler et al type approach" you could borrow a lot of Speex code for handling the filters.


Thank you for this. All pointers greatfully received!

Does it have any IP attached?
What form is the "post-filter" in?

The reason for the first question is obvious! I ask the second because I know what a noise shaping filter should be like (you missed minimum phase off your list) and it's not trivial getting exactly what you want - the LPC-based method delivers filters which check all the boxes - does this one? If not, is "normalization"/conversion easy?

Cheers,
David.

  • SebastianG
  • [*][*][*][*][*]
  • Developer
Near-lossless / lossy FLAC
Reply #207
Hi Dave,

2. I didn't stop there. I've done a noise shaping version. See the previous page!

Sorry, I wasn't aware of that.

You could, of course, make this a proper psychoacoustic codec, but I'd only do this for fun - what would be the practical point? You'd be forcing the underlying issues of FLAC onto a psychoacoustic codec - why would you do that? Surely it would be much better to use Vorbis or something without these issues?

It depends. Why were you tackling "lossy FLAC" again?
I just wanted to mention the benefits of noise shaping. Given a specific target bitrate one can maximize the lowest MNR (mask-to-noise ratio) via noise shaping. This means higher quality at the same rate. To guarantee a certain minimum MNR if only white quantization noise is introduced you have to raise the bitrate -- sometimes by a great amount.

I don't think myself or Nick are up for designing a new psychoacoustic model(!), though I guess we could "borrow" one.

Me, neither.

Thank you for this. All pointers greatfully received!

Does it have any IP attached?
What form is the "post-filter" in?

The reason for the first question is obvious! I ask the second because I know what a noise shaping filter should be like (you missed minimum phase off your list) and it's not trivial getting exactly what you want - the LPC-based method delivers filters which check all the boxes - does this one? If not, is "normalization"/conversion easy?

I don't know about the IP issue. The pre- and post filters are minimum-phase IIR filters and each other's inverse. They are "just a frequency warped" version of the LPC-based/autocorrelation method where the autocorrelation coefficients are determined by the output of the psychoacoustic model. Frequency warping is used to match the varying bandwidths of the critical bands. Regarding the missing "minimum phase" property: It may not be obvious but it follows from both properties I mentioned. If a filter's impulse response starts with the sample X and the average log response is log(x) then your filter is also a minimum phase filter. By normalizing I just meant scaling the impulse response so X=1. The difference between what Edler et al did and how it can be applied to FLAC is that the varying "post filter" does both, shaping in frequency and shaping in time whereas the noise shaping filter for FLAC can only shape in frequency and shaping in time is done by varying the wasted_bits count. To isolate these you have to extract the "gain" of the post filter which in this case is equal to the first sample. The postfiter (including gain) is supposed to represent the masking curve, so it makes sense to use it as noise shaper.

Edit: You asked about the form of the post filter:
H(z) = 1 / [1 + a1 D(z) + a2 D^2(z) + a3 D^3(z) + ... + an D^n(z) ]  (frequency warped all-pole filter)
where D(z) is a non-linear phase all-pass used as a replacement for the z^-1.
To use it as noise shaper is not more difficult than to use it as synthesis filter for linear prediction coding. However, it is a bit tricky because this form includes a delay-free loop in general. Edler et al point to another paper that describes how to resolve that.

Cheers,
SG
  • Last Edit: 10 October, 2007, 04:04:27 AM by SebastianG

  • jmvalin
  • [*][*][*][*][*]
  • Developer
Near-lossless / lossy FLAC
Reply #208
The idea is simple: lossless codecs use a lot of bits coding the difference between their prediction, and the actual signal. The more complex (hence, unpredictable) the signal, the more bits this takes up. However, the more complex the signal, the more "noise like" it often is. It's seems silly spending all these bits carefully coding noise / randomness.

So, why not find the noise floor, and dump everything below it?

This isn't about psychoacoustics. What you can or can't hear doesn't come into it. Instead, you perform a spectrum analysis of the signal, note what the lowest spectrum level is, and throw away everything below it. (If this seems a little harsh, you can throw in an offset to this calculation, e.g. -6dB to make it more careful, or +6dB to make it more aggressive!).


Sounds like you're trying to get the worse from standard lossy and lossless codecs. What you have now is a *lossy* codec that just uses a really crappy psychoacoustic model *and* is stuck with time-domain linear prediction instead of frequency transforms. BTW, the main reason why lossless codecs use time-domain linear prediction is not because it's better. It's only because that's the only sane way of getting back *exactly* what you encoded without numerical errors or having to code irrelevant information. By going lossy anyway, that advantage of LP no longer applies. I can't see any advantage of your idea compared to a lossy codec at very high rate (e.g. Vorbis q10 or something like that).

  • 2Bdecided
  • [*][*][*][*][*]
  • Developer
Near-lossless / lossy FLAC
Reply #209
That's what I like - real encouragement!

Cheers,
David.

  • halb27
  • [*][*][*][*][*]
Near-lossless / lossy FLAC
Reply #210
Please don't feel discouraged.
I think it's okay if somebody thinks there is no use in this approach.
Pure practically minded persons won't consider using it anyway. It's a way to encode for perfectionists or near-perfectionists. And even these are free to prefer a transform codec with a high quality settings if they like to. Ask 5 perfectionists about what they prefer and you'll get (nearly) 5 different answers with possibly underlying strong emotions. BTW it's the same thing with lossless codecs where differences between many codecs are very small. And for the practically minded it's not different: everybody loves his champion though in an overall sense differences between codecs and encoders may be rather small (looking for instance at AAC, Vorbis, and MPC, but at least at high bitrate even MP3 is competetive most of the time).
  • Last Edit: 11 October, 2007, 06:49:47 AM by halb27
lame3995o -Q1

  • Nick.C
  • [*][*][*][*][*]
  • Developer
Near-lossless / lossy FLAC
Reply #211
Ran my 52 sample set through OGG aoTuv 4.51 @ 10 and lossyWAV -2 -spread. lossyWAV output is smaller when compressed to FLAC with the corresponding codec_block_size (in this case 1152 samples), 485 kbps vs 488 kbps

Using fb2k bit compare as a quick way to "see" differences, lossyWAV has fewer samples which are different to the lossless original than OGG and a smaller maximum magnitude of difference than OGG.
lossyWAV -q X -a 4 -s h -A --feedback 2 --limit 15848| FLAC -5 -e -p -b 512 -P=4096 -S-

  • j7n
  • [*][*][*][*][*]
Near-lossless / lossy FLAC
Reply #212
Using fb2k bit compare as a quick way to "see" differences, lossyWAV has fewer samples which are different to the lossless original than OGG and a smaller maximum magnitude of difference than OGG.

What happened to the strong argument that audio quality should not be "seen" and codecs not evaluated by substracting...

  • Nick.C
  • [*][*][*][*][*]
  • Developer
Near-lossless / lossy FLAC
Reply #213
Using fb2k bit compare as a quick way to "see" differences, lossyWAV has fewer samples which are different to the lossless original than OGG and a smaller maximum magnitude of difference than OGG.
What happened to the strong argument that audio quality should not be "seen" and codecs not evaluated by substracting...
[/size]Yes, I know, sorry, I won't do it again. However, as lossyWAV only ever rounds a sample to fewer bits the sample value barely changes. Surely fewer changed samples has some merit?
  • Last Edit: 11 October, 2007, 07:07:26 AM by Nick.C
lossyWAV -q X -a 4 -s h -A --feedback 2 --limit 15848| FLAC -5 -e -p -b 512 -P=4096 -S-

  • 2Bdecided
  • [*][*][*][*][*]
  • Developer
Near-lossless / lossy FLAC
Reply #214
Please don't feel discouraged.
I think it's okay if somebody thinks there is no use in this approach.


Thanks halb27. I wasn't discouraged though. It's there for whoever wants to use it, and I'm fully aware of the strengths and weaknesses.

FWIW there are circumstances where a real psychoacoustic model (even backed off from the assumed threshold of audibility by several dB via the use of "insane" quality settings), is still inferior to having no psychoacoustic model at all. The places should be obvious: where the psychoacoustic model is wrong, where the psychoacoustic model is crippled by the format, and where the psychoacoustic model will interact (unpredictably) with something down stream.

LossyFLAC is there for those instances, and for those people who would like to use lossless, but recognise that sometimes you're wasting 1000kbps+ on making a "perfect" copy of something that has been smashed to pieces before it reached you.


It still surprises me that LossyFLAC works as well as it does. I'm very grateful (we should all be very grateful!) to Nick for all the experimenting he's done. He probably felt deflated with positive ABX results to some of his changes, but what it showed was that lossyFLAC is hitting more or less exactly the right bitrate for the technique to work. A higher bitrate doesn't add anything, and a lower bitrate rapidly falls apart. It seems to have a very sharp "sweet spot".

I don't think for one minute all possible issues are ironed out. This low frequency thing has to be nailed properly in a way that makes some sense, so it'll fix problem samples we haven't found yet! Then there is the question of what happens with M/S (surround) decoding. It's easy to add something to prevent problems - but no one has even looked for problems here yet AFAIK. Finally, there are times when dither is necessary, but in the vast majority of times it isn't. I'm wondering if there could be a check for this? It would probably encoding down, but I'll think about it anyway.

Anyway, thank you programmers for all your hard work, and thank you Nick too for spotting some bugs and implementing genuine improvements.

Cheers,
David.


Using fb2k bit compare as a quick way to "see" differences, lossyWAV has fewer samples which are different to the lossless original than OGG and a smaller maximum magnitude of difference than OGG.
What happened to the strong argument that audio quality should not be "seen" and codecs not evaluated by substracting...
[/size]Yes, I know, sorry, I won't do it again. However, as lossyWAV only ever rounds a sample to fewer bits the sample value barely changes. Surely fewer changed samples has some merit?


You can't draw any conclusions about perceived audio quality from this, but there are obvious reasons to test and report this behaviour, e.g. to understand something (not everything) about what the algorithm is doing.

It tells you what you already know though: Ogg makes no attempt to preserve the original samples numerically, while lossyFLAC will, on average, keep the exactly original value 1 in 2^bits_removed samples. This doesn't tell you anything about what it sounds like. Neither does the maximum difference.

Cheers,
David.

  • Nick.C
  • [*][*][*][*][*]
  • Developer
Near-lossless / lossy FLAC
Reply #215
I've got to the pre-alpha test stage of the Bark related bin averaging - I haven't managed to listen to anything yet at a high enough volume (everyone else in the house is sleeping!) but on size of output alone, this is an interesting development.

My 52 sample set: WAV: 121.5 MB; FLAC: 68.2MB; lossyWAV -2: 39.5MB; lossyWAV -2 -spread: 35.3MB.

Late now, must sleep - will listen to the samples in the morning.

[edit] Sounds promising (pardon the pun!) Will post as alpha v0.3.7 [/edit]


Development of Bark related bin averaging has stopped in favour of frequency dependent variable length spreading function.
  • Last Edit: 16 October, 2007, 05:45:48 PM by Nick.C
lossyWAV -q X -a 4 -s h -A --feedback 2 --limit 15848| FLAC -5 -e -p -b 512 -P=4096 -S-

  • SebastianG
  • [*][*][*][*][*]
  • Developer
Near-lossless / lossy FLAC
Reply #216
Sounds like you're trying to get the worse from standard lossy and lossless codecs. What you have now is a *lossy* codec that just uses a really crappy psychoacoustic model *and* is stuck with time-domain linear prediction instead of frequency transforms. [...] I can't see any advantage of your idea compared to a lossy codec at very high rate (e.g. Vorbis q10 or something like that).


There ARE some advantages, though:
  • Decoding FLAC is really simple. This is not true for transform based methods -- especially if you can't use floating point math.
  • The decoder doesn't need to know anything about how (spectral) noise shaping has been done. Spectral noise shaping is completely in the hands of the encoder and no extra side information needs to be transmitted. In the MP3/AAC case you need to code scalefactors and codebook indices for each scalefactof band.
LPC based methods for perceptual lossy coding can't compete with AAC/MP3 at low bitrates, on that we agree. But at higher bitrates the advantages of MP3/AAC-like methods are probably close to insignificant and outweighed by the LPC method's decoding simplicity, I suppose.

FWIW there are circumstances where a real psychoacoustic model (even backed off from the assumed threshold of audibility by several dB via the use of "insane" quality settings), is still inferior to having no psychoacoustic model at all.

I totally disagree. Having no model at all is for sure inferior to having a model that's a bit off. Also, even if you don't trust the raw output of a psy model you can still enforce some safety conditions like it's possible with MusePack (--minSMR so_and_so).

Maybe we interpret "having no/some psychoacoustic model" differently. Let's say we do 2-pass VBR to achieve some target bitrate. How can an encoder without an idea of how we perceive things perform better than an encoder who knows about psychoacoustics?

Cheers!
SG
  • Last Edit: 22 October, 2007, 05:36:01 AM by SebastianG

  • 2Bdecided
  • [*][*][*][*][*]
  • Developer
Near-lossless / lossy FLAC
Reply #217

FWIW there are circumstances where a real psychoacoustic model (even backed off from the assumed threshold of audibility by several dB via the use of "insane" quality settings), is still inferior to having no psychoacoustic model at all.

I totally disagree. Having no model at all is for sure inferior to having a model that's a bit off. Also, even if you don't trust the raw output of a psy model you can still enforce some safety conditions like it's possible with MusePack (--minSMR so_and_so).

Maybe we interpret "having no/some psychoacoustic model" differently. Let's say we do 2-pass VBR to achieve some target bitrate. How can an encoder without an idea of how we perceive things perform better than an encoder who knows about psychoacoustics?
You can't shoot for a given bitrate (CBR or VBR) with lossyFLAC. You can only shoot for a given quality. Even there, options are limited!

As for "backing off a psychoacoustic model" - well, yes, and at some point you will hit/match lossyFLAC. The idea here is to have a codec which delivers transparency, or transparency plus resilience to anything upstream/downstream. What settings should people use to get that with Vorbis or MPC? I have some ideas, but with lossyFLAC it will be -2 and -1 - that's it. If it works!

Cheers,
David.

  • SebastianG
  • [*][*][*][*][*]
  • Developer
Near-lossless / lossy FLAC
Reply #218
(*)
Quote
FWIW there are circumstances where a real psychoacoustic model (even backed off from the assumed threshold of audibility by several dB via the use of "insane" quality settings), is still inferior to having no psychoacoustic model at all.

Let's assume it's true. How would you explain it?

You can't shoot for a given bitrate (CBR or VBR) with lossyFLAC. You can only shoot for a given quality. Even there, options are limited!

I know. I was just being hypothetical. In any case (2pass VBR with target bitrate or quality-controlled VBR) an encoder would benefit from a component that estimates the optimal distribution of distortions in the time/frequency plane. Without such a component you'll get highly varying MNRs. What good is a high mask-noise-ratio in some frequency/time region when in another time/frequency region it's too low? The goal needs to be to maximize the minimum mask-noise-ratio.

By saying (*) aren't you implying that the the benefit of a psychoacoustic model is outweighed by its uncertainty? I don't think current models are that bad.

The idea here is to have a codec which delivers transparency, or transparency plus resilience to anything upstream/downstream.

<=> high min(MNR).


Cheers!
SG
  • Last Edit: 07 November, 2007, 07:43:16 AM by SebastianG