As promised, here is the answer that i got from Frank Klemm today. I translated it from german.
The calculated masked threshold is indeed depending on the level. It changes if lower levels
are approached. This modification was made sometime between encoder version 1.06 and 1.1.
With high levels, the NMR (noise-to-mask-ratio) was raised by 0.5 dB, with low levels,
it was lowered. The masked threshold (ATH) was lowered by 6 dB in total.
The original behavior was that, up to a certain threshold, things were coded with full NMR,
and after that it would suddenly get muted. A signal around that switching threshold
produced audible artifacts, despite the fact that many bits were used for coding.
The current behavior is that the coding gradually gets worse with very low levels and there's
almost no usable signal in the end. Only when this point is reached, the coding is stopped.
When you're looking at the error signal over the signal strength, there's a slowly declining
function that approaches the ATH from above. The old behavior first caused the error signal to
fall ca. 20 dB below the ATH and only raise to ATH-level again when the coding was stopped.
Extensive listening tests with headphones were conducted (headphones because of the high
listening level). For listening material, among others, the Bolero by M. Ravel was used.
Volume was adjusted to ca. 114 dB SPL at -0 dB signal strength.
At this volume, noise in the recording and quantization artefacts are already an issue with
many 16 bit recordings. As long as this level is not (clearly) exceeded, the quality of the
coding was clearly better, despite the lower bitrate (even though the NMR was raised by 0.5 dB
and the ATH lowered by 6 dB, there are spare bits with almost every kind of music).
The fluctuation in the coding - which was caused by activation and deactivation of subbands -
disappears.
But if you turn up the volume clearly above this level (ca. from 120 dB SPL at -0 dB signal
strength on), you hear the coding errors which are then pretty different from the older versions.
Now, if you disregard the question "what good are replaygains above +10 dB?" (with classical
music, only album-based replaygain should be used anyway), the problem can be solved by
lowering the ATH. It will result in a slightly higher bitrate.
If this problem is relevant for daily use in any kind of way, i dare say "no".
For most pop titles, you can increase the ATH by 30 dB and still not notice anything.
Even with classical music, 10 dB are often possible.
A clean solution is not possible with a 1-pass-coder; you would first need a rough
volume estimation of the whole song to estimate the maximum position of the volume knob -
and even then, you could still re-adjust during the title.
Furthermore, i would recommend corrections within Replaygain. A "quick-to-hack" solution
would be that the title-based replaygain of neighboring tracks in an album must not
differentiate by more than 6 dB.
From these (calculated) values:
- 7,81 dB
- 6,41 dB
- 7,61 dB
+4,81 dB
- 8,11 dB
- 6,12 dB
+1,12 dB
- 9,12 dB
you will then get:
- 7,81 dB
- 6,41 dB
- 7,61 dB
- 2,11 dB // raised to -8,11 + 6
- 8,11 dB
- 6,12 dB
- 3,12 dB // raised to -9,12 + 6
- 9,12 dB
Then, short voice tracks/interludes/preludes etc. don't get boosted to +40 dB anymore.
Because this is currently the only limit: Replaygain values of more than +40 dB are
simply reduced to 0 dB (not really that clean either). This limit should also be
reduced to +12 dB (corresponds to K-26).
If this proposal is taken up, i could send some reasonably tuned example code.
Somewhere in the depths of my hard disk there should be something.
In that code, the increase of these "holes" is also depending on the Album-replaygain,
the title length and sometimes from more distant neighboring tracks.
A "1 second digital null" before the first title approximately gets the value of the
first track, a "2 second digital null" in between two tracks gets the mean value
of both tracks.
static const Profile_Setting_t Profiles [16] = {
{ 0 },
{ 0 },
{ 0 },
{ 0 },
{ 0 },
/* Short MinVal EarModel Ltq_ min Ltq_ Band- tmpMask CVD_ varLtq MS Comb NS_ Trans */
/* Thr Choice Flag offset TMN NMT SMR max Width _used used channel Penal used PNS Det */
{ 1.e9f, 1, 300, 30, 3.0, -1.0, 0, 106, 4820, 1, 1, 1., 3, 24, 6, 1.09f, 200 }, // 0: pre-Telephone
{ 1.e9f, 1, 300, 24, 6.0, 0.5, 0, 100, 7570, 1, 1, 1., 3, 20, 6, 0.77f, 180 }, // 1: pre-Telephone
{ 1.e9f, 1, 400, 18, 9.0, 2.0, 0, 94, 10300, 1, 1, 1., 4, 18, 6, 0.55f, 160 }, // 2: Telephone
{ 50.0f, 2, 430, 12, 12.0, 3.5, 0, 88, 13090, 1, 1, 1., 5, 15, 6, 0.39f, 140 }, // 3: Thumb
{ 15.0f, 2, 440, 6, 15.0, 5.0, 0, 82, 15800, 1, 1, 1., 6, 10, 6, 0.27f, 120 }, // 4: Radio
{ 5.0f, 2, 550, 0, 18.0, 6.5, 1, 76, 19980, 1, 2, 1., 11, 9, 6, 0.00f, 100 }, // 5: Standard
{ 4.0f, 2, 560, -6, 21.0, 8.0, 2, 70, 22000, 1, 2, 1., 12, 7, 6, 0.00f, 80 }, // 6: Xtreme
{ 3.0f, 2, 570, -12, 24.0, 9.5, 3, 64, 24000, 1, 2, 2., 13, 5, 6, 0.00f, 60 }, // 7: Insane
{ 2.8f, 2, 580, -18, 27.0, 11.0, 4, 58, 26000, 1, 2, 4., 13, 4, 6, 0.00f, 40 }, // 8: BrainDead
{ 2.6f, 2, 590, -24, 30.0, 12.5, 5, 52, 28000, 1, 2, 8., 13, 4, 6, 0.00f, 20 }, // 9: post-BrainDead
{ 2.4f, 2, 599, -30, 33.0, 14.0, 6, 46, 30000, 1, 2, 16., 15, 2, 6, 0.00f, 10 }, //10: post-BrainDead
};
The Ltq_offset entry is the alteration of the masked threshold against the standard model.
A reduction by 6 dB decreases the ATH by 6 dB in the whole frequency range.
The value left of that (EarModel) can be used for ATH fine-tuning for higher frequencies.
An increasing by 20 results in a ATH decrease by 1.5 dB at 10 KHz and 6 dB at 20 KHz.
--quality 6 against --quality 5 has the following differences in the ATH with this:
- 6,0 dB for low frequencies
- 6,5 dB for 8 kHz
- 7,0 dB for 11 kHz
- 8,0 dB for 16,3 kHz
- 9,0 dB for 20 kHz
-10,0 dB for 23 kHz
If there are further questions or if something was unintelligible, just keep asking.
I still have no time, but when i have 15 minutes silence, i can answer such things.
Motto of the day: The ingeniousness of a construction lies within its simplicity.
Everyone can build something complicated. (Sergeij P. Koroljow)