tonal problems
I'm not really sure what those are. Could you explain that for me please?
OK, lets do this rather thoroughly...
You can consider sounds to be of I guess three types: tonal, transient and continuous noise.
Tonal is like a whistle - a pure note, with a sharp frequency response and quite often overtones (harmonics) at multiples of the base frequency which give the timbre or character of the instrument's note. Vocally, vowel sounds are tonal and can be sung.
A chord - multiple notes played at once - is also tonal.
A transient is like a click, a cymbal or hi-hat hit or the breathy or plucked onset of a note from an instrument. (as an aside: often the type of onset gives the human as much information about the instrument as the timbre, which is why although first gen synthesizers tried mainly to reproduce timbre and overtones, later generations improved onset transients for more realism). In the frequency domain, most transients are spread over a wide spectrum (like noise) but in the time domain they are of short duration. Vocally, plosive consonant sounds and similar like p, b, f, k, t have transient nature.
Continuous noise is largely uncorrelated to previous samples, it's an essentially random signal that has components over a broad frequency spectrum. While transients are noiselike in the frequency domain - a broad spectrum with little in the way of frequency peaks - they last only a short time. A brushed snare drum or tape hiss is a good example of continuous noise. Vocally, breath sounds such as sh, ss, ff, dh/th are continuous noiselike sounds.
So the word COSINE for example
starts with a sharp transient C with a clicking noise
the long O is a tonal, singable vowel
the S is noiselike and lasts longer than the transient C without a particularly sharp onset
the I is a tonal, singable vowel
and the N is mostly tonal with a fairly abrupt but not really transient ending.
(the E is silent and modifies the vowel sound represented by letter I)
To accurately match the frequency or pitch of a slow-varying tonal signal, a long block in a transform codec, provides more points in the frequency domain, each representing a narrower frequency band. The frequency resolution can be said to be good. Because of the long duration, the time resolution is poor.
Imagine a grossly oversimplified example pretty much plucked from thin air:
If you imagine a bunch of frequency components in a tonal signal represented as decimal integer numbers, in a long block lasting 12 ms a small selection of them (12 by coincidence only) might be:
160Hz 240Hz 320Hz 400Hz 480Hz 560Hz 640Hz 720Hz 800Hz 880Hz 960Hz 1040Hz
30 119 475 879 3049 10234 4520 960 214 53 178 422
If you need to encode them only to the nearest 100, say, you might then get
160Hz 240Hz 320Hz 400Hz 480Hz 560Hz 640Hz 720Hz 800Hz 880Hz 960Hz 1040Hz
0 100 500 900 3000 10200 4500 1000 200 100 200 400
and being all zero in the units and tens digits, we only need to send the higher digits (hundreds, thousands, tens-of-thousands etc). This is similar to how we save bitrate in lossy encoding compared to lossless.
This gives a pretty good match for the frequency and amplitude of that tone when reconstructed, which our psychoacoustic model tells us is indistinguishable from the original on this occasion.
In a short block, there might be only a third of the number of samples, and a third of the number of frequency bins, each representing a 3-times wider frequency band but over a shorter time (e.g. 4ms), so while the frequency resolution is poor, the time resolution to represent rapid changes in the signal is good. The same bunch of frequencies over the same 12 ms is now divided into three short blocks, but instead of 12 frequency components in one long block, there are just 4 frequency components, each of which is three times broader in bandwidth, in each of three short-time blocks, lasting 4ms each.
first 4ms | second 4ms | third 4ms
240Hz 480Hz 720Hz 960Hz | 240Hz 480Hz 720Hz 960Hz | 240Hz 480Hz 720Hz 960Hz
375 3849 2022 111 | 208 4721 1898 218 | 142 5431 1468 102
If the psychoacoustic model has detected that there's a transient and requested a short block, it might well assume that the frequency spectrum is fairly broad, which is true for purely transient noiselike signals like hi-hat cymbals, and might calculate that rounding to the nearest 100, say, is enough:
first 4ms | second 4ms | third 4ms
240Hz 480Hz 720Hz 960Hz | 240Hz 480Hz 720Hz 960Hz | 240Hz 480Hz 720Hz 960Hz
400 3800 2000 100 | 200 4700 1900 200 | 100 5400 1500 100
However, there are cases where there is both a strong transient component and a strong tonal component. One example I've tested a few times is the problem sample Angels Fall First. This has a close-microphone on the right-channel guitarist's pick, producing strong clicking sounds (transients) as each string is picked. The string's notes are strongly tonal and the first string continues to sound as the next string is picked.
My guess is that the click of the pick triggers a short block to capture the short-duration sound. However, the bandwidth of each frequency bin in these three short blocks is a good deal broader now and if the same rounding accuracy (e.g. to nearest 100 in the above example) is provided, it sounds as though the frequency or amplitude of the continuous tones from the strings the sound throughout is wavering slightly.
To encode both the short time of the transient and preserve the sharp frequency spectrum of the tonal part of the signal over the whole time, a very high bitrate is required to produce finer-than-usual rounding accuracy for these broad bandwidth frequency bins to still result in fine frequency precision. This is a large part of what halb27's lame3.99.5z version does in the -Vn+ and -V0+eco settings when a short block is triggered and I think it's why it solves the Angels Fall First problem sample.
(The fact that there's a trade off between fine rounding accuracy and high time & frequency precision is a subtle mathematical point in the field of windowed overlapping Fourier Transforms, that's too advanced to explain in this context. There's some hope that the new Opus codec's band-by-band time/frequency preference will allow some frequency ranges to encode tonal components at low bitrate with poor time resolution while simultaneously providing good time resolution at low bitrate to other frequency bands.)