I have a couple hundred hours of recorded speech (recorded in mono, 16-bit, mostly 48kHz but some 44.1kHz) I need to make available on the Internet, and as much as I'd like to use Opus for this, it's probably going to have to be MP3 for compatibility reasons. A bit of off-the-cuff testing shows that somewhere around 40-48 kbps seems to be fine for my purposes, but I'd like to be somewhat more confident about the choices I'm making before I start encoding all this. Thought I'd use the opportunity to educate myself a little while I'm at it too.
I know LAME automatically resamples the input when targeting low bitrates. I'd like to know exactly how it determines the output bitrate. I'm not fantastic with C (reading others' source takes me an inordinate amount of time) and I didn't find the relevant code with a little use of grep in the LAME sources. Could somebody point me to where in the sources this decision is made?
Also, how are the threshholds for switching to lower sample rates tuned? Might I be better off to resample at a lower rate than LAME would normally choose, since speech has so much less energy / useful information at high frequencies than music?
One thing that I did come across when looking for how the output rate is set was the following line from lame.c:
cfg->mode_gr = cfg->samplerate_out <= 24000 ? 1 : 2; /* Number of granules per frame */
I don't know much about the details of the MP3 format, but my initial guess is that using only one granule per frame increases the overhead from headers but allows for better accuracy in seeking etc. Since a granule at 24kHz is only 24ms this doesn't strike me as a very good guess. Could someone enlighten me about the reason for this switch? What other threshholds/decision points in either bitrate or sample rate are interesting or might be worth being informed about?
Could someone enlighten me about the reason for this switch?
MPEG-1 Layer III frames consist of two granules (you may call them "sub-frames" if that helps you ). MPEG-2 LSF Layer III frames consist of only one granule.
So the reason is to follow Layer3 specs.
The --resample and --lowpass switch gives you a good control for fine tuning the way you like it (within what's possible with mp3).
Use for instance --resample 11.025 --lowpass 4.5 for speech if you're not content with the defaults for your purposes.
Use lame --longhelp and watch out for the possible resampling frequencies given at the end of the help. I guess the MPEG-2.5 values are those that best suit your needs.
I suggest you use a rather good quality -V setting like -V5 together with --resample 11.025 --lowpass 4.5 or similar. Should give you very low bitrate for mono recordings in the range you consider (or even a bit lower) as well as a decent quality for speech. You can use a fractional -V value like -V5.5 in case that is useful to you.
An alternative to using -V5 is using --abr 35 or similar. In the low bitrate range ABR is sometimes considered to be superior to VBR.
Thanks, lvqcl. I still wonder what the practical upshot is and why MPEG made it that way.
halb27- I don't think I need to use the <16kHz MPEG-2.5 rates. Though speech is certainly comprehensible in narrowband and 12kHz isn't bad, there is a noticeable quality difference, and since nothing's forcing me to use <32kbps bitrates I don't imagine I need to make that tradeoff. I'll probably stick with 16, 22, or 24.
Also, I gather that some mp3 players may not be able to play the MPEG-2.5 sample rates (MPEG 2.5 was never really standardized, it was a proprietary Fraunhofer extension). Probably not much of an issue but I don't really know.
I have been using ABR; I had thought the recommendation to use it rather than VBR for sub-64kbps was pretty definite. Maybe there's more debate on that question than I realized.
Still looking for where in the source the default output sample rate is determined...
In lame.c, function int optimum_samplefreq(int lowpassfreq, int input_samplefreq):
/*
* Rules:
* - if possible, sfb21 should NOT be used
*
*/
...
if (lowpassfreq <= 15250)
suggested_samplefreq = 32000;
if (lowpassfreq <= 11220)
suggested_samplefreq = 24000;
if (lowpassfreq <= 9970)
suggested_samplefreq = 22050;
...
About ABR. Do you use LAME 3.99.x or earlier version?
Well, that simply means we need to find out how the lowpass frequency gets set. For ABR that turns out to be fairly simple: the optimum_bandwidth function is called, giving a lowpass frequency which depends only on the target bitrate; the result is then multiplied by 1.5 for mono, giving us the following table of lowpass frequencies and resampling rates:
bitrate >= lowpass freq sampling rate
60 16500 48000
52 15000 32000
44 11250 32000
36 10500 24000
28 8250 22050
20 5850 16000
12 5550 16000
0 3000 8000
This doesn't seem particularly carefully tuned. I see no reason why just multiplying the stereo lowpass frequencies by 1.5 should work all the way across this range of bitrates, and this completely skips 44.1kHz, 12kHz, and 11.05kHz sampling rates.
I had thought that I'd be learning more about what makes sense from LAME's carefully tuned defaults. While I imagine the stereo ABR defaults have been carefully tuned, it may not be at all difficult to improve on the above for mono, and it would be simple to cobble together a patch implementing such improvements.
The Lame defaults have music in mind, not speech.
For comparison, here's the corresponding table for stereo ABR:
bitrate >= lowpass freq sampling rate
120 17000 48000
104 15600 44100
88 15100 32000
72 13500 32000
60 11000 24000
52 10000 24000
44 7500 22050
36 7000 16000
28 5500 16000
20 3900 8000
12 3700 8000
0 2000 8000
12kHz and 11.05kHz are again skipped. Curious. BTW in both cases there are higher lowpass frequency cutoffs at higher bitrates, all the way up to 320kbps, but I only included the lower bitrates where there's more of a difference in lowpass frequency and where resampling comes into play.
Also, I am suddenly finding it annoying that HA doesn't support the BBCode for tables and the only option seems to be to go back to the fixed-width ascii-art past.
halb27, I'm aware of that- that's why I wanted to learn more about this in the first place, since if LAME had modes tuned for speech I would have just trusted it to make good decisions on its own. But some of these, esp in mono, don't seem at first glance like they would make sense for music either.
Of course I'm no expert, I have no idea what goes on with the psymodel, and the LAME devs have been at it for quite a while. Maybe there are very good reasons for every single odd-looking behavior. But I imagine most of the tuning effort has gone into higher-bitrate stereo VBR (the -V6 to -V2 "sweet spot" everybody wants to use for encoding their CDs) rather than low-bitrate mono ABR, and I've heard some people claim that some other mp3 encoders outperform LAME at low bitrates (though I've not seen this proven).
BTW, lvqcl- I've been using 3.99.4. Why do you ask? Were there ABR-related changes in 3.99 which didn't show up on the changelog?
The changes in CBR/ABR modes
are mentioned in the ChangeLog and history.html files:
LAME 3.99 beta 0 not officially released
All encoding modes use the PSY model from new VBR code, addresses Bugtracker item [ 3187397 ] Strange compression behavior
btw, 3.99.5 was released several days ago.
jensend, as the Lame defaults have music in mind and probably are not excessively tuned for very low bitrate mono sources, why don't you just use a resampling frequency and lowpass according to your likings?
Obviously you know what you're doing and have reasonable settings in mind, and you want to use Lame for specific purposes. Do you expect miracles from the Lame defaults?
As for the quality: when I proposed -V5/--abr 35 --resample 11.025 --lowpass 4.5 I did a test with some speech from my smartphone, and to me the quality was very decent. If your speech source is of very good quality it's better of course to use a higher sampling frequency and lowpass (and abr setting), but as for the quality I think Lame is a good choice. I guess you tried. What was your findings?
One of the advantages of Lame is that you can choose parameters like sampling frequency and lowpass according to your specific needs. You can't expect that from other encoders.