Best tweaks for encoding speech with Vorbis
2008-10-20 17:26:44
The title says it already: I want to encode a lot of spoken messages (mainly just human speech, mono) with Vorbis. I want to go as low as possible, but still not suffer too much from artifact of lossy compression. My question is: has anybody found a relatively optimal oggenc tweaks to get a nice-sounding audio at low bitrate, but not suffering from lossy artifact? In the past I have typically used 48kbps compression: oggenc -o out.ogg --bitrate=48 --downmix src.wav Something like that. It yields sound that is better than MP3 (in my opinion; I can be wrong since now there are so many more MP3 encoders), but as I listen more often, I realize there is a kind of strange "echo" here and there, especially when there is rich sound like American "are". The strange "echo" is somewhat like the "robot" sound in movies. I can upload a sample Vorbis stream to point that out (please let me know how to upload it, I am new to this forum). I have been using oggenc version 1.0.2 provided by Ubuntu 7.04. The original stream has 44kHz sampling rate. I tried a simple tweak by compiling aotuv beta 5.5 (b5.5_20080330) and use its shared library in place of the stock liboggenc, by invoking this kind of script (Bourne shell script): #!/bin/sh export LD_LIBRARY_PATH=/usr/local/aotuv-b5.5_20080330/lib exec oggenc "$@" Still, the artifact is there. As another attempt, I tried to reduce the bitrate using "ssrc", then invoking oggenc. Here's what I got for oggenc-ing the data stream:Encoding speech: TEST 04 Subdir: /data1/wirawan/test/vorbis/speech04 Sample: pet_30.flac The original filename was cut from LS Peter radio message #30 (1 minute length). Sample Bitrate File size Filename rate Nominal Avg Inflation Actual Inflation (kHz) (kbps) (kbps) (%) (bytes) (%) 16khz/oggenc-32kbps.ogg 16 32 29.81 -6.85 226969 -41.41 16khz/oggenc-48kbps.ogg 16 48 38.84 -19.09 294674 -23.93 16khz/oggenc-64kbps.ogg 16 64 48.52 -24.18 367650 -5.1 16khz/oggenc-80kbps.ogg 16 80 61.93 -22.59 468206 20.86 22khz/oggenc-32kbps.ogg 22 32 39.03 21.96 296110 -23.56 22khz/oggenc-48kbps.ogg 22 48 59.89 24.76 452540 16.82 22khz/oggenc-64kbps.ogg 22 64 75.60 18.12 570709 47.32 22khz/oggenc-80kbps.ogg 22 80 91.83 14.79 692450 78.74 32khz/oggenc-32kbps.ogg 32 32 37.83 18.22 287232 -25.86 Very robotic 32khz/oggenc-48kbps.ogg 32 48 55.48 15.59 419620 8.32 OK, but second man's voice is not great 32khz/oggenc-64kbps.ogg 32 64 65.38 2.16 493593 27.41 32khz/oggenc-80kbps.ogg 32 80 74.31 -7.12 560914 44.79 44khz/oggenc-32kbps.ogg 44 32 37.65 17.64 285854 -26.21 44khz/oggenc-48kbps.ogg 44 48 51.18 6.64 387396 Baseline 44khz/oggenc-64kbps.ogg 44 64 63.96 -0.06 482875 24.65 44khz/oggenc-80kbps.ogg 44 80 70.87 -11.41 534853 38.06 Inflation is the percent kbps inflation of the avg kpbs in comparison to the nominal (target) kbps. File size inflation is against the "baseline" of 44khz/48kbps encoding. Interesting! At lower sampling freq (22 and 32kHz), actually the file size is larger (at 48, 64, 80 kbps). Now this can be a topic on its own, but my main question remains: how to optimize the compression-vs-quality? For your notes, this may be relevant: the original audio may not be directly from a raw source (I mean, like recorded directly, or from faithful CD-quality recording). In the case above, it is actually from a high-quality MP3 mono stream (which I guess is 80kbps mono stream). Linux "file" utility yields the following information (filename is different, but they are of the same kind) for the original file: /data1/wirawan/test/vorbis/speech04 $ file /d/temp/ls/luk/Luke_01.mp3 /d/temp/ls/luk/Luke_01.mp3: MPEG ADTS, layer III, v1, 160 kBits, 44.1 kHz, Monaural Any help and pointer will be appreciated. Unfortunately I don't have time to deeply study this matter, so it is best to go to the point, and point the deeper explanation (web pages, wiki) as a "side note". Wirawan