Best tweaks for encoding speech with Vorbis

2008-10-20 17:26:44

The title says it already: I want to encode a lot of spoken messages (mainly just human speech, mono) with Vorbis. I want to go as low as possible, but still not suffer too much from artifact of lossy compression.

My question is: has anybody found a relatively optimal oggenc tweaks to get a nice-sounding audio at low bitrate, but not suffering from lossy artifact?

In the past I have typically used 48kbps compression:

oggenc -o out.ogg --bitrate=48 --downmix src.wav

Something like that. It yields sound that is better than MP3 (in my opinion; I can be wrong since now there are so many more MP3 encoders), but as I listen more often, I realize there is a kind of strange "echo" here and there, especially when there is rich sound like American "are". The strange "echo" is somewhat like the "robot" sound in movies. I can upload a sample Vorbis stream to point that out (please let me know how to upload it, I am new to this forum).

I have been using oggenc version 1.0.2 provided by Ubuntu 7.04. The original stream has 44kHz sampling rate. I tried a simple tweak by compiling aotuv beta 5.5 (b5.5_20080330) and use its shared library in place of the stock liboggenc, by invoking this kind of script (Bourne shell script):

#!/bin/sh
export LD_LIBRARY_PATH=/usr/local/aotuv-b5.5_20080330/lib
exec oggenc "$@"

Still, the artifact is there.

As another attempt, I tried to reduce the bitrate using "ssrc", then invoking oggenc. Here's what I got for oggenc-ing the data stream:

Code: [Select]

Encoding speech: TEST 04

Subdir: /data1/wirawan/test/vorbis/speech04
Sample: pet_30.flac
The original filename was cut from LS Peter radio message #30 (1 minute length).

                         Sample        Bitrate             File size
  Filename                rate  Nominal Avg Inflation   Actual  Inflation
                         (kHz)  (kbps)  (kbps)  (%)     (bytes) (%)
  16khz/oggenc-32kbps.ogg 16     32    29.81   -6.85    226969  -41.41
  16khz/oggenc-48kbps.ogg 16     48    38.84  -19.09    294674  -23.93
  16khz/oggenc-64kbps.ogg 16     64    48.52  -24.18    367650    -5.1
  16khz/oggenc-80kbps.ogg 16     80    61.93  -22.59    468206   20.86
  22khz/oggenc-32kbps.ogg 22     32    39.03   21.96    296110  -23.56
  22khz/oggenc-48kbps.ogg 22     48    59.89   24.76    452540   16.82
  22khz/oggenc-64kbps.ogg 22     64    75.60   18.12    570709   47.32
  22khz/oggenc-80kbps.ogg 22     80    91.83   14.79    692450   78.74
  32khz/oggenc-32kbps.ogg 32     32    37.83   18.22    287232  -25.86  Very robotic
  32khz/oggenc-48kbps.ogg 32     48    55.48   15.59    419620    8.32  OK, but second man's voice is not great
  32khz/oggenc-64kbps.ogg 32     64    65.38    2.16    493593   27.41
  32khz/oggenc-80kbps.ogg 32     80    74.31   -7.12    560914   44.79
  44khz/oggenc-32kbps.ogg 44     32    37.65   17.64    285854  -26.21
  44khz/oggenc-48kbps.ogg 44     48    51.18    6.64    387396  Baseline
  44khz/oggenc-64kbps.ogg 44     64    63.96   -0.06    482875   24.65
  44khz/oggenc-80kbps.ogg 44     80    70.87  -11.41    534853   38.06

  Inflation is the percent kbps inflation of the avg kpbs in comparison to
  the nominal (target) kbps.

  File size inflation is against the "baseline" of 44khz/48kbps encoding.

Interesting! At lower sampling freq (22 and 32kHz), actually the file size is larger (at 48, 64, 80 kbps). Now this can be a topic on its own, but my main question remains: how to optimize the compression-vs-quality?

For your notes, this may be relevant: the original audio may not be directly from a raw source (I mean, like recorded directly, or from faithful CD-quality recording). In the case above, it is actually from a high-quality MP3 mono stream (which I guess is 80kbps mono stream).

Linux "file" utility yields the following information (filename is different, but they are of the same kind) for the original file:

/data1/wirawan/test/vorbis/speech04 $ file /d/temp/ls/luk/Luke_01.mp3
/d/temp/ls/luk/Luke_01.mp3: MPEG ADTS, layer III, v1, 160 kBits, 44.1 kHz, Monaural

Any help and pointer will be appreciated. Unfortunately I don't have time to deeply study this matter, so it is best to go to the point, and point the deeper explanation (web pages, wiki) as a "side note".

Wirawan

Best tweaks for encoding speech with Vorbis

Reply #1 – 2008-10-20 21:38:06

Is there a special reason why you want to use Vorbis?
Speex http://www.speex.org/ is specifically designed for voice recordings.
http://en.wikipedia.org/wiki/Speex

Best tweaks for encoding speech with Vorbis

Reply #2 – 2008-10-21 02:46:55

I did try speex a little bit, but I did not find it very satisfactory. probably I wasn't trying seriously. Another problem, as many other members already point out, is that speex is not widely available on systems other than "computer". It is not yet supported on small hardware like portable audio players. I want to create a copy of OGG file which can be played both on computers and portable audio players alike.

Best tweaks for encoding speech with Vorbis

Reply #3 – 2008-10-21 05:02:55

Quote

I did try speex a little bit, but I did not find it very satisfactory. probably I wasn't trying seriously.

Did you try ulta-wideband mode? Speex also has echo cancellation.

Quote

Another problem, as many other members already point out, is that speex is not widely available on systems other than "computer".

It supported on the Rockbox open-source firmware, which is used by many DAP. Take a look at the website:

http://www.rockbox.org/twiki/bin/view/Main/WhyRockbox

Best tweaks for encoding speech with Vorbis

Reply #4 – 2008-10-21 22:28:33

If you're still open to the idea of using mp3 for your application, try LAME. I find the following parameters to provide amazingly small files that are transparent for me:

Code: [Select]

lame -V8 -m m --resample 24

If you have the time, try it and let us know what you think.

Best tweaks for encoding speech with Vorbis

Reply #5 – 2008-12-02 15:48:00

FWIW, I have some vorbis files.. don't recall the options used, but they show as mono, 44.1 khz sampling, 30 kbps.

They play ok in DBpoweramp player and my Rockbox Sansa, but won't play in foobar2000 or winamp.

If I recall correctly, when I first started playing with mono, DBpoweramp played it back at double speed (like it split the available mono samples between the L and R channels,) but Spoon fixed it promptly when I reported the problem.

Notice