Information on lossy audiocoding and the MPEGplus-project
Information on lossy audiocompression
Information, software, links about the MPEGplus project
Why apply lossy audiocompression:
When storing audio files in Compact Disc
format, a large amount of data is needed (44.1 kHz * 2 channels * 16 bit
results in 1441.2 kbit/s or nearly 10 MByte/min). Because of this, this format
needs high capacities of transfer or storage. The goal of data reduction
is to reduce the needed bitrate and retain audible quality.
E.g. 74 min. music at nearly CD-quality
is storable on Sony's MiniDisk at compression of 1:4.8. MP3 compresses at
about 1:11, but with audible distortion.
How does it work:
-
The main disadvantage of PCM coding (Pulse Code Modulation) exists - from view of data reduction - in the constant
resolution of quantization over the full bandwidth of the audio signal.
Less complex compression algorithms (MPEG-1 Layer-1
and Layer-2) decompose the broadband signal into so called subbands. In
each subband the time signal is quantized using different resolutions of
quantization. Within the more complex algorithms (MPEG-1 Layer-3, AAC, . . .)
the signal is transformed into frequency domain and its spectrum is quantized.
In order to be able to quantize only with the quantizer-resolution the
psychoacoustic model demands, the time-signal samples (Layer-1 and -2) -
respective the spectral coefficients (Layer-3 and AAC) - are represented
through a floating point structure. Within this floating point structure
the exponent (scalefactor) represents the energy of the subband
signal (spectral coefficients), and quantization is applied to the mantissa
(quantization).
-
The biggest reduction in bitrate can be achieved
through the exploitation of psychoacoustic effects. These effects were determined by measurements on human subjects and describe
the perception capacities of human hearing. During data reduction encoding
a model of the human hearing estimates the amount of distortion that can
be applied to the audio signal (in frequency and time domain) without the
distortion becoming audible. If distortion is inevitable, the encoder estimates
where this distortion has to be placed to be least disturbing.
-
If the quantization of time signal samples
or spectral coefficients is controlled by the results of the psychoacoustic
model, different resolutions of quantization can be assigned to the different
frequency regions. This results in adding (hopefully) inaudible noise to
the audio signal.
Psychoacoustics:
-
Threshold in quiet:
This is the sound pressure level (SPL in dB) below which the human hearing
of most people is unable to perceive a sine-tone
(dashed line). In most psychoacoustic models the threshold in quiet is
normalized the way a sine wave with an amplitude
of +/- 0.5 in 16bit format reaches 0 dB sound pressure level. The problem
with this processing is the missing knowledge of the exact SPL during playback.

-
Simultaneous masking:
During an acoustic excitation the threshold of perception is lifted depending
on the spectrum of the signal. All signal components below this threshold
- also quantization noise - is imperceptible. The simultaneous masking
is the strongest masking effect. In the described psychoacoustic models
several steps are processed to calculate the masking threshold:
-
Short-time fourier analysis of the
audio signal through FFT
-
Amplitude and phase of each spectral bin is
predicted; small error of prediction (unpredictability) points to stationary
spectral components like sine-waves
-
Accumulating spectral energy in critical
bands (scale: Bark)
-
Spreading: Each critical bands masking
effect on the other critical bands are described through the spreading
function. The spreading function consists of two falling straights (dB/bark),
whereas the rise of these straights is dependable of the center-frequency
of the masking band and the sound pressure level in this band.
-
Tonality measurement: From the unpredictability
a factor is calculated that represents the tonality (kind of sine-likelihood)
of each spectral bin. This processing is necessary since sinusoidal maskers
result in a lower masking threshold than noise-like maskers. Sinusoidal
maskers therefore force higher resolution of quantization than noise-like
maskers.
-
Combining the spreaded signal with
the measured tonality leads to the overall masking threshold. All
distortions below this threshold will not be perceived by the human hearing.
-
Comparison of masking threshold and threshold
in quiet leads to the global masking threshold. The maximum of masking
threshold and threshold in quiet is assumed to be the global masking threshold.
-
Calculation of the necessary (to the quantization
control - bit allocation) signal-to-mask-ratio
(SMR) in each subband.
This is the minimum SNR the quantizer must deliver in each subband.
global masking threshold of a 1 kHz sine-wave
global masking threshold of 4 sine-waves
-
Temporal masking:
The simultaneous masking builds up (beginning from threshold in quiet)
within fewer milliseconds before a sound-event and reduces within 100 -
200 ms after ending of the sound-event. Also in this case, no distortion
is audible below this temporal masking threshold.
-
Pre-masking cannot be exploited to
reduce irrelevancy. Rather the encoder has to check whether so-called pre-echoes
could become audible. These pre-echoes appear because the analysis of the
input signal has poor time resolution (especially with transformation coders)
during stationary conditions that cause the quantization noise of the consecutive
frame to be smeared into the actual frame. When encoding transient signals
these encoders have to switch to short blocks for better resolution in
time domain.
-
Post-masking can be exploited negligibly.
If the last frame showed a higher masking threshold than the actual one,
this older masking threshold, despite 'sinking', can dominate the masking
in the actual frame.
The time-constant of the post-masking
(steepness of the falling masking threshold) depends on the duration of
the previous sound-event. For sound-event durations of 200 ms and longer,
the graph below will approximately describe the masking. After short sound-events
the masking threshold falls off more quickly, so that the psychoacoustic
model must estimate the duration of sound-events to control the time-constants
of temporal masking.
Quantization:
-
Quantization at the example of a subband
coder:
The reduction of irrelevancy is based
on controlling the quantization through the calculated masking thresholds.
The following picture should clarify the processing in principle.
- Skalenfaktor, Exponent, Mantisse, Signalspektrum,
Subband (quite the same in english)
- Maskierungsschwelle: masking threshold
- Pegel: level (sound pressure level)
- Frequenzachse: frequency axis

With subband coders (like MPEG-1 Layer-1 and -2) the minimum of the masking threshold within a subband will
be the maximum allowed distortion. After quantization any noise
energy higher than this minimum masking threshold is not allowed.
Subband samples are stored in a floating
point-like representation. Whereas the energy of the subband samples is
represented by the exponent. The signal-to-noise-ratio (SNR) that has to
be achieved by the quantization is calculated from the 'distance' of the
maximum level to the maximum allowed distortion in each subband. This distance
is called SMR (Signal-to-Mask-Ratio). For the noise to become imperceptible
the quantization must offer SNR > SMR. The following bit allocation routine
tries to figure out the minimum quantization resolution that fills
this condition.
The resulting bitrate can be estimated
easily. It is proportional to the sum of the areas SMR*subband-bandwidth,
at which only the 'positive' areas (SMR > 0) have to be summarized. Like
one can hopefully recognize in the figure shown above, this summarized
area amounts only a fraction of the total visible area (equals PCM-coding).
-
Example of an audiocoder controlled by
a psychoacoustic model:
The figure below on the one hand shows
the short-time spectrum of an audio signal (turqoise) and on the other
hand the noise (pink) that has been added to the signal during encoding
(MPEGplus at 128 kbit/s).
It is visible that with frequencies above
17 kHz the added noise (coding error) is equal to the input signal. This
means that these frequencies were assumed to be imperceptible and therefore
were not encoded. One can also easily see that the added distortion follows
the spectral envelope of the input signal (ca. 6 . . . 9 dB SNR). In lower
frequency regions the SNR is raised substantially - based on less allowed
distortion with sinusoidal signal components.
For further information on psychoacoustics
please take a look at "Psychoacoustics, Facts and Models" (Zwicker, Fastl;
Springer Verlag, Berlin 1990).
This is the project name of my audiocoder.
The encoder was developed during my study in my spare time. The motivation
was the nonsatisfactory about MP3 encoders at the beginning of this project
(in 1997-1998). Now the encoder has reached a very high quality that can
be compared to the quality of MP3 encoders. MPEGplus utilizes
decompositon into small frequency bands and therefore belongs to the group
of so called subband coders.
The main advantage of this encoder is
the highly tuned psychoacoustics, which allows pure VBR encoding (variable bitrate).
MPEGplus aims to reach transparent results with default-parameters
on nearly all audio signals one wants to encode.
About MPEGplus in general:
Lossless coding:
-
The MPEGplus-Encoder uses different Huffman-codes to reach high lossless compression. Huffman-coding is performed on quantized samples, scalefactors and frame-headers.
Psychoacoustic Model:
-
Adaptive Noise Shaping ANS (still in German): Within the subbands, the encoder tries to shape a spectral
envelope of the added noise that is similar to the spectral shape of the
masking threshold. This leads to more noise energy being allowed to be
added to ANS-quantized subbands. ANS uses filters of up to 5th order (the
filter with the largest coding gain will be chosen for noise shaping).
-
ClearVoiceDetection CVD (still
in German): During activity of vocals (especially vowels) or similar harmonic
signals higher resolution of quantization is assigned to the belonging
spectral bins. This procedure will fix some problems of the psychoacoustic
models during changes of the base frequency of harmonic signals.
-
Use of a self-measured threshold in quiet
(switch "-ltq ank" or "-ltq fil") since the ISO-listener
seems to be unable to perceive frequencies above 16 kHz - but I really
do!
-
Consideration of aliasing between subbands
when calculating the SMR's
-
Use of a non-linear spreading function - the
one that depends on the mid-frequency and the sound pressure level of the
critical bands (assumption of an maximum sound pressure level of 96 dB
during playback).
-
Exploitation of temporal masking through temporal
masking with variable time-constant (still in German)
Quality
and performance:
-
Encoding with default-parameters leads to
very high quality, that in general exceeds the quality of known MP3-encoders
like Lame. MPEGplus shows nearly no pre-echoes or flanging.
-
MPEGplus - regarding to the quality - works very stable, what means that i do only know very few signals on
which differences to the original audio signal are audible.
-
With the current version - encoder with StreamVersion
7 (SV7) - the average bitrates are about 160-170 kbit/s. Non-critical signals
are going to about 100-120 kbit/s, more critical signals may need more
than 200 kbit/s.
-
The Encoder runs with 5.0x realtime on a P3-800, the Winamp-plugin needs below 1% CPU-load on this machine.
Download:
Sources:
- Source Code of decoder: mppdec_source (ca. 36 KB, v1.7.8c) (06/23/2001) (outdated)
- Source Code of Winamp plugin (English): in_mpp_source (ca. 54 KB, v1.7.9f) (outdated)
- Source Code of Winamp plugin (German): in_mpp_source (ca. 54 KB, v1.7.9f (deutsch)) (outdated)
-
Alternative encoder / decoder by Frank Klemm (unbelievable fast, Pentium III/K6-2/Athlon-support)
Binaries:
Windows:
-
Decoder:
mppdec (ca. 50 KB, v1.7.8c) (06/23/2001) (outdated)
-
Encoder:
mppenc (ca. 78 KB, v1.7.9c) (07/10/2001) (outdated)
-
Winamp-plugin (English):
in_mpp (ca. 42 KB, v1.7.9f) (outdated)
-
Winamp-plugin (German):
in_mpp (ca. 42 KB, v1.7.9f (deutsch)) (outdated)
Linux:
-
Decoder: mppdec (ca. 28 KB, v1.7.8a) (outdated)
-
Encoder: mppenc (ca. 61 KB, v1.7.9a) (outdated)
OS2:
-
Player: mppplay (ca. 84 KB, v1.7.6) by Brian Harvard (http://silk.apana.org.au/utils.html)
Other OS :
-
Alternative decoder by Frank Klemm (unbelievable fast, Pentium III/K6-2/Athlon-support)
Logos: logos.zip (ca. 78 KB)
Programs with MPEGplus support:
Playback:
Winamp via plugin (http://www.winamp.com)
MediaJukebox via plugin (http://www.musicex.com/mediajukebox/)
DeliPlayer via plugin (http://www.deliplayer.com)
XMMS via plugin (under development)
(http://sourceforge.net/projects/mpegplus/)
Batch-Encoding:
Easy CD-DA Extractor via external encoder (http://www.poikosoft.com)
CDex via external encoder (http://www.cdex.n3.net/)
EAC via external encoder
(http://www.exactaudiocopy.de)
Audiograbber via external
encoder (http://www.audiograbber.com-us.net/)
Monkey's Audio via external encoder
(http://www.monkeysaudio.com)
MediaJukebox via external encoder
(http://www.musicex.com/mediajukebox/)
WinDAC32 via Script
MP+ Frontend von M.Spueler http://www.mpegplus.de
CD Copy for en-/decoding and burning
http://www.cdcopy.sk
FAQ:
When activating the equalizer
mp+/mpc files seem to loose high frequencies in comparison to mp3 files.
Why does this happen?
The option "EQ controlled by WinAMP" should be deactivated. WinAMP's build-in EQ will lead to loss of high frequencies (above 16 kHz) and will cause much more CPU-load than the mp+/mpc plugins' "fast EQ".
Even if I encode a file with the option "-bw 22050" (bandwidth set to 22.05 kHz) the mp+ file typically "cuts" off at about 18.5 kHz.
The switch "-bw x" only sets the
maximum possible bandwidth. The encoder will only store frequencies that
are assumed to be audible. The psychoacoustic model works very carefully
in relation to the encoded bandwidth. Even if the encoded bandwidth does
not equal the maximum bandwidth there should not be any audible difference.
For storing the full bandwidth one has
to set "-VBRmode 2 -minSMR x" (x>0) or simply use "-insane" (this profile
utilizes full bandwidth encoding).
What are the minimum and the maximum bitrates
supported by mp+/mpc?
The encoder can theoretically
use bitrates up to 1.32 MBit/sec. Normally such high bitrate will not occure.
The minimum bitrate is used for zero-samples and will reach about 3.4 kbit/s.
Will forthcoming decoders/plugins support the current bitstream versions?
Of course! All forthcoming versions
will support the existing bitstreams (SV4 to SV7) as well as the file extension
".mp+" and ".mpc".
Will mp+/mpc reach much higher audio quality
with forthcoming versions or is the encoder near the final state in terms
of quality?
In terms of quality the encoder
has almost reached the final state. The modifications with version-changes
are mostly related to debugging of file-i/o and parsing and not to the
encoder-kernel.
The current encoder has proven its reliability
in terms of quality in several tests by many users and on some hundreds
tracks. Only direct A/B-comparison using high-quality headphones and under
hard effort of listening may show minor differences to the original with
a small number of tracks. None of the encoded files showed heavy or annoying
artifacts.
Will the "-standard"-profile do for encoding
or should I use "-xtreme" or even "-insane"?
The encoder was tested intensively
and optimized in "-standard"-profile, the default setting. In this mode
the quality of the encoded tracks reaches - despite the profile's naming
- very high level!
The next profile "-xtreme" uses slightly
modified parameters to lower the quantization noise further below the masking
threshold - it offers even more headroom.
For the "-insane"-profile the parameters
are tweaked heavily. Using this mode will store the full bandwidth of the
input signal and lead to much higher bitrates than "-standard" or "-xtreme"
need. The storage of full bandwidth is not based on psychoacoustic reasons
- it was implemented at some users desire.
Summarization: When using "-standard"-profile
you will get high quality audio-files. If you want to push it a bit further
use "-xtreme". The use of "-insane" is not necessary in general.
If I want to encode the file Track of my
favorite band.wav the message "ERROR: File not found!" is displayed.
When encoding tracks with long
file-names the filenames must be given with quotation marks:
e.g. mppenc -v "Track of my favorite band.wav"
Will there be a Windows ACM-codec for
mp+/mpc?
It is not out of question that
there will be an ACM-codec in the future. But at the moment the main work
is focussed on debugging and implementation of new features. Designing
an ACM-codec is low priority.
Links:
Disclaimer (German):
Mit Urteil vom 12. Mai 1998 hat das Landgericht Hamburg entschieden, daß man durch die Ausbringung eines Links die Inhalte der gelinkten Seite ggf. mit zu verantworten hat. Dies kann - so das LG - nur dadurch verhindert werden, daß man sich ausdrücklich von diesen Inhalten distanziert.
Für alle auf diesen Seiten aufgeführten Links gilt:
Ich möchte ausdrücklich betonen, daß ich keinerlei Einfluß auf die Gestaltung und die Inhalte der gelinkten Seiten habe. Deshalb distanziere ich mich hiermit ausdrücklich von allen Inhalten aller gelinkten Seiten auf meiner Homepage. Diese Erklärung gilt für alle auf meiner Homepage ausgebrachten Links und für alle Inhalte der Seiten, zu denen die Banner und Links führen.
Andree.Buschmann@web.de