Skip to main content

Topic: Interesting Formal MP3-WAV Listening Test (Read 10830 times) previous topic - next topic

0 Members and 1 Guest are viewing this topic.
Interesting Formal MP3-WAV Listening Test
Original German article:

Cross-examination test

The c't-Reader's Listening Test: MP3 versus CD

After our controversial discussion of some fundamental issues of MP3
encoding in the March, 2000 issue (see [1]), c't asked our readers to
perform a listening test: Unbelievers should face the task of
identifying, in 'blind flight', the source of various music
selections. The results of our test surprised not only our reference
listeners; also our editing staff was perplexed by some new knowledge
they gained.

We had stirred up a hornet's nest. Long discussions on our Usenet
forum, harsh as well as constructive letters to the editor, and angry
messages to our hot-line during business hours showed that there the
battle between MP3 opponents and supporters was still undecided after
that test. Critics accused us of populist opinion making, argued with
great technical skill about the intricacies of HiFi/Audio
specifications, and damned MP3 compression as the work of the Devil;
others praised our enlightened explanations as worth reading and
useful to dispel all the esoteric and voodoo superstitions on matters
of audio and HiFi, or simply declared us correct with respect to the
audibility (or even inaudibility) of the effects of lossy audio
compression at different quality levels.

All this persuaded us to take an extraordinary step, which we made
public in the April, 2000 issue of c't. Our critical readers
themselves were asked to distinguish MP3-encoded samples of music from
the originals in a common listening test. The participant with the
best hit quota would win a cash prize of 1000 DM (approx. US$600).
Initially we wanted to invite six readers, but we got so much response
(more than 300 serious applications within a week), that we decided
that twelve participants would be asked to come to Hanover. They were
screened initially by their qualifications and then a final selection
of that group was made randomly. We asked sound engineer Gernot von
Schultzendorff to participate and to be our assessor and 'reference
listener'. Mr. Schultzendorff works for Deutsche Gramophon in Hanover,
and his primary activity is to prepare masters for the production of
classical recordings. Without wanting to anticipate the result of this
second test, we may say that the charts in the March/2000 issue are
still as valid as before, and we don't need to recommend to any of our
former participants a visit to their hearing doctors.


This time our comparative listening test took place entirely in our
publishing house studio, where the damping, reflection, and resonance
conditions are comparable to those in an audiophile's living room.
Some readers may remember the studio from the time when the magazine
HIFI-Vision was sold to Heise. At that time, the ceiling had been
covered with diffusers (sand-filled plastic sacks), and had additional
damping elements on the walls, as well as a built-in filled bookshelf,
which made for dry acoustics. However, the former conditions of the
HIFI-Vision studio could not be completely reconstructed: instead of
the HiFi magazines in the bookshelves, we had to content ourselves
with telephone directories from the publisher's program to provide
effective acoustic lining. Our readers will have to forgive us for
this inaccuracy.

Our top class audio components were a pair of B&W Nautilus 803
speakers, connected to a Marantz CD-Player CD14 and a PM14 amplifier.
With the Straightwire-Pro cables and accessories, this combination
cost approximately 30,000 DM, an amount that few HiFi lovers could pay
for their hobby. The Nautilus speakers, of high-quality English
manufacture, are a first choice for studios and mastering rooms,
because of their balanced, analytic and neutral sound. Furthermore,
Axel Grell, from Sennheiser, (who is not related to our chief editor
and unofficial competitor Detlef Grell) provided us with the
electrostatic reference headphones Orpheus, along with the
corresponding tube amplifier - unfortunately only for the duration of
the test, because the noble small series product, priced at 20,000 DM,
was the most expensive component we used.

Four minutes

We chose an arbitrary list of musical works (17 in all, see the list
below). From each of these a one-minute long passage would be played
to each listener from the original CD, as a reference. Then, three
samples of the same passage (at 128 kbps, at 256 kbps, and again from
the original) were to be played in a random sequence. The listeners
had to determine the correct source of the three samples and record
their answers on a questionnaire. Correctly identifying the 128 kbps
sample earned the listener one point each per piece, and the same for
a correct identification of the CD sample. For correctly identifying
the source of all three versions, the contestant got three points. But
no points were awarded at all if the 256 kbps sample was correctly
identified but the 128 kbps and original CD samples were reversed. A
maximum score of 51 points was therefore possible and the random
statistical mean (caused by unequal weight) was at 14.1 points. Any
contestant who had a score greater than 14.1 would therefore have
heard actual differences in quality.

In order to eliminate variations that could be caused by different
D-to-A characteristics between the CD and MP3 players, we had the test
samples encoded with MusicMatch 4.4 for Windows in joint-stereo,
converted into AIFF format with a Power Mac G3 for the Apple QuickTime
Player, and then burned onto a single Audio-CD in a random sequence
along with the extracted CD Audio files.

Listening Test

After the first half-hour of intense listening, some of the
contestants already wanted to quit. 'A lottery', was a comment heard
many times. Many of the listeners were surprised at how good an MP3
recording can sound through the outstanding Marantz player. People
chattered about technical issues such as phase relationships, the
influence of the (imperfect) room acoustics and their personal
listening habits. They argued about the importance of good cables and
praised the superiority of  analog recordings on vinyl (which
unfortunately were not available for the listening test).

During the pause and after the official common part of the test,
several doubting contestants were allowed to use the Orpheus
headphones to help listen to and classify the individual pieces. They
were also then permitted to jump from one passage to another in direct
one-to-one comparisons between the individual versions, which
obviously could not be done in the common listening test.

First Place Winner

The unofficial winner, with 26 total points was our 'reference
listener' Gernot von Schultzendorff who, after over an hour of
intensive listening, had to admit he was exhausted. 'That was hard. It
seemed to me almost as if some of the 256 kbps samples sounded
somewhat rounder and more pleasing than the originals from the CD. One
cannot let oneself be distracted by those characteristics', he said.
And, in fact, people often incorrectly chose the 256 kbps sample as
the original CD version.

Among the invited readers, Mirko Eßling from Schopp, a student
electronics developer, won first place. According to his own statement
on his application, he 'can predict the sound of an audio circuit by
the mere sight of it'. He won with 22 points. Given the test
conditions of foreign acoustics, performance stress, unfamiliar
equipment, and sub-optimal listening conditions, he achieved an
absolutely respectable score that garnered him the first place prize
of our competition: 1000 DM, in cash.

We were somewhat surprised when we found out about his musical
preferences. 'In fact I cheated a little in my application. I really
have a classical piano training, but as an active amateur musician, I
prefer to perform punk-rock', said he. Prior to the test, he practiced
intensely by listening to different kinds of MP3s. He had a final
success rate of 90% with 128 kbps encoding, and that despite a severe
handicap. 'Since an accident involving an explosion I can hear on my
left-side only up to 8 kHz, and on the right side I had a stubborn
ringing until recently. However, I can catch the typical flanging
effects of the MP3 filters and maybe do that better than my
competitors because of my hearing impairment.'

There may be some truth in this. The basis for the psycho-acoustic
model of MP3 encoding originates from a person with normal hearing.
Someone who can perceive frequencies up to only 8 kHz will not hear a
bright cymbal or triangle crash, but will probably hear the
normalization noise of the filters in the lower frequencies, because
in this case the noise will not be appropriately masked by high
frequency sounds. Sharp notch filters, as implemented in the MP3
decoders, can generate a flanging (or jet effect) when the signal
changes rapidly.

So it isn't those with perfect hearing, but those that deviate
strongly from normal that seem to be especially sensitive to MP3
artifacts. Psycho-acoustic masking effects are at the basis of the MP3
encoding algorithm (the alarm clock goes on ticking even when it rings
[but the algorithm doesn't encode the ticking because it will be
masked by the ringing anyway G.]; and the algorithm relies upon such
effects also in the case of the generated normalization noises, which
in general are supposed to be masked by the useful signals. But when a
hearing impairment cause these noises to surface they will be much
easier to detect.

A Shared Second Place

With 20 points each, Jochen Kähler and Tom Weidner from Nuremberg both
achieved second place, followed by Martin Eisenmann from Hamburg. Mr.
Eisenmann owns the big B&W Nautilus 801, and because of his 'deep
appreciation of music and desire to accept nothing but the best' he
spent 40,000 DM on his stereo system. Tom Weidner is an engineer who
develops hearing aids, works on audio signal processing algorithms,
and is used to participating 'in complex sound tests, mostly dealing
with finding artifacts and sound differences'. Jochen Kähler had a
previous opportunity while employed at the Fraunhofer IIS in Erlangen,
to work on the Advanced Audio Coding and other MP3 successors.

Stefan Weiler from Hambühren, blind from birth and an ardent listener
of classical, jazz and of "serious light music", possesses perfect
pitch and has been actively involved in the development of the
'Kunstkopf' recording apparatus [a recording device in the form of a
human head with microphones in the place of the ears, used to obtain a
more realistic stereo effect in recording (G)]. Because of an
inadvertent mistake when communicating his choice to his companion he
came in at an undistinguished fourth place. If he had not
inadvertently switched the Brahms samples, he too would have amassed
20 points. As a consolation we have promised him the opportunity to
work on a campaign we are launching for the sight impaired. Weiler
identified MP3 encodings chiefly by the lack of "spatiality of the
rustle in the silent passages", as he explained.

From a statistical point of view

It's true that the data we collected does not support watertight
conclusions, but they do provide interesting insights. We wanted to
find out which pieces of music were the hardest to distinguish from
the original and which ones were the easiest for the listeners to
detect. From the simple sum of all the scores obtained by all
participants for each title we can tell whether it was easy or
difficult for participants to distinguish the original and the
different MP3 encodings (see table scores).

By no means do classical recordings always have an advantage in this
respect, and in the case of some pieces, participants were
consistently wrong in their choices. For example, the Arabic Dance of
Edvard Grieg's Peer Gynt encoded at 128 kbps was preferred over the
original by more than half of our participants. The compression may
have eliminated some small weaknesses of the recording, perhaps a
roughness of the woodwind players. On the other hand, Chic's
'Jusagroove', a very dynamic and tight funk, was correctly identified
by most listeners.

In order to further understand this phenomenon we did some additional
investigation of the test results. We were particularly interested in
the causes of the difficulties. Did the testers have problems
distinguishing high-quality MP3s at 256 kbps from lower quality ones
at 128 kbps, or did the MP3s sound better to them than the original

To determine this, we modified a bit the evaluation procedure.
According to people's prejudices about MP3 quality, one would expect
that 128 kbps sounds the worst, 256K would be preferred next, and that
the original Audio-CD sample delivers the best sound. So, we re-scored
the test results; every test sample that was identified as 128 kbps
received one point, a sample identified as 256 kbps garnered two
points, and a sample identified as the original CD got three points.
This was done for each sample regardless of whether the listener's
identification of the sample source was correct or not. If a listener
could not hear any difference between any of the three sample
versions, we assessed all of them as 'CD quality' and gave each sample
three points.

Then we added up all the points for each sample over all listeners. If
all 14 people had always guessed correctly, then each of the pieces of
music would show the same distribution for its samples: 14 points for
a sample at 128 kbps, 28 points for a 256 kbps sample, and 42 points
of the original CD. But a completely different picture emerged. For
those pieces which our listeners most frequently guessed wrong, the
MP3 encoded samples were judged in general to be superior to the CD

Our biggest surprise, however, came when we added up all the points
achieved by all of the samples at each quality level: 128 kbps, 256
kbps, and CD-ROM. The samples at 256 kbps and the original CD samples
achieved precisely the same score of 501 points. The 128 kbps samples
clearly scored lower, with a total of 439 points. For those interested
in statistics, these values of 501 and 439 differ significantly in
statistical terms, with a probability of error of one percent (in
scientific investigations, statistical deviations are considered
significant when the probability is 5% or less). And between the 256
kbps and CD samples, which got exactly the same score, there was, of
course, no statistical difference.

Summing Up

In plain language, this means that our musically trained test
listeners could reliably distinguish the poorer quality MP3s at 128
kbps quite accurately from either of the other higher-quality samples.
But when deciding between 256 kbps encoded MP3s and the original CD,
no difference could be determined, on average, for all the pieces. The
testers took the 256 kbps samples for the CD just as often as they
took the original CD samples themselves.

The fact that some of the 128 kbps samples were consistently judged to
be better than their original CD counterparts by this skilled group -
even by the best among them - stunned our editor (who participated in
the test although his results were not included in the evaluation, and
had to confess that he got only 15 points). It seems safe to declare
that there is no musical genre that is especially well-suited or
ill-suited to compression. It is apparent that there are quite other
factors related to the technical aspects of recording that will later
adversely affect the results at low bit rates.

This article will not end the ongoing debate of whether the use of MP3
compression is a reasonable or unreasonable procedure. Audiophile fans
that concern themselves with brand names and are status conscious will
never listen to MP3s, no matter how many tests may prove that the
sound experience is equivalent in both cases. Skeptics ("They are all
sissies at c't; I would certainly have heard the difference") should
get encoders and CD burners and then submit themselves - perhaps even
using the same pieces and under similar conditions - to their own



[1] Carsten Meyer, Doppelt blind, MP3 gegen CD: Der Hörtest [Double
blind, MP3 versus CD: The Listening Test], c't March, 2000, p. 144

Results of Readers' Listening Test
Test Listener  a  (b)  c  d  e  f  g  h  i  j  k  l  m  (n) 
(stat. random average:
11 points)
Chic - Jusagroove  3  3  3  0  3  1  1  1  1  3  3  3  3  3 
Brahms - Ungarische Tänze  1  1  1  0  3  0  0  0  1  0  3  3  1  0 
Donald Fagen - IGY  1  1  0  0  0  1  1  3  1  0  0  1  0  3 
Anne S. von Otter - I'm a Stranger Here...  0  3  0  0  0  0  0  3  0
1  0  3  3  3
Peter Gabriel - Steam  3  3  0  1  0  3  1  0  3  1  0  3  3  1 
Leonard Cohen - First We Take Manhattan  1  3  0  0  0  1  3  1  3  3
0  3  0  0
Orff - Carmina/Gnomus  1  3  0  1  0  1  0  1  3  3  1  1  1  3 
Shostakovitch - Jazz/2 March  0  1  1  3  3  1  1  1  0  0  1  0  0  3
Bill Whithers - Ain't No Sunshine  1  0  3  1  0  0  0  0  0  0  0  1
0  1
Adrian Legg - Norah Hanleys Waltz  0  0  3  0  0  0  0  0  0  1  0  0
3  1
Liszt - Aprés une lecture du Dante  1  0  0  1  3  0  1  0  0  3  1  0
0  0
Mussorgsky - Bilder einer Ausstellung  1  1  0  3  0  0  0  0  0  0  1
1  0  1
Sara K. - Tell Me I'm Not Dreamin  3  3  1  1  0  0  1  1  1  0  0  1
1  1
Grieg - Arabischer Tanz  0  1  3  3  1  0  0  1  0  0  0  0  0  0 
Marla Glen - The Cost Of Freedom  1  0  1  0  1  1  3  0  1  0  3  1
0  3
Anne S. von Otter - Quello di Tito è il volto  0  0  0  3  0  3  0  0
0  1  0  0  1  0
Clair Marlo - All For The Feeling  3  3  3  3  0  3  3  3  3  0  3  1
1  0
Points/Listener  20  26  19  20  14  15  15  15  17  16  16
22  17  23

  • ssamadhi97
  • [*][*][*][*][*]
  • Developer (Donating)
Interesting Formal MP3-WAV Listening Test
Reply #1
wow, that once-famous article from June 2000.. OLD!

It's almost some kind of "Evergreen" among articles on lossy audio listening tests already 
A riddle is a short sword attached to the next 2000 years.