GstPEAQ: PEAQ done right, allegedly || Multiformat correlation

Topic: GstPEAQ: PEAQ done right, allegedly || Multiformat correlation (Read 2604 times) previous topic - next topic

0 Members and 1 Guest are viewing this topic.

GstPEAQ: PEAQ done right, allegedly || Multiformat correlation

2023-08-10 14:20:35

GstPEAQ is the new (2015) PEAQ implementation that, for the first time, provides an open-source implementation of PEAQ's advanced model. I figured that it would be good to repeat EAqual test and see what has changed.

Results


Column correlation with Subj
ODG Basic   DI Basic   ODG Advanced   DI Advanced
0.920218616535201   0.92386326991674   0.935034941758886   0.937994140557553

Materials

I tried recovering AACTest from the Wayback Machine, but too many torrents were lost. The ones that were available had insufficient seeders anyways.

Instead, data was downloaded from the 2014 multiformat listening test using wget -v -r -c -l 1 https://listening-test.coresv.net/results.htm -D listening-test.coresv.net -R '*.zip'.

A total of 184 MB was downloaded. Of these, 181 MB the tracks directory.

Software versions:

ffmpeg 6.0-5, from Debian unstable package version 7:6.0-5.
GstPEAQ 0.6.1, freshly compiled

Methods
All audio is decoded to wav using for i in *.{ogg,mp4,mp3,opus,mp4}; do basename=${i%.*}; ffmpeg -i
"$i" -- "../decode/$basename".wav; done. Now that's 779 MB more stuff.

All reference tracks were moved to a "ref" directory.

For all files, four values were obtained: the basic mode ODG, DI, and the advanced mode ODG and DI. In a perfect world, ODG + 5 should be the subjective score. We will see if that's the case.

I use the "pea.sh" to generate a CSV. The CSV is imported to LibreOffice for formatting and statistics (Correlation, Coefficient of Determination).

Discussion

Between ODG Basic and ODG Advanced, the main difference occurs in mid-high quality ranges. Midrange values do not seem to match the trend line either way, but the Advanced clearly separates higher-quality samples better. As expected, Advanced has better correlation to Subjective.

The "DI" metric is only included for curiosity -- the main metric is still ODG. Unexpectedly, DI seems to score better than ODG on correlation -- this could be due to it being immune to saturation at extreme values (DAFX-2). DI Advanced especially comes close to a slope of 1, huh!

There are some interesting outliers in the Subj vs ODG Advanced series. That could be worth looking into.

The same data pipeline should be applicable to the other objective metrics such as Google's ViSQOL. ViSQOL has one additional parameter to pass, however: whether the sample to be compared is speech. (Well, we should not be feeding speech to PEAQ anyways. That's PESQ and POLQA's job).

The much elevated correlation compared to the old EAqual is likely due to a wider range of sample qualities involved. The basic ODG calculation should be essentially the same as EAqual, after all. (Yes, I checked, CORREL() and PEARSON() are giving the same results.)

Re: GstPEAQ: PEAQ done right, allegedly || Multiformat correlation

Reply #1 – 2023-08-12 11:26:03

I looked through your post and find it very interesting.
My thought is that probably you should use those fully pre-decoded wavs instead, since it already compensated for the delay QAAC and LAME introduce (but not the volume, volume is equalized in the built-in function of ABC/HR for Java, sorry).

sample01\sample01\sound\SinceAlways.16b48k.1.wav // SinceAlways.qaac.cvbr96k // 1088 samples delay compensated
sample01\sample01\sound\SinceAlways.16b48k.2.wav // SinceAlways.opus1.1.b96k
sample01\sample01\sound\SinceAlways.16b48k.3.wav // SinceAlways.aotuv.q2.2
sample01\sample01\sound\SinceAlways.16b48k.4.wav // SinceAlways.lame3.99.5.v5 // 1105 samples delay compensated
sample01\sample01\sound\SinceAlways.16b48k.5.wav // SinceAlways.faac.abr96k
sample01\sample01\sound\SinceAlways.16b48k.6.wav // SinceAlways.faac.vbr30q

This is the commands used to encode, decode and sample-rate convert the 44.1kHz lossless "%InputWavFile%" to each decoded "%RawFile%.16b48k.n.wav.

Code: [Select]

bin\sox_v1441 "%InputWavFile%" -b 16 "%RawFile%.16b48k.wav" gain -1.5 rate -v 48000

bin\qaac_2.41\qaac --cvbr 96 -o "%OutputFile%%AACName%" "%InputWavFile%"
bin\faad -b 2 -o "%TemporaryFile%%AACName%.24b44k.wav" "%OutputFile%%AACName%"
bin\sox_v1441 "%TemporaryFile%%AACName%.24b44k.wav" -b 16 "%RawFile%.16b48k.1.wav" trim 1088s gain -1.5 rate -v 48000

bin\opus-tools-0.1.9-win32\opusenc --bitrate 96 "%InputWavFile%" "%OutputFile%%OpusName%"
bin\opus-tools-0.1.9-win32\opusdec --rate 48000 --float --quiet "%OutputFile%%OpusName%" "%TemporaryFile%%OpusName%.flo48k.wav"
bin\sox_v1441 "%TemporaryFile%%OpusName%.flo48k.wav" -b 16 -e signed "%RawFile%.16b48k.2.wav" gain -1.5

bin\venc603 -q2.2 "%InputWavFile%" "%OutputFile%%OggName%"
bin\oggdec -q -b 3 "%OutputFile%%OggName%" --wavout "%TemporaryFile%%OggName%.24b44k.wav"
bin\sox_v1441 "%TemporaryFile%%OggName%.24b44k.wav" -b 16 "%RawFile%.16b48k.3.wav" gain -1.5 rate -v 48000

bin\lame3.99.5\lame -V5 -S "%InputWavFile%" "%OutputFile%%MP3Name%"
bin\madplay -b 24 -o "%TemporaryFile%%MP3Name%.24b44k.wav" "%OutputFile%%MP3Name%"
bin\sox_v1441 "%TemporaryFile%%MP3Name%.24b44k.wav" -b 16 "%RawFile%.16b48k.4.wav" trim 1105s gain -1.5 rate -v 48000

bin\faac-1.28-mod\faac -b 96 -o "%OutputFile%%FAACName%" "%InputWavFile%"
bin\faad -b 2 -q -o "%TemporaryFile%%FAACName%.24b44k.wav" "%OutputFile%%FAACName%"
bin\sox_v1441 "%TemporaryFile%%FAACName%.24b44k.wav" -b 16 "%RawFile%.16b48k.5.wav" trim 0s gain -1.5 rate -v 48000

bin\faac-1.28-mod\faac -q 30 -o "%OutputFile%%FAACLName%" "%InputWavFile%"
bin\faad -b 2 -q -o "%TemporaryFile%%FAACLName%.24b44k.wav" "%OutputFile%%FAACLName%"
bin\sox_v1441 "%TemporaryFile%%FAACLName%.24b44k.wav" -b 16 "%RawFile%.16b48k.6.wav" trim 0s gain -1.5 rate -v 48000

Re: GstPEAQ: PEAQ done right, allegedly || Multiformat correlation

Reply #2 – 2023-08-16 09:02:06

Quote from: Kamedo2 on 2023-08-12 11:26:03

My thought is that probably you should use those fully pre-decoded wavs instead, since it already compensated for the delay QAAC and LAME introduce (but not the volume, volume is equalized in the built-in function of ABC/HR for Java, sorry).

Welp. I didn't get the wavs because they would be a bigger download. Judging from how the listening-tests site does not want me to browse the index listings of, say, https://listening-test.coresv.net/tracks, I just assumed they really care about bandwidth.

There is indeed some eyebrow-rising file size diffs among the wavs. I assumed that GstPEAQ will take care of it -- it has some thresholding function to ignore silence -- so I wasn't too worried. But yes, it's good to at least grab some of the WAVs and verify that the outputs are the same or close enough.


-rw-r--r-- 1 root root 1383078 Aug 10 19:42  12-German-male-speech.441.aotuv.q2.2.wav
-rw-r--r-- 1 root root 1384526 Aug 10 19:44  12-German-male-speech.441.faac.abr96k.wav
-rw-r--r-- 1 root root 1384526 Aug 10 19:44  12-German-male-speech.441.faac.vbr30q.wav
-rw-r--r-- 1 root root 1383078 Aug 10 19:43  12-German-male-speech.441.lame3.99.5.v5.wav
-rw-r--r-- 1 root root 1505386 Aug 10 19:43  12-German-male-speech.441.opus1.1.b96k.wav
-rw-r--r-- 1 root root 1384270 Aug 10 19:44  12-German-male-speech.441.qaac.cvbr96k.wav

Of the sample above, we get four entire distinct file sizes. ffprobe -i says:

1505386 is 48000 Hz, 00:00:07.84
1383078 is 44100 Hz, 00:00:07.84
1384270 is 44100 Hz, 00:00:07.85
1384526 is 44100 Hz, 00:00:07.85

Welp. I vaguely remember reading in the LAME manual that a responsible decoder will trim off the delay based on some metadata field, but ffmpeg here clearly isn't doing that. As for opus doing 48000: that's how the format works. Opusdec gives 44100 because it looks for a metadata field and does a downsample -- possibly losing some quality there. All these are possible causes for deviation from the actual tested WAV set -- I was just too consumed by my laziness and decided to ffmpeg everything.

Re: GstPEAQ: PEAQ done right, allegedly || Multiformat correlation

Reply #3 – 2023-08-16 09:27:01

In other news, I came across a work titled Can we still use PEAQ? A Performance Analysis of the ITU Standard for the Objective Assessment of Perceived Audio Quality. It does two things:

Compares MUSHRA of USAC VT1 dataset to PEAQ, PEMO-Q, and ViSQOL-Audio. Here we see PEAQ performing more poorly compared to others, with a not-so-good Pearson quite similar to the old AAC test! We also see DI being better than ODG in terms of absolute error.
Retrains the final stage of PEAQ (a neural network mashing "model output variables" describing errors into the final DI and ODG scores) using the human dataset. PEAQ ends up better than everyone else. Some bootstrapping is used to eliminate the element of chance in choosing the training vs testing set.

The bad news is that they aren't releasing the retrained weights, so you gotta do what they did (play with MATLAB) to replicate this magick. Hah, hah, sad.

Notice