Skip to main content

Topic: Testing multi-core optimized encoders for next BonkEnc release (Read 5707 times) previous topic - next topic

0 Members and 1 Guest are viewing this topic.
  • enzo
  • [*][*]
  • Developer
Testing multi-core optimized encoders for next BonkEnc release
Hi, I am currently testing encoders built and optimized using the Intel Compiler 11.1. I plan to include them in BonkEnc 1.0.13.

ICL 11 supports automatic parallelization and thus allows encoders like LAME or Vorbis to make use of multi-core processors.

Note that this approach to multi-core support does not introduce additional quality reduction into the encoding process. Other approaches encode multiple frames in parallel, but have to disable certain encoder features in exchange (e.g. LAME-MT which disabled the bit reservoir). However, automatic parallelization does not scale as good as those other methods. Not all parts of the algorithms can be parallelized and thus we will not see four times faster encoding on a quad core with this one.

On my desktop system (Windows XP Pro x64, Phenom II X4 920) I still measured great performance improvements when ripping and converting. On a laptop (Windows Vista Home Premium, Athlon64 X2 QL-64) the improvement was not as impressive but still significant. Curiously, aoTuV was extremely slow on the Phenom and twice as fast on the actually inferior laptop. Here are the numbers (encoding time for a 44 min CD):
[blockquote]
Code: [Select]
System: Windows XP Pro x64, AMD Phenom II X4 920

Input Output GCC ICL ICL (single core)
-------------------------------------------------
CD LAME 5:45 3:27 4:01
CD Vorbis 9:26 2:39 7:27
CD FLAC 2:25 2:22

WAV FLAC 1:18 0:41
Code: [Select]
System: Windows Vista Home Premium, AMD Athlon64 X2 QL-64 

Input Output GCC ICL
-------------------------------------------------
CD LAME 5:33 5:26
CD Vorbis 4:48 3:50

WAV FLAC 1:30 1:21
[/blockquote]
I am using the following compiler options to build the encoders: /Q3 /Qparallel /Qipo /Qprof-use /arch:IA32 /QaxSSE4.1,SSSE3,SSE3,SSE2

You can find a preview release of BonkEnc 1.0.13 at bonkenc.org/updates/bonkenc-1.0.13-pre.zip. It includes ICL 11.1 compiles of LAME 3.98.3, FLAC 1.2.1, Vorbis aoTuV b5.7, FAAC 1.28 and Bonk 0.12.

It would be great to have some speed comparisons for other systems. So if you could compare BonkEnc 1.0.12 vs. this prerelease and post the numbers here, I would be very grateful.

What do you think about this idea in general? Are the compiler options safe or should I exclude any specific optimizations? Any other things you would like to mention?

Looking forward to read your suggestions and opinions on this.

Robert

  • saratoga
  • [*][*][*][*][*]
Testing multi-core optimized encoders for next BonkEnc release
Reply #1
How much do these settings change the output of the encoders?  Do you get the same speed up when the source is WAV?

  • Fandango
  • [*][*][*][*][*]
Testing multi-core optimized encoders for next BonkEnc release
Reply #2
It would be great to have some speed comparisons for other systems. So if you could compare BonkEnc 1.0.12 vs. this prerelease and post the numbers here, I would be very grateful.
...is not so good when using a different set of audio files.

Also, there's a dll missing: libiomp5md.dll.

General question: aren't there any license-free music tracks out there we could use and redistribute for testing purposes? Nine Inch Nails comes to my mind... but that's just one genre unless the many remixes are license free, too. Of course they should be checked for lossy compression artefacts first.

What do you think about this idea in general?
It's great, I've been waiting for something like this since the day I got a quad core CPU.

Are the compiler options safe or should I exclude any specific optimizations?
I don't know. 

Any other things you would like to mention?
What I think is important about MT transcoding: every systen is different and so is every codec implementation and each of their encoding/decoding settings.

It depends on all three things whether the CPU or the HDD is the bottleneck. Ideally we want the CPU to be the bottleneck, but a slow HDD can actually lead to less overall CPU usage when using just one more thread. For example, depending on the encoder settings of wavpack I can use either 3 or all 4 cores, if I don't use the extra switches I can only use 3 cores or else my HDD is turning into a sub-machine gun and the overall CPU usage drops below 75%. Well, that's how it was with my old drive.

The actual user has to know his/her system's bottlenecks, the coder of the transcoding app possibly can't know. And in order to get the most out of it for every user, i.e. the fastest batch transcode, there should be some automation to determine how many threads to use with the current codecs and their current settings.

Therefore an "easy" benchmark in an encoder suite would be nice. It could determine the overall bandwidth the current encoder settings need, and then let the user compare that to a simple I/O benchmark of his HDD, or the I/O output of 1 to n decoding threads using selected input files.
  • Last Edit: 29 June, 2009, 07:25:28 PM by Fandango

  • Fandango
  • [*][*][*][*][*]
Testing multi-core optimized encoders for next BonkEnc release
Reply #3
How much do these settings change the output of the encoders?

Output should be identical.

Do you get the same speed up when the source is WAV?

Sure. Because this automated parallelization is only applied to the encoders it seems.

Uh, yeah. What about MT support in the decoders?
  • Last Edit: 29 June, 2009, 07:33:48 PM by Fandango

  • enzo
  • [*][*]
  • Developer
Testing multi-core optimized encoders for next BonkEnc release
Reply #4
Thank you for your answers so far!

I noticed I did not mention the encoder settings I used. All test were made with BonkEnc's default settings so far.
[blockquote]LAME: Preset Standard
Vorbis: VBR, Q6.0
FLAC: -l 8 -r 3,3 -b 4608 -m -A "tukey(0.5)"[/blockquote]
Also, there's a dll missing: libiomp5md.dll.
Just added it. Thanks for the hint!

General question: aren't there any license-free music tracks out there we could use and redistribute for testing purposes? Nine Inch Nails comes to my mind... but that's just one genre unless the many remixes are license free, too. Of course they should be checked for lossy compression artefacts first.
Yes, it's a good idea to find a common selection of music to test with first.

Nine Inch Nails' free album would be a good candidate I guess as it's available in CD quality FLAC. I also found Bach's Brandenburg Concertos also available in FLAC format.

Archive.org has an Open Source Audio Collection with many entries available in FLAC as well.

What I think is important about MT transcoding: every systen is different and so is every codec implementation and each of their encoding/decoding settings.

It depends on all three things whether the CPU or the HDD is the bottleneck. Ideally we want the CPU to be the bottleneck, but a slow HDD can actually lead to less overall CPU usage when using just one more thread. For example, depending on the encoder settings of wavpack I can use either 3 or all 4 cores, if I don't use the extra switches I can only use 3 cores or else my HDD is turning into a sub-machine gun and the overall CPU usage drops below 75%. Well, that's how it was with my old drive.
I guess you are using multiple instances of wavpack in parallel in this case, right? The HDD has to reposition the r/w-head a lot when encoding to multiple outputs at once, because the write operations are not sequential. With a parallelized version of wavpack you could probably use all 4 cores without any problems as the HDD writes would still be sequential.

The actual user has to know his/her system's bottlenecks, the coder of the transcoding app possibly can't know. And in order to get the most out of it for every user, i.e. the fastest batch transcode, there should be some automation to determine how many threads to use with the current codecs and their current settings.
I think that determining the optimal number of threads automatically under such circumstances where the HDD is the limiting factor would be very difficult. The best guess probably is to use 1 thread per CPU core in most cases. However, I will add an option to let the user choose how many threads to use. Some may prefer to use only one or two cores even on a quad core system to not have the CPU fan speed up all the time.

How much do these settings change the output of the encoders?
In my tests, the output of parallel LAME and aoTuV was slightly smaller than the non-parallel output. FLAC files came out slightly bigger. I believe this is because of different floating point math commands/ordering used in the parallel builds. It is the same effect you get when comparing GCC to ICL builds or SSE optimized builds to plain IA32 builds. If that is the case, it's nothing to worry about. This will have to be further analyzed, though.

Do you get the same speed up when the source is WAV?
Yes. In some cases the speedup is even greater as the CD ripping as a limiting factor drops out. For example, WAV to FLAC went from 1:18 to 0:41 for the 44 min album I tested with.

Uh, yeah. What about MT support in the decoders?
The decoders are parallelized as well. Except for the FAAD2 AAC decoder where the compiler was obviously unable to find parallelizable parts in the algorithm (CPU usage stayed at 25% during decoding) and the GCC build was actually faster than the ICL one.

  • enzo
  • [*][*]
  • Developer
Testing multi-core optimized encoders for next BonkEnc release
Reply #5
I updated the preview release again and removed support for the vbr-old LAME presets. So if you test LAME encoding speed with BonkEnc 1.0.12 vs. 1.0.13 you need to select the "Standard, Fast" preset in 1.0.12 first.

I also tested non-parallel ICL builds vs. parallel builds and they are producing exactly the same output. Furthermore, parallel FAAC and Bonk encoders produce exactly the same output as the GCC compiled builds included with BonkEnc 1.0.12.

  • saratoga
  • [*][*][*][*][*]
Testing multi-core optimized encoders for next BonkEnc release
Reply #6
How much do these settings change the output of the encoders?

Output should be identical.


Yes I realize that we would like the same code to produce the same output regardless of how its compiled.  Unfortunately we don't always get what we would like

Quote
In my tests, the output of parallel LAME and aoTuV was slightly smaller than the non-parallel output. FLAC files came out slightly bigger. I believe this is because of different floating point math commands/ordering used in the parallel builds. It is the same effect you get when comparing GCC to ICL builds or SSE optimized builds to plain IA32 builds. If that is the case, it's nothing to worry about. This will have to be further analyzed, though.


Maybe do a quick decode to wav and compute the RMS error in the difference between the two tracks.  As long as its small I wouldn't worry too much about differences in bitrate.  of course, just because its large doesn't necessarily mean theres a problem either . . .

  • Fandango
  • [*][*][*][*][*]
Testing multi-core optimized encoders for next BonkEnc release
Reply #7
I did a test on the Brandenburg Concertos from the Czech radio website.

My system:
Intel Core 2 Quad, 2.6GHz (Q6700), 8GB RAM, SATA2 Western Digital Caviar Black HDD and running Windows 7 64bit.

So for starters all 4 cores were fully used, BonkEnc used 98-99% CPU time. There was one thread smooth.dll and 3 of the MT dll. I guess that's how it's supposed to be.

But how do you get the encoding times? BonkEnc didn't write a log.

  • rpp3po
  • [*][*][*][*][*]
  • Developer
Testing multi-core optimized encoders for next BonkEnc release
Reply #8
I think you would get the best performance (without hurting quality) by concurrently encoding as many files as you have CPU cores. Nothing will beat that. So writing an intelligent scheduler, that optimizes disk I/O (large, just-in-time sequential instead of parallel reads), should be the best you could do. People usually only care about an encoder's performance for bulk conversion, so not showing any benefit for single file encodes wouldn't be a practical constraint most of the time.
  • Last Edit: 01 July, 2009, 04:07:23 PM by rpp3po

  • enzo
  • [*][*]
  • Developer
Testing multi-core optimized encoders for next BonkEnc release
Reply #9
Maybe do a quick decode to wav and compute the RMS error in the difference between the two tracks.  As long as its small I wouldn't worry too much about differences in bitrate.  of course, just because its large doesn't necessarily mean theres a problem either . . .
Ok, I will try that probably tomorrow.

But how do you get the encoding times? BonkEnc didn't write a log.
No, it does not support log writing, sorry. I stopped the time manually.

I think you would get the best performance (without hurting quality) by concurrently encoding as many files as you have CPU cores. Nothing will beat that.
Yes, I agree and I am planning something like that for BonkEnc 1.1. However, that won't help much when you are ripping CDs.

I updated the preview release again with slightly faster encoder builds. Disabling multifile IPO (the /Qipo switch) made the DLLs smaller and the encoders slightly faster.

  • enzo
  • [*][*]
  • Developer
Testing multi-core optimized encoders for next BonkEnc release
Reply #10
So, I did some RMSD calculations (the RW versions are the binaries from RareWares):
[blockquote]LAME-GCC vs. LAME-ICL: 1.41 (0.004%)
LAME-ICL vs. LAME-ICL-RW: 0.00[/blockquote]That looks just fine and leaves only aoTuV to worry about. The RMSD values for it are quite strange though:
[blockquote]aoTuV-GCC vs. aoTuV-ICL: 100.42 (0.31%)
aoTuV-ICL vs. aoTuV-ICL-RW: 100.51 (0.31%)
aoTuV-GCC vs. aoTuV-ICL-RW: 7.68 (0.02%)[/blockquote]I tried to find out what causes these differences and found that they are indeed triggered by the /Qparallel option. Compiled without /Qparallel it looks like this:
[blockquote]aoTuV-GCC vs. aoTuV-ICL: 4.24 (0.01%)[/blockquote]What's strange about this is that the RMSD stays high even if I manually disable parallelization of all effected loops with pragmas. It looks like /Qparallel is changing something else as well which is not mentioned in the documentation.

Compared to the original, the RMSD is about the same as for the GCC and RW builds though:
[blockquote]Original vs. aoTuV-GCC: 133.41 (0.41%)
Original vs. aoTuV-ICL: 133.33 (0.41%)
Original vs. aoTuV-ICL-RW: 133.40 (0.41%)[/blockquote]So I guess we cannot know without a listening test if quality is effected by this issue. Or what do you think?

Edit:

Found the cause of this problem and a work-around for it. The problem is in the loop initializing e->mdct_win[] in function _ve_envelope_init in envelope.c. Somehow the call to sin() seems to be optimized to use some faster version of it. As a work-around I disabled optimization for the whole function using pragmas. As this is just an init function, this work-around does not effect overall speed.

This now gives the following RMSD:
[blockquote]aoTuV-GCC vs. aoTuV-ICL: 5.48 (0.02%)[/blockquote]That should be ok.

Edit 2:

Updated the preview with the new aoTuV build.
  • Last Edit: 02 July, 2009, 12:29:26 PM by enzo

  • enzo
  • [*][*]
  • Developer
Testing multi-core optimized encoders for next BonkEnc release
Reply #11
Here are my speed comparisons for the Brandenburg Concertos and the NIN album. I also tested the most recent builds from RareWares as they have always been the reference of encoding speed for me in the past.

The numbers are for my Phenom II X4 system.

Brandenburg Concertos - 89:47 min
[blockquote]
Code: [Select]
Input	Output	GCC	ICL-RW	ICL
-------------------------------------
FLAC LAME 6:02 6:48 4:10
FLAC Vorbis 10:31 5:48 4:29
FLAC FAAC 6:41 7:37 3:37
FLAC FLAC 3:16 2:45 1:37
[/blockquote]Nine Inch Nails - The Slip - 43:47 min
[blockquote]
Code: [Select]
Input	Output	GCC	ICL-RW	ICL
-------------------------------------
FLAC LAME 3:19 3:12 1:57
FLAC Vorbis 5:24 4:08 2:12
FLAC FAAC 3:31 3:11 1:40
FLAC FLAC 1:54 1:15 0:47
[/blockquote]I will do some more speed comparisons on my laptop and then release the final BonkEnc 1.0.13 with these encoders in a few days.

  • gottkaiser
  • [*][*][*]
Testing multi-core optimized encoders for next BonkEnc release
Reply #12
Hi Enzo,

will you release an updated development snapshot?

  • enzo
  • [*][*]
  • Developer
Testing multi-core optimized encoders for next BonkEnc release
Reply #13
Yes, a new snapshot should be available next week. I will probably finish feature coding tomorrow, but need to do some additional testing and refactoring before the release.

Besides the optimized encoders, the snapshot will add WMA support and allow editing tags of existing files.

  • gottkaiser
  • [*][*][*]
Testing multi-core optimized encoders for next BonkEnc release
Reply #14
Yes, a new snapshot should be available next week.

Anything new about the new snapshot?
Don't want to push you, Im just exited about the new multicore support.

  • enzo
  • [*][*]
  • Developer
Testing multi-core optimized encoders for next BonkEnc release
Reply #15
Anything new about the new snapshot?
Don't want to push you, Im just exited about the new multicore support.

Sorry, I had to delay the snapshot because of problems with the new tag editor. I do now plan to release it tomorrow.