fpMP3Enc: a multi-core MP3 encoder based upon LAME 3.98.2

Topic: fpMP3Enc: a multi-core MP3 encoder based upon LAME 3.98.2 (Read 128289 times) previous topic - next topic

0 Members and 1 Guest are viewing this topic.

fpMP3Enc: a multi-core MP3 encoder based upon LAME 3.98.2

2009-08-02 12:58:05

Hello,

I'd like to introduce my project "fpMP3Enc" - a multicore MP3 encoder based on LAME 3.98.2. It is a sample application/case study for my multicore library "Fiber Pool".

The source code can be downloaded here: http://www.fiberpool.de/en/downloads.html

The current version is still mostly single-threaded for single-file encoding but scales for multiple-files encoding.

The multicore tweaks used in this version are:
- parallel/asynchronous conversion/scaling of PCM data to float samples
- parallel/asynchronous computation of replay gain
- I/O ordering (WAV files are read into memory first before MP3 files are written to disk)

With the following test system
- Intel Q9450@2.66GHz (Quad Core)
- 8 GiB RAM
- Windows Vista x64 (Superfetch disabled)
- 61 WAV files (2.99 GiB, about 5 hours play time)

I've got the following results:
LAME 3.98.2 (x32; rarewares): 24.6x
LAME64 3.98 (x64; mp3tech): 20.8x
fpMP3Enc (x64; single): 23.3x
fpMP3Enc (x64; multi): 80.2x

Single-file encoding is slower than the original LAME because I haven't SSE-optimized the code.
The scale factor in multiple-files encoding is 3.3 compared to LAME and 3.4 compared to "fpMP3Enc single", which is quite good for a first version.

The next steps will be to perform the psycho acoustics computation for each frame in parallel, then the MDCT, and so on.

Compile and usage:
- You need Visual Studio 2008 (and obviously Windows) to compile the project.
- SSE4 is enabled by default. To disable: #undef USE_SSE4
- Memory control is disabled by default in this version. This can lead to pagefile swapping if too many files are to be encoded. You can enable memory control by using a different 'MEMORY_MODE' macro (experimental).
- Win32 version: The total WAV file size sum must not exceed 400MiB.
- Win64 version: The total WAV file size sum must not exceed 400GiB.
- Input: WAV 16-bit stereo, 44.1kHz
- Output: MP3 Stereo/Joint-Stereo, 44.1kHz

I'd like to know what you think about this project so contact me if you have any comments or feedback.
George

fpMP3Enc: a multi-core MP3 encoder based upon LAME 3.98.2

Reply #1 – 2009-08-02 14:51:25

Quote from: GeorgeFP on 2009-08-02 12:58:05

I'd like to know what you think about this project so contact me if you have any comments or feedback.

I think it's great. Nice to see how multi-threading encoding is finally starting off here and there. That there's not one single effort for one codec is good, too, IMHO. Different aproaches mostly in the form of different MT-libraries, it seems, need to be tried out and tweaked to see which one is ultimately superior in certain situations.

PS: Don't forget to translate the software license to English.

fpMP3Enc: a multi-core MP3 encoder based upon LAME 3.98.2

Reply #2 – 2009-08-02 15:49:49

Doesn't foobar2000 already do this if you have more than one core in your computer? I was getting similar numbers on quad core machine when I had to do some coding for my dad. My core duo goes about 43x if its not running warm.

fpMP3Enc: a multi-core MP3 encoder based upon LAME 3.98.2

Reply #3 – 2009-08-02 15:55:10

I'm not sure, but isn't it that foobar2000 just starts 2 instances of the encoder rather than one multithreaded instance? Out of curiosity, would the multithreaded one be better?

fpMP3Enc: a multi-core MP3 encoder based upon LAME 3.98.2

Reply #4 – 2009-08-02 16:56:37

foobar2000 does not use multi-threading encoders, but it could. (Actually out of the box it uses no encoders, but what is meant is that most users will use single-threaded encoders, simply because those are the ones available.) nazgulord is correct.

A multi-threaded encoder would be better if you have less files to encode than you have cores, if your harddrive becomes the bottleneck when encoding multiple files at the same time, and then there might be other more code-related advantages that may make a sequential file encode with a mt-enabled encoder faster than several instances of a single-threaded encoder. For example, when multiple instances of the same executable are running it means 4 times the exact same code is executed in parallel, even code that might only be needed once in a real multi-threaded encoder. Modern CPUs share their cache, so there is some internal automatic optimisation, still a less generic mt-library can be better than that or even trigger hardware optimizations better.

But IMHO, the most apparent drawbacks of using multiple instances of single-threaded encoders are when you process very big files (whole CD images instead of track based files), because to the end of your batch encode as there are less files left than you have cores, efficiency drops as not all cores are being fully used anymore, yet finishing the big CD rips that are left might still take some minutes. It's even more apparent when you only want to transcode a single CD image, it's as (in)efficient as on a single core system. And also when you use very fast encoding settings usually on systems with more than two cores but normal non-RAID disk setups, the disk access of, for instance, four I/O streams may be more than the single hard disk can handle, so that the CPU load then drops far below 98-100%. And you want your CPU to be the bottleneck not your harddrive, because more CPU intense settings usually mean higher quality or better compression.

For example because of the latter I started to use the extra modes of the WavPack encoder in foobar2000. Transcoding four lossless audio files at once using -hh is simply more than my harddrive can handle.

fpMP3Enc: a multi-core MP3 encoder based upon LAME 3.98.2

Reply #5 – 2009-08-02 17:21:22

Quote from: nazgulord on 2009-08-02 15:55:10

Out of curiosity, would the multithreaded one be better?

If you only have 1 file, then yes. Otherwise, running two encodes in parallel will almost certainly be faster due to less overhead from threading, synchronization, etc.

fpMP3Enc: a multi-core MP3 encoder based upon LAME 3.98.2

Reply #6 – 2009-08-02 17:23:29

Quote from: Mike Giacomelli on 2009-08-02 17:21:22

Otherwise, running two encodes in parallel will almost certainly be faster due to less overhead from threading, synchronization, etc.

Are you sure?

fpMP3Enc: a multi-core MP3 encoder based upon LAME 3.98.2

Reply #7 – 2009-08-02 17:38:16

Quote from: Fandango on 2009-08-02 17:23:29

Quote from: Mike Giacomelli on 2009-08-02 17:21:22
Otherwise, running two encodes in parallel will almost certainly be faster due to less overhead from threading, synchronization, etc.
Are you sure?

Running two encodes in parallel results in essentially perfect parallelization. The only overhead comes from disk contention, which is still a problem for the multithreaded single process case anyway.

Running one process will encounter additional overhead due to thread synchronization, lack of granularity in parallelism, overhead for inter-thread communication, etc. In order to make up for this, there would have to be additional work saved by running in one process and I don't see what that would be for MP3.

Of course if you have 4 cores and only 3 files for instance, the second approach may still be faster, simply because it finds a use for the fourth core . . . For instance when I multithreaded libmad, it gave a ~90% speed up, so it was actually slower then just running two files in parallel. Of course in my case I only had one file to decode, so it ended up being the best solution

fpMP3Enc: a multi-core MP3 encoder based upon LAME 3.98.2

Reply #8 – 2009-08-02 19:18:35

Quote from: Kitsuned on 2009-08-02 15:49:49

Doesn't foobar2000 already do this if you have more than one core in your computer? I was getting similar numbers on quad core machine when I had to do some coding for my dad. My core duo goes about 43x if its not running warm.

I cannot confirm this on my quad core. Converting the test set with foobar2000 took me about 6 minutes which gave a 47.4x speed. (BTW, the test was encoding to CBR 128.)

Quote from: Mike Giacomelli on 2009-08-02 17:38:16

Running two encodes in parallel results in essentially perfect parallelization. The only overhead comes from disk contention, which is still a problem for the multithreaded single process case anyway.

This is not a problem in "fpMP3Enc". I/O processing is a separate task and is controlled by a file I/O scheduler that sorts and serializes I/O operations on the same drive. The encoder tasks work only on memory.

That's the reason why foobar2000 has a 47.4x performance while "fpMP3Enc" has 80.2x. I think, on an i7 it's possible to get 150x and above... unfortunately I don't have such a system to test it.

Quote

Running one process will encounter additional overhead due to thread synchronization, lack of granularity in parallelism, overhead for inter-thread communication, etc. In order to make up for this, there would have to be additional work saved by running in one process and I don't see what that would be for MP3.

As you mentioned above, concurrent disk access is a problem when you execute multiple encoder processes in parallel. If you run them as tasks in one process you can use a file I/O scheduler that takes care of it, as I did.

fpMP3Enc: a multi-core MP3 encoder based upon LAME 3.98.2

Reply #9 – 2009-08-02 19:23:11

Quote from: GeorgeFP on 2009-08-02 19:18:35

Quote
Running one process will encounter additional overhead due to thread synchronization, lack of granularity in parallelism, overhead for inter-thread communication, etc. In order to make up for this, there would have to be additional work saved by running in one process and I don't see what that would be for MP3.

As you mentioned above, concurrent disk access is a problem when you execute multiple encoder processes in parallel. If you run them as tasks in one process you can use a file I/O scheduler that takes care of it, as I did.

Why is there a speedup for scheduling sequential reads vs. letting the OS's file caching schedule?

fpMP3Enc: a multi-core MP3 encoder based upon LAME 3.98.2

Reply #10 – 2009-08-02 20:04:24

Quote from: Mike Giacomelli on 2009-08-02 19:23:11

Why is there a speedup for scheduling sequential reads vs. letting the OS's file caching schedule?

For example, if you have four processes where each one tries to read a file sequentially on the same disk you won't get sequential reads. The OS will split the I/O operations into pieces in order to feed each process.

My scheduler does not interrupt I/O operations. For example, if you want to read two 50 MiB files that reside on the same drive in 50 1 MiB chunks each, first the 50 I/O operations of the first file are performed and then the 50 I/O operations of the second file.

Of course, my library supports setting a memory limit.

I wrote an article about what I call "Parallel File Processing" (unfortunately it's only in German) in my blog . You can find the article here. This is the theory behind it.

fpMP3Enc: a multi-core MP3 encoder based upon LAME 3.98.2

Reply #11 – 2009-08-02 20:40:35

So you're getting a huge speed up by essentially just buffering the entire file into memory before processing?

fpMP3Enc: a multi-core MP3 encoder based upon LAME 3.98.2

Reply #12 – 2009-08-02 21:15:17

Yeah, that was probably a bad example.

In the more general case, an optimal scheduler would read just enough from one file before it needs to switch to reading for another file. This would lead to pretty big buffers, but not necessarily whole-file buffering.

No idea yet if this encoder is implementing something like that, but if it is, that sort of scheduler would be extremely valuable for other open-source applications.

fpMP3Enc: a multi-core MP3 encoder based upon LAME 3.98.2

Reply #13 – 2009-08-02 21:28:34

Quote from: Mike Giacomelli on 2009-08-02 20:40:35

So you're getting a huge speed up by essentially just buffering the entire file into memory before processing?

To be precise, in this version ALL files are buffered into memory WHILE processing.

fpMP3Enc: a multi-core MP3 encoder based upon LAME 3.98.2

Reply #14 – 2009-08-02 22:13:34

Quote from: Axon on 2009-08-02 21:15:17

In the more general case, an optimal scheduler would read just enough from one file before it needs to switch to reading for another file. This would lead to pretty big buffers, but not necessarily whole-file buffering.

This is possible with my framework (see SequentialVirtualMemory classes).

In "fpMP3Enc" I've not determined the optimal memory strategy yet because I'm still working on parallelizing the encoder. It will depend on the final performance.

Quote

No idea yet if this encoder is implementing something like that, but if it is, that sort of scheduler would be extremely valuable for other open-source applications.

The framework can be used for free in (totally) non-commercial applications.

fpMP3Enc: a multi-core MP3 encoder based upon LAME 3.98.2

Reply #15 – 2009-08-02 22:50:30

Quote from: GeorgeFP on 2009-08-02 22:13:34

In "fpMP3Enc" I've not determined the optimal memory strategy yet because I'm still working on parallelizing the encoder. It will depend on the final performance.

Will it be optimised for fpMP3Enc or will it be more adaptable to different encoders or even decoders?

There are codec settings that decode very slowly, Monkey's Audio's extra high for example. The encoding tasks might run idle if the I/O scheduler reads too much data from such files. Maybe when the scheduler adjusts the buffer size dynamically it should look in both directions, and if decoding takes longer than encoding even give the decoding tasks higher priority?

I know fpMP3Enc currently just supports WAV, but that might change in the future, or not be the case in other projects that might want to use your library.

Anyway, it's an interesting (and logical) approach to optimize an encoder for multi-cores without parallelizing that much of the original code, I guess.

fpMP3Enc: a multi-core MP3 encoder based upon LAME 3.98.2

Reply #16 – 2009-08-03 07:00:02

Quote from: Fandango on 2009-08-02 22:50:30

Will it be optimised for fpMP3Enc or will it be more adaptable to different encoders or even decoders?

The memory strategy is not set by the framework, it's set by the application and each memory object in the application can use a different strategy.

Quote

There are codec settings that decode very slowly, Monkey's Audio's extra high for example. The encoding tasks might run idle if the I/O scheduler reads too much data from such files. Maybe when the scheduler adjusts the buffer size dynamically it should look in both directions, and if decoding takes longer than encoding even give the decoding tasks higher priority?

This can be done by having a small buffer for the input files and a large buffer for the output files. Then the scheduler would fill the buffer for the first file first, switch to the next and so on and return to the first file to read the next chunk.

Quote

I know fpMP3Enc currently just supports WAV, but that might change in the future, or not be the case in other projects that might want to use your library.

For me, fpMP3Enc is just a case study or proof of concept for my multicore framework so you cannot expect new features from me. But the fpMP3Enc source code will become available under an open source license (probably GPL) for other developers to extend it.

fpMP3Enc: a multi-core MP3 encoder based upon LAME 3.98.2

Reply #17 – 2009-08-05 19:53:08

Hi again,

I've updated the application and added three command-line switches that show how it scales under different memory conditions:

--mem-use-ignore (default)
This switch should be used if you know that you have enough physical memory free for the task. If more memory is needed than expected, page file swapping may occur.

--mem-use-file <size in MiB>
With this switch you can specify the maximum buffer size to use for each file.

--mem-use-system-free <size in MiB>
This switch can be used to specify the physical memory size that should be kept free for other applications. fpMP3Enc will stop committing memory if the available physical memory falls below this value. To avoid a deadlock, the minimum committed memory for each memory object is 128 KiB. Page file swapping will not occur unless another application needs more memory than specified.

I've made some tests with these switches. The results are:
--mem-use-file 1: 58,5x
--mem-use-file 2: 71,7x
--mem-use-file 3: 80,5x
--mem-use-file 4: 78,8x
--mem-use-file 10: 80,2x
--mem-use-system-free 5500: 80,9x (peak mem usage was about 1 GiB on my 8 GiB system)
--mem-use-ignore: 77,8x

So, on my system I could use a 3 MiB buffer per file to get the best results. This values will vary on other systems.
I recommend to use the "--mem-use-system-free" switch with a value that's OK for your system.

fpMP3Enc: a multi-core MP3 encoder based upon LAME 3.98.2

Reply #18 – 2009-08-26 22:03:29

Hello, it's me again...

The last three weeks I was working on optimizations of the vbr-new algorithm. You can download the updated version from my web site.

The benchmarks (Vista x64, Intel Q9450@2.66GHz, 8 GiB RAM):
LAME 3.98.2 (x32): 32.3x
fpMP3Enc (x64; single file encoding): 60.3x
fpMP3Enc (x64; multi file encoding): 109.7x

This means that fpMP3Enc is about 87% or 1.87x faster than LAME in single file encoding, while the speedup is 3.4x in multi file encoding.

The following optimizations were performed:
- A frame buffer (20 MiB) is used to hold the work data for about 1000 frames
- Psycho acoustics are computed frame by frame in a separate task without data dependencies
- MDCT is computed frame by frame almost in parallel to psycho acoustics (right after attack detection, which is at the very beginning)
- MDCT (left channel) and MDCT (right channel) are computed in separte tasks (for details see here; English translation will follow)
- A small part of the "VBR_encode_frame" function is split into four tasks

ABR and vbr-old should also benefit from these changes. CBR is disabled in this version because this mode has some data dependencies that I have not handled yet.

To get best results you should not use more than 3 threads on quad cores or higher for single file encoding. The value can be set by using the "--threads" option.

There is still so much to optimize in the LAME code. Let's see how far it can get...

fpMP3Enc: a multi-core MP3 encoder based upon LAME 3.98.2

Reply #19 – 2009-08-26 23:09:46

x80 encoding speed equates to 14MB a second read (from an uncompressed file, if lossless then half that). A modern HDD should be able to do 3x that without breaking a sweat, so I am not sure where the speed differentials come from when buffering to memory. Windows can have radically different read speeds depending on the transfer buffer size selected, off the top of my head on XP 64KB was the optimum size.

fpMP3Enc: a multi-core MP3 encoder based upon LAME 3.98.2

Reply #20 – 2009-08-27 08:41:41

Quote from: GeorgeFP on 2009-08-26 22:03:29

The benchmarks (Vista x64, Intel Q9450@2.66GHz, 8 GiB RAM):
LAME 3.98.2 (x32): 32.3x
fpMP3Enc (x64; single file encoding): 60.3x
fpMP3Enc (x64; multi file encoding): 109.7x

This means that fpMP3Enc is about 87% or 1.87x faster than LAME in single file encoding, while the speedup is 3.4x in multi file encoding.

Thanks for sharing your work. I'm no expert on programming but would like some things to be cleared up.

How many threads did you use with the presented results of fpMP3Enc ? I suspect the multi file encoding is done with 4 cores. If so, Lame encoding 4 or more files with one file per core would result in about 4 x 32.3=129.2 times encoding speed.

Also, should a native well designed 64 bits encoder be nearly twice as fast as its 32 bits counterpart or do the Core2Duo chips handle this well through some kind of emulation?

I'm trying to understand what is measured here and where the speed gain is coming from.

fpMP3Enc: a multi-core MP3 encoder based upon LAME 3.98.2

Reply #21 – 2009-08-27 09:45:21

Quote from: spoon on 2009-08-26 23:09:46

x80 encoding speed equates to 14MB a second read (from an uncompressed file, if lossless then half that). A modern HDD should be able to do 3x that without breaking a sweat, so I am not sure where the speed differentials come from when buffering to memory. Windows can have radically different read speeds depending on the transfer buffer size selected, off the top of my head on XP 64KB was the optimum size.

If you consider that 8 files (4x read, 4x write) are processed in parallel then 14MB/s is pretty good on a 100MB/s drive.

With the I/O strategy used in fpMP3Enc you could use 8 threads processing 16 files in parallel and get almost the same throughput on a quad-core. On an 8-core the throughput should be far above 20MB/s.

In contrast, if you execute 4 LAME processes for parallel encoding, the concurrent disk access will lead to an I/O performance below 14MB/s.

fpMP3Enc: a multi-core MP3 encoder based upon LAME 3.98.2

Reply #22 – 2009-08-27 10:12:42

Quote from: Alexxander on 2009-08-27 08:41:41

How many threads did you use with the presented results of fpMP3Enc ? I suspect the multi file encoding is done with 4 cores. If so, Lame encoding 4 or more files with one file per core would result in about 4 x 32.3=129.2 times encoding speed.

I used 3 threads in single and 4 threads in multi file encoding for the CPU bound tasks. I/O always needs two threads, one for the I/O scheduler and one for listening to the I/O completion port.

For the reason why it's not possible to get a 129.2x speed with 4xLAME, see my previous post.

Quote

Also, should a native well designed 64 bits encoder be nearly twice as fast as its 32 bits counterpart or do the Core2Duo chips handle this well through some kind of emulation?

AFAIK, the CPU intensive parts of LAME are SSE-optimized using 64- and 128-bit registers on both systems, x32 and x64. So, I don't think that an x64 encoder would be much faster than a x32 encoder.

fpMP3Enc: a multi-core MP3 encoder based upon LAME 3.98.2

Reply #23 – 2009-09-30 07:04:03

Hello,

the "final" version of fpMP3Enc is out. The work on it is finished.

The new web site is: www.thinkmeta.de

There, you will find detailed information about the optimizations I did. The direct link to the benchmark page is: [Benchmark]

The source code is now GPLv3 for further improvements (e.g. ID3 tagging) by other developers.

Have fun!
George

fpMP3Enc: a multi-core MP3 encoder based upon LAME 3.98.2

Reply #24 – 2009-09-30 07:48:05

I am sorry, but where are compiled binaries? I don't really want to compile the code myself.

Notice