Multithreading

Topic: Multithreading (Read 32369 times) previous topic - next topic

0 Members and 1 Guest are viewing this topic.

Multithreading

2022-10-25 12:52:21

It was brought up 15 years ago and someone made a toy example ( https://www.akalin.com/parallelizing-flac-encoding ). Seems the right time to rehash the discussion, intel/AMD both have 32 thread consumer parts and that number is going to keep growing.

As is tradition, instead of just starting the discussion I made a toy example of how I imagine a palatable interface in libFLAC would work. It boils down to exposing a function (with variants for input types) that lets a frontend feed a frames worth of samples and receive an encoded frame in return. The PoC API changes are here ( https://github.com/chocolate42/flac ), a PoC frontend (probably Linux only ATM) that uses it to multi-thread is here ( https://github.com/chocolate42/flaccid ).

tl;dr these are added, no existing function is altered:

Code: [Select]

typedef struct {
	FLAC__StreamEncoder *stream_encoder;/* should be protected in a proper implementation? */
	int dummy; /* guarantee that there is a distinction between stream and static encoding */
} FLAC__StaticEncoder;

FLAC_API FLAC__StaticEncoder *FLAC__static_encoder_new(void);
FLAC_API void FLAC__static_encoder_delete(FLAC__StaticEncoder *encoder);
FLAC_API FLAC__StreamEncoderInitStatus FLAC__static_encoder_init(FLAC__StaticEncoder *encoder);
FLAC_API FLAC__StreamEncoder *FLAC__static_encoder_get_underlying_stream(FLAC__StaticEncoder *encoder);/* for if stream protected */
FLAC_API FLAC__bool FLAC__static_encoder_process_frame_interleaved(FLAC__StaticEncoder *encoder, const FLAC__int32 buffer[], uint32_t samples, uint32_t current_frame, void *outbuf, size_t *outbuf_size);
/* other *_process_frame_* functions for flexible input types */

The API doesn't do threading, but with this a frontend can have multiple static encoder instances to encode frames in parallel. flaccid is hacked together with some questionable decisions (it is the definition of toy, raw cdda input only and just functional enough for an example), even so it trivially maxes out my quad core with a near-linear speedup wrt encode threads (note it always has a separate thread for output and another for MD5, which is why the 1 worker version sees a speedup). The _process_frame_ functions do not do MD5 hashing as that is a serial operation, that is left to the frontend.

That's not the only way multithreading could be enabled, but it is by far the simplest while still being "safe" by keeping stream/static distinct. I'd be interested how others would do it.

Re: Multithreading

Reply #1 – 2022-10-25 19:32:49

Quote from: cid42 on 2022-10-25 12:52:21

I'd be interested how others would do it.

The most obvious way to multithread is processing multiple files at once of course. A front end could simply invoke several stream encoder instances. The upside of this is that it is very easy to make thread safe, a downside might be the scattered disc I/O and that it doesn't work when a single file is being processed.

What you propose is interesting, and it should be possible to get it thread safe. Will still need a lot of testing with for example TSan.

edit: another possibility is through the subframe processing. FLAC brute-forces left-right decorrelation. This means for stereo it processed 4 different audio signals, L, R, L+R and L-R. This could be another possibility to do multithreading.

Re: Multithreading

Reply #2 – 2022-10-25 19:38:37

There exists multithreaded FLAC encoder using Fiber Pool (not that i know what Fiber Pool is) - https://www.rarewares.org/files/mp3/fpMP3Enc.zip (in archive together with multithreaded LAME)

Re: Multithreading

Reply #3 – 2022-10-25 19:45:48

I'm somewhat skeptical this is a huge benefit given that FLAC encodes at many times realtime singlethreaded on anything remotely modern.
There's probably a very limited use case where this makes a decent amount of sense, when you're encoding a really long single file, otherwise

Quote from: ktf on 2022-10-25 19:32:49

The most obvious way to multithread is processing multiple files at once of course

Re: Multithreading

Reply #4 – 2022-10-26 13:00:12

Quote from: ktf on 2022-10-25 19:32:49

Quote from: cid42 on 2022-10-25 12:52:21
I'd be interested how others would do it.
The most obvious way to multithread is processing multiple files at once of course. A front end could simply invoke several stream encoder instances. The upside of this is that it is very easy to make thread safe, a downside might be the scattered disc I/O and that it doesn't work when a single file is being processed.

That could be made mildly better than a script that runs n instances of ./flac by trying to load-balance the input probably by length as an estimate of encode time. A script could do that even if things like gnu parallel can't be coaxed into doing it for you.

Quote from: ktf on 2022-10-25 19:32:49

edit: another possibility is through the subframe processing. FLAC brute-forces left-right decorrelation. This means for stereo it processed 4 different audio signals, L, R, L+R and L-R. This could be another possibility to do multithreading.

I thought about intra-frame SMT, but can't see how it can be done in a way that would be palatable. One way could be to expose a lot of the internals for a frontend to deal with piecemeal which isn't ideal. Another way builds multithreading into libFLAC itself which probably can't be done portably, would be a heavy burden to maintain (probably fine-grained as coarse is less efficient the smaller the unit is).

Quote from: ktf on 2022-10-25 19:32:49

What you propose is interesting, and it should be possible to get it thread safe. Will still need a lot of testing with for example TSan.

The main benefit of exposing frame encoding is that the burden of SMT is on the frontend and it's coarse, coarse-grained SMT is much easier to do correctly (a few mutexes to ensure an encoder encodes when it can and the writer writes when it can and that's about it, my implementation currently does this hackily with an atomic integer instead and is not a shining example of how to do it right).

Quote from: Bogozo on 2022-10-25 19:38:37

There exists multithreaded FLAC encoder using Fiber Pool (not that i know what Fiber Pool is) - https://www.rarewares.org/files/mp3/fpMP3Enc.zip (in archive together with multithreaded LAME)

Fibers are apparently user-space threading within a thread, at a guess a fiber pool sounds like an implementation of M:N scheduling. I can't get the binaries to work in wine to test. They may be using fibers to separate I/O/MD5/encoding for gains similar to the 1 worker flaccid in the graph, beyond that I can't see how they can SMT without customising libFLAC (unless old versions of libFLAC exposed more than it does now).

Quote from: binaryhermit on 2022-10-25 19:45:48

I'm somewhat skeptical this is a huge benefit given that FLAC encodes at many times realtime singlethreaded on anything remotely modern.
There's probably a very limited use case where this makes a decent amount of sense, when you're encoding a really long single file, otherwise
Quote from: ktf on 2022-10-25 19:32:49
The most obvious way to multithread is processing multiple files at once of course

Parallel-files is basically multiprocessing (technically multithreading if you get a frontend to do it via threads instead of a script via processes, semantics IMO), it's fine but not ideal even when you have multiple files. You're right that FLAC is already quick on modern hardware so it's less beneficial to SMT than with a heavier codec, but if it's as simple to expose just enough functionality to make SMT possible as it appears I don't see a compelling reason not to. Stronger compression levels become more enticing, Risc-V and other low power devices would see more of a benefit than x86_64 as the cores/power-per-core is skewed very differently.

Having a simple static option built into libFLAC could allow a frontend to do more than SMT. How about an --adapt option (with or without SMT) that tries to maximise throughput by balancing I/O speeds with encoder speed? A frontend could have multiple static encoders initialised with different compression levels (ensuring blocksize is identical), and choose which one to encode the next frame with based on which element was the biggest bottleneck in recent history (or whatever metric makes sense). Pair that with a minimum and/or maximum compression level to get some guarantees about the output. Another possibility would be --target bitrate for vbr-like adapting, again paired with min/max compression level thresholds and with or without SMT. Both could exist albeit with debatable merit.

Re: Multithreading

Reply #5 – 2022-10-26 21:20:31

Quote from: cid42 on 2022-10-26 13:00:12

it's fine but not ideal even when you have multiple files

Please elaborate. People have gotten great results with that kind of multithreading/multiprocessing.

Re: Multithreading

Reply #6 – 2022-10-27 09:14:11

As you say disk IO particularly for spinning rust, then I'm mainly thinking occupancy issues when the file count is low enough to have cores sitting idle (say you're multiprocessing an album or a discography that's ripped file-per-album). For example multiproc an album with 12 similar tracks on 8 core, 4 cores sit idle while the last 4 tracks process. Or an album with 8 tracks on an 8 core but the last track is 20 minutes long, most cores are idle most of the time. Very minor but not ideal.

A special case where single file is inevitable is ripping from disc and encoding on the fly.

Re: Multithreading

Reply #7 – 2022-10-27 11:12:03

Quote from: cid42 on 2022-10-27 09:14:11

A special case where single file is inevitable is ripping from disc and encoding on the fly.

Not sure if that is a big enough problem to bother much about, let's think aloud:
Disregarding those very few CD spinners that would read by multiple lasers (not sure if they even would for audio extraction!), CD spinners boast of up to 52x speed and reach those at the very end of an 80 minutes disc. From experience, I rarely see ripping applications report anything like that in progress bars, but then I used secure ripping modes ... so let me just take that number at face value.
In the 2015 revision of ktf's performance tests, flac -8 would encode CDDA faster than that in a single thread of an AMD A4-3400 launched in 2011.

So OK, if you run -8pr8 and above upon ripping, sure.

As for DVD, that is a different can of worms. Looks like they can extract data at 9x this speed, but I have no idea how fast they can deliver audio content - nor whether the applications in question will encode on-the-fly or get the thing to HD first and extract later. (That is of course not a theoretical bound just because applications in practice do it such a way, so you can wave it off in a discussion on possible benefits.)

Re: Multithreading

Reply #8 – 2022-10-27 14:10:36

Quote from: Porcus on 2022-10-27 11:12:03

In the 2015 revision of ktf's performance tests, flac -8 would encode CDDA faster than that in a single thread of an AMD A4-3400 launched in 2011.

And isn't the A4-3400 not exactly particularly high-end for its day?

Re: Multithreading

Reply #9 – 2022-10-27 15:10:46

Not saying this is useless and more power to anyone who goes down the path of making this feasible, but I have to say even only using spinning rust I am very impressed with the speed/efficiency of multi-threading batch conversions using F2K. So I think this really ultimately has limited utility unless there is some other use case that hasn’t come to mind for me yet.

For encoding very lengthy singular files - maybe FLACCL might be a speedy option. If I’m not mistaken doesn’t this take advantage of multiple GPU cuda cores? Similar concept, different approach perhaps?

Re: Multithreading

Reply #10 – 2022-10-27 15:12:20

That is a fair characterization: https://en.wikipedia.org/wiki/List_of_AMD_accelerated_processing_units#Lynx:_%22Llano%22_(2011)

But part of the cheepniz was irrelevant for the test: only two cores. The test was performed on one, so the number of inactive cores "would only contribute to the price tag".
True, it had half the L2 cache of the more expensive models and so-called "Turbo" - but as for the latter, the absence of Turbo might (for all that I know) only make results more reliable for a test run over long time where cooling would become a constraint.

Re: Multithreading

Reply #11 – 2022-10-27 15:13:56

Quote from: binaryhermit on 2022-10-27 14:10:36

And isn't the A4-3400 not exactly particularly high-end for its day?

That is a fair characterization: https://en.wikipedia.org/wiki/List_of_AMD_accelerated_processing_units#Lynx:_%22Llano%22_%282011%29
Edit: Well it is a 65 watt thing, so compared to the "mobile" ones that were back in the day also used in desktops, it is not their bottom range. It seems to be a not-so-high-end in their sure-you-didn't-buy-this-to-save-power range.
So ... not that bad? And part of the cheepniz was irrelevant for the test: only two cores. The test was performed on one, so the number of inactive cores "would only contribute to the price tag".
True, it had half the L2 cache of the more expensive models and so-called "Turbo" - but as for the latter, the absence of Turbo might (for all that I know) only make results more reliable for a test run over long time where cooling would become a constraint.

As for FLACCL: yes, it is fast on a decently expensive GPU, and even faster on an indecently expensive one

Re: Multithreading

Reply #12 – 2022-10-27 17:55:12

I iterated on the PoC a little (again barely tested), flaccid can now do variable blocksize encoding in a brute-force kind of way. It's quite limited and non-optimal but that's to allow an easy implementation:

min_blocksize and max_blocksize are defined, min must be a power-of-two-multiple of max
Input is processed in chunks, a chunk is the size of max_blocksize
The entire chunk is processed fully in each of the blocksizes used and the best combination of frames are chosen
The blocksizes used are max_blocksize, max_blocksize/2, max_blocksize/4, max_blocksize/8, ... , min_blocksize. So it's like partitioning a chunk based on the optimal nodes of a perfectly-shaped binary tree, where each node is one of the processed frames and each nodes children partition the parent exactly in half.

For example a min max of 1024 4096 would encode the entire input 3 times to be able to pick the best out of 1024/2048/4096 blocksizes. Min max of 256 8192 encodes entire input 6 times to choose between best of 256/512/1024/2048/4096/8192.

cdda version of nine inch nail's the slip used as a test (open license https://archive.org/details/nine_inch_nails_the_slip ), compared to flac 1.4.2, -8p, flac uses blocksize 4096, flaccid min 512 max 8192. They all passed flac -t, flaccid did 5x the encoding of flac in this test:

Code: [Select]

Size
 5141054 01_999999.raw.8p.min4096.max4096.flac
 5114191 01_999999.raw.8p.min512.max8192.flaccid.flac
30929952 02_1000000.raw.8p.min4096.max4096.flac
30898003 02_1000000.raw.8p.min512.max8192.flaccid.flac
30329800 03_letting_you.raw.8p.min4096.max4096.flac
30281386 03_letting_you.raw.8p.min512.max8192.flaccid.flac
33140258 04_discipline.raw.8p.min4096.max4096.flac
32937577 04_discipline.raw.8p.min512.max8192.flaccid.flac
27283598 05_echoplex.raw.8p.min4096.max4096.flac
27034633 05_echoplex.raw.8p.min512.max8192.flaccid.flac
29520427 06_head_down.raw.8p.min4096.max4096.flac
29432222 06_head_down.raw.8p.min512.max8192.flaccid.flac
 9770865 07_lights_in_the_sky.raw.8p.min4096.max4096.flac
 9714938 07_lights_in_the_sky.raw.8p.min512.max8192.flaccid.flac
25352359 08_corona_radiata.raw.8p.min4096.max4096.flac
25206304 08_corona_radiata.raw.8p.min512.max8192.flaccid.flac
23137304 09_the_four_of_us_are_dying.raw.8p.min4096.max4096.flac
23035775 09_the_four_of_us_are_dying.raw.8p.min512.max8192.flaccid.flac
37134638 10_demon_seed.raw.8p.min4096.max4096.flac
36945942 10_demon_seed.raw.8p.min512.max8192.flaccid.flac

flaccid might work on windows now but I doubt it, haven't figured out how to cross-compile libFLAC for windows but other groundwork has been done (mbedtls is implemented as an alternative to openssl, and a wrapper for mmap that uses MS-equivalent has been added untested).

Re: Multithreading

Reply #13 – 2022-10-28 07:05:11

Quote from: cid42 on 2022-10-27 17:55:12

I iterated on the PoC a little (again barely tested), flaccid can now do variable blocksize encoding in a brute-force kind of way. It's quite limited and non-optimal but that's to allow an easy implementation:

Very interesting, rather high compression gains actually. Have you run the test suite with make check (or ctest if you're using cmake)?

Re: Multithreading

Reply #14 – 2022-10-28 10:12:31

make check passed fully but that's to be expected. Nothing of the existing API or anything the API uses has been changed, specifically so that this hacky PoC couldn't break anything except potentially itself. There's two functions copied into *static_ variants which could be merged with a proper implementation, process_frame_ and process_subframes_.

process_frame_ replaces the default write with directly exposing the buffer, and in this hack calls the appropriate process_subframes_static_. Seems fine as a solution as the function is small.

Ideally we'd use process_subframes_ directly, but as the concept of variable blocksize isn't built in to FLAC__StreamEncoder (at least not that I could see) it was hacked in by copying process_subframes_ to process_subframes_static_ to take an extra arg is_variable_blocksize. The proper way to do it would be adding the variable flag to FLAC__StreamEncoder, which would be necessary anyway if StreamEncoder one day supported variable blocksize. The plumbing in the bitstream is already present to output current_sample instead of current_frame, this was the only change needed to use it.

Re: Multithreading

Reply #15 – 2022-10-28 16:06:54

Implemented non-two-stride to try and eek out some more gains, very minimal improvement and mostly still from stride=2. It was only at effort level 3 (input encoded fully 3 times) that a stride other than 2 was the top overall winner (with blocksizes 1024 3072 9216, they just so happen to be close to the blocksizes that should be represented with this input and possibly more generally). Note the implementation is flawed in that the final chunk is not subdivided if it's partial, meaning something like min=128,max=16384,stride=2 can sometimes beat min=128,max=32768,stride=2 because the partial chunk of the latter is encoded large when the former splits the penultimate chunk for better gains.

But this is going full tangent, from now on if there's a development worth mentioning that isn't directly multithreading I'll make another thread.

Re: Multithreading

Reply #16 – 2022-10-28 16:44:35

Brute-forcing various block sizes, is that something that could be multi-threaded?
Assuming powers of two: work on the following 4096 samples is split into the "1x4096 thread", the "2x2048" and the "4x1024" (with the fourth thread to rule them all)?

Re: Multithreading

Reply #17 – 2022-10-28 19:51:49

Quote from: Porcus on 2022-10-28 16:44:35

Brute-forcing various block sizes, is that something that could be multi-threaded?
Assuming powers of two: work on the following 4096 samples is split into the "1x4096 thread", the "2x2048" and the "4x1024" (with the fourth thread to rule them all)?

Yes, the chunk encoding in flaccid does this and seems like a reasonable way to do it in an SMT-friendly way (at least to try and maximise occupancy, it's not smart and is still linearly chunked just into something possibly pretty big). The split you describe is defined in flaccid with blocksize_min=1024, blocksize_max=4096, blocksize_stride=2, the spreadsheet above has results for this set with the nine inch nails album, it's the first row. In flaccid terms the chunk size would be 4096, and the possible ways to store the chunk would be 4096, 2048/2048, 2048/1024/1024, 1024/1024/2048, 1024/1024/1024/1024. On a quad core especially with hyperthreading (I should stop using SMT for multithreading, really SMT is the generic term for hyperthreading), you'd be better off using 4 blocksizes, maybe add 512 or 8192 to the others.

The benefit of chunks is that the multithreading can be done at the chunk level, you'd do this for the same reason multithreading at the frame level is nice (you know where the chunk boundaries are and can queue up embarrassingly parallel work). A downside of chunks is (probably minor, TBD) inefficiencies because you aren't free to pick any blocksize for a given frame. Take the above example, if the first blocksize in a chunk is 1024 the next one must also be 1024, because we have only calculated blocksize 2048 at offsets 0 and 2048 (not the offset 1024 we are at).

The next thing to try is a greedy algorithm, where we try a set of blocksizes for the next frame and pick the one that has the best encoded bits per sample of input. This is an obvious algorithm but comes with drawbacks when it comes to multithreading. We don't know where the following frame starts so can't do that simultaneously, so have to multithread the set of encodes for one frame at a time and wait for all to complete to decide how to proceed. The encodes by definition cannot take similar times to complete as they cover a spectrum of blocksizes, so you have a grab bag of work taking 1/2/4/8/16/.../8192 of time and need to be careful about the choice of blocksizes and how they're executed if you want to minimise occupancy. It's not a dealbreaker but if the difference to chunk encoding turns out to be a rounding error then a greedy algorithm is less interesting IMO.

Then there's adapting the blocksize smartly by predicting what the best blocksize will be and trying that first, but coming up with a good algorithm is probably beyond me. There's many different strategies that could work, anyone know offhand some standard ways it is done?

Re: Multithreading

Reply #18 – 2022-10-28 20:03:22

Quote from: cid42 on 2022-10-28 19:51:49

Then there's adapting the blocksize smartly by predicting what the best blocksize will be and trying that first, but coming up with a good algorithm is probably beyond me. There's many different strategies that could work, anyone know offhand some standard ways it is done?

Wild guess: Try the "previous" optimal one first, and its neighbours?

Re: Multithreading

Reply #19 – 2022-10-28 20:57:31

Quote from: cid42 on 2022-10-28 19:51:49

There's many different strategies that could work, anyone know offhand some standard ways it is done?

This is exactly the reason variable block sizes never took of in FLAC. It is a rather difficult problem.

One lead I'd like to follow at some point is this: https://www.mathworks.com/help/signal/ref/findchangepts.html Especially the Audio File Segmentation example looks very promising.

Re: Multithreading

Reply #20 – 2022-10-29 00:49:48

Maybe look at the way LAME determines whether to use long or short blocks and try to adapt the approach, or at least take some pointers.

After all, that is probably the most well-tuned approach for block length switching. (I know that that's lossy whereas FLAC is lossless, but there ought to be at least some relevance and connection, I think.)

Re: Multithreading

Reply #21 – 2022-10-29 10:27:31

In LAME it is to choose between better temporal or frequency resolution, I don't think it has much to do with FLAC block size...

Re: Multithreading

Reply #22 – 2022-10-29 17:47:23

Tried the most naive greedy algorithm possible for variable blocksize, here's the nin comparison with blocksizes 1024 2048 4096:

Code: [Select]

greed 5142283 30925019 30314397 32984879 27107991 29482757 9773746 25360038 23133439 37026662 total 251251211
chunk 5138754 30907511 30295140 32961194 27048226 29452718 9767257 25359384 23066576 36966063 total 250962823

The greedy algorithm has worse sizes, worse effort (>3 compared to 3 for the chunk algorithm thanks to overlap every time the biggest blocksize isn't chosen) and is harder to multithread. Seems like a failure all round but it's a small sample. It's surprising that even chunk's paltry pseudo-lookahead behaviour (of to the end of the current chunk which is only 4096 samples large here) behaves better than greedy's ability to cross chunk boundaries (by not having chunks).

Quote from: ktf on 2022-10-28 20:57:31

Quote from: cid42 on 2022-10-28 19:51:49
There's many different strategies that could work, anyone know offhand some standard ways it is done?
This is exactly the reason variable block sizes never took of in FLAC. It is a rather difficult problem.

One lead I'd like to follow at some point is this: https://www.mathworks.com/help/signal/ref/findchangepts.html Especially the Audio File Segmentation example looks very promising.

Choosing a blocksize without brute force sounds very hard, heuristics of the audio itself seems like a dark art I'm not brave enough for.

Re: Multithreading

Reply #23 – 2022-11-01 14:19:46

Implemented a new algorithm that's optimal for a given set of blocksizes, from now on called peakset. When input, compression settings, libFLAC and the block set are equal it is the best representation possible, with the extremely minor caveat that the partial frame at the end could possibly be split saving a few extra bytes maybe.

The only limitation is that all blocksizes must be a multiple of blocksize_min, this means that all frames regardless of size will start on a boundary that's a multiple of blocksize_min. This works fine, but being optimal it does take a lot of effort as the number of blocksizes increases and how much bigger they are than blocksize_min. If you normalise all blocksizes by dividing by blocksize_min, the effort for a given set of blocks is their sum. This means blocksizes close to blocksize_min are relatively cheap to include, and if you include all multiples of blocksize_min up to blocksize_max the effort is given by (blocksize_count*(blocksize_count+1))/2. It's not novel, it's an obvious way you might constrain the set into something fully brute-forceable. It started off as a generalisation of chunk encoding with the window the size of a chunk or a bit bigger, but as the window size increases effort increases logarithmically so it made more sense to just go full brutus and have the window size be the size of the input.

Code: [Select]

These all test and encode using -8p
Algorithm blocksize_count blocksize_min blocksize_max Effort Track 1  Track 2  Track 3  Track 4  Track 5  Track 6  Track 7 Track 8   Track 9 Track 10 Album total
peakset                 2          2304          4608      3 5133496 30904160 30296881 33015062 27103528 29457496  9751163 25322968 23068610 37011732   251065096
peakset                 3          1536          4608      6 5130680 30892525 30282259 32967826 27043482 29431242  9747195 25319686 23041505 36960567   250816967
peakset                 4          1152          4608     10 5128519 30883325 30270267 32936824 27004215 29415022  9744598 25317502 23021762 36922440   250644474
peakset                 6           768          4608     21 5127185 30876851 30262990 32909669 26975406 29400776  9742235 25315041 23003911 36893864   250507928
peakset                 8           576          4608     36 5126236 30869818 30256647 32893712 26955466 29391818  9740845 25312890 22993305 36875683   250416420
peakset                 9           512          4608     45 5125504 30867471 30254046 32888554 26948670 29387396  9739849 25311263 22989475 36870104   250382332
peakset (non-subset)    9          1024          9216     45 5105631 30870712 30247986 32912334 26979518 29380381  9691365 25176978 22975422 36900952   250241279

A good strategy that builds on this might be to use peakset at some reasonable effort level to get a reasonable representation, then tweak that representation further with fractional blocksizes that weren't available to peakset. For example tweaking where adjacent blocks are partitioned (a particular pair 1024,3072 might be better representated as 1360,2736) can be done without disrupting any other blocks, so can merging adjacent blocks that sum to greater than blocksize_max (but <=4608 if subset) if that turns out to be more efficient.

It's also not necessary to use the same compression settings when brute forcing the blocksizes as those used to encode the final output. It wouldn't be fully optimal, but if all compression settings that could be turned down without massively altering the permutation of frame sizes chosen were turned down, we could go from something like effort at "-8p"=45+1 to effort at "-5"=45 effort at "-8p"=1. So the question is which settings can be reduced without making the cheaper compression vastly misrepresentative of the expensive compression? LPC order must be important, and apodization, LPC quantization and rice partition order might not be so important if they perform similarly across the board, etc. Here's a rough test that hasn't been tuned but shows that it might work (the top one encodes 10 times with -8p for analysis, once more for output, the bottom one encodes 45 times at -5 for analysis, once at -8p for output):

Code: [Select]

test_setting encode_setting time_est bcount  bmin bmax effort                                                                                          Total
        -8p            -8p      1750      4  1152 4608     10  5128519 30883325 30270267 32936824 27004215 29415022 9744598 25317502 23021762 36922440 250644474
        -5             -8p       673      9   512 4608     45  5129211 30886850 30269013 32896086 26977882 29408331 9746598 25325718 23022185 36885304 250547178

Picking contiguous normalised blocksizes (256,512,768...) is probably not optimal. It's cheap to include those close to blocksize_min, and blocksizes around 4k are important as that's where most blocks want to be, some input likes 6k to 8k and 1k to 2k so they should be represented well too, but we can probably be sparse at very high blocksizes without loosing too much of value (maybe only go up to 6k and rely on a merge step as described above for the low number of input that likes 8k+). Lots of ways to potentially optimise, so little time.

Re: Multithreading

Reply #24 – 2022-11-08 20:00:09

In the name of trying to find the right levers to pull for reasonably efficient somewhat-bruteforce settings:

n tweak passes as a final step before output. Each tweak pass tries successively smaller tweaks to the partition location between adjacent frames starting at half of blocksize_min. This may not be optimal
Different compression settings for analysis and final output

The following are the best subset results from a wide range of settings (all of them have no more-efficient competitor that's faster, except the starred result which is the previous best result from the post above). CPU is a single-threaded estimate using clock_t (tests actually done on quad core), all results encode track one of the album with -8p, effort is the number of times input was encoded for analysis and tweak (and once more for output) (analysis uses analysis compression setting, tweak and output use output compression setting).

Code: [Select]

     Size  CPU time    mode analysis tweak output effort_analysis effort_tweak Blocks used during analysis
  5124075  94.15798 peakset       8p    10     8p              36       11.942 576,1152,1728,2304,2880,3456,4032,4608
  5124583  84.53425 peakset       8p     4     8p              36        4.201 576,1152,1728,2304,2880,3456,4032,4608
  5124877  82.16994 peakset       8p     2     8p              36        1.979 576,1152,1728,2304,2880,3456,4032,4608
  5124881  68.70227 peakset       8p    10     8p              21       11.858 768,1536,2304,3072,3840,4608
  5125344  51.31762 peakset       8     10     8p              45       13.384 512,1024,1536,2048,2560,3072,3584,4096,4608
  5125376  38.69915 peakset       8     10     8p              36       11.373 576,1152,1728,2304,2880,3456,4032,4608
* 5125504  94.82512 peakset       8p     0     8p              45        0     512,1024,1536,2048,2560,3072,3584,4096,4608
  5125705  33.13277 peakset       8      4     8p              45        4.985 512,1024,1536,2048,2560,3072,3584,4096,4608
  5125798  31.44276 peakset       8      3     8p              45        3.672 512,1024,1536,2048,2560,3072,3584,4096,4608
  5125867  25.73276 peakset       8      4     8p              36        4.007 576,1152,1728,2304,2880,3456,4032,4608
  5125980  24.87177 peakset       8      3     8p              36        2.898 576,1152,1728,2304,2880,3456,4032,4608
  5126643  21.18103 peakset       8      1     8p              36        0.879 576,1152,1728,2304,2880,3456,4032,4608
  5126657  20.71826 peakset       8      4     8p              21        3.548 768,1536,2304,3072,3840,4608
  5126743  19.22586 peakset       8      3     8p              21        2.599 768,1536,2304,3072,3840,4608
  5126976  17.76938 peakset       8      2     8p              21        1.673 768,1536,2304,3072,3840,4608
  5127356  17.26785 peakset       5      3     8p              45        4.179 512,1024,1536,2048,2560,3072,3584,4096,4608
  5127473  16.18667 peakset       8      1     8p              21        0.771 768,1536,2304,3072,3840,4608
  5127622  14.59223 peakset       5      3     8p              36        3.722 576,1152,1728,2304,2880,3456,4032,4608
  5127681  12.57558 peakset       5      4     8p              21        4.404 768,1536,2304,3072,3840,4608
  5127790  11.35689 peakset       5      3     8p              21        3.22  768,1536,2304,3072,3840,4608
  5128042   9.23730 peakset       5      2     8p              21        2.084 768,1536,2304,3072,3840,4608
  5128588   6.51163 peakset       5      1     8p              21        0.99  768,1536,2304,3072,3840,4608
  5129239   5.17407 peakset       5      2     8p              10        1.549 1152,2304,3456,4608
  5129983   4.10213 peakset       5      1     8p              10        0.731 1152,2304,3456,4608
  5131055   2.91449 peakset       5      0     8p              10        0     1152,2304,3456,4608
  5132505   2.63151 peakset       5      1     8p               3        0.393 2304,4608
  5132757   2.32122 peakset       5      0     8p               6        0     1536,3072,4608
  5135107   1.93578 peakset       5      0     8p               3        0     2304,4608
  5138542   1.75692 peakset       0      0     8p               3        0     2304,4608
  5146517           flac                       8p                              4608
  5149423           flac                       8p                              4096

Greed and chunk don't get a look in mostly because tweak passes haven't been implemented there, I'd have expected at least one chunk result to make it lower down the list regardless but it appears peakset+tweak gets a more efficient result faster than pure chunk across the board. Chunk+tweak might make for a good mid-table result where tweak is limited to adjacent frames within chunks, TODO.

A merge pass after the tweak passes looking for efficient blocksizes >blocksize_max, would benefit lax encodes. It could mildly improve subset encodes IFF blocksize 4608 was omitted from the analysis stage (which probably shouldn't be omitted).

Notice