Text only | Text with Images

HydrogenAudio

Lossless Audio Compression => FLAC => Topic started by: cid42 on 2022-10-25 12:52:21

Title: Multithreading
Post by: cid42 on 2022-10-25 12:52:21

It was brought up 15 years ago and someone made a toy example ( https://www.akalin.com/parallelizing-flac-encoding ). Seems the right time to rehash the discussion, intel/AMD both have 32 thread consumer parts and that number is going to keep growing.

As is tradition, instead of just starting the discussion I made a toy example of how I imagine a palatable interface in libFLAC would work. It boils down to exposing a function (with variants for input types) that lets a frontend feed a frames worth of samples and receive an encoded frame in return. The PoC API changes are here ( https://github.com/chocolate42/flac ), a PoC frontend (probably Linux only ATM) that uses it to multi-thread is here ( https://github.com/chocolate42/flaccid ).

tl;dr these are added, no existing function is altered:

typedef struct {
	FLAC__StreamEncoder *stream_encoder;/* should be protected in a proper implementation? */
	int dummy; /* guarantee that there is a distinction between stream and static encoding */
} FLAC__StaticEncoder;

FLAC_API FLAC__StaticEncoder *FLAC__static_encoder_new(void);
FLAC_API void FLAC__static_encoder_delete(FLAC__StaticEncoder *encoder);
FLAC_API FLAC__StreamEncoderInitStatus FLAC__static_encoder_init(FLAC__StaticEncoder *encoder);
FLAC_API FLAC__StreamEncoder *FLAC__static_encoder_get_underlying_stream(FLAC__StaticEncoder *encoder);/* for if stream protected */
FLAC_API FLAC__bool FLAC__static_encoder_process_frame_interleaved(FLAC__StaticEncoder *encoder, const FLAC__int32 buffer[], uint32_t samples, uint32_t current_frame, void *outbuf, size_t *outbuf_size);
/* other *_process_frame_* functions for flexible input types */

The API doesn't do threading, but with this a frontend can have multiple static encoder instances to encode frames in parallel. flaccid is hacked together with some questionable decisions (it is the definition of toy, raw cdda input only and just functional enough for an example), even so it trivially maxes out my quad core with a near-linear speedup wrt encode threads (note it always has a separate thread for output and another for MD5, which is why the 1 worker version sees a speedup). The _process_frame_ functions do not do MD5 hashing as that is a serial operation, that is left to the frontend.

That's not the only way multithreading could be enabled, but it is by far the simplest while still being "safe" by keeping stream/static distinct. I'd be interested how others would do it.

Title: Re: Multithreading
Post by: ktf on 2022-10-25 19:32:49

Quote from: cid42 on 2022-10-25 12:52:21

I'd be interested how others would do it.

The most obvious way to multithread is processing multiple files at once of course. A front end could simply invoke several stream encoder instances. The upside of this is that it is very easy to make thread safe, a downside might be the scattered disc I/O and that it doesn't work when a single file is being processed.

What you propose is interesting, and it should be possible to get it thread safe. Will still need a lot of testing with for example TSan (https://clang.llvm.org/docs/ThreadSanitizer.html).

edit: another possibility is through the subframe processing. FLAC brute-forces left-right decorrelation. This means for stereo it processed 4 different audio signals, L, R, L+R and L-R. This could be another possibility to do multithreading.

Title: Re: Multithreading
Post by: Bogozo on 2022-10-25 19:38:37

There exists multithreaded FLAC encoder using Fiber Pool (not that i know what Fiber Pool is) - https://www.rarewares.org/files/mp3/fpMP3Enc.zip (in archive together with multithreaded LAME)

Title: Re: Multithreading
Post by: binaryhermit on 2022-10-25 19:45:48

I'm somewhat skeptical this is a huge benefit given that FLAC encodes at many times realtime singlethreaded on anything remotely modern.
There's probably a very limited use case where this makes a decent amount of sense, when you're encoding a really long single file, otherwise

Quote from: ktf on 2022-10-25 19:32:49

The most obvious way to multithread is processing multiple files at once of course

Title: Re: Multithreading
Post by: cid42 on 2022-10-26 13:00:12

Quote from: ktf on 2022-10-25 19:32:49

Quote from: cid42 on 2022-10-25 12:52:21
I'd be interested how others would do it.
The most obvious way to multithread is processing multiple files at once of course. A front end could simply invoke several stream encoder instances. The upside of this is that it is very easy to make thread safe, a downside might be the scattered disc I/O and that it doesn't work when a single file is being processed.

That could be made mildly better than a script that runs n instances of ./flac by trying to load-balance the input probably by length as an estimate of encode time. A script could do that even if things like gnu parallel can't be coaxed into doing it for you.

Quote from: ktf on 2022-10-25 19:32:49

edit: another possibility is through the subframe processing. FLAC brute-forces left-right decorrelation. This means for stereo it processed 4 different audio signals, L, R, L+R and L-R. This could be another possibility to do multithreading.

I thought about intra-frame SMT, but can't see how it can be done in a way that would be palatable. One way could be to expose a lot of the internals for a frontend to deal with piecemeal which isn't ideal. Another way builds multithreading into libFLAC itself which probably can't be done portably, would be a heavy burden to maintain (probably fine-grained as coarse is less efficient the smaller the unit is).

Quote from: ktf on 2022-10-25 19:32:49

What you propose is interesting, and it should be possible to get it thread safe. Will still need a lot of testing with for example TSan (https://clang.llvm.org/docs/ThreadSanitizer.html).

The main benefit of exposing frame encoding is that the burden of SMT is on the frontend and it's coarse, coarse-grained SMT is much easier to do correctly (a few mutexes to ensure an encoder encodes when it can and the writer writes when it can and that's about it, my implementation currently does this hackily with an atomic integer instead and is not a shining example of how to do it right).

Quote from: Bogozo on 2022-10-25 19:38:37

There exists multithreaded FLAC encoder using Fiber Pool (not that i know what Fiber Pool is) - https://www.rarewares.org/files/mp3/fpMP3Enc.zip (in archive together with multithreaded LAME)

Fibers are apparently user-space threading within a thread, at a guess a fiber pool sounds like an implementation of M:N scheduling. I can't get the binaries to work in wine to test. They may be using fibers to separate I/O/MD5/encoding for gains similar to the 1 worker flaccid in the graph, beyond that I can't see how they can SMT without customising libFLAC (unless old versions of libFLAC exposed more than it does now).

Quote from: binaryhermit on 2022-10-25 19:45:48

I'm somewhat skeptical this is a huge benefit given that FLAC encodes at many times realtime singlethreaded on anything remotely modern.
There's probably a very limited use case where this makes a decent amount of sense, when you're encoding a really long single file, otherwise
Quote from: ktf on 2022-10-25 19:32:49
The most obvious way to multithread is processing multiple files at once of course

Parallel-files is basically multiprocessing (technically multithreading if you get a frontend to do it via threads instead of a script via processes, semantics IMO), it's fine but not ideal even when you have multiple files. You're right that FLAC is already quick on modern hardware so it's less beneficial to SMT than with a heavier codec, but if it's as simple to expose just enough functionality to make SMT possible as it appears I don't see a compelling reason not to. Stronger compression levels become more enticing, Risc-V and other low power devices would see more of a benefit than x86_64 as the cores/power-per-core is skewed very differently.

Having a simple static option built into libFLAC could allow a frontend to do more than SMT. How about an --adapt option (with or without SMT) that tries to maximise throughput by balancing I/O speeds with encoder speed? A frontend could have multiple static encoders initialised with different compression levels (ensuring blocksize is identical), and choose which one to encode the next frame with based on which element was the biggest bottleneck in recent history (or whatever metric makes sense). Pair that with a minimum and/or maximum compression level to get some guarantees about the output. Another possibility would be --target bitrate for vbr-like adapting, again paired with min/max compression level thresholds and with or without SMT. Both could exist albeit with debatable merit.

Title: Re: Multithreading
Post by: ktf on 2022-10-26 21:20:31

Quote from: cid42 on 2022-10-26 13:00:12

it's fine but not ideal even when you have multiple files

Please elaborate. People have gotten great results with that kind of multithreading/multiprocessing.

Title: Re: Multithreading
Post by: cid42 on 2022-10-27 09:14:11

As you say disk IO particularly for spinning rust, then I'm mainly thinking occupancy issues when the file count is low enough to have cores sitting idle (say you're multiprocessing an album or a discography that's ripped file-per-album). For example multiproc an album with 12 similar tracks on 8 core, 4 cores sit idle while the last 4 tracks process. Or an album with 8 tracks on an 8 core but the last track is 20 minutes long, most cores are idle most of the time. Very minor but not ideal.

A special case where single file is inevitable is ripping from disc and encoding on the fly.

Title: Re: Multithreading
Post by: Porcus on 2022-10-27 11:12:03

Quote from: cid42 on 2022-10-27 09:14:11

A special case where single file is inevitable is ripping from disc and encoding on the fly.

Not sure if that is a big enough problem to bother much about, let's think aloud:
Disregarding those very few CD spinners that would read by multiple lasers (not sure if they even would for audio extraction!), CD spinners boast of up to 52x speed and reach those at the very end of an 80 minutes disc. From experience, I rarely see ripping applications report anything like that in progress bars, but then I used secure ripping modes ... so let me just take that number at face value.
In the 2015 revision (http://audiograaf.nl/losslesstest/Lossless%20audio%20codec%20comparison%20-%20revision%204.pdf) of ktf's performance tests, flac -8 would encode CDDA faster than that in a single thread of an AMD A4-3400 launched in 2011.

So OK, if you run -8pr8 and above upon ripping, sure.

As for DVD, that is a different can of worms. Looks like they can extract data at 9x this speed, but I have no idea how fast they can deliver audio content - nor whether the applications in question will encode on-the-fly or get the thing to HD first and extract later. (That is of course not a theoretical bound just because applications in practice do it such a way, so you can wave it off in a discussion on possible benefits.)

Title: Re: Multithreading
Post by: binaryhermit on 2022-10-27 14:10:36

Quote from: Porcus on 2022-10-27 11:12:03

In the 2015 revision (http://audiograaf.nl/losslesstest/Lossless%20audio%20codec%20comparison%20-%20revision%204.pdf) of ktf's performance tests, flac -8 would encode CDDA faster than that in a single thread of an AMD A4-3400 launched in 2011.

And isn't the A4-3400 not exactly particularly high-end for its day?

Title: Re: Multithreading
Post by: MrRom92 on 2022-10-27 15:10:46

Not saying this is useless and more power to anyone who goes down the path of making this feasible, but I have to say even only using spinning rust I am very impressed with the speed/efficiency of multi-threading batch conversions using F2K. So I think this really ultimately has limited utility unless there is some other use case that hasn’t come to mind for me yet.

For encoding very lengthy singular files - maybe FLACCL might be a speedy option. If I’m not mistaken doesn’t this take advantage of multiple GPU cuda cores? Similar concept, different approach perhaps?

Title: Re: Multithreading
Post by: Porcus on 2022-10-27 15:12:20

That is a fair characterization: https://en.wikipedia.org/wiki/List_of_AMD_accelerated_processing_units#Lynx:_%22Llano%22_(2011)

But part of the cheepniz was irrelevant for the test: only two cores. The test was performed on one, so the number of inactive cores "would only contribute to the price tag".
True, it had half the L2 cache of the more expensive models and so-called "Turbo" - but as for the latter, the absence of Turbo might (for all that I know) only make results more reliable for a test run over long time where cooling would become a constraint.

Title: Re: Multithreading
Post by: Porcus on 2022-10-27 15:13:56

Quote from: binaryhermit on 2022-10-27 14:10:36

And isn't the A4-3400 not exactly particularly high-end for its day?

That is a fair characterization: https://en.wikipedia.org/wiki/List_of_AMD_accelerated_processing_units#Lynx:_%22Llano%22_%282011%29
Edit: Well it is a 65 watt thing, so compared to the "mobile" ones that were back in the day also used in desktops, it is not their bottom range. It seems to be a not-so-high-end in their sure-you-didn't-buy-this-to-save-power range.
So ... not that bad? And part of the cheepniz was irrelevant for the test: only two cores. The test was performed on one, so the number of inactive cores "would only contribute to the price tag".
True, it had half the L2 cache of the more expensive models and so-called "Turbo" - but as for the latter, the absence of Turbo might (for all that I know) only make results more reliable for a test run over long time where cooling would become a constraint.

As for FLACCL: yes, it is fast on a decently expensive GPU, and even faster on an indecently expensive one :))

Title: Re: Multithreading
Post by: cid42 on 2022-10-27 17:55:12

I iterated on the PoC a little (again barely tested), flaccid can now do variable blocksize encoding in a brute-force kind of way. It's quite limited and non-optimal but that's to allow an easy implementation:

min_blocksize and max_blocksize are defined, min must be a power-of-two-multiple of max
Input is processed in chunks, a chunk is the size of max_blocksize
The entire chunk is processed fully in each of the blocksizes used and the best combination of frames are chosen
The blocksizes used are max_blocksize, max_blocksize/2, max_blocksize/4, max_blocksize/8, ... , min_blocksize. So it's like partitioning a chunk based on the optimal nodes of a perfectly-shaped binary tree, where each node is one of the processed frames and each nodes children partition the parent exactly in half.

For example a min max of 1024 4096 would encode the entire input 3 times to be able to pick the best out of 1024/2048/4096 blocksizes. Min max of 256 8192 encodes entire input 6 times to choose between best of 256/512/1024/2048/4096/8192.

cdda version of nine inch nail's the slip used as a test (open license https://archive.org/details/nine_inch_nails_the_slip ), compared to flac 1.4.2, -8p, flac uses blocksize 4096, flaccid min 512 max 8192. They all passed flac -t, flaccid did 5x the encoding of flac in this test:

Size
 5141054 01_999999.raw.8p.min4096.max4096.flac
 5114191 01_999999.raw.8p.min512.max8192.flaccid.flac
30929952 02_1000000.raw.8p.min4096.max4096.flac
30898003 02_1000000.raw.8p.min512.max8192.flaccid.flac
30329800 03_letting_you.raw.8p.min4096.max4096.flac
30281386 03_letting_you.raw.8p.min512.max8192.flaccid.flac
33140258 04_discipline.raw.8p.min4096.max4096.flac
32937577 04_discipline.raw.8p.min512.max8192.flaccid.flac
27283598 05_echoplex.raw.8p.min4096.max4096.flac
27034633 05_echoplex.raw.8p.min512.max8192.flaccid.flac
29520427 06_head_down.raw.8p.min4096.max4096.flac
29432222 06_head_down.raw.8p.min512.max8192.flaccid.flac
 9770865 07_lights_in_the_sky.raw.8p.min4096.max4096.flac
 9714938 07_lights_in_the_sky.raw.8p.min512.max8192.flaccid.flac
25352359 08_corona_radiata.raw.8p.min4096.max4096.flac
25206304 08_corona_radiata.raw.8p.min512.max8192.flaccid.flac
23137304 09_the_four_of_us_are_dying.raw.8p.min4096.max4096.flac
23035775 09_the_four_of_us_are_dying.raw.8p.min512.max8192.flaccid.flac
37134638 10_demon_seed.raw.8p.min4096.max4096.flac
36945942 10_demon_seed.raw.8p.min512.max8192.flaccid.flac

flaccid might work on windows now but I doubt it, haven't figured out how to cross-compile libFLAC for windows but other groundwork has been done (mbedtls is implemented as an alternative to openssl, and a wrapper for mmap that uses MS-equivalent has been added untested).

Title: Re: Multithreading
Post by: ktf on 2022-10-28 07:05:11

Quote from: cid42 on 2022-10-27 17:55:12

I iterated on the PoC a little (again barely tested), flaccid can now do variable blocksize encoding in a brute-force kind of way. It's quite limited and non-optimal but that's to allow an easy implementation:

Very interesting, rather high compression gains actually. Have you run the test suite with make check (or ctest if you're using cmake)?

Title: Re: Multithreading
Post by: cid42 on 2022-10-28 10:12:31

make check passed fully but that's to be expected. Nothing of the existing API or anything the API uses has been changed, specifically so that this hacky PoC couldn't break anything except potentially itself. There's two functions copied into *static_ variants which could be merged with a proper implementation, process_frame_ and process_subframes_.

process_frame_ replaces the default write with directly exposing the buffer, and in this hack calls the appropriate process_subframes_static_. Seems fine as a solution as the function is small.

Ideally we'd use process_subframes_ directly, but as the concept of variable blocksize isn't built in to FLAC__StreamEncoder (at least not that I could see) it was hacked in by copying process_subframes_ to process_subframes_static_ to take an extra arg is_variable_blocksize. The proper way to do it would be adding the variable flag to FLAC__StreamEncoder, which would be necessary anyway if StreamEncoder one day supported variable blocksize. The plumbing in the bitstream is already present to output current_sample instead of current_frame, this was the only change needed to use it.

Title: Re: Multithreading
Post by: cid42 on 2022-10-28 16:06:54

Implemented non-two-stride to try and eek out some more gains, very minimal improvement and mostly still from stride=2. It was only at effort level 3 (input encoded fully 3 times) that a stride other than 2 was the top overall winner (with blocksizes 1024 3072 9216, they just so happen to be close to the blocksizes that should be represented with this input and possibly more generally). Note the implementation is flawed in that the final chunk is not subdivided if it's partial, meaning something like min=128,max=16384,stride=2 can sometimes beat min=128,max=32768,stride=2 because the partial chunk of the latter is encoded large when the former splits the penultimate chunk for better gains.

But this is going full tangent, from now on if there's a development worth mentioning that isn't directly multithreading I'll make another thread.

Title: Re: Multithreading
Post by: Porcus on 2022-10-28 16:44:35

Brute-forcing various block sizes, is that something that could be multi-threaded?
Assuming powers of two: work on the following 4096 samples is split into the "1x4096 thread", the "2x2048" and the "4x1024" (with the fourth thread to rule them all)?

Title: Re: Multithreading
Post by: cid42 on 2022-10-28 19:51:49

Quote from: Porcus on 2022-10-28 16:44:35

Brute-forcing various block sizes, is that something that could be multi-threaded?
Assuming powers of two: work on the following 4096 samples is split into the "1x4096 thread", the "2x2048" and the "4x1024" (with the fourth thread to rule them all)?

Yes, the chunk encoding in flaccid does this and seems like a reasonable way to do it in an SMT-friendly way (at least to try and maximise occupancy, it's not smart and is still linearly chunked just into something possibly pretty big). The split you describe is defined in flaccid with blocksize_min=1024, blocksize_max=4096, blocksize_stride=2, the spreadsheet above has results for this set with the nine inch nails album, it's the first row. In flaccid terms the chunk size would be 4096, and the possible ways to store the chunk would be 4096, 2048/2048, 2048/1024/1024, 1024/1024/2048, 1024/1024/1024/1024. On a quad core especially with hyperthreading (I should stop using SMT for multithreading, really SMT is the generic term for hyperthreading), you'd be better off using 4 blocksizes, maybe add 512 or 8192 to the others.

The benefit of chunks is that the multithreading can be done at the chunk level, you'd do this for the same reason multithreading at the frame level is nice (you know where the chunk boundaries are and can queue up embarrassingly parallel work). A downside of chunks is (probably minor, TBD) inefficiencies because you aren't free to pick any blocksize for a given frame. Take the above example, if the first blocksize in a chunk is 1024 the next one must also be 1024, because we have only calculated blocksize 2048 at offsets 0 and 2048 (not the offset 1024 we are at).

The next thing to try is a greedy algorithm, where we try a set of blocksizes for the next frame and pick the one that has the best encoded bits per sample of input. This is an obvious algorithm but comes with drawbacks when it comes to multithreading. We don't know where the following frame starts so can't do that simultaneously, so have to multithread the set of encodes for one frame at a time and wait for all to complete to decide how to proceed. The encodes by definition cannot take similar times to complete as they cover a spectrum of blocksizes, so you have a grab bag of work taking 1/2/4/8/16/.../8192 of time and need to be careful about the choice of blocksizes and how they're executed if you want to minimise occupancy. It's not a dealbreaker but if the difference to chunk encoding turns out to be a rounding error then a greedy algorithm is less interesting IMO.

Then there's adapting the blocksize smartly by predicting what the best blocksize will be and trying that first, but coming up with a good algorithm is probably beyond me. There's many different strategies that could work, anyone know offhand some standard ways it is done?

Title: Re: Multithreading
Post by: Porcus on 2022-10-28 20:03:22

Quote from: cid42 on 2022-10-28 19:51:49

Then there's adapting the blocksize smartly by predicting what the best blocksize will be and trying that first, but coming up with a good algorithm is probably beyond me. There's many different strategies that could work, anyone know offhand some standard ways it is done?

Wild guess: Try the "previous" optimal one first, and its neighbours?

Title: Re: Multithreading
Post by: ktf on 2022-10-28 20:57:31

Quote from: cid42 on 2022-10-28 19:51:49

There's many different strategies that could work, anyone know offhand some standard ways it is done?

This is exactly the reason variable block sizes never took of in FLAC. It is a rather difficult problem.

One lead I'd like to follow at some point is this: https://www.mathworks.com/help/signal/ref/findchangepts.html Especially the Audio File Segmentation example looks very promising.

Title: Re: Multithreading
Post by: doccolinni on 2022-10-29 00:49:48

Maybe look at the way LAME determines whether to use long or short blocks and try to adapt the approach, or at least take some pointers.

After all, that is probably the most well-tuned approach for block length switching. (I know that that's lossy whereas FLAC is lossless, but there ought to be at least some relevance and connection, I think.)

Title: Re: Multithreading
Post by: rutra80 on 2022-10-29 10:27:31

In LAME it is to choose between better temporal or frequency resolution, I don't think it has much to do with FLAC block size...

Title: Re: Multithreading
Post by: cid42 on 2022-10-29 17:47:23

Tried the most naive greedy algorithm possible for variable blocksize, here's the nin comparison with blocksizes 1024 2048 4096:

greed 5142283 30925019 30314397 32984879 27107991 29482757 9773746 25360038 23133439 37026662 total 251251211
chunk 5138754 30907511 30295140 32961194 27048226 29452718 9767257 25359384 23066576 36966063 total 250962823

The greedy algorithm has worse sizes, worse effort (>3 compared to 3 for the chunk algorithm thanks to overlap every time the biggest blocksize isn't chosen) and is harder to multithread. Seems like a failure all round but it's a small sample. It's surprising that even chunk's paltry pseudo-lookahead behaviour (of to the end of the current chunk which is only 4096 samples large here) behaves better than greedy's ability to cross chunk boundaries (by not having chunks).

Quote from: ktf on 2022-10-28 20:57:31

Quote from: cid42 on 2022-10-28 19:51:49
There's many different strategies that could work, anyone know offhand some standard ways it is done?
This is exactly the reason variable block sizes never took of in FLAC. It is a rather difficult problem.

One lead I'd like to follow at some point is this: https://www.mathworks.com/help/signal/ref/findchangepts.html Especially the Audio File Segmentation example looks very promising.

Choosing a blocksize without brute force sounds very hard, heuristics of the audio itself seems like a dark art I'm not brave enough for.

Title: Re: Multithreading
Post by: cid42 on 2022-11-01 14:19:46

Implemented a new algorithm that's optimal for a given set of blocksizes, from now on called peakset. When input, compression settings, libFLAC and the block set are equal it is the best representation possible, with the extremely minor caveat that the partial frame at the end could possibly be split saving a few extra bytes maybe.

The only limitation is that all blocksizes must be a multiple of blocksize_min, this means that all frames regardless of size will start on a boundary that's a multiple of blocksize_min. This works fine, but being optimal it does take a lot of effort as the number of blocksizes increases and how much bigger they are than blocksize_min. If you normalise all blocksizes by dividing by blocksize_min, the effort for a given set of blocks is their sum. This means blocksizes close to blocksize_min are relatively cheap to include, and if you include all multiples of blocksize_min up to blocksize_max the effort is given by (blocksize_count*(blocksize_count+1))/2. It's not novel, it's an obvious way you might constrain the set into something fully brute-forceable. It started off as a generalisation of chunk encoding with the window the size of a chunk or a bit bigger, but as the window size increases effort increases logarithmically so it made more sense to just go full brutus and have the window size be the size of the input.

These all test and encode using -8p
Algorithm blocksize_count blocksize_min blocksize_max Effort Track 1  Track 2  Track 3  Track 4  Track 5  Track 6  Track 7 Track 8   Track 9 Track 10 Album total
peakset                 2          2304          4608      3 5133496 30904160 30296881 33015062 27103528 29457496  9751163 25322968 23068610 37011732   251065096
peakset                 3          1536          4608      6 5130680 30892525 30282259 32967826 27043482 29431242  9747195 25319686 23041505 36960567   250816967
peakset                 4          1152          4608     10 5128519 30883325 30270267 32936824 27004215 29415022  9744598 25317502 23021762 36922440   250644474
peakset                 6           768          4608     21 5127185 30876851 30262990 32909669 26975406 29400776  9742235 25315041 23003911 36893864   250507928
peakset                 8           576          4608     36 5126236 30869818 30256647 32893712 26955466 29391818  9740845 25312890 22993305 36875683   250416420
peakset                 9           512          4608     45 5125504 30867471 30254046 32888554 26948670 29387396  9739849 25311263 22989475 36870104   250382332
peakset (non-subset)    9          1024          9216     45 5105631 30870712 30247986 32912334 26979518 29380381  9691365 25176978 22975422 36900952   250241279

A good strategy that builds on this might be to use peakset at some reasonable effort level to get a reasonable representation, then tweak that representation further with fractional blocksizes that weren't available to peakset. For example tweaking where adjacent blocks are partitioned (a particular pair 1024,3072 might be better representated as 1360,2736) can be done without disrupting any other blocks, so can merging adjacent blocks that sum to greater than blocksize_max (but <=4608 if subset) if that turns out to be more efficient.

It's also not necessary to use the same compression settings when brute forcing the blocksizes as those used to encode the final output. It wouldn't be fully optimal, but if all compression settings that could be turned down without massively altering the permutation of frame sizes chosen were turned down, we could go from something like effort at "-8p"=45+1 to effort at "-5"=45 effort at "-8p"=1. So the question is which settings can be reduced without making the cheaper compression vastly misrepresentative of the expensive compression? LPC order must be important, and apodization, LPC quantization and rice partition order might not be so important if they perform similarly across the board, etc. Here's a rough test that hasn't been tuned but shows that it might work (the top one encodes 10 times with -8p for analysis, once more for output, the bottom one encodes 45 times at -5 for analysis, once at -8p for output):

test_setting encode_setting time_est bcount  bmin bmax effort                                                                                          Total
        -8p            -8p      1750      4  1152 4608     10  5128519 30883325 30270267 32936824 27004215 29415022 9744598 25317502 23021762 36922440 250644474
        -5             -8p       673      9   512 4608     45  5129211 30886850 30269013 32896086 26977882 29408331 9746598 25325718 23022185 36885304 250547178

Picking contiguous normalised blocksizes (256,512,768...) is probably not optimal. It's cheap to include those close to blocksize_min, and blocksizes around 4k are important as that's where most blocks want to be, some input likes 6k to 8k and 1k to 2k so they should be represented well too, but we can probably be sparse at very high blocksizes without loosing too much of value (maybe only go up to 6k and rely on a merge step as described above for the low number of input that likes 8k+). Lots of ways to potentially optimise, so little time.

Title: Re: Multithreading
Post by: cid42 on 2022-11-08 20:00:09

In the name of trying to find the right levers to pull for reasonably efficient somewhat-bruteforce settings:

n tweak passes as a final step before output. Each tweak pass tries successively smaller tweaks to the partition location between adjacent frames starting at half of blocksize_min. This may not be optimal
Different compression settings for analysis and final output

The following are the best subset results from a wide range of settings (all of them have no more-efficient competitor that's faster, except the starred result which is the previous best result from the post above). CPU is a single-threaded estimate using clock_t (tests actually done on quad core), all results encode track one of the album with -8p, effort is the number of times input was encoded for analysis and tweak (and once more for output) (analysis uses analysis compression setting, tweak and output use output compression setting).

     Size  CPU time    mode analysis tweak output effort_analysis effort_tweak Blocks used during analysis
  5124075  94.15798 peakset       8p    10     8p              36       11.942 576,1152,1728,2304,2880,3456,4032,4608
  5124583  84.53425 peakset       8p     4     8p              36        4.201 576,1152,1728,2304,2880,3456,4032,4608
  5124877  82.16994 peakset       8p     2     8p              36        1.979 576,1152,1728,2304,2880,3456,4032,4608
  5124881  68.70227 peakset       8p    10     8p              21       11.858 768,1536,2304,3072,3840,4608
  5125344  51.31762 peakset       8     10     8p              45       13.384 512,1024,1536,2048,2560,3072,3584,4096,4608
  5125376  38.69915 peakset       8     10     8p              36       11.373 576,1152,1728,2304,2880,3456,4032,4608
* 5125504  94.82512 peakset       8p     0     8p              45        0     512,1024,1536,2048,2560,3072,3584,4096,4608
  5125705  33.13277 peakset       8      4     8p              45        4.985 512,1024,1536,2048,2560,3072,3584,4096,4608
  5125798  31.44276 peakset       8      3     8p              45        3.672 512,1024,1536,2048,2560,3072,3584,4096,4608
  5125867  25.73276 peakset       8      4     8p              36        4.007 576,1152,1728,2304,2880,3456,4032,4608
  5125980  24.87177 peakset       8      3     8p              36        2.898 576,1152,1728,2304,2880,3456,4032,4608
  5126643  21.18103 peakset       8      1     8p              36        0.879 576,1152,1728,2304,2880,3456,4032,4608
  5126657  20.71826 peakset       8      4     8p              21        3.548 768,1536,2304,3072,3840,4608
  5126743  19.22586 peakset       8      3     8p              21        2.599 768,1536,2304,3072,3840,4608
  5126976  17.76938 peakset       8      2     8p              21        1.673 768,1536,2304,3072,3840,4608
  5127356  17.26785 peakset       5      3     8p              45        4.179 512,1024,1536,2048,2560,3072,3584,4096,4608
  5127473  16.18667 peakset       8      1     8p              21        0.771 768,1536,2304,3072,3840,4608
  5127622  14.59223 peakset       5      3     8p              36        3.722 576,1152,1728,2304,2880,3456,4032,4608
  5127681  12.57558 peakset       5      4     8p              21        4.404 768,1536,2304,3072,3840,4608
  5127790  11.35689 peakset       5      3     8p              21        3.22  768,1536,2304,3072,3840,4608
  5128042   9.23730 peakset       5      2     8p              21        2.084 768,1536,2304,3072,3840,4608
  5128588   6.51163 peakset       5      1     8p              21        0.99  768,1536,2304,3072,3840,4608
  5129239   5.17407 peakset       5      2     8p              10        1.549 1152,2304,3456,4608
  5129983   4.10213 peakset       5      1     8p              10        0.731 1152,2304,3456,4608
  5131055   2.91449 peakset       5      0     8p              10        0     1152,2304,3456,4608
  5132505   2.63151 peakset       5      1     8p               3        0.393 2304,4608
  5132757   2.32122 peakset       5      0     8p               6        0     1536,3072,4608
  5135107   1.93578 peakset       5      0     8p               3        0     2304,4608
  5138542   1.75692 peakset       0      0     8p               3        0     2304,4608
  5146517           flac                       8p                              4608
  5149423           flac                       8p                              4096

Greed and chunk don't get a look in mostly because tweak passes haven't been implemented there, I'd have expected at least one chunk result to make it lower down the list regardless but it appears peakset+tweak gets a more efficient result faster than pure chunk across the board. Chunk+tweak might make for a good mid-table result where tweak is limited to adjacent frames within chunks, TODO.

A merge pass after the tweak passes looking for efficient blocksizes >blocksize_max, would benefit lax encodes. It could mildly improve subset encodes IFF blocksize 4608 was omitted from the analysis stage (which probably shouldn't be omitted).

Title: Re: Multithreading
Post by: ktf on 2022-11-08 21:42:21

Very impressive. However, I've found out in the past that focussing optimization on one particular sample can give skewed results.

Could you perhaps compare flac -8p and on of the settings you find promising on some more material? I'm curious whether the gain we see here (~0.4% - ~0.5%) holds on other kinds of music.

Title: Re: Multithreading
Post by: Porcus on 2022-11-08 23:06:42

For reference, how much time do -8p and -8pb4608 take?
But anyway, the size savings from the 1.7 seconds run to the tenfold time, is around 0.225 percent. Which of course isn't miraculous, but quite impressive compared to the 0.09 percent space saved by going -8 to -8p, on this particular file set.

(For the the 96/24 version of the same album, -8 to -8p makes for only 0.066 percent. Still more than a "-e" saves.)

Title: Re: Multithreading
Post by: cid42 on 2022-11-09 23:35:08

Implemented tweak passes for chunk mode. Here's some results from different albums, all subset, there's some more in the pipe. Not as good a spread as perhaps there should be but my internet is slow so it's just what's available.

Chunk with a single blocksize does a fixed blocksize, so chunk:8p:8p:0:4096 and chunk:8p:8p:0:4608 are ./flac's -8p and -8pb4608
peakset:8p:8p:10:512,1024,1536,2048,2560,3072,3584,4096,4608 would be slower than realtime on a single core. It puts the 4 sliders tested (blocksize choice, analysis settings, mode choice, tweak passes) in the upper range, not fully off the deep end but a lot of effort (~90, varies give or take 10 by tweak effort)
There's no change in placement except for 4096 vs 4608, it's not a massive corpus but the results seem relatively stable so far
Input is fully read to the processes RAM before timing to minimise SSD as a factor, output is still written so ./flac -t can validate

Size      CPUTimeEst Album                            Settings(mode:analysis_comp:output_comp:tweak_passes:blocksizes)
317603594 5158.64833 ChemicalBrothers-NoGeography     peakset:8p:8p:10:512,1024,1536,2048,2560,3072,3584,4096,4608
317685039 3434.47596 ChemicalBrothers-NoGeography     peakset:8p:8p:4:576,1152,1728,2304,2880,3456,4032,4608
317705060 1412.46291 ChemicalBrothers-NoGeography     peakset:8:8p:4:576,1152,1728,2304,2880,3456,4032,4608
317739305 1138.52770 ChemicalBrothers-NoGeography     peakset:8:8p:4:768,1536,2304,3072,3840,4608
317980994  390.43533 ChemicalBrothers-NoGeography     peakset:8:8p:1:1152,2304,3456,4608
318038269  271.71686 ChemicalBrothers-NoGeography     peakset:5:8p:1:1152,2304,3456,4608
318256373  239.11328 ChemicalBrothers-NoGeography     chunk:8:8p:1:576,1152,2304,4608
319200096   69.77557 ChemicalBrothers-NoGeography     chunk:8p:8p:0:4096
319300930   68.78749 ChemicalBrothers-NoGeography     chunk:8p:8p:0:4608
523589456 8715.10619 DaftPunk-MusiqueVol1             peakset:8p:8p:10:512,1024,1536,2048,2560,3072,3584,4096,4608
523835812 5692.51132 DaftPunk-MusiqueVol1             peakset:8p:8p:4:576,1152,1728,2304,2880,3456,4032,4608
523870247 2466.42877 DaftPunk-MusiqueVol1             peakset:8:8p:4:576,1152,1728,2304,2880,3456,4032,4608
523993599 2058.40626 DaftPunk-MusiqueVol1             peakset:8:8p:4:768,1536,2304,3072,3840,4608
524629652  685.96927 DaftPunk-MusiqueVol1             peakset:8:8p:1:1152,2304,3456,4608
524731195  495.35884 DaftPunk-MusiqueVol1             peakset:5:8p:1:1152,2304,3456,4608
525063374  449.26300 DaftPunk-MusiqueVol1             chunk:8:8p:1:576,1152,2304,4608
528212165  112.73555 DaftPunk-MusiqueVol1             chunk:8p:8p:0:4096
528655833  111.28555 DaftPunk-MusiqueVol1             chunk:8p:8p:0:4608
257706353 4486.87945 Eels-DaisiesOfTheGalaxy          peakset:8p:8p:10:512,1024,1536,2048,2560,3072,3584,4096,4608
257762825 3000.10531 Eels-DaisiesOfTheGalaxy          peakset:8p:8p:4:576,1152,1728,2304,2880,3456,4032,4608
257778499 1211.72235 Eels-DaisiesOfTheGalaxy          peakset:8:8p:4:576,1152,1728,2304,2880,3456,4032,4608
257818027  964.54024 Eels-DaisiesOfTheGalaxy          peakset:8:8p:4:768,1536,2304,3072,3840,4608
257990037  339.80432 Eels-DaisiesOfTheGalaxy          peakset:8:8p:1:1152,2304,3456,4608
258020585  233.79144 Eels-DaisiesOfTheGalaxy          peakset:5:8p:1:1152,2304,3456,4608
258224956  208.56897 Eels-DaisiesOfTheGalaxy          chunk:8:8p:1:576,1152,2304,4608
259022020   63.11173 Eels-DaisiesOfTheGalaxy          chunk:8p:8p:0:4096
259110152   62.28333 Eels-DaisiesOfTheGalaxy          chunk:8p:8p:0:4608
477271623 7616.88662 GunsNRoses-ChineseDemocracy      peakset:8p:8p:10:512,1024,1536,2048,2560,3072,3584,4096,4608
477348516 5138.17785 GunsNRoses-ChineseDemocracy      peakset:8p:8p:4:576,1152,1728,2304,2880,3456,4032,4608
477369279 1957.24402 GunsNRoses-ChineseDemocracy      peakset:8:8p:4:576,1152,1728,2304,2880,3456,4032,4608
477424204 1502.32550 GunsNRoses-ChineseDemocracy      peakset:8:8p:4:768,1536,2304,3072,3840,4608
477658393  533.66029 GunsNRoses-ChineseDemocracy      peakset:8:8p:1:1152,2304,3456,4608
477703519  356.51980 GunsNRoses-ChineseDemocracy      peakset:5:8p:1:1152,2304,3456,4608
478051214  330.46694 GunsNRoses-ChineseDemocracy      chunk:8:8p:1:576,1152,2304,4608
479170287  110.70671 GunsNRoses-ChineseDemocracy      chunk:8p:8p:0:4096
479278372  108.72984 GunsNRoses-ChineseDemocracy      chunk:8p:8p:0:4608
122521118 2479.95507 MikeOldfield-TubularBellsPartOne peakset:8p:8p:10:512,1024,1536,2048,2560,3072,3584,4096,4608
122548191 1683.96844 MikeOldfield-TubularBellsPartOne peakset:8p:8p:4:576,1152,1728,2304,2880,3456,4032,4608
122560169  643.47294 MikeOldfield-TubularBellsPartOne peakset:8:8p:4:576,1152,1728,2304,2880,3456,4032,4608
122578013  484.59866 MikeOldfield-TubularBellsPartOne peakset:8:8p:4:768,1536,2304,3072,3840,4608
122648202  173.11262 MikeOldfield-TubularBellsPartOne peakset:8:8p:1:1152,2304,3456,4608
122670414  112.27985 MikeOldfield-TubularBellsPartOne peakset:5:8p:1:1152,2304,3456,4608
122774533  107.24483 MikeOldfield-TubularBellsPartOne chunk:8:8p:1:576,1152,2304,4608
123012084   36.47129 MikeOldfield-TubularBellsPartOne chunk:8p:8p:0:4608
123016735   37.33374 MikeOldfield-TubularBellsPartOne chunk:8p:8p:0:4096
250230777 4426.60597 NineInchNails-TheSlip            peakset:8p:8p:10:512,1024,1536,2048,2560,3072,3584,4096,4608
250300683 3001.93029 NineInchNails-TheSlip            peakset:8p:8p:4:576,1152,1728,2304,2880,3456,4032,4608
250327830 1176.29593 NineInchNails-TheSlip            peakset:8:8p:4:576,1152,1728,2304,2880,3456,4032,4608
250363924  925.24971 NineInchNails-TheSlip            peakset:8:8p:4:768,1536,2304,3072,3840,4608
250553480  327.97218 NineInchNails-TheSlip            peakset:8:8p:1:1152,2304,3456,4608
250607320  215.46922 NineInchNails-TheSlip            peakset:5:8p:1:1152,2304,3456,4608
250756438  206.58991 NineInchNails-TheSlip            chunk:8:8p:1:576,1152,2304,4608
251739815   63.01859 NineInchNails-TheSlip            chunk:8p:8p:0:4096
251832015   62.26452 NineInchNails-TheSlip            chunk:8p:8p:0:4608
423291336 6309.92615 prodigy-FatOfTheLand             peakset:8p:8p:10:512,1024,1536,2048,2560,3072,3584,4096,4608
423387560 4166.49125 prodigy-FatOfTheLand             peakset:8p:8p:4:576,1152,1728,2304,2880,3456,4032,4608
423407230 1727.42195 prodigy-FatOfTheLand             peakset:8:8p:4:576,1152,1728,2304,2880,3456,4032,4608
423470736 1394.59504 prodigy-FatOfTheLand             peakset:8:8p:4:768,1536,2304,3072,3840,4608
423746793  475.69761 prodigy-FatOfTheLand             peakset:8:8p:1:1152,2304,3456,4608
423804552  329.79729 prodigy-FatOfTheLand             peakset:5:8p:1:1152,2304,3456,4608
424132847  287.47611 prodigy-FatOfTheLand             chunk:8:8p:1:576,1152,2304,4608
425201004   84.63871 prodigy-FatOfTheLand             chunk:8p:8p:0:4096
425296934   83.59144 prodigy-FatOfTheLand             chunk:8p:8p:0:4608

There's probably a lot of efficiency left on the table sticking to subset. The histogram of best blocksizes is likely a bell curve with 4096/4608 near the peak, meaning subset likely cuts off too soon to have good coverage of the curve. Finding ways to extend into --lax without blowing out encode time with peakset TODO.

edit: Tests done on a different PC with 8 cores so timings not comparable with previous posts.

Title: Re: Multithreading
Post by: cid42 on 2022-11-10 07:45:33

Second part of album stats:

Size      CPUTimeEst Album                                Settings(mode:analysis_comp:output_comp:tweak_passes:blocksizes)
486754940 7401.92191 PunkAndNewWaveUltimateCollection-CD1 peakset:8p:8p:10:512,1024,1536,2048,2560,3072,3584,4096,4608
486854649 4919.80083 PunkAndNewWaveUltimateCollection-CD1 peakset:8p:8p:4:576,1152,1728,2304,2880,3456,4032,4608
486876075 1984.64279 PunkAndNewWaveUltimateCollection-CD1 peakset:8:8p:4:576,1152,1728,2304,2880,3456,4032,4608
486950076 1567.01817 PunkAndNewWaveUltimateCollection-CD1 peakset:8:8p:4:768,1536,2304,3072,3840,4608
487249654  543.47563 PunkAndNewWaveUltimateCollection-CD1 peakset:8:8p:1:1152,2304,3456,4608
487297154  373.30103 PunkAndNewWaveUltimateCollection-CD1 peakset:5:8p:1:1152,2304,3456,4608
487662645  334.13117 PunkAndNewWaveUltimateCollection-CD1 chunk:8:8p:1:576,1152,2304,4608
489080538  101.41232 PunkAndNewWaveUltimateCollection-CD1 chunk:8p:8p:0:4096
489290781  100.16961 PunkAndNewWaveUltimateCollection-CD1 chunk:8p:8p:0:4608
371061128 5677.33136 Radiohead-OKComputer                 peakset:8p:8p:10:512,1024,1536,2048,2560,3072,3584,4096,4608
371129145 3793.16594 Radiohead-OKComputer                 peakset:8p:8p:4:576,1152,1728,2304,2880,3456,4032,4608
371146775 1487.96806 Radiohead-OKComputer                 peakset:8:8p:4:576,1152,1728,2304,2880,3456,4032,4608
371195155 1163.05440 Radiohead-OKComputer                 peakset:8:8p:4:768,1536,2304,3072,3840,4608
371382589  409.29770 Radiohead-OKComputer                 peakset:8:8p:1:1152,2304,3456,4608
371416496  280.44687 Radiohead-OKComputer                 peakset:5:8p:1:1152,2304,3456,4608
371661749  254.26818 Radiohead-OKComputer                 chunk:8:8p:1:576,1152,2304,4608
372533424   80.60868 Radiohead-OKComputer                 chunk:8p:8p:0:4096
372625690   79.15295 Radiohead-OKComputer                 chunk:8p:8p:0:4608
329603351 4875.99008 RATM-EvilEmpire                      peakset:8p:8p:10:512,1024,1536,2048,2560,3072,3584,4096,4608
329653372 3276.83557 RATM-EvilEmpire                      peakset:8p:8p:4:576,1152,1728,2304,2880,3456,4032,4608
329667407 1275.60028 RATM-EvilEmpire                      peakset:8:8p:4:576,1152,1728,2304,2880,3456,4032,4608
329709473  989.10326 RATM-EvilEmpire                      peakset:8:8p:4:768,1536,2304,3072,3840,4608
329861813  348.56314 RATM-EvilEmpire                      peakset:8:8p:1:1152,2304,3456,4608
329879280  230.40419 RATM-EvilEmpire                      peakset:5:8p:1:1152,2304,3456,4608
330093058  216.85232 RATM-EvilEmpire                      chunk:8:8p:1:576,1152,2304,4608
330830099   68.85256 RATM-EvilEmpire                      chunk:8p:8p:0:4096
330915000   68.60566 RATM-EvilEmpire                      chunk:8p:8p:0:4608
332628659 4991.72321 SOAD-Toxicity                        peakset:8p:8p:10:512,1024,1536,2048,2560,3072,3584,4096,4608
332701247 3291.49541 SOAD-Toxicity                        peakset:8p:8p:4:576,1152,1728,2304,2880,3456,4032,4608
332720306 1352.51852 SOAD-Toxicity                        peakset:8:8p:4:576,1152,1728,2304,2880,3456,4032,4608
332779597 1084.18631 SOAD-Toxicity                        peakset:8:8p:4:768,1536,2304,3072,3840,4608
332989575  371.13667 SOAD-Toxicity                        peakset:8:8p:1:1152,2304,3456,4608
333014456  261.12648 SOAD-Toxicity                        peakset:5:8p:1:1152,2304,3456,4608
333259634  225.99539 SOAD-Toxicity                        chunk:8:8p:1:576,1152,2304,4608
334130781   66.84236 SOAD-Toxicity                        chunk:8p:8p:0:4096
334223800   66.30961 SOAD-Toxicity                        chunk:8p:8p:0:4608
412422849 7760.62776 ThinLizzy-TheCollection              peakset:8p:8p:10:512,1024,1536,2048,2560,3072,3584,4096,4608
412505798 5182.30982 ThinLizzy-TheCollection              peakset:8p:8p:4:576,1152,1728,2304,2880,3456,4032,4608
412527983 2049.92532 ThinLizzy-TheCollection              peakset:8:8p:4:576,1152,1728,2304,2880,3456,4032,4608
412595671 1600.67376 ThinLizzy-TheCollection              peakset:8:8p:4:768,1536,2304,3072,3840,4608
412872643  552.71229 ThinLizzy-TheCollection              peakset:8:8p:1:1152,2304,3456,4608
412912756  386.65311 ThinLizzy-TheCollection              peakset:5:8p:1:1152,2304,3456,4608
413370741  330.46464 ThinLizzy-TheCollection              chunk:8:8p:1:576,1152,2304,4608
414442451  109.35235 ThinLizzy-TheCollection              chunk:8p:8p:0:4096
414571508  108.20086 ThinLizzy-TheCollection              chunk:8p:8p:0:4608

Totals of all 12 albums:

Size       CPUTimeEst  Settings(mode:analysis_comp:output_comp:tweak_passes:blocksizes)
4304685184 69901.6021  peakset:8p:8p:10:512,1024,1536,2048,2560,3072,3584,4096,4608
4305712837 46581.26799 peakset:8p:8p:4:576,1152,1728,2304,2880,3456,4032,4608
4305956860 18745.70384 peakset:8:8p:4:576,1152,1728,2304,2880,3456,4032,4608
4306617780 14872.27901 peakset:8:8p:4:768,1536,2304,3072,3840,4608
4309563825  5151.83705 peakset:8:8p:1:1152,2304,3456,4608
4310095996  3546.86498 peakset:5:8p:1:1152,2304,3456,4608
4313307562  3190.43474 chunk:8:8p:1:576,1152,2304,4608
4326579415  968.38887  chunk:8p:8p:0:4096
4328113099  955.85215  chunk:8p:8p:0:4608

Title: Re: Multithreading
Post by: ktf on 2022-11-10 08:40:16

So, the CPU time is 'single-core time' I guess? So when using 4 CPU cores wall time would be a quarter of that?

I think there are plenty people on this forum that would be willing to use this, creating a subset, fully spec-compliant FLAC file that might play in their hardware system. Half a percent is quite an improvement, even at the cost of having the encoder run near real-time on a single core. So once again: very impressive.

Sadly non-standard blocksizes and variable blocksize streams are not well supported features in many hardware decoders, see here (https://wiki.hydrogenaud.io/index.php?title=FLAC_decoder_testbench), but I think many people here won't mind.

Title: Re: Multithreading
Post by: Porcus on 2022-11-10 08:51:23

The "peakset:5:8p:1:1152,2304,3456,4608" looks particularly interesting. But 4096 seems so good for CDDA that I might wonder what sizes you would obtain by just encoding higher multiples of 512.

To get an idea of benefit/cost (in that order), I took the total of 12 table and computed the file size difference per second extra taken. All relative to the next row. In order to pick the lower-hanging fruit first, you would want the following sequence to be increasing - it isn't far off:
44 (that means, the top "peakset:8p:8p:10:" row saves 44 bytes per extra second taken over the second row)
9 (less than the 44, so if you are willing to wait for this one, you should have gone to the top row)
171
303
332
9010 (this is how much the "peakset:5:8p:1:" improves over the next, which has to do with the next one being not-as-good)
5973
122335 (note the number of digits! The biggg difference is chunk:8p:8p:0:4096 relative to chunk:8p:8p:0:4608)
- (nothing for the last row, it has no next)

Bypassing the not particularly-good 5973, then "peakset:5:8p:1:" saves 6393 per extra second (edit: over "chunk:8p:8p:0:4096"). The benefit of going 4608 -> 4096 is 19x as big.

Title: Re: Multithreading
Post by: cid42 on 2022-11-10 16:58:15

Quote from: ktf on 2022-11-10 08:40:16

So, the CPU time is 'single-core time' I guess? So when using 4 CPU cores wall time would be a quarter of that?

I think there are plenty people on this forum that would be willing to use this, creating a subset, fully spec-compliant FLAC file that might play in their hardware system. Half a percent is quite an improvement, even at the cost of having the encoder run near real-time on a single core. So once again: very impressive.

Sadly non-standard blocksizes and variable blocksize streams are not well supported features in many hardware decoders, see here (https://wiki.hydrogenaud.io/index.php?title=FLAC_decoder_testbench), but I think many people here won't mind.

It should roughly translate to that yes, greed mode seems to be inflated but I think it's because greed has lower occupancy which seems to be measured by clock_t in a roundabout way, not necessarily a bad thing to measure IMO. Chunk/peakset have good occupancy and the numbers seem to make sense give or take. All core workloads run at lower clocks on a lot of hardware, especially low power parts like the laptop part this was tested with which might skew results a bit, also as you go for quicker encode settings IO and hashing take proportionally more time. Short of doing something like counting cycles by tapping into CPU performance registers I don't think there's an accurate way for a single benchmark to truly scale to a different number of cores.

I'm not against one day whipping this code into a usable frontend, right now it's just a testbed to scratch a benchmark itch. Is there a known example where variable blocksizes are supported but only within a narrower range than 16-65535? Currently tweak is somewhat limited, merge unlimited. Range limiters will be added soon anyway as extra levers to limit the excessive less-useful computation in tweak/merge, curious if there's some informal line in the sand between subset and "go nuts".

A preliminary test of merge passes indicates merging should happen before tweaking, and that merging alone is probably not all that should be done to expand into large blocksizes. Instead of doing 512:4608 or 576:4608 at the top end it's likely 768:6144 is the way to go, maybe 1024:8192 1152:9216 or more on some input. It might be fine to rely on tweak passes to shape the frames into lower sizes where beneficial if it means that the O(block_count^2) effort scaling of the peakset analysis phase doesn't get too out of control.

Quote from: Porcus on 2022-11-10 08:51:23

The "peakset:5:8p:1:1152,2304,3456,4608" looks particularly interesting. But 4096 seems so good for CDDA that I might wonder what sizes you would obtain by just encoding higher multiples of 512.

To get an idea of benefit/cost (in that order), I took the total of 12 table and computed the file size difference per second extra taken. All relative to the next row. In order to pick the lower-hanging fruit first, you would want the following sequence to be increasing - it isn't far off:
44 (that means, the top "peakset:8p:8p:10:" row saves 44 bytes per extra second taken over the second row)
9 (less than the 44, so if you are willing to wait for this one, you should have gone to the top row)
171
303
332
9010 (this is how much the "peakset:5:8p:1:" improves over the next, which has to do with the next one being not-as-good)
5973
122335 (note the number of digits! The biggg difference is chunk:8p:8p:0:4096 relative to chunk:8p:8p:0:4608)
- (nothing for the last row, it has no next)

Bypassing the not particularly-good 5973, then "peakset:5:8p:1:" saves 6393 per extra second (edit: over "chunk:8p:8p:0:4096"). The benefit of going 4608 -> 4096 is 19x as big.

My analysis is far from exhaustive, the rule of thumb seems to be that around 4k is very important as you'd expect. At a guess the frequency chart of best blocksizes for most input can be approximated with a bell curve centred at 3.5-4.5k (the actual best chart is probably skewed more towards smaller blocksizes with a longer tail to the right, 2-3k seems next-most important). You're suggesting trying things like 4096,4608 and 3584,4096,4608? If a good algorithm that could efficiently use just these sizes were made then perhaps they would perform well, however chunk and peakset cannot do it. Peakset could be 512,3584,4096,4608 for an effort of 25, unlikely to perform better than 768,1536,2304,3072,3840,4608 with an effort of 21 IMO.

Title: Re: Multithreading
Post by: Porcus on 2022-11-10 21:30:36

Quote from: cid42 on 2022-11-10 16:58:15

Is there a known example where variable blocksizes are supported but only within a narrower range than 16-65535?

You mean, an implementation that encodes variable block size? CUETools Flake does - but it is buggy: https://hydrogenaud.io/index.php/topic,123016
... and therefore about to be removed. Source available though.

Quote from: cid42 on 2022-11-10 16:58:15

You're suggesting trying things like 4096,4608 and 3584,4096,4608? If a good algorithm that could efficiently use just these sizes were made then perhaps they would perform well, however chunk and peakset cannot do it.

I was rather thinking: if you are looking to find those block sizes that will indeed often be "optimal" for size, then well, efficiency can come later. Say, your nineteen-hour (it is seconds, right?) peakset:8p:8p:10:512,1024,1536,2048,2560,3072,3584,4096,4608: Each block is a multiple 1 to 9 of 512. How does the histogram look?
(And they aren't independent over time, but I have no idea whether a time series plot - with ultra-fine resolution - would be a "moving cloud" or just scattered.)

As for efficiency:
If I understood the end of Reply 23 right, you use "-5" runs to determine block sizes and then -8p to encode? Reference FLAC does a bit of guesstimate-first-and-encode-the-winner, so that approach is quite a bit in the spirit. Half speed and smaller file isn't bad either.
(-7 is closer to -8p, if this is to be explored further.)

Title: Re: Multithreading
Post by: cid42 on 2022-11-11 11:04:18

Quote from: Porcus on 2022-11-10 21:30:36

Quote from: cid42 on 2022-11-10 16:58:15
Is there a known example where variable blocksizes are supported but only within a narrower range than 16-65535?
You mean, an implementation that encodes variable block size? CUETools Flake does - but it is buggy: https://hydrogenaud.io/index.php/topic,123016
... and therefore about to be removed. Source available though.

I meant is there an example of a decoder that supports variable blocksizes, but only works with some range that isn't 16:4608 or 16:65535. I guess we are talking hardware decoder as for a software decoder it would be a fixable bug.

Quote from: Porcus on 2022-11-10 21:30:36

As for efficiency:
If I understood the end of Reply 23 right, you use "-5" runs to determine block sizes and then -8p to encode? Reference FLAC does a bit of guesstimate-first-and-encode-the-winner, so that approach is quite a bit in the spirit. Half speed and smaller file isn't bad either.
(-7 is closer to -8p, if this is to be explored further.)

There's an analysis phase which can use different compression settings to the final encode, then there's an optional merge phase (WIP) which merges adjacent frames if the result is smaller, then an optional tweak phase which tweaks the boundary between adjacent frames (merge and tweak use the final encode settings). The analysis phase can only choose from the list of blocksizes given to it, merge and tweak are an imperfect way to grab some low hanging fruit by trying blocksizes not on the list. It can be thought of as the analysis phase roughly approximating the optimal permutation, and merge/tweak refining the approximation a little more. Merge and tweak are imperfect by nature and their performance is heavily influenced by implementation which is a WIP.

I've tried analysis settings of -0, -5, -8, -8p with output settings fixed at -8p for comparison, not exhaustive and not smart, just a spread to see if it made a difference. -5 -8 and -8p had positive results at quick to slow settings, -0 was not good at finding permutations for -8p to encode (not surprising, -0 only uses fixed predictors so it has very little in common with -8p).

Quote from: Porcus on 2022-11-10 21:30:36

Quote from: cid42 on 2022-11-10 16:58:15
You're suggesting trying things like 4096,4608 and 3584,4096,4608? If a good algorithm that could efficiently use just these sizes were made then perhaps they would perform well, however chunk and peakset cannot do it.
I was rather thinking: if you are looking to find those block sizes that will indeed often be "optimal" for size, then well, efficiency can come later. Say, your nineteen-hour (it is seconds, right?) peakset:8p:8p:10:512,1024,1536,2048,2560,3072,3584,4096,4608: Each block is a multiple 1 to 9 of 512. How does the histogram look?
(And they aren't independent over time, but I have no idea whether a time series plot - with ultra-fine resolution - would be a "moving cloud" or just scattered.)

It is seconds, an estimate using clock_t to measure CPU time (if you're familiar with /bin/time, it's similar to the user time it gives). The testing is done on an 8 core Zen 2 4700u in an Asus PN50, which is a mini PC using laptop components. All-core the clocks average around 2.4GHz as it throttles hard from thermals doing an 8 core AVX2 workload, when thermal or power limits are reached (especially in laptop parts) clockspeeds dip. I believe the 19 hour result means it would take 19 hours at 2.4GHz, however running single core it's more likely to run at 3.5GHz say which should reduce the wall time to 13 hours or so. Desktop components tend to not have such a dramatic difference in clockspeeds, higher power limits better coolers etc.

If you did a peakset encode with merge and tweak disabled (and analysis settings the same as output settings), the result is optimal for the list of blocksizes and compression setting used. Using ./flac -a you can build the histogram, attached are some examples (track 1 of NiN The Slip which is an ambient album introduction, SOAD toxicity which is noisy with some quiet sections, and sultans of swing which is more even rock).

Title: Re: Multithreading
Post by: Porcus on 2022-11-11 11:39:42

Surprising, those histograms - at least if you picked them random, rather than "see how much they vary"?
Although they are number of blocks, so for the rockers, there will be about the same number of samples in 4096 blocks as in 2048 blocks.

@ktf : since I totally suck at reading code, I'm thinking-aloud stupid questions about the subdivide_tukey or something similar to it:
* A "reasonably sane" suggestion for something you could either label "2" or "3", would be to compute "three Hann bumps": Over [0 ... 0.5], [0.25 ... 0.75] and [.5 ... 1]. Sum of them would be tukey(.5), and if that is what the encoder wants to select ... jolly good.
* Should the encoder want to select only #1 or only #3, that is likely a hint that the window should be split? Then the entire thing would be better off by "the troublesome half inheriting the other half's predictor", and although it could be that all predictors are bad for that half, well, it is likely not worse to give it its own.
* Should the encoder want to use the sum of #1 and #3 ("Hann-punching" out the middle), then who knows whether there is enough to save by splitting the frame. Worth testing.
* Should the encoder want to use #1 and #2, you have a tukey(two thirds) on [0 ... 0.75] and can likely split too.
* Should the encoder want to use #2 and #3 ... now it is worth asking, should the first quarter be encoded separately or should it have been merged with the previous block?
Disregarding the subset consideration about merging in "a quarter" to an already 4096 block.

Title: Re: Multithreading
Post by: ktf on 2022-11-11 11:53:21

Quote from: Porcus on 2022-11-11 11:39:42

@ktf : since I totally suck at reading code, I'm thinking-aloud stupid questions about the subdivide_tukey or something similar to it

I really don't think those are usable for analysing whether a block should be split up. I think a completely new algorithm is needed to determine where to split up blocks.

Title: Re: Multithreading
Post by: cid42 on 2022-11-11 12:13:19

Picked the slip track 1 out of habit, realised it was a terrible example of the general case being mostly ambient so picked toxicity, realised the quiet portions were causing a second peak in the histogram as it's basically two signals interleaved so picked sultans of swing which has an even signal throughout. The sultan result is what I'd expect in general, that rough shape (possibly scaled in x and/or y) should apply to most "normal" input.

A cheap way to be able to choose the scaling factor would be a good way to adapt to input instead of the user providing a fixed blocksize list.

It also shows that some input may benefit from holes in the blocksizes chosen particularly in the upper range, ie 1,2,3,4,8 may perform only slightly worse than 1,2,3,4,5,6,7,8 for half the effort, but notably more efficient than 1,2,3,4[,5,6] for some input. You'd have holes to account for a signal not being even throughout, the majority of the track would be covered by 1,2,3,4 with the quieter/ambient parts covered by 8.

You could get really fancy and detect when a signal changes and adapt the set of blocksizes on the fly, but that is approaching proper signal analysis and it's likely that using the signal directly rather than ham-fisted frame-size-brute-force is the way to go when you get that deep.

Title: Re: Multithreading
Post by: Porcus on 2022-11-12 20:13:22

Quote from: ktf on 2022-11-11 11:53:21

Quote from: Porcus on 2022-11-11 11:39:42
@ktf : since I totally suck at reading code, I'm thinking-aloud stupid questions about the subdivide_tukey or something similar to it
I really don't think those are usable for analysing whether a block should be split up. I think a completely new algorithm is needed to determine where to split up blocks.

Maybe not. It is a long shot.
The idea was around over in this compression improvement thread (https://hydrogenaud.io/index.php/topic,120158.0.html) 41 to 44 - especially your "a certain set of partial windows @ 44.1kHz with blocksize 4096 [...]" and TBeck's musings over windowing at https://hydrogenaud.io/index.php/topic,44229.msg402752.html#msg402752 . He looks at changing signal characteristics yes - and would, at least in that early version, do simple bisection if needed.

I was imagining that this would not only be a "cheap" way to guess it - but also easy to make sense out of test results.
If a "variable" build would for two consecutive 2048 frames let one of them use the same predictor as the current "constant blocksize" build would use for the entire 4096, you would get a picture of whether those signals are those that call for this treatment.

(All these considerations are before we start asking whether other block sizes should have other apodization functions. In the same 2006 thread linked to above, that was a big collaborative effort that ended up with Josh concluding on tukey(5e-1) as default - but I don't see block size being part of the tests.)

Title: Re: Multithreading
Post by: cid42 on 2022-11-12 22:36:20

Quote from: ktf on 2022-11-11 11:53:21

Quote from: Porcus on 2022-11-11 11:39:42
@ktf : since I totally suck at reading code, I'm thinking-aloud stupid questions about the subdivide_tukey or something similar to it
I really don't think those are usable for analysing whether a block should be split up. I think a completely new algorithm is needed to determine where to split up blocks.

Analysing brute-forced encodes could be a way to reverse engineer a heuristic algorithm for variable blocksizes based on input. Beyond pathological examples it's probably hard to make a good algorithm that's widely applicable, but it may be an easier approach than (or augment) proper signal analysis.

Quote from: Porcus on 2022-11-12 20:13:22

(All these considerations are before we start asking whether other block sizes should have other apodization functions. In the same 2006 thread linked to above, that was a big collaborative effort that ended up with Josh concluding on tukey(5e-1) as default - but I don't see block size being part of the tests.)

There's always another dimension to add isn't there :P

Made some improvements to merge/tweak mode and done another round of testing. A few changes that need explaining:

Merge passes added with --merge threshold, where user passes a byte threshold. Multiple passes over the input is done until a pass fails to save at least that many bytes
Tweak phase modified to act like merge does, user passes a threshold instead of specifying the number of passes
Merge and tweak can be done using analysis settings or output settings, with --merge-when 0/1 --tweak-when 0/1 to specify before/after
A minimum and maximum blocksize can be set with --blocksize-limit-lower limit --blocksize-limit-upper limit, this is to fine-tune how tweak/merge passes act and usually doesn't do much. Some input doesn't require very large blocksizes or very small blocksizes, this would be a way to avoid testing them at all (instead of relying on tweak/merge thresholds failing it's basically a way to make the last passes of particularly merge more efficient, maybe)
Starred results are subset, everything else --lax

  Size          CPUTimeEst Settings
* 4326579415    887.49764  Fixed blocksize 4096
* 4312846890    1172.31728 mode(peakset);analysis_comp(5r3);analysis_apod((null));tweak(99999);merge(99999);blocksize_limit_lower(1152);blocksize_limit_upper(4608);analysis_blocksizes(1152,2304,4608)      
  4311188781    1384.49821 mode(peakset);analysis_comp(5r3);analysis_apod((null));tweak(99999);merge(99999);blocksize_limit_lower(1152);blocksize_limit_upper(8192);analysis_blocksizes(1152,2304,4608)      
* 4308148582    3497.63002 mode(peakset);analysis_comp(5);analysis_apod(partial_tukey(2));tweak(1024);merge(4096);blocksize_limit_lower(256);blocksize_limit_upper(4608);analysis_blocksizes(768,1536,2304,3072,3840,4608)      
  4305653857    2518.54257 mode(peakset);analysis_comp(5);analysis_apod(partial_tukey(2));tweak(10240);merge(10240);blocksize_limit_lower(256);blocksize_limit_upper(65535);analysis_blocksizes(1152,2304,3456,4608)      
  4303811368    4699.85835 mode(peakset);analysis_comp(5);analysis_apod(partial_tukey(2));tweak(1024);merge(1024);blocksize_limit_lower(256);blocksize_limit_upper(65535);analysis_blocksizes(768,1536,2304,3072,3840,4608)      
  4302718243   19406.89554 mode(peakset);analysis_comp(8p);analysis_apod(partial_tukey(2));tweak(1024);merge(1024);blocksize_limit_lower(256);blocksize_limit_upper(65535);analysis_blocksizes(512,1024,1536,2048,2560,3072,3584,4096,4608)      
  4301991677   26677.72353 mode(peakset);analysis_comp(8p);analysis_apod(partial_tukey(2));tweak(64);merge(512);blocksize_limit_lower(256);blocksize_limit_upper(65535);analysis_blocksizes(512,1024,1536,2048,2560,3072,3584,4096,4608)      
  4300096544  142492.31279 mode(peakset);analysis_comp(8p);analysis_apod((null));tweak(1);merge(1);blocksize_limit_lower(256);blocksize_limit_upper(65535);analysis_blocksizes(512,1024,1536,2048,2560,3072,3584,4096,4608)

The big win is from doing tweak/merge with analysis settings instead of output settings. Using just partial_tukey(2) also gave a good speedup seemingly without much loss of accuracy. Just tried all 6 combinations of the apod used in -8 and picked what seemed good, I don't know enough about apod settings to make an informed decision beyond that.

edit: Round 2 testing all done with merge/tweak using analysis settings

Title: Re: Multithreading
Post by: ktf on 2022-11-13 19:48:38

Quote from: Porcus on 2022-11-12 20:13:22

I was imagining that this would not only be a "cheap" way to guess it - but also easy to make sense out of test results.

I think it isn't cheap enough though. Also, there might be other reasons a 'partial' window turns out to be the best, for example because part of a signal is cleaner. That wouldn't necessarily mean that it is a good block to split up, just that part of a block can be ignored on analysis.

Quote from: cid42 on 2022-11-12 22:36:20

Quote from: ktf on 2022-11-11 11:53:21
Quote from: Porcus on 2022-11-11 11:39:42
@ktf : since I totally suck at reading code, I'm thinking-aloud stupid questions about the subdivide_tukey or something similar to it
I really don't think those are usable for analysing whether a block should be split up. I think a completely new algorithm is needed to determine where to split up blocks.
Analysing brute-forced encodes could be a way to reverse engineer a heuristic algorithm for variable blocksizes based on input. Beyond pathological examples it's probably hard to make a good algorithm that's widely applicable, but it may be an easier approach than (or augment) proper signal analysis.

Yes, I agree your work is very valuable. It gives insights on what FLACs algorithms tend to do when a variable blocksize is possible. Also, it gives a good indication of how much there is to gain. It seems it is about 0.5% (which is almost 0.8 percentagepoint on this material) which is quite a lot.

Title: Re: Multithreading
Post by: Porcus on 2022-11-14 19:16:04

The second line in the above table - the fastest among the variable block size - takes out more than half of the possible improvement, and at a thirtysomething percent time penalty - that isn't much, provided the first is standard -8p.

Quote from: cid42 on 2022-11-12 22:36:20

There's always another dimension to add isn't there :P

Oh sure. My point was - I think wrongly! - the notion that you get this improvement despite running tests "on the 4096 home ground", and that would "bias the results against you".

But then I tested (using reference FLAC, corpus as in signature). A bunch of apodization functions with different block sizes. Quite puzzling results, it seems that 4096 was a good choice no matter what.
Two tests:
* First, something that by gut feeling "could happen to turn out good": different block sizes 3072, 4096 and 4608 with only the tukey(P), but trying to keep the same number of samples untapered. That is, 4096 with tukey(.5) keeps the middle 2048; then 3072 would also do that, tapering first and last 512, that is, tukey(.333). Etc.
Trying that with a few values for P at 4096 and translating up and down, ...
Findings: 4096 still wins overall.
* fourteen different block sizes with just -5bxxx -A "<precisely one>" -7<the same>, that is also restricting to one function, but beefing up the -l and -r. Functions were hann (yes I know that's tukey(1)), hamming, welch - and two known bad ones: triangle and rectangle.
Findings:
-> Classical music: bigger is better. 11 of 14 classical music albums had a strictly monotone relationship, every function.
-> Other genres & all aggregated: 4096 wins hands down. With an exception for rectangle, where 4608 wins eighty percent of the albums. More is better with rectangle? No support for the notion that tapering helps because 4096 is "too big".
-> 2560 often comes out bad. Not in the average over files, where it is always between the neighbouring block sizes, but if there is one the stands out as worse than the neighbours in indvidual files ... it is 2560. 25 percent of the time, 2560 was worse than both 2304 and 2880. Other block sizes did not experience the same (bar a small handful of entries in a four thousand chances).
I have not the slightest idea why. The phenomenon is "mildly visible" in your graphs.
-> 4096 is surprisingly slow. Dunno, hardware cache size? By and large, larger block size is faster until 3456, then 4096 is about as slow as 1152.

Title: Re: Multithreading
Post by: cid42 on 2022-11-17 20:58:42

Added partial support for flac input, MD5 is disabled for non-16 bit input for now as it's a pain. 24 bit input passes the listen test as does flac-test-files but no other testing has been done (as an aside mpv 0.32.0 which uses libavcodec 58.54.100 from ffmpeg 4.2.7 doesn't play the original version of '55 - file 48-53 combined.flac', but it can play the individual tracks 48-53. This mpv/ffmpeg is a few years old so it may be fixed already)
Assuming flac input actually works properly flaccid can now encode everything libFLAC can, with the caveat that the decompressed input fits in RAM as it's currently stored in full for testing
TODO: Peakset currently requires seekable input if it were to be implemented as a generally usable frontend (ie all input needs to be available as it fully encodes/analyses the input then builds the optimal permutation in reverse then processes all input again with output settings, a proper frontend cannot rely on loading everything to RAM so input must be seekable). This is because peakset uses a single window the size of the input allowing it to be optimal. There's probably a way to stick with a single window without having to be seekable, but it seems horribly complicated to combine the multi-threaded processing phase with the analysis phase in tight-enough lockstep that occupancy doesn't suffer. By far the easier way is to chunk the input into multiple windows (not sliding windows), not for MT but to limit RAM. It wouldn't be fully optimal but it would be close (non-optimal split at window boundaries, a window could feasibly be 100+MiB on most modern systems so most input would functionally be optimal and everything else a rounding error away from optimal).

Now that flac input works it's easier to test wider. Attached is most of the lossless I have, mostly CDDA with a little 24 bit (LudAndSchlattsMusicalEmporium is mostly classical 24 bit 48KHz with an open license from here https://pmmusic.pro/downloads/ ) (blondie parallel lines is 24 bit 192 KHz). As before input is read to RAM before timing so flac input decode is irrelevant (other than being single-threaded so giving the chip a little time to recover thermals, should be negligible). Classical/punk/rock/metal/synth/electronic pretty well covered across decades, a lot of atmospheric in some of the OST's, many VA albums that might flesh things out even more. A few runs with quick settings to see which settings are more important, same with some slower settings and 1 even slower (but not off the deep end) for comparison. These are the best results, quick chunk settings were tried but none beat quick peakset:

Size        CPUTimeEst Settings
28487101750 6358.74381 Fixed 4096
28392114967 10040.00751 mode(peakset);analysis_comp(5r3);analysis_apod((null));output_comp(8p);output_apod((null));tweak_after(0);tweak(65536);tweak_early_exit(0);merge_after(0);merge(65536);blocksize_limit_lower(576);blocksize_limit_upper(8192);analysis_blocksizes(1152,2304,4608)
28383268498 23101.79561 mode(peakset);lax(0);analysis_comp(5);analysis_apod(partial_tukey(2));output_comp(8p);output_apod((null));tweak_after(0);tweak(768);tweak_early_exit(0);merge_after(0);merge(0);blocksize_limit_lower(256);blocksize_limit_upper(4608);analysis_blocksizes(768,1536,2304,3072,3840,4608)
28347347740 31982.75614 mode(peakset);analysis_comp(5);analysis_apod(partial_tukey(2));output_comp(8p);output_apod((null));tweak_after(0);tweak(768);tweak_early_exit(0);merge_after(0);merge(768);blocksize_limit_lower(256);blocksize_limit_upper(65535);analysis_blocksizes(768,1536,2304,3072,3840,4608)
28340323346 111480.3915 mode(peakset);lax(1);analysis_comp(5);analysis_apod(partial_tukey(2));output_comp(8p);output_apod((null));tweak_after(0);tweak(1);tweak_early_exit(0);merge_after(0);merge(1);blocksize_limit_lower(256);blocksize_limit_upper(65535);analysis_blocksizes(768,1536,2304,3072,3840,4608)
28335653960 148385.92248 mode(peakset);lax(1);analysis_comp(8);analysis_apod(partial_tukey(2));output_comp(8p);output_apod((null));tweak_after(0);tweak(1);tweak_early_exit(0);merge_after(0);merge(1);blocksize_limit_lower(256);blocksize_limit_upper(65535);analysis_blocksizes(512,1024,1536,2048,2560,3072,3584,4096,4608)

Merge and particularly tweak are not optimal, it's not immediately obvious how best to make them more optimal. Ideally tweak would provide a strictly decreasing improvement for each additional pass, currently it's like a decreasing inverse sawtooth. It's not clear what new things to try:

Maybe a partial-brute-force mode that collects a sliding window of metrics from the signal and brute-forces the n best candidates from that, basically poor-mans signal analysis plus brute force to make up for the poor quality.
How about using the chosen settings from a strong smaller blocksize encode starting at x, to cheaply try larger blocksizes starting at x? ie find a model that fits the input well and use it to encode as much of the signal as it can efficiently do

Title: Re: Multithreading
Post by: Porcus on 2022-12-22 16:14:43

Folks,
here is a (Master's) thesis from 2007 on variable FLAC block size: https://www.collectionscanada.gc.ca/obj/thesescanada/vol2/002/MR31230.PDF
But the results aren't that impressive. .15, .13, .20 percentage points for the "cheap" algorithm that makes for a 5.5x slowdown - I guess that those expensive algorithms that make the slowdown factor skyrocket up to the thousands, are more interesting for comparison than for applicability.
And I would not be surprised if the subsequent improvements (and in particular the partial_tukey/punchout_tukey/subdivide_tukey) have taken out some of the possible gains.

Anyway, if any of those ideas are useful for anything - there they are.

Title: Re: Multithreading
Post by: doccolinni on 2022-12-22 17:27:44

Quote from: Porcus on 2022-12-22 16:14:43

Folks,
here is a (Master's) thesis from 2007 on variable FLAC block size: https://www.collectionscanada.gc.ca/obj/thesescanada/vol2/002/MR31230.PDF
But the results aren't that impressive. .15, .13, .20 percentage points for the "cheap" algorithm that makes for a 5.5x slowdown - I guess that those expensive algorithms that make the slowdown factor skyrocket up to the thousands, are more interesting for comparison than for applicability.
And I would not be surprised if the subsequent improvements (and in particular the partial_tukey/punchout_tukey/subdivide_tukey) have taken out some of the possible gains.

Anyway, if any of those ideas are useful for anything - there they are.

This very nice paragraph seems notable.

Spoiler (click to show/hide)

Title: Re: Multithreading
Post by: Porcus on 2022-12-22 20:56:10

Yes, most likely.
In reality, flac -8 is doing something in between 64 and 4096 samples at the time. Whereas other slower formats adapt and update the predictor vector every sample (was that right about WavPack and the animals?), FLAC keep that one and tweak the residual coding method - sometimes with baffling results on extreme material: https://hydrogenaud.io/index.php/topic,123025.msg1018251.html#msg1018251 . Maxing out the Rice partitioning reduces the file size by ten percent, to the point where where it beats the heaviest OptimFROG.
(But not xz ...)

For those who want to stay within subset, this was a case where you would cut the block size in half just to be able to change the Rice encoding parameter twice as often. But it is not something you will see that often.

Title: Re: Multithreading
Post by: cid42 on 2022-12-24 17:43:15

It looks like the extreme block size search in 7.1.2 corresponds to peakset with contiguous steps, but instead of stopping at some sensible upper blocksize they use the full range available (they say 65536 but it should not have included 65536, minor mistake or it wasn't excluded in 2007). It's little wonder that the extreme search starting at min_blocksize 512 took 28 hours, that is peakset with an effort of 8256 (or 8128 exluding 65536) on what must have been a pretty weak CPU.

Against all odds the corpus they used still exists, so I re-ran some of the tests from table 7.2 at 8p flac v1.42, with the extreme512 results below limited to 16384 instead of 65536 for an effort level of "only" 528 instead of >8k:

File FixedBlocking "Extreme512"
     8pb4096       8p peakset 512..16384 contiguous
 00  38.58%        38.35%
 03  58.34%        58.15%
 05  68.59%        68.35%

Might implement blocking method 2 from 6.1.2 to see how it compares (one day, time is short). It's a simple iterative process that tries bigger and bigger blocksizes with a fixed step until the next size up has worse efficiency than the current blocksize+step, at which point it outputs the current blocksize. So it's simple, greedy with a different greed metric than flaccid's greed mode, multithreads poorly (which isn't a knock against it in general). Probably has a smaller effort than current greed mode and close in efficiency.

Title: Re: Multithreading
Post by: Porcus on 2022-12-26 22:00:19

65536 was disallowed already at the first archive.org capture in 2001 (https://web.archive.org/web/20010210024619/http://flac.sourceforge.net/format.html) - all to annoy the cult of -r15,15 ;)

Title: Re: Multithreading
Post by: cid42 on 2022-12-29 21:57:21

Implemented method 2 from the paper, named gasc for greed ascending
Renamed greed mode to gset, at each step it encodes all sizes from a set and greedily picks the most efficient

All analysis and output settings at -8p with tweak and merge disabled. This isn't a maximal efficiency or bang for buck test but a comparison of modes, a mix of a few dozen songs skewed towards metal/punk/electronic:

Relative Size       CPUEst    Settings
100.00%  872823885  194.03077 Fixed 4096
 99.91%  872079622  910.36765 gset 1536..4608 
 99.89%  871852994 1832.12025 gset 768..4608
 99.73%  870451258  622.45694 gasc 1536..4608
 99.72%  870421404 2957.32347 lax gasc 512..65535
 99.69%  870155090 1287.84806 gasc 512..4608
 99.69%  870108957  965.25097 gasc 768..4608
 99.69%  870108468  746.15064 gasc 1152..4608
 99.68%  870072803 1415.20407 peakset 1536..4608
 99.64%  869659199 1504.23657 lax gasc 1024..65535
 99.58%  869119347 4689.57017 peakset 768..4608

gset has been trounced by gasc, gasc is faster and more efficient. Some unintuitive results from gasc, it seems that when too-small a blocksize_min is used the algorithm performs worse (compare gasc subset results 512, 768, 1152). gasc 1152..4608 is interesting, relatively cheaply it approaches peakset performance (the more natural comparison would be peakset 1152..4608 which was not tested, it'd sit roughly between the peakset results tested in size and time). When tweak/merge come into play gasc should benefit more than peakset as it has more ground to gain, when quick analysis settings come into play peakset should benefit more than gasc as it does more analysis.

Title: Re: Multithreading
Post by: Porcus on 2022-12-30 00:16:11

A counterintuitive result or two yes, but gasc 1536..4608 and gasc 1152..4608 look quite good.
I did some tests with 1.4.0, where -8pe took nearly 12x the time of -8p. Not saying that -8pe is any good (indeed I said something quite different about it) (https://hydrogenaud.io/index.php/topic,122949.msg1015508.html#msg1015508) but it is kinda a benchmark to what time that, at least, some have been willing to pay.

(You probably have mentioned it somewhere up in this thread, but ... you are using a "-p" on everything? Reason for asking: what if you do everything without -p - would you then beat -8p in shorter time?)

Title: Re: Multithreading
Post by: cid42 on 2022-12-30 08:41:21

Yes mostly used -8p as the output setting as an arbitrary lever to not pull, if too many levers are pulled at once a benchmark doesn't really show anything.

Done a few quick tests, variable blocksize peakset can easily beat fixed in size and time simultaneously. gasc can too but not as well, to beat size it needs to crank up the analysis settings to the point where the time it takes is not that competitive (gasc can scale time down much further than peakset can but not at competitive size):

Size     CPUEst
29904618 5.05966 Fixed -8pb4608
29902783 5.29877 Fixed -8pb4096
29883629 3.54501 gasc analysis 7 output 8 tweak 0 merge 0 list 1152,2304,3456,4608
29879435 4.70175 gasc analysis 8 output 8 tweak 0 merge 0 list 1152,2304,3456,4608
29874167 3.27683 peakset analysis 3 output 8 tweak 0 merge 0 list 1152,2304,3456,4608
29869549 3.79056 peakset analysis 3 output 8 tweak 999999 merge 0 list 1152,2304,3456,4608

There may be better settings to improve further.

Title: Re: Multithreading
Post by: Porcus on 2022-12-30 13:18:35

Quote from: cid42 on 2022-12-30 08:41:21

Yes mostly used -8p as the output setting as an arbitrary lever to not pull, if too many levers are pulled at once a benchmark doesn't really show anything.

Sure, except that you are too modest ;) . Improving over -8p on size and time simultaneously means you are at something which would have been used had it been implemented in the reference encoder (and known well enough).

Title: Re: Multithreading
Post by: ktf on 2022-12-30 14:48:08

I agree. Many have tried and failed at getting variable blocksizes to be both faster and smaller than fixed blocksizes. In fact, getting an encoder to output a variable blocksize stream that is consistently smaller is a feat already, even if the encoder is quite a bit slower.

Title: Re: Multithreading
Post by: cid42 on 2022-12-31 10:51:26

Thanks. All I'm doing is exploring some simple things to a conclusion because it's fun.

Quote from: ktf on 2022-12-30 14:48:08

I agree. Many have tried and failed at getting variable blocksizes to be both faster and smaller than fixed blocksizes.

peakset with analysis 3 output 8 blocklist 1152..4608 is faster always and smaller most of the time, but there's a subset of tracks that do better with fixed -8pb4096 (see round6, ~7.6% of everything I have was smaller with fixed albeit when it was better it was mostly marginal). I think it mainly boils down to variable not being of much benefit when the tracks are predictable (so mostly uses the top blocksize anyway, so the p in fixed -8p runs away with it).

One way to improve could be an adaptable strategy that detects when variable -8 isn't of much benefit and switching to fixed -8p for a stretch. A proper adaptable strategy would be hell to implement but something simple like this might be okay (lets be honest it would probably hurt at least as much as it helps so I've already talked myself out of trying it for now).

Another way to passively improve could be to introduce "fractional" output settings between -8 and -8p, by simply using say -8 X% of the time and -8p Y% of the time. If the game is to use some of the spare time available to a variable run to improve efficiency, this may be the best way to do it. It would at least make the input that prefers fixed to perform better with variable (if not prefer it). Another benefit would be in allowing the expected time to complete to be tuned much finer when doing variable encodes (relying directly on flac/mode settings is pretty coarse). This will be the next thing implemented unless something better comes along.

Quote from: ktf on 2022-12-30 14:48:08

In fact, getting an encoder to output a variable blocksize stream that is consistently smaller is a feat already, even if the encoder is quite a bit slower.

If the encoder is allowed to be slower as long as it nearly always produces smaller files then brute force has a massive advantage, the main wrinkle is having to save 3 bytes per frame to account for sample-based addressing. 4096 is near ideal on average but it shouldn't take much tweaking of a 4096 stream to get the vast majority of content marginally across the line.

Round 5 tl;dr:

Size      CPUEst    Settings
870739898	147.56386 gasc 1152..4608 analysis 8 output 8
870746851	120.52655 peakset 1152..4608 analysis 3 output 8
870942164	111.21946 gasc 1152..4608 analysis 7 output 8
872823885	185.98145 fixed -8pb4096
873203809	183.78757 fixed -8pb4608
874088872	 93.83656 gasc 1152..4608 analysis 6 output 8

Luckily one track in round 5 performed worse, otherwise the assumption would have incorrectly been that fixed was dead in the water. Round 6 tests wider (>1000 tracks) for a better view:

Size        CPUEst     Settings
30886326138	4473.30063 peakset 1152..4608 analysis 3 output 8
30946173364	7032.54908 fixed -8pb4096

Peakset was more space-efficient than fixed ~92.4% of the time, completing in ~65% of the time on average.

Title: Re: Multithreading
Post by: cid42 on 2023-01-01 13:06:53

Disregard the above round 5, there was a mistake in the gasc implementation that meant the output settings were never triggered (so the above round 5 is analysis x output x not analysis x output 8, no wonder it scaled so strongly with analysis settings). Round 6 is valid.

The fixed results show that gasc is more compelling:

Round5Fixed tl;dr
All variable output -8, all variable blocksizes 1152..4608
* 870166176 499.55594 peakset analysis 8
* 870231352 301.22267 peakset analysis 7
* 870333821 256.93233 peakset analysis 6
* 870500404 170.39666 peakset analysis 5
* 870663610 117.48744 gasc analysis 6
  870712531 129.71610 gasc analysis 7
  870739898 148.85595 gasc analysis 8
  870746851 120.34103 peakset analysis 3
* 870777535  94.49873 gasc analysis 5
* 870935582  78.17366 gasc analysis 3
  871363551 128.79010 peakset analysis 2
* 871431044  76.93183 gasc analysis 2
  871569362 108.65949 peakset analysis 0
* 871658798  70.50157 gasc analysis 0
  872823885 183.95303 fixed -8pb4096

* No stronger setting was faster

So gasc is stronger at quicker settings, peakset is stronger at slower settings but only one of the best-in-show peakset results fulfilled the task of being quicker than fixed -8p.

Implemented fractional output settings for gasc (not for peakset yet, that requires some heavy refactoring and I'm about to run out of free time for the next several weeks). Fractional output can be used whenever analysis settings are different to the standard output settings. Settings of --comp-output 8 --comp-outputalt 8p --outperc 60 use -8 60% of the time, -8p 40% of the time.

Repeating the above tests but with all gasc encodes using fractional output to try and take roughly the same amount of time as a fixed -8p encode gives these results:

Round 7 tl;dr out 8 outalt 8p
  870337444 180.93322 gasc analysis 5 outperc 27
  870369148 176.91559 gasc analysis 6 outperc 50
  870431237 177.77482 gasc analysis 3 outperc 17
  870464232 179.80446 gasc analysis 7 outperc 57
* 870500404 170.39666 peakset analysis 5
  870905130 175.96778 gasc analysis 2 outperc 17
  871104714 176.42821 gasc analysis 0 outperc 10
* 872823885 183.95303 fixed -8pb4096

* Results from round5fixed for comparison

So the top gasc+outperc results can beat peakset 5 at least for this small corpus. The bigger corpus is running but it'll take a while as gasc is not multi-threaded as it cannot easily be done (at least not purely, well-performing multi-threading can and will probably have to be done by chunking, a reasonably well-performing pure dual core implementation is possible but not worth the hassle).

Attempts could be made to smartly choose when to use stronger output settings with frac-output, but it's not clear how to go about that. The current implementation is indiscriminate deterministic and evenly spread and that's probably how it will stay unless there's an objective win to a different strategy (that's also easy to implement).

Title: Re: Multithreading
Post by: Porcus on 2023-01-01 15:15:00

Quote from: cid42 on 2022-12-31 10:51:26

One way to improve could be an adaptable strategy that detects when variable -8 isn't of much benefit and switching to fixed -8p for a stretch. A proper adaptable strategy would be hell to implement but something simple like this might be okay (lets be honest it would probably hurt at least as much as it helps so I've already talked myself out of trying it for now).

Another way to passively improve could be to introduce "fractional" output settings between -8 and -8p, by simply using say -8 X% of the time and -8p Y% of the time.

For output, the X must be updated by learning for this to make any sense? Just randomizing independently isn't really "useful": those who think -p is worth it will want -p to be applied, right?
If it is for analysis ... then how? Use -p on some block sizes and not on others?

Title: Re: Multithreading
Post by: cid42 on 2023-01-01 19:55:16

The percentage of output blocks that use X or Y output settings is defined by the user in the same way all the other settings are chosen. As long as both output settings are stronger than the analysis setting it makes sense, either way you're replacing whatever the analysis phase finds with a stronger setting. There are a number of reasons it might be a good idea to add this lever:

It allows a config to be scaled to a different fixed time target relatively easily, it boils things down to a single number that should be somewhat linear in time and space between two points
The way it scales is between two known good flac settings, an arbitrary combination of which should also be good. If you know that A takes 1.2 time and B takes 1.8 time, this single setting can allow you to tailor a time to encode output to anything between 1.2 and 1.8
It's much harder (if not impossible in most cases) to choose a singular set of flac settings that sits in time between two other sets of flac settings in a space-efficient way. You need to be an expert and even then you don't have arbitrary granularity and not all settings are made equal
It's relatively fine-grained even in the integer 1-100% way it's implemented. If you want something better than -8 but not as slow as -8p, and p is the most time-efficient way to improve efficiency, this can do that
Different configurations that take the same amount of time to complete are directly comparable in efficiency. Round 7 above did this (I did trial and error using a single track to find the outperc settings to use) so that all the gasc configs take roughly the same time to complete as fixed -8pb4096. This allowed gasc to beat peakset in that time frame which isn't possible without fractional-output
Works just as well with fixed blocksizes

Another arguably better way to implement fractional-output could be for the user to specify "roughly fixed -8p speed please" with the encoder having a concept of time and choosing output settings accordingly. That would be an easier interface as the same user setting would apply regardless of other settings used, however it would be complicated to implement for a number of reasons including building some concept of time into the encoder and probably losing determinism in the process.

If the game is to pull the optimal levers to maximise space-efficiency for a given time-efficiency, a lever that can scale a given config to better match a time is useful.

Title: Re: Multithreading
Post by: Porcus on 2023-01-01 21:31:17

First, what it can be good for, if someone is dependent on "be done in time T": fit it for post-processing later.
It is technically possibly to "apply -p later", but obviously only by successively reducing the predictor precision. So if someone writes a "post-processor" that takes a .flac file, reading each subframe and - subject to user's choice I guess - does one out of two:
I: starts with the predictor at the actual quantified precision q, then work itself downwards to see if that improves,
or
II: if quantified at q=15, start out from that and do the -p routine; if quantified at q<15, assume that it has been done already.

Then the initial run could do as much of this work as the time budget allows for. Then the initial run should be adapted to the choice between I and II:
I: start at 15, run down to q (which may be determined by the time budget!) but stop if it does not improve.
II: leave some at 15 and run the full -p on the others.

Now, why I don't like the idea at all: precisely the linearity in bytes saved per second spent. It goes against the principle of picking the low-hanging improvements first. Say if you want to achieve the speed of -8, you don't run -3 on most of the data and then spend the time saved on running -8ep on a tiny fraction of the data only to get a little bit more than -8 out of that.
So if -p saves B bytes per second and that is acceptable, then take those bytes. If it isn't, then find something between -8 and -8p that is cheaper. There is something such.

For an example from the table I posted at https://hydrogenaud.io/index.php/topic,122949.msg1015508.html#msg1015508: suppose you are willing to spend the time equal to using -8 on 80 percent of the data and -8p on 20 percent of it. That would take .8*833 + .2*3003 = 666+601=1267 seconds. The size would be .8*11969604531+.2*11961291433 = 11967941911, which is bigger than using -8 -A "subdivide_tukey(4)".

Title: Re: Multithreading
Post by: cid42 on 2023-01-01 23:27:01

Ideally the output settings used with frac-out do not surround a setting in time that has a better space-efficiency than the interpolation does at that time. If -8 -A "subdivide_tukey(4)" almost always does fit that bill, then frac-out should probably use that as one of its settings with -8 or -8p (or something even closer) as the other to hit some target within a narrower range.

How many settings are there that can be said to have a strong tendency to have the best space-efficiency for its time, ie there exists no surrounding pair of settings whose interpolated value at that point of time is more space-efficient? If these super settings exist do they tend to be strongly ordered, how much volatility is there in timing and in space-efficiency?

It would be nice if there were a set of super settings that could confidently be applied (either individually or adjacent with frac-out for more granularity) to "normal" output and be happy that it's close to optimal. Like presets but on steroids. I doubt the set if it exists has too many members, many settings search within a range which must introduce at least some volatility, which should limit how close super settings can get to each other.

How obvious is it which apod settings should be better than others? Is there an exhaustive list of apod functions that can be tested with representative input to collate a rule-of-thumb ordered set of super settings?

Title: Re: Multithreading
Post by: Porcus on 2023-01-02 00:33:53

If you put up a size/time diagram, there should be a "convex" relationship: given two settings A (faster) and B (compresses better), dash a line between them; if a third setting C is between them in size but slower than the line, throw it out. Same if it is between them in time but larger than the line. (Here I disregard the -l affecting decoding speed, FLAC decodes faster than anything.)
At https://hydrogenaud.io/index.php/topic,123025.msg1016761.html#msg1016761 I argue that certain settings should be avoided.

I'm not sure what you mean by "volatility" here, there is surely quite a lot of variation between signals - but either the decision has to be made before selecting how to encode, or one must learn from the signal during encoding. The "signal analysis" could be a first-round encoding - like you have employed here, using faster settings for analysis.
Some properties can be guessed from the "format"; although some CDDA signals do fare better with -8e than with -8p, that is getting very rare with 1.4.0, ktf & co nearly killed it. But not so for high resolution, where I more often see -8e beating -8p, e.g. my table at https://hydrogenaud.io/index.php/topic,123025.msg1018116.html#msg1018116 (but stacking up with apodization functions typically does even better).

As for the windowing:
Reference flac introduced a lot of possible apodization functions, and the tukey (cosine-tapering the beginning and end) turned out successful after a bit of testing (including here at HA). ktf crafted the partial_tukey and punchout_tukey a decade ago (they leave out portions of the signal) and the subdivide_tukey introduced in 1.4.0 does this in a way that recycles more calculations and covers more possibilities. Saying that subdivide_tukey(N+1) subsumes subdivide_tukey(N) isn't completely true - you have to upscale the tapering parameter by (N+1)/N. And still it will estimate before encoding and could be slightly wrong. But by and large, stepping up the number of those will spend progressively more time searching for the even smaller needles in the haystack. If you want something with time consumption between -A subdivide_tukey(4) and -A subdivide_tukey(5), you can give -A tukey(7e-1);subdivide_tukey(4);flattop or something like that.
Oh, or different taperings. https://hydrogenaud.io/index.php/topic,123025.msg1017245.html#msg1017245 .

Title: Re: Multithreading
Post by: cid42 on 2023-01-02 20:53:37

Round 8 tl;dr
1172 tracks, a few dozen dupes from VA compilations. All variable blocksizes 1152:4608

Size        CPUEst     TracksGreaterThanFixedCounterpart Settings
35086113631 7703.72447    5 (0.43%)                      gasc analysis 5 output 8 outputalt 8p outperc 27
35086680315 7429.37314    6 (0.51%)                      gasc analysis 5 output 8 subdivide_tukey(4) outputalt 8p outperc 35
35088009462 7582.09801   10 (0.85%)                      gasc analysis 6 output 8 outputalt 8p outperc 50
35089443839 7629.84433    7 (0.60%)                      gasc analysis 3 output 8 outputalt 8p outperc 17
35090596517 7737.50519   18 (1.54%)                      gasc analysis 7 output 8 outputalt 8p outperc 57
35092376807 7042.73055   67 (5.72%)                      peakset analysis 5 output 8
35096064696 5757.87535   63 (5.38%)                      gasc analysis 6 output 8 subdivide_tukey(4)
35096857576 6269.30259   64 (5.46%)                      gasc analysis 7 output 8 subdivide_tukey(4)
35099235041 4687.09807   68 (5.80%)                      gasc analysis 5 output 8 subdivide_tukey(4)
35104983452 4098.46001   94 (8.02%)                      gasc analysis 3 output 8 subdivide_tukey(4)
35165983516 7972.30849    0 (0.00%)                      fixed -8pb4096

Without subdivide_tukey(4) is slightly ahead but takes a bit longer, you may be right that subdivide_tukey(4) and other intermediate settings may be slightly better choices but this test is a wash. Should have set outperc slightly lower than 35 to get time targets closer, for that matter all gasc runs could do with reducing outperc slightly to better match fixed run but I was more concerned with making sure none of them exceeded fixed -8p time.

Only a handful of tracks from each of the best runs were bigger than their fixed -8pb4096 counterpart which is a good thing to note, it means that when using good variable settings targeting the same time as a fixed setting there is very little downside to using variable blocksize (at least for -8p and probably all "slower" settings that give the analysis phase enough time to benefit).

Title: Re: Multithreading
Post by: Porcus on 2023-01-02 23:14:01

The figures for "peakset analysis 5 output 8" are a size saving of 0.2 percent.
What does plain -8 yield of size on this corpus? -p doesn't save as much as 0.2? (That is rare in a big corpus.)
Yes I know what you say about pulling too many levers, but again consider the following perspective: If you have found an algorithm that - starting from -8 - triples the size savings of -p in less of a time cost, that is damn impressive - and for an algorithm that isn't even tweaking the apodization functions.
(No reason not to tweak the functions in the final stage, of course.)

If I have understood your notation, the number after "analysis" is the preset used for analysis, e.g. "analysis 6" means you use a -6 setting for the analysis stage?
It is then a bit weird that gasc analysis 6 output 8 subdivide_tukey(4) is better than "7", but I guess this is just one of those coincidences that happen. -6 has -5's LPC order.

... by the way, -8 -A subdivide_tukey(5) would also fall between -8 -A subdivide_tukey(4) and -8p, but cost so much more than (4) in my testing that you would rather use -8p. Gut feeling says that when a finer subdivision isn't much useful with 4096 blocks, then it is no better at smaller blocks - and that assumption could very well fail.

Oh, and: 1152:4608, does that mean the multiples of 576 or ...?

Title: Re: Multithreading
Post by: cid42 on 2023-01-03 00:18:40

You're right about the notation, by 1152:4608 I meant blocksizes 1152,2304,3456,4608, that's also what I mean whenever there's something like 1152..4608, contiguous multiples of min blocksize up to whatever. It is odd that gasc 6 beats 7 but it's not impossible, possibly for some reason gasc 7 has more instances of a smaller blocksize being chosen before a bigger more appropriate blocksize gets tested (hazard of early exit strategy). Or more likely they perform so similarly one gets an advantage over the other by random chance, in an even fight 7 should win but they're not working on the exact same set of frames as the frames might be staggered.

These are the tracks that the top 2 runs output bigger than the fixed run:

Heat OST - Various Artists [1995]/11 - Moby - New Dawn Fades.flac
LudAndSchlattsMusicalEmporium/PMM-Bach-Cello-Suite-No.-1-G-Major-MASTER_V1.flac
LudAndSchlattsMusicalEmporium/PMM-Romeo-and-Julliet-MASTER-V1.flac
LudAndSchlattsMusicalEmporium/t.flac
TOOL - Fear Inoculum (Deluxe) (2019)/03 - Litanie contre la Peur.flac
Tool-Lateralus-CD-FLAC-2001-SCORN/04-tool-mantra.flac

t.flac is a copy of bach erroneously left in the corpus from previous testing. Litanie and tool mantra are both simple ambient filler. Classical is classical and apparently moby is also simple no offense moby. So it's no surprise that those tracks perform worse, the variable without being lax was of limited benefit relative to missing out on p output.

And here's the corpus with fixed -8b4096, as always nothing unnecessary not even a seektable:

35188815331 2021.08537 fixed -8b4096

The benefit with p does look slim. There's a little high res but it's mostly CDDA, there's a lot of metal/punk/electronic but there's also a selection of classical and OST's.

Title: Re: Multithreading
Post by: cid42 on 2023-01-16 00:18:24

Made a tool (attached, should compile fine for windows) to dump some stats about a flac bitstream. Probably not going to develop it much further and the code is nothing to write home about. Works on my PC but YMMV, here's some example output:

STREAMINFO{
 blocksize min 4096 max 4096
 framesize min 14 max 13664
 samplerate 44100
 channels 2 bits_per_sample 16 total samples 6791829
 Blocking strategy: Fixed
}

Metadata bit stats (% including bitstream):
   304 (0.000184%) STREAMINFO

Frame header stats (% excluding metadata):
 23226 (0.014046%) bits spent on sync codes
  3318 (0.002007%) bits spent on frame reservations to maintain syncability
  1659 (0.001003%) bits spent on block strategy bit
  6652 (0.004023%) bits spent encoding blocksize
  6636 (0.004013%) bits spent encoding samplerate
  6636 (0.004013%) bits spent encoding channel assignment
  4977 (0.003010%) bits spent encoding sample size
 25520 (0.015433%) bits spent encoding current frame/sample index with UTF8
 13272 (0.008026%) bits spent encoding frame header crc8

Subframe header stats (% excluding metadata)
  3318 (0.002007%) bits spent on subframe reservations to maintain syncability
 19908 (0.012039%) bits spent encoding model type
  3318 (0.002007%) bits spent on wasted bits flag

Modelling stats (bit % excluding metadata) (excluding residual bits)
    18 (0.542495%) subframes used constant modelling
     0 (0.000000%) subframes used verbatim modelling
     6 (0.180832%) subframes used fixed modelling
  3294 (99.276673%) subframes used lpc modelling
   288 (0.000174%) bits spent on constant
     0 (0.000000%) bits spent on verbatim
    48 (0.000029%) bits spent on fixed
927478 (0.560881%) bits spent on LPC

Residual stats (% excluding metadata):
  3300 (0.001996%) bits spent on residual reservations to maintain syncability
  3300 (0.001996%) bits spent on residual type (4 or 5 bit rice parameter)
164275633 (99.343672%) bits spent on residual encoding

Frame footer stats (% excluding metadata):
 26544 (0.016052%) bits spent encoding frame footer crc16
  5913 (0.003576%) bits spent on frame padding for byte alignment

Combined stats (% excluding metadata)
927814 (0.561084%) total bits spent on modelling
164282233 (99.347663%) total bits spent on residual
150897 (0.091253%) total bits spent on overhead (frame_header+subframe_header+footer

Miscellaneous stats:
 27432 (100.000000%) of residual partitions stored rice-encoded
     0 (0.000000%) of residual partitions stored verbatim
  9936 (0.006009%) total bits spent on pure reservations to maintain syncability (not including the many reserved values in used elements or end-of-frame padding)

Used it to see how the structure of the bitstream changes as different compression settings are used. Tested with the album version of this song without the talking intro: https://www.youtube.com/watch?v=MODhTJwebz8

Percentages are just for the bitstream, no metadata blocks are included in the 100% not even streaminfo

Filesize  Frames  Overhead  Modelling  Residual  Settings
22131570   1659    0.085%    0.067%    99.848%  fixed -0b4096
20941266   1659    0.090%    0.324%    99.586%  fixed -3b4096
20705886   1659    0.091%    0.418%    99.491%  fixed -6b4096
20681000   1659    0.091%    0.579%    99.330%  fixed -8b4096
20670160   1659    0.091%    0.561%    99.348%  fixed -8pb4096
20649171   1721    0.120%    0.565%    99.315%  peakset all -8p, blocksizes 1152,2304,3456,4608 no merge no tweak
20641291   1721    0.126%    0.566%    99.308%  peakset all -8p, blocksizes 1152,2304,3456,4608, no merge with tweak set to maximum iterations
20614856    904    0.070%    0.305%    99.625%  peakset all -8p, blocksizes 1152,2304,3456,4608, non-subset with merge and tweak set to maximum iterations
20614731    958    0.074%    0.318%    99.608%  peakset all -8p, blocksizes 576,1152,1728,2304,2880,3456,4032,4608, non-subset with merge and tweak set to maximum iterations
20613363    921    0.071%    0.312%    99.616%  peakset all -8p, blocksizes 1152,2304,3456,4608,5760,6912,8064,9216, non-subset with merge and tweak set to maximum iterations

No surprise that as compression effort increases for fixed runs that the overhead and model proportions increase, stronger settings search for heavier models which take more space to define
The overhead of the weakest peakset jumping relative to fixed is mostly because variable frame indexing is by sample instead of frame number, adding approx 2 bytes per header
By adding tweak in the next run overhead increased slightly thanks to having to encode a few blocksizes with a literal instead of one of the dozen or so predefined common values
This track apparently prefers blocksizes well above 4k on average so lax allows a lot of overhead to be saved by reducing the number of frames. Less frames also means less models to store. Merge did a good job of clawing upwards in blocksize despite the max blocksize used in analysis being below what it probably should be for this input
Adding intermediate blocksizes without increasing the max blocksize used during analysis didn't help much, the analysis stage performed better but it ate some of tweak/merges lunch to do so
Finally changing the blocksize list to something better-fitting the input was beneficial, but there wasn't much to gain

Title: Re: Multithreading
Post by: cid42 on 2023-02-03 19:58:26

No major new features worth benchmarking but a lot of quality of life improvements. The code's a mess by design as it's all experimentation, this is the start of tidying by keeping what works and reworking how some things are done into what they probably should have been from the start:

Implement an output queue. Instead of encoding and dumping an output frame directly after analysis has found it, add it to the queue to be processed later
Batched output frame encoding is multithreaded (when analysis settings are different to output settings). This allows a user to introduce multithreading even if the analysis algorithm used is single-threaded
Move merge and tweak passes to act on the output queue instead of being intertwined with the analysis algorithm. Merge/tweak passes are triggered on queue flush
Separating merge/tweak entirely simplifies analysis implementations and allows tweak/merge to be used with modes they previusly couldn't (gasc couldn't use either but now can use both, chunk couldn't use merge and still can't as it hasn't been ported yet).
Another benefit is that the queue size can be tuned to reduce the time merge/tweak take to execute with the tradeoff of slightly lower efficiency (as boundaries are introduced that tweak/merge cannot cross). A pass is iterative and acts on all known frames, even if only a small part of the set is benefitting from iteration all frames are tested. A smaller queue localises the high-order passes.
Multiple output settings has been implemented for the output queue, so anything that implements the queue gets it for free
Implement fixed blocking strategy as its own mode instead of fudging it with chunk mode. Slightly less overhead and a few checks to validate the commandline. Technically the implementation is still a fudge just in a different way (to be able to reuse queue code instead of reimplementing a multithreaded queue)
peakset ported to use output queue so it catches up on some features. As well as allowing the merge/tweak code to be removed from peakset there is multithreading of output frame encoding which was single-threaded before so there should be some speedup

Not all deprecated code has been removed yet as chunk/gset still uses it, they will either be ported to use the queue or deprecated TODO. The code in merge.c/h and tweak.c/h is deprecated, replaced by functions queue_tweak/qtweak/queue_merge/qmerge in common.c. flist functions are deprecated, peakset still uses the flist struct but it probably shouldn't.

I'm predicting a queue size somewhere in the range 128-1024 is probably a good tradeoff to speed up tweak/merge with little efficiency loss. For settings that rely heavily on these it might be a nice overall speed bump (ie peakset subset with strong tweak). Might test this week.

Title: Re: Multithreading
Post by: ktf on 2023-02-04 12:16:35

I finally got some time to look into this code. It seems my compiles keep segfaulting at the very last frame (which is a different size than the others). Does this sound familiar?

edit: the segfault backtrace in gdb

Reading symbols from ./flaccid.exe...
(gdb) r
Starting program: /home/m_van/flac-cid42/src/flaccid/flaccid.exe --mode peakset --in test-input.flac --out test-output.flac
[New Thread 3516.0x4930]
test-input.flac Processed 1/7
Processed 3/7
Processed 7/7

Thread 1 received signal SIGSEGV, Segmentation fault.
0x00007ff6e97aa008 in MD5 ()
(gdb) bt
#0  0x00007ff6e97aa008 in MD5 ()
#1  0x00007ff6e971b672 in peak_main ()
#2  0x00007ff6e97a73a8 in main ()

Title: Re: Multithreading
Post by: lithoformer on 2023-02-04 14:39:32

Quote from: ktf on 2023-02-04 12:16:35

I finally got some time to look into this code. It seems my compiles keep segfaulting at the very last frame (which is a different size than the others). Does this sound familiar?

edit: the segfault backtrace in gdb

Code: [Select]
Reading symbols from ./flaccid.exe...
(gdb) r
Starting program: /home/m_van/flac-cid42/src/flaccid/flaccid.exe --mode peakset --in test-input.flac --out test-output.flac
[New Thread 3516.0x4930]
test-input.flac Processed 1/7
Processed 3/7
Processed 7/7

Thread 1 received signal SIGSEGV, Segmentation fault.
0x00007ff6e97aa008 in MD5 ()
(gdb) bt
#0  0x00007ff6e97aa008 in MD5 ()
#1  0x00007ff6e971b672 in peak_main ()
#2  0x00007ff6e97a73a8 in main ()

at the very last frame you'll likely have a different number of samples than 'buffer size' so you'll have a segfault

Title: Re: Multithreading
Post by: cid42 on 2023-02-04 19:44:57

Quote from: ktf on 2023-02-04 12:16:35

I finally got some time to look into this code. It seems my compiles keep segfaulting at the very last frame (which is a different size than the others). Does this sound familiar?

edit: the segfault backtrace in gdb

Code: [Select]
Reading symbols from ./flaccid.exe...
(gdb) r
Starting program: /home/m_van/flac-cid42/src/flaccid/flaccid.exe --mode peakset --in test-input.flac --out test-output.flac
[New Thread 3516.0x4930]
test-input.flac Processed 1/7
Processed 3/7
Processed 7/7

Thread 1 received signal SIGSEGV, Segmentation fault.
0x00007ff6e97aa008 in MD5 ()
(gdb) bt
#0  0x00007ff6e97aa008 in MD5 ()
#1  0x00007ff6e971b672 in peak_main ()
#2  0x00007ff6e97a73a8 in main ()

Coincidentally commit 4b41a0fdb9068d802a624bd35b4419f73f0a7fb8 fixed a last frame bug in peakset introduced the previous commit when porting to the queue, but it looks like the problem is actually that I broke mbedtls support at some point and didn't notice. Peakset currently doesn't hash on the fly it does it after frame processing all at once, so MD5 in the trace is a giveaway.

The latest commit 138832bf232f42b92e3ea4c33f168930d45d4135 fixes the mbedtls path for me, to compile on Linux with gcc I need to add

-I./ mbedtls/platform_util.c mbedtls/md5.c

to the gcc command, YMMV.

flaccid should compile without warnings, if there's warnings there may be a problem but I've only tested with gcc on Linux. MSVC probably warns about some things, if it does let me know and I can try to eliminate them.

Title: Re: Multithreading
Post by: ktf on 2023-02-05 07:34:14

Quote from: cid42 on 2023-02-04 19:44:57

flaccid should compile without warnings, if there's warnings there may be a problem but I've only tested with gcc on Linux. MSVC probably warns about some things, if it does let me know and I can try to eliminate them.

I get a lot of warnings like these

chunk.c: In function 'chunk_main':
chunk.c:214:49: warning: implicit declaration of function 'MD5_Update' [-Wimplicit-function-declaration]
  214 |                                                 MD5_Update(&ctx, ((void*)input)+iloc, ilen);
      |                                                 ^~~~~~~~~~
chunk.c:214:49: warning: nested extern declaration of 'MD5_Update' [-Wnested-externs]
chunk.c:232:33: warning: implicit declaration of function 'MD5_Final' [-Wimplicit-function-declaration]
  232 |                                 MD5_Final(set->hash, &ctx);
      |                                 ^~~~~~~~~
chunk.c:232:33: warning: nested extern declaration of 'MD5_Final' [-Wnested-externs]

I couldn't figure out what I did wrong, so I just disabled MD5 calculation altogether. Even with that, I couldn't get gasc to work, so I did a comparison with peakset only. I compiled with

gcc -o flaccid -DFLAC__NO_DLL -fopenmp *.c -I../flac-cid42/include ../flac-cid42/src/libFLAC/.libs/libFLAC-static.a -logg -lmbedtls -lmbedcrypto -Wall -O3 -funroll-loops  -Wall -Wextra -Wstrict-prototypes -Wmissing-prototypes -Waggregate-return -Wcast-align -Wnested-externs -Wshadow -Wundef -Wmissing-declarations -Winline  -Wdeclaration-after-statement -fvisibility=hidden -fstack-protector-strong -Ofast

Here is a nice graph

This is about 8.5 hours of music from very different genres. In fact, two tracks randomly selected from every album I used for my comparison here (http://audiograaf.nl/losslesstest/Lossless%20audio%20codec%20comparison%20-%20revision%205%20-%20cdda.html). Compared are flac -4, -5, -6, -7, -8, -8p (from left to right) with flaccid peakset output-comp 8, peakset (with defaults), peakset with a 9 blocksizes (all multiples of 512 up to 4608) and peakset with 18 blocksizes (all multiples of 256 up to 4608) Times are CPU times, so 'ignoring' multithreading. edit: for those interested, here is a breakdown per track (http://www.audiograaf.nl/misc_stuff/flaccid-very%20large%20set%20of%20tracks-per-track.pdf).

Very impressive.

Title: Re: Multithreading
Post by: cid42 on 2023-02-05 11:11:07

Ooh I made it to one of your PDF's 8)

Are those warnings after updating to the latest commit? I was getting warnings like that trying to build on Linux using the mbedtls path, fixed it after you were having issues.

Here's how I compile the mbedtls path on Linux:

gcc -oflaccid-mbed chunk.c common.c fixed.c flaccid.c gasc.c gset.c load.c merge.c peakset.c tweak.c -I./ mbedtls/platform_util.c mbedtls/md5.c -O3 -funroll-loops  -Wall -Wextra -Wstrict-prototypes -Wmissing-prototypes -Waggregate-return -Wcast-align -Wnested-externs -Wshadow -Wundef -Wmissing-declarations -Winline  -Wdeclaration-after-statement -fvisibility=hidden -fstack-protector-strong -Wno-sign-compare -I../flac/include -lm -fopenmp /home/u20/Documents/mountain/runtime/flac/linbuild2/src/libFLAC/libFLAC.a -I../flac/include -lm -fopenmp -logg

The difference being that mbedtls is embedded instead of using a library, if you have issues on the latest commit this might help.

FWIW here's the compile command for OpenSSL which is what I normally use:

gcc -oflaccid chunk.c common.c fixed.c flaccid.c gasc.c gset.c load.c merge.c peakset.c tweak.c -DUSE_OPENSSL -lcrypto -O3 -funroll-loops  -Wall -Wextra -Wstrict-prototypes -Wmissing-prototypes -Waggregate-return -Wcast-align -Wnested-externs -Wshadow -Wundef -Wmissing-declarations -Winline  -Wdeclaration-after-statement -fvisibility=hidden -fstack-protector-strong -Wno-sign-compare /home/u20/Documents/mountain/runtime/flac/linbuild2/src/libFLAC/libFLAC.a -I../flac/include -lm -fopenmp -logg

There were a few gasc bugs introduced a few commits ago from implementing the queue that are now fixed, the commits tested working on a few tracks but I was too eager and pushed before wider testing. Commit 2d1c66e36b26f750ba7a8c9cf6afc90099515aaa fixed gasc multithreading, commit 4b41a0fdb9068d802a624bd35b4419f73f0a7fb8 fixed a gasc bug in hashing the last frame. All known bugs have been fixed in the latest commit, if you're having issues with commit 138832bf232f42b92e3ea4c33f168930d45d4135 or later then it's something I'm unaware of. I'll try and be better at not pushing bugs from now on.

Title: Re: Multithreading
Post by: Porcus on 2023-02-05 11:36:22

Interesting indeed. Though I realize that I have too often overlooked the log scale, damn sometimes I am just stupid and even sticking to it ... but anyway:

* Bumping prediction order from 8 to 12 makes for quite a bit, and that is not a surprise. Not unlikely that could have been exploited further, which within subset is only possible for high resolution
* Optimizing blocksize - which obviously also makes for a better fitting predictor - can easily make for slightly more than half of that improvement

Which effect is largest?

So for testing purposes, I am curious what happens in the following comparison.
* Take the leftmost red (peakset output-comp 8, is that?) and limit it to be comparable to -6 (prediction order, apodization, max Rice parameter) It would still improve over -6; how much? Is it cheaper/better than going -l12?
* To get "everything else equal", one would have to consider the fact -r6 means half the partition size for half the block size. At least to compare predictor fit one might consider to lock it down to ... well -r9 would of course make everything equal? Not time taken, I mean - it is so -r6 will run 7 calculations rounds?

Wondering if the association between blocksize and the actual partitioning of that block shows anything surprising?
Some phenomena possibly at play:
1: bad-fitting predictor leads to larger residuals and to more block splitting and finer (in terms of absolute sample count) partitioning
2: "short noisy parts" surely lead to bad-fitting predictor, but there the more noise-alike the less point in improving it and so you might as well just use the "neighbouring frame's vector" and merge the blocks.

Title: Re: Multithreading
Post by: cid42 on 2023-02-05 15:21:40

Actually fixed gasc as of commit 5fb1861601759de021a8b25f9038f5034da890be. gasc/fixed/peakset passed flac-test-files and MD5 works for all bit depth input now. Latest commit adds a --no-md5 option which disables md5 for fixed/gasc/peakset.

Title: Re: Multithreading
Post by: cid42 on 2023-02-06 19:21:21

gset and chunk have both been ported to the new codebase (now everything has been ported with feature parity), gset is probably not as good as gasc but now that chunk can benefit just as much from merge/tweak it may be competitive with gasc. Chunk has been reworked to multithread within a chunk instead of multiple chunks simultaneously, heavily simplifying the implementation. I've also removed the ability to stride >2 as chunk is already low effort and this simplifies the code further, now every chunk subdivides by 2 not n. Every mode has been tested to the point where it passes flac-test-files and a few hundred other tracks. Hopefully an issue hasn't crept in, maybe edge-case tests beyond what flac-test-files catches wouldn't be a bad thing to implement instead of relying on giant benchmarks to weed out 1/1000 bugs.

If using tweak and trying to be time-efficient, --tweak 1 --queue 64 or something like it may be a better use of equivalent time than --tweak 1024 --queue 2048 (focusing the most time on the frames that show the most benefit).

Title: Re: Multithreading
Post by: ktf on 2023-02-06 20:53:56

Thanks! Compiles better indeed. Changes to the help text are an improvement :))

Would you consider adding a few examples as to what you would consider good choices? Much like `flac` has presets? I just tried a few things before running the benchmark, but perhaps you can do some suggestions? It is for example not really clear whether --analysis-apod and --output-apod are populated by default and are only to override what --analysis-comp and --output-comp set or not.

If you can do some suggestions I can run the benchmark again. There are quite a few suggestions in this thread already, but many might be outdated by now. It would be nice to show how much room for improvement FLAC really still has 8)

edit: would you like me to share a Windows compile here for others to test?

Title: Re: Multithreading
Post by: cid42 on 2023-02-06 23:08:04

Quote from: ktf on 2023-02-06 20:53:56

Thanks! Compiles better indeed. Changes to the help text are an improvement :))

Would you consider adding a few examples as to what you would consider good choices? Much like `flac` has presets? I just tried a few things before running the benchmark, but perhaps you can do some suggestions? It is for example not really clear whether --analysis-apod and --output-apod are populated by default and are only to override what --analysis-comp and --output-comp set or not.

The apod options if used override, they're NULL by default so whatever apod the flac preset uses is what's normally active.

Quote from: ktf on 2023-02-06 20:53:56

If you can do some suggestions I can run the benchmark again. There are quite a few suggestions in this thread already, but many might be outdated by now. It would be nice to show how much room for improvement FLAC really still has 8)

I can add some presets maybe by the end of the week. Until then here's some sprawling pointers, most from the thread should apply. It's been a while since experimenting and TBH it would probably take a lot of testing to answer properly.

Avoid flac presets 1 and 4 for cleaner comparisons, flaccid can't use -M which means preset 1 is really 0 and preset 4 is 3l8
Unlike every other mode, gasc doesn't use --blocksize-list and instead uses --blocksize-limit-lower to define the minimum blocksize. The interface for gasc may require a rethink
For subset encodes, even if tweak is enabled blocksize 4608 should probably be in the list for full coverage
Tweak passes are not very useful when the minimum blocksize in the list is very small. The distance to tweak is based on the smallest blocksize so as not to cover the same ground analysis did (particularly peakset in mind). If the minimum blocksize is below 576 it's probably not worthwhile using unless doing a slow encode
A smaller queue should speed tweak/merge up considerably, not tested. A queue size in the ballpark of 100 is probably the way to go for anything that's not meant to be very slow, very slow analysis should use a large queue just to avoid the minor efficiency loss
For lax encodes there is normally some benefit to the analysis blocksize list extending beyond 4608, but there's limited benefit (unless the input has a high frequency). Merge seems a more time-efficient way to attempt to encode large blocksizes in general
Sometimes merge gets in the way of tweak, often when merge is disabled tweak makes up the difference. Merge is far weaker of the two, which makes sense as tweak tries to shape to the signal whereas merge only benefits when the signal is simple enough that overhead is minimised by merging. Large blocksizes are nice but less useful than they seem
Chunk may be more suitable for quick lax encodes, it can quickly extend into large blocksizes which gasc cannot (then again gasc+merge+lax might compete). Probably best to use chunk with tweak, probably best that the minimum blocksize isn't too small so tweak can do some heavy lifting
gasc is probably more suitable for quick subset encodes but not extensively tested
4096 and other common blocksizes are common for a reason, any blocksize list for analysis should have good coverage at least of 2k-4k, probably 1k too. Below 1k you can probably get away with relying on tweak or nothing to extend into smaller blocksizes, unless going very slow
For very slow encodes, output encode settings are used much less often so you could potentially get away with using -e
Even for slow encodes I don't think peakset should go below a blocksize of 768, maybe 576 at a push. Ideally peakset gets the frames in the ballpark and tweak has room to refine, beyond that you're probably better off increasing flac settings
Don't bother with --outperc unless you want to tune to fix time for a specific comparison
Analysis and output settings shouldn't be too different, ideally analysis is a good predictor of output otherwise it could be detrimental
Even for very quick encodes I don't see the point of using very light analysis settings, very quick encodes have low analysis effort which already does the job
The way tweak/merge is defined it keeps iterating until a pass saves less than the specified number of bytes. Passes are done on a queue, so a smaller queue makes saving x bytes harder

With the above in mind these settings might be reasonable for subset (again I haven't tested queue depth or gasc much, nor chunk with decent tweak/merge, some of these are more guess than knowledge). Seems reasonable to pair stronger flac settings with stronger flaccid settings, it could be tweaked many ways. In probably ascending difficulty

Quick
--mode gasc --blocksize-limit-lower 1536 --tweak 0 --merge 0 --analysis-comp 6 --output-comp 6
--mode chunk --blocksize-list 1152,2304,4608 --tweak 0 --merge 0 --analysis-comp 6 --output-comp 6
--mode gasc --blocksize-limit-lower 1152 --tweak 0 --merge 0 --analysis-comp 6 --output-comp 6
--mode chunk --blocksize-list 1152,2304,4608 --tweak 64 --queue 16 --merge 0 --analysis-comp 6 --output-comp 8
--mode gasc --blocksize-limit-lower 1152 --tweak 64 --queue 16 --merge 0 --analysis-comp 6 --output-comp 8
Mid
--mode peakset --blocksize-list 1152,2304,3456,4608 --analysis-comp 8 --output-comp 8p --queue 64 --tweak 1 --merge 0
--mode peakset --blocksize-list 768,1536,2304,3072,3840,4608 --analysis-comp 8 --output-comp 8p --queue 64 --tweak 1 --merge 0
Slow
--mode peakset --blocksize-list 768,1536,2304,3072,3840,4608 --analysis-comp 8p --output-comp 8p --queue 8192 --tweak 1 --merge 0
Probably not worth it
--mode peakset --blocksize-list 768,1536,2304,3072,3840,4608 --analysis-comp 8p --output-comp 8ep --queue 8192 --tweak 1 --merge 0 (this may have better efficiency than the last two as tweak may perform better)
--mode peakset --blocksize-list 576,1152,1728,2304,2880,3456,4032,4608 --analysis-comp 8p --output-comp 8ep --queue 8192 --tweak 1 --merge 0
--mode peakset --blocksize-list 576,1152,1728,2304,2880,3456,4032,4608 --analysis-comp 8p --output-comp 8ep --queue 8192 --tweak 1 --merge 0

Probably far from optimal but it might be a good starting point. I'd consider the first peakset setting on the list as probably the most balanced.

Quote from: ktf on 2023-02-06 20:53:56

edit: would you like me to share a Windows compile here for others to test?

Sure, crack on sharing whatever you like. I can normally cross-compile but had trouble as libflac is involved.

Title: Re: Multithreading
Post by: ktf on 2023-02-07 12:41:44

Here's a Windows x64 binary for anyone wanting to give this a try.

Title: Re: Multithreading
Post by: cid42 on 2023-02-07 13:14:03

Thank you for posting a binary, a few people kicking the tyres should shake out any remaining bugs in handling.

Case in point I've just fixed subset handling, high sample rates were being limited more strictly than they should have been (they were limited as if they were <=48KHz, ie a blocksize limit of 4608 not 16384 and lpc order of 12 not 32). The above binary should be fine for <=48KHz, for subset >48KHz it may perform slightly worse than it could. Not fully tested but it's just a UI change, saw the binary and figured it should get pushed asap in case someone tries with high sample rate input.

Title: Re: Multithreading
Post by: bennetng on 2023-02-07 14:52:18

Quote from: cid42 on 2023-02-06 13:45:23

And that's with ~8KB of padding. The smallest I can make your example while still subset (if you consider variable blocksize to be subset) is this:

I think it is more appropriate to ask here. "What I consider" is unimportant. If variable blocksize is implemented in future Xiph releases, would it be subset according to the "official" standard? In other words, would it require --lax?

Title: Re: Multithreading
Post by: cid42 on 2023-02-07 15:08:23

Technically it's subset, but some players fail even if supposedly they support subset.

From earlier in the thread:

Quote from: ktf on 2022-11-10 08:40:16

So, the CPU time is 'single-core time' I guess? So when using 4 CPU cores wall time would be a quarter of that?

I think there are plenty people on this forum that would be willing to use this, creating a subset, fully spec-compliant FLAC file that might play in their hardware system. Half a percent is quite an improvement, even at the cost of having the encoder run near real-time on a single core. So once again: very impressive.

Sadly non-standard blocksizes and variable blocksize streams are not well supported features in many hardware decoders, see here (https://wiki.hydrogenaud.io/index.php?title=FLAC_decoder_testbench), but I think many people here won't mind.

Title: Re: Multithreading
Post by: bennetng on 2023-02-07 15:14:46

Thanks. Because the (buggy) CUETools.Flake encoder does not require --lax when using variable blocksize, so hopefully it will be within subset in future official releases.

Title: Re: Multithreading
Post by: Porcus on 2023-02-07 15:26:57

New CUETools has disabled the variable blocksize as a quick fix, so by now there is no maintained encoder supporting it.

Title: Re: Multithreading
Post by: bennetng on 2023-02-08 07:30:47

Quote from: cid42 on 2023-02-06 23:08:04

With the above in mind these settings might be reasonable for subset (again I haven't tested queue depth or gasc much, nor chunk with decent tweak/merge, some of these are more guess than knowledge). Seems reasonable to pair stronger flac settings with stronger flaccid settings, it could be tweaked many ways. In probably ascending difficulty

Quick
--mode gasc --blocksize-limit-lower 1536 --tweak 0 --merge 0 --analysis-comp 6 --output-comp 6
Probably far from optimal but it might be a good starting point. I'd consider the first peakset setting on the list as probably the most balanced.

cmd

H:\>flaccid --in phantom.wav --out phantom.flac --mode gasc --blocksize-limit-lower 1536 --tweak 0 --merge 0 --analysis-comp 6 --output-comp 6
phantom.wav     settings        mode(gasc);lax(0);analysis_comp(6);analysis_apod((null));output_comp(6);output_apod((null));tweak(0);merge(0);blocksize_limit_lower(1536);blocksize_limit_upper(4608)
effort  analysis(2.734);tweak(0.000);merge(0.000);output(0.000)  subtiming       analysis(0.00000);tweak(0.00000);merge(0.00000) size    518191375        cpu_time        62.12700

foo_bitcompare

Differences found in compared tracks; the tracks became identical after applying offset and truncating first/last samples.
Extra leading/trailing sections contained non-null samples.

Comparing:
"H:\phantom.flac"
"H:\phantom.wav"
Differences found: length mismatch - 1:40:31.333583 vs 1:40:31.333333, 265981811 vs 265981800 samples.
Compared 265981800 samples.
Discarded last 11 samples containing non-null values from the longer file.
Differences found within the compared range: 531100405 values, 0:00.000000 - 1:40:31.333311, peak: 1.631927 (+4.25 dBTP) at 0:40.642426, 1ch
Channel difference peaks: 1.631927 (+4.25 dBTP) 1.481720 (+3.42 dBTP)
File #1 peaks: 0.999969 (-0.00 dBTP) 1.000000 (0.00 dBTP)
File #2 peaks: 0.999969 (-0.00 dBTP) 1.000000 (0.00 dBTP)
Detected offset as -11 samples.

Comparing again with corrected offset...
Compared 265981800 samples, with offset of -11.
Discarded 11 leading samples containing non-null values from file #1.
No differences in decoded data found within the compared range.
Channel peaks: 0.999969 (-0.00 dBTP) 1.000000 (0.00 dBTP)

Title: Re: Multithreading
Post by: ktf on 2023-02-08 07:54:46

This is because flaccid doesn't support wav input yet, it treats the wav as raw and sees the wav header as samples. Only using FLAC input is recommended for now.

Title: Re: Multithreading
Post by: bennetng on 2023-02-08 08:22:11

I see. For comparison, the same file compressed using flac 1.4.2 -8 is 515831607 bytes.

Title: Re: Multithreading
Post by: Porcus on 2023-02-08 10:52:57

Ooops :-[ Off to the recycle bin you go, tonight's files ...

Anyway, for testing now: what is a suggested setting for "cheap improvement"?

Anyway, errors from compressing .wav are not that big, so here are tonight's sizes from the corpus in my signature, all after metaflac --remove-all --dont-use-padding. I used "partial_tukey(1)" because I was too lazy checking whether decimal points are an issue with this build, and because I intended to increase it later.
Left number: -8p using reference FLAC; middle: mode(gasc);lax(0);analysis_comp(5);analysis_apod(partial_tukey(1));output_comp(8p);output_apod(partial_tukey(1));tweak(0);merge(4096);blocksize_limit_lower(1024);blocksize_limit_upper(4608)
And right: plain "gasc"

3 983 858 119 vs 3 982 064 862 vs 3 978 851 946 for the classical music. The latter makes for bigger difference
3 979 488 237 vs 3 975 772 637 vs 3 973 080 798 for the heavier stuff. The first difference is bigger.
3 997 945 099 vs 3 988 216 396 vs 3 983 642 274 for the "none of the above" pop/jazz/etc.
Last line is a "what?!". Savings of nearly ten and then nearly five - and a quick glance reveals that it makes biggest difference for Kraftwerk's TdF soundtrack:
301 973 233 vs 299 405 596 vs 297 844 789.

Title: Re: Multithreading
Post by: cid42 on 2023-02-08 11:26:18

It'll be difficult to compete with fixed at quick settings as there's not a lot of time and the way a variable blocksize is chosen regardless of mode involves encoding the input multiple times. It might take a new algorithm that can more smartly shape the blocksizes chosen to the input to be competitive for quick encodes, possibly with little or no brute forcing. For now a variable encoding is best suited to compete with slower fixed encodes and to increase how well an input can viably be compressed beyond "slower".

It'll take some tweaking to find suitable presets, those were OTOH and while they should be ascending in difficulty I didn't consider how they compare to flac, the quick settings suggested should probably be disregarded as fixed probably beats them all. gasc is a greedy algorithm so can produce worse results than fixed.

I've found a bug in tweak that makes it perform worse than it should thanks to porting to the output queue, the bug doesn't invalidate the output. Apologies but any benchmarking should probably avoid tweak for a little while, hopefully I have time today to fix it. The port to output queue didn't go as smoothly as I thought, there may be other dragons lurking.

BTW all stat output except cpu_time is deprecated as a lot of it is inaccurate now and was only meant for testing.

Title: Re: Multithreading
Post by: Porcus on 2023-02-08 11:43:01

Sure worth pointing out that as of now there is not much hope that this will compete at speed and surely not at this stage - but it is still worth finding where the lowest-hanging fruit are. Say if merely checking 2048 & 4096 is sufficient to take out the biggest part of the possible improvements, that would likely be the cheapest lever to pull (borrowing a phrase from you) - and it would be worth pursuing for a reasonable preset.

(There must be some nice thesis idea for a clever jr researcher here: find a heuristic that takes as input your typical .flac file with a 4096 frame size, does a quick analysis on the predictor and the ... hm, rice partition order and parameter I guess? - to decide what frames to split. I've seen the IEEE publish far less interesting ideas than that.)

Title: Re: Multithreading
Post by: cid42 on 2023-02-09 14:46:43

Tweak should now be fixed, it mainly affected peakset (tweak did nothing on first flush and 2nd flush onwards had a negative impact) but there was also a general bug (2nd flush onwards was working with a less-optimal tweak distance but at least the impact was still positive). Haven't had a huge amount of time to test the fix, but pushing this is better than leaving it as it was.

gasc is now used by using --blocksize-list with a single blocksize as it should be, --blocksize-limit-lower should never have been used for this purpose.

Title: Re: Multithreading
Post by: cid42 on 2023-02-12 17:23:18

As of commit 0f421d4d6897aaa3829cca51831f5113d2a9f792:

Peakset now supports multiple windows of user-defined size (in millions of samples, with --peakset-window size), so peakset no longer requires access to the entire file at once. This paves the way for RAM to not scale with input size (in a future commit when the loader no longer reads the entire input before encoding begins, big TODO)
Add --no-seek option to disable output stream seeking, meaning the header is not updated after encoding. Requires --no-md5 to be set so that the user understands that hashing has to be disabled
Add output pipe support (with --out -). Either fully cached in RAM to allow header to be updated before writing, or used in tandem with --no-seek to write to pipe as soon as output frames are created
Fixed out-of-spec issue with blocksize_min in header when blocksize of partial last frame of variable encode is less than 16 samples. The partial last frame of variable encodes is still represented in the header but that's within spec (blocksize_min/max are bounds, nothing is mentioned about frame sizes but I assume the partial frame should always be present there even with fixed encodes as there is a tiny chance it's the biggest frame size)
Fixed potential issue with peakset and small queues, set->blocksize_min strikes again. set->blocksize_min has now been eliminated where possible to reflect its change of use
Fixed rare case where a variable encode looks like a fixed encode in the header. This probably occurs when input is very small, it can occur when the settings used are pathological or wildly inefficient, it is very unlikely to occur under normal circumstances
Fixed stat gathering and made it multithread-safe. Subtiming eliminated and everything gathered properly instead of using mode-specific shortcuts
Fixed subset with high samplerates still being restricted slightly too strictly
Fully convert modes to using simple_enc instead of manually defining edge cases. Fixed implementation dramatically simplified
Boilerplate sucked out of beginning and end of all mode implementations to simplify them
flanal utility to validate flac files is now in the repo and has been updated slightly to work with input created with --no-seek

It's getting there but still alpha for now. Once input handling is improved it'll be possible to implement an input pipe, once that's in we're close to all the major features being in place to consider it beta.

Title: Re: Multithreading
Post by: cid42 on 2023-02-14 17:51:40

Thinking about it there's no reason flaccid needs to reuse presets -0 to -8, those could match ./flac's fixed -0 to -8 until better variable encodes taking roughly the same time possibly replace them. The midrange of the previous suggestions could start at -9.

Quote from: Porcus on 2023-02-08 11:43:01

...
(There must be some nice thesis idea for a clever jr researcher here: find a heuristic that takes as input your typical .flac file with a 4096 frame size, does a quick analysis on the predictor and the ... hm, rice partition order and parameter I guess? - to decide what frames to split. I've seen the IEEE publish far less interesting ideas than that.)

Maybe adjacent frames that have similar models are ripe for merging, mostly this will not be subset. Maybe when adjacent modelling varies by a lot it's a sign that more blocksize nuance is needed in that area, the levers at the fixed encodes disposal were changed but maybe they weren't the ideal levers for the job. edit: This amounts to targeted tweak/merge phases just with an initial encode fed in by the input instead of computed fresh. If a targeted tweak/merge is viable it's not just viable for flac input but are candidates to replace the current tweak/merge passes. It seems possible that it's viable, kind of a halfway hack between being smart and being bruteforce.

Regardless of how the variable permutation is chosen, the model of the closest frame at a given point in the input could be used to encode the variable frames. You'd do it to narrow the search space where time is at a premium, the main benefit would be using a smaller max-lpc-order when you seem to be able to get away with it. You could apply it to flac-input, or again use this prior knowledge to speed up the tweak/merge test encodes.

Title: Re: Multithreading
Post by: cid42 on 2023-02-15 17:34:40

From the spec:

Quote

The partition order MUST be so that the block size is evenly divisible by the number of partitions. This means for example that for all odd block sizes, only partition order 0 is allowed. The partition order also MUST be so that the (block size >> partition order) is larger than the predictor order. This means for example that with a block size of 4096 and a predictor order of 4, partition order cannot be larger than 9

The first part means it may be wise to stick to blocksizes divisible by 2^n for n as large as is feasible. Sane analysis settings will do this anyway, but tweak currently chooses the tweak distance each pass with (min_blocksize_used_in_analysis/(tweak_pass+2)). This definitely introduces blocksizes that can only be stored with low order partitions fairly early, and it looks like it's a contributing factor to successive tweak pass benefit looking like a decreasing reverse sawtooth wave.

Ensuring tweak distance is a multiple of 16 (or 8/32/64) might be enough, something like ((((min_blocksize_used_in_analysis/(tweak_pass+2)))/16)*16). Results using that are inconclusive, it appears to help a tweak pass get a better result sooner but the restriction means a full tweak run (aka --tweak 1) exits earlier resulting in a slightly worse result (but much quicker). If that's true in general then it's still a win IMO, ideally for a general encode you do a few passes that grab most of the benefit and call it good.

     Amount saved in pass               Total saved by a certain point
pass    1    2    4    8   16   32   64     1      2      4      8     16     32     64
 1   3900 3900 3900 3900 3900 3900 4012
 2   1982 1982 1982 1982 1982 1982 1681
 3   1165 1165 1165 1165 1165 1376 1491
 4     69  146  803  803  803 1099 1662
 5   1189 1032  740  740  740  125  295
 6    126  150  527  527  527  863   40
 7    467  460  260  260  606  123   11  8898   8835   9377   9377   9723   9468   9192
 8    489  484  411  411   87    9
 9     34  314  304  304  486  820
10    140  58    77  397   63  152
11    431 358   315   62    7    9       9992  10049  10484  10551  10366  10458
12     95  88    87  176  550
13     25 234   140   17  74
14     50  25   104  428  11            10162  10396  10815  11172  11001
15    123  77    13   60
16      9 372   368    7                10294  10845  11196  11239
17    436  73    72
18     17  37    75
19     83  91    15
20     15  13   167                     10845  11059  11525
21     10        16
22      1         5                     10856         11546
23    181
24      3
25     10
26      8
27    100
28      3                               11161

Maybe law of small numbers, this is the only test.

Title: Re: Multithreading
Post by: Porcus on 2023-02-15 22:17:59

That may give quite different treatment when subset is imposed, limiting to partition order 8, i.e. 1/256th of the block. Sure there are exceptional signals (https://hydrogenaud.io/index.php/topic,123025.msg1018251.html#msg1018251) where constraining yourself to subset means you should do -b2048 or maybe even lower only to get a smaller partition (in samples) at -r8. And when your project is to tweak block size, it could very well be that those signals are many enough to care about. Assuming subset is a constraint.

(It would likely need to be that in the reference encoder. I have no idea whether there is any subset-compliant decoder - meaning it would handle partition size of 1 sample and block size 64 - that would choke on partition size of 32 samples and block size 4096. But anyway ...)

As for this:

Quote from: cid42 on 2023-02-14 17:51:40

Maybe adjacent frames that have similar models are ripe for merging, mostly this will not be subset.

Yes, that is merging. I was talking splitting - which for all that I know, may not even be on the table in what you are working on.
(Generally I don't even have any idea how much a "good" (like in -8 with a reasonable -q) predictor vector differs from an "optimal" of the same -q. I know I can run some encodes and parse -a output, but ... not happening with my clumsiness at coding.)

Title: Re: Multithreading
Post by: cid42 on 2023-02-15 23:36:19

Quote from: Porcus on 2023-02-15 22:17:59

That may give quite different treatment when subset is imposed, limiting to partition order 8, i.e. 1/256th of the block. Sure there are exceptional signals (https://hydrogenaud.io/index.php/topic,123025.msg1018251.html#msg1018251) where constraining yourself to subset means you should do -b2048 or maybe even lower only to get a smaller partition (in samples) at -r8. And when your project is to tweak block size, it could very well be that those signals are many enough to care about. Assuming subset is a constraint.

(It would likely need to be that in the reference encoder. I have no idea whether there is any subset-compliant decoder - meaning it would handle partition size of 1 sample and block size 64 - that would choke on partition size of 32 samples and block size 4096. But anyway ...)

Not sure I'm following, I think you misunderstand the test which is fair enough as it was communicated poorly (haven't talked about increasing the min partition order and the test is subset). The test merely limits the blocksizes tweak tries to a multiple of 1/2/4/8/16/32/64 (meaning max partition order is probably at least 0/1/2/3/4/5/6 respectively), that's the columns with rows successive tweak passes. By doing this it allows flac to use higher partition orders if it wants to more often, there's less oddball sizes tested which is good in theory because they are at a disadvantage only having access to lower orders.

Quote from: Porcus on 2023-02-15 22:17:59

As for this:
Quote from: cid42 on 2023-02-14 17:51:40
Maybe adjacent frames that have similar models are ripe for merging, mostly this will not be subset.
Yes, that is merging. I was talking splitting - which for all that I know, may not even be on the table in what you are working on.
(Generally I don't even have any idea how much a "good" (like in -8 with a reasonable -q) predictor vector differs from an "optimal" of the same -q. I know I can run some encodes and parse -a output, but ... not happening with my clumsiness at coding.)

I haven't thought about splitting as in the context of refining peakset analysis we know that splitting is not beneficial, if it was then peakset would have chosen smaller blocks (the exception being when the smallest blocksize available to peakset is used, that is something to consider but tweak does cover that). In an ideal world the other modes approach the accuracy of peakset for a fraction of the cost, so splitting shouldn't be beneficial enough to bother with with there either.

In the context of using a fixed encode to create a variable encode, I don't know how useful looking at a single block is. It's probably best not to split solo or tweak, but to take adjacent frames either side into account when determining the fate of the frame in between. After all frames aren't in a bubble and the frame boundaries were not chosen smartly.

Not working on anything as such, just keep poking things which is how I noticed the partition restriction in the spec. There's been some pretty big updates to the code like input is now read only when analysis needs it and a lot of sane refactoring, but you're looking at the research for new ideas. There's wav input to get working properly, seektable would be nice, and presets TODO. But today I went off on a tangent and am creating a benchmark to test different ways to encode a residual for fun, nothing that's going to help here.

edit: Split could potentially be its own pass before merge or tweak, that only tries to split frames that use the smallest blocksize available to analysis. It might capture a fair amount of what tweak currently does cheaply.

Title: Re: Multithreading
Post by: cid42 on 2023-02-16 15:10:36

As of commit 25e49fcd2122ff8001070eb3a3e5b1e9805653d5:

Initial preset implementation is in, options --preset and --preset-apod. Presets cannot be mixed with manual definitions, if you try it should error.

Presets 0..8 match ./flac so that people can encode fixed with settings they're familiar with, just multithreaded
Preset 9 is currently gasc starting at blocksize 1536 with analysis setting 3m and output setting 8. This is slightly more space-efficient than -8 in testing but it's not guaranteed
Preset 10 is currently peakset using 1152,2304,3456,4608 with analysis setting 3m and output setting 8
More variable presets will come and the two that exist might change
These options from ./flac work with all presets, appended to the preset number: e/l/m/p/q/r
-b from ./flac to pick blocksize also works, with the fixed presets 0..8 only
--preset-apod overrides the apod settings used by the preset, it doesn't append

Other notable changes:

Reworked input handling to now read only as much input as is immediately needed by analysis
--input-format forces input to be read as a particular format. Useful for piped input or when extension is different from typical. Valid formats are: flac wav cdda
flac input can now be piped by specifying '--in -' in tandem with '--input-format flac'
Raw input reworked to only be for CDDA, automatically chosen for files with .bin exension. Pipe works and requires '--input-format cdda'
Initial wav support implemented but it's basic. No pipe, 16 bit only, not very well tested
Hashing is now done by input, greatly simplifying the rest of the interface
Total samples of input no longer used during encode so it passes flac-test-files 45, now all flac-test-files pass
If total samples is available from input header (flac/wav), it's checked on completion
If MD5 is available from input header (flac) and we're hashing, it's checked on completion

Full wav support is a major missing feature, seektable support is another must-have IMO. Is there anything else major besides those that should be on the TODO?

Title: Re: Multithreading
Post by: cid42 on 2023-02-18 22:42:22

As of commit 0fd319c11503b4a31b355310ed73ca9a7366e551:

Seektable support added with --seektable val. -1 (default) to let program size the seektable based on input header (flac/wav), fallback to 100 seekpoints. 0 disables the seektable, n>0 tells the program to use that many seekpoints. The program reserves the right to override users choice if it detects that it's massive overkill
Optional metadata preservation from flac input files with --preserve-flac-metadata. Preserves the following meadata blocktypes: Application, Vorbis, Cuesheet, Picture, Undefined, aka everything that can be preserved
Added a --metacomp option to flanal allowing it to check if --preserve-flac-metadata losslessly preserved the metadata

Both seektable and metadata preservation has passed all tests including preserving the oddball metadata in flac-test-files, that doesn't mean they're free from bugs. If anyone has weird input they'd like to throw at it please do, I've tried to account for edge-cases but may have missed something.

There's also a separate branch in the repo that doesn't have these new features but does have a split pass as discussed a few posts above, enabled with --split. It doesn't seem very effective so unless it can be reworked to be useful it's not going to be mainlined, but maybe I'm not testing with suitable input.

Now that seektable and preserving metadata is in some brave soul can convert their entire collection if they wish. I recommend checking the result with flac -t before you delete the originals, this is still alpha you know ;)

edit: By preserving metadata I mean anything in the flac spec, not very familiar with metadata but there can also be some stuff after the bitstream? That's not currently kept.

Title: Re: Multithreading
Post by: TheBigBadBoy on 2023-07-25 20:19:11

Hi!
Just discovered this tool recently, and I'm trying to compile it under Linux.
The Windows build shared by @ktf works with wine, it's just a bit outdated.

@cid42 I'm not able to compile flaccid for Linux. I read the GitHub README and your message https://hydrogenaud.io/index.php/topic,123248.msg1022138.html#msg1022138 .

During the compilation, I get errors. The main one being:

./common.h:34:29: error: unknown type name 'FLAC__StaticEncoder'; did you mean 'FLAC__StreamEncoder'?

Tested with clang 15.0.7 and gcc 13.1.1, both giving same errors (see attached log files).

If I build the custom libFLAC (https://github.com/chocolate42/flac), I don't get any static libraries (.a) but rather libtool .la files.
Thus I have `flac/src/libFLAC/libFLAC-static.la` and `flac/src/libFLAC/libFLAC.la`but in the README in flaccid it says "PATH_TO_libFLAC-static.a". So I don't know if this is a problem.

I also read that in your message #68 you are using `-I../flac/include` in order to compile flaccid, while the only "include" folder I have is `flac/src/libFLAC/include`.

Any help is appreciated, I'm quite used to compile stuffs myself but I have to admit that I'm a bit stumped here.

Title: Re: Multithreading
Post by: cid42 on 2023-07-26 00:21:28

I'm in and out at the minute but luckily you've caught me in, unfortunately if this post doesn't solve anything it may be up to a week before a follow up. I haven't abandoned flaccid but haven't worked on it for a while, it was left afair in a working state but with an awkward UI and a few nice-to-haves missing like full wav support. The flac repo has since been used to test other libflac things, the static code was originally on the master branch but it's now in the static_encoder branch so you'll need to switch to that branch before building. That's why it can't find FLAC__StaticEncoder, the master branch is probably some old snapshot of the xiph repo with no static implementation.

On my PC the flac and flaccid repo's are in the same directory, and the build script is in the flaccid dir, so `-I../flac/include` includes flac's headers when building flaccid. Straddling repo's like this is isn't portable but it works, for now it saves having to learn however repo dependencies work. My entire build command for flaccid is this:

gcc flaccid.c gasc.c gset.c load.c chunk.c fixed.c common.c peakset.c seektable.c -oflaccid -I../flac/include '/home/u20/Documents/mountain/runtime/flac/src/libFLAC/.libs/libFLAC-static.a' -logg -lm -fopenmp -Wall -O3 -funroll-loops  -Wall -Wextra -Wstrict-prototypes -Wmissing-prototypes -Waggregate-return -Wcast-align -Wnested-externs -Wshadow -Wundef -Wmissing-declarations -Winline  -Wdeclaration-after-statement -fvisibility=hidden -fstack-protector-strong -lcrypto -DUSE_OPENSSL

This builds but has not been tested. A few commits made months ago have now been pushed to the public repo so you can try to compile with the same code I just compiled with.

I do plan on eventually updating flaccid to the latest flac and implementing the few things it needed to make it less clunky in general, but how best to do that may need a rethink now that libflac can multithread itself. Let me know how you get on.

Title: Re: Multithreading
Post by: TheBigBadBoy on 2023-07-26 13:48:56

Thank you for your fast answer!

I should have checked for another branch than the main. The errors are now gone, and I managed to build flaccid for Linux.

The next step will be to build it with PGO (profile guided optimization) and statically. If I manage to do that (I should), I'll share static Linux builds for x86_64 and aarch64.
I'll also try to compile for Windows using mingw, but no promise here (and no PGO).

Quote from: cid42 on 2023-07-26 00:21:28

I do plan on eventually updating flaccid to the latest flac and implementing the few things it needed to make it less clunky in general, but how best to do that may need a rethink now that libflac can multithread itself. Let me know how you get on.

Updating to the latest flac would be great yes. If I'm not mistaken, the current fork of chocolate42 is somwhere at version 1.4.2, while up-to-date flac is 1.4.3 (+ some other commits). I don't think there's anything related to compression improvements in that new release.

Of course supporting completely WAV input is a plus, but imo not really needed (encoding to FLAC can be really fast, and in general my own audio collection is already stored in FLAC - I just compress them more when I have the time).

Also, currently someone is writing a PR (still a draft) to add multithreading support in `flac` (I did not really looked into it, so I don't know if it's just to encode simultaneously all the input audio files or something else). https://github.com/xiph/flac/pull/634

I discovered your good `flaccid` thanks to an issue I wrote, asking if variable block size encoding could be added in the Xiph's flac encoder. https://github.com/xiph/flac/issues/639

I quite like your utility, and we can really use whatever we want (no restriction on the number of bruteforced blocksizes, ...).
Thank you for making that!

Title: Re: Multithreading
Post by: Porcus on 2023-07-26 13:59:33

Quote from: TheBigBadBoy on 2023-07-26 13:48:56

Also, currently someone is writing a PR (still a draft) to add multithreading support in `flac`

This thread: https://hydrogenaud.io/index.php/topic,124437

Title: Re: Multithreading
Post by: TheBigBadBoy on 2023-07-27 23:29:55

Finally finished my build! :D

So, this is a build: Linux x86-64, generic (mtune), static, optimized (O3 flto fprofile-instr-gen), stripped, UPX'd (compressed with -9 --ultra-brute)
Compiled with sources from yesterday, using clang 15.0.7

I will definitely not build a Windows version, but aarch64 build will come when I need to use it (or when someone asks for it).

Command used for compiling (PGO gen):

 clang -static -flto -s -fprofile-instr-generate -fcoverage-mapping -mtune=generic flaccid.c gasc.c gset.c load.c chunk.c fixed.c common.c peakset.c seektable.c -oflaccid -I../flac/include ../libFLAC-static.a ../libogg.a ../libcrypto.a -lm ../libomp.a -Wall -O3 -funroll-loops -Wall -Wextra -Wstrict-prototypes -Wmissing-prototypes -Waggregate-return -Wcast-align -Wnested-externs -Wshadow -Wundef -Wmissing-declarations -Winline -Wdeclaration-after-statement -fvisibility=hidden -fstack-protector-strong -DUSE_OPENSSL

I compiled everything myself (static libs like OMP too, using `clang -flto -O3 -mtune=generic`).

Command used for profiling data:

./flaccid --in liberate.flac --lax --out liberate_VarSizeBruteForced.flac --peakset-window 48 --preserve-flac-metadata --queue 8192 --workers 16 --tweak 1 --merge 0 --analysis-apod 'subdivide_tukey(5);gauss(0.1);tukey(5);partial_tukey(5);punchout_tukey(5);welch;bartlett;bartlett_hann;blackman;blackman_harris_4term_92db;connes;flattop;hamming;hann;kaiser_bessel;nuttall;rectangle;triangle' --output-apod 'subdivide_tukey(5);gauss(0.1);tukey(5);partial_tukey(5);punchout_tukey(5);welch;bartlett;bartlett_hann;blackman;blackman_harris_4term_92db;connes;flattop;hamming;hann;kaiser_bessel;nuttall;rectangle;triangle' --analysis-comp mepl32r15 --output-comp mepl32r15 --mode peakset --blocksize-list 512,1024,1536,2048,2560,3072,3584,4096,4608

on a 5 seconds FLAC file (44.1kHz stereo s16)

I'm quite happy with the output FLAC file: compared to fixed 4096-blocksize with high compression settings using `flac`, `flaccid` managed to save 0.5% of filesize!
To even save more I could have increased the numbers given to all tukey functions and add more blocksize to bruteforce, but this was just a test on a single file in order to build with PGO.

Since I wanted a slow encode (took 3000 seconds - take into account it was profiling data), I based my command on this comment https://hydrogenaud.io/index.php/topic,123248.msg1022236.html#msg1022236. I must say that it's quite unclear for me what the values --merge and --tweak do (even reading that comment and the help message from flaccid).

Even if I set workers to 16, it never reached 200% CPU usage. Perhaps because it was profiling data?

Title: Re: Multithreading
Post by: Replica9000 on 2023-07-28 00:17:36

Quote from: TheBigBadBoy on 2023-07-27 23:29:55

Finally finished my build! :D

So, this is a build: Linux x86-64, generic (mtune), static, optimized (O3 flto fprofile-instr-gen), stripped, UPX'd (compressed with -9 --ultra-brute)
Compiled with sources from yesterday, using clang 15.0.7

I will definitely not build a Windows version, but aarch64 build will come when I need to use it (or when someone asks for it).

Command used for compiling (PGO gen):
Code: [Select]
 clang -static -flto -s -fprofile-instr-generate -fcoverage-mapping -mtune=generic flaccid.c gasc.c gset.c load.c chunk.c fixed.c common.c peakset.c seektable.c -oflaccid -I../flac/include ../libFLAC-static.a ../libogg.a ../libcrypto.a -lm ../libomp.a -Wall -O3 -funroll-loops -Wall -Wextra -Wstrict-prototypes -Wmissing-prototypes -Waggregate-return -Wcast-align -Wnested-externs -Wshadow -Wundef -Wmissing-declarations -Winline -Wdeclaration-after-statement -fvisibility=hidden -fstack-protector-strong -DUSE_OPENSSL
I compiled everything myself (static libs like OMP too, using `clang -flto -O3 -mtune=generic`).

Command used for profiling data:
Code: [Select]
./flaccid --in liberate.flac --lax --out liberate_VarSizeBruteForced.flac --peakset-window 48 --preserve-flac-metadata --queue 8192 --workers 16 --tweak 1 --merge 0 --analysis-apod 'subdivide_tukey(5);gauss(0.1);tukey(5);partial_tukey(5);punchout_tukey(5);welch;bartlett;bartlett_hann;blackman;blackman_harris_4term_92db;connes;flattop;hamming;hann;kaiser_bessel;nuttall;rectangle;triangle' --output-apod 'subdivide_tukey(5);gauss(0.1);tukey(5);partial_tukey(5);punchout_tukey(5);welch;bartlett;bartlett_hann;blackman;blackman_harris_4term_92db;connes;flattop;hamming;hann;kaiser_bessel;nuttall;rectangle;triangle' --analysis-comp mepl32r15 --output-comp mepl32r15 --mode peakset --blocksize-list 512,1024,1536,2048,2560,3072,3584,4096,4608
on a 5 seconds FLAC file (44.1kHz stereo s16)

I'm quite happy with the output FLAC file: compared to fixed 4096-blocksize with high compression settings using `flac`, `flaccid` managed to save 0.5% of filesize!
To even save more I could have increased the numbers given to all tukey functions and add more blocksize to bruteforce, but this was just a test on a single file in order to build with PGO.

Since I wanted a slow encode (took 3000 seconds - take into account it was profiling data), I based my command on this comment https://hydrogenaud.io/index.php/topic,123248.msg1022236.html#msg1022236. I must say that it's quite unclear for me what the values --merge and --tweak do (even reading that comment and the help message from flaccid).

Even if I set workers to 16, it never reached 200% CPU usage. Perhaps because it was profiling data?

I gave this a quick try. This is the syntax I used.

flaccid --workers 8 --preset 8p --in file.wav --out file.flac

The workers option seems to be ignored, and flaccid seg faults on large files.

Title: Re: Multithreading
Post by: ktf on 2023-07-28 06:35:03

Quote from: TheBigBadBoy on 2023-07-27 23:29:55

So, this is a build: Linux x86-64, generic (mtune), static, optimized (O3 flto fprofile-instr-gen), stripped, UPX'd (compressed with -9 --ultra-brute)
[...]
Command used for compiling (PGO gen):

From my experience, using PGO and/or LTO for compiling libFLAC shrinks the binary size and slows down encoding instead of speeding it up. I haven't seen a build where either (or both) improve speed.

Title: Re: Multithreading
Post by: cid42 on 2023-07-28 09:34:50

Thank you for a static Linux build, doubt I could have figured that out.

Quote from: TheBigBadBoy on 2023-07-27 23:29:55

--analysis-comp mepl32r15 --output-comp mepl32r15

You almost certainly want to prepend 8 to these options, the first character if it's '0' to '8' defines the compression level. Without defining 8 you're using libflac's default of 5.

Quote from: TheBigBadBoy on 2023-07-27 23:29:55

I must say that it's quite unclear for me what the values --merge and --tweak do (even reading that comment and the help message from flaccid).

A value of 0 disables them, otherwise it's the minimum number of bytes a pass over the queue needs to save to trigger another pass. Merge/tweak are iterative over the queue, the bigger the queue the easier it is for a pass to hit the byte target. Each successive pass tends to save less bytes than the last, a larger value means you're trying to target only the low hanging fruit for some cheap gains, a 1 means you want to minimise filesize at any cost.

Quote from: Replica9000 on 2023-07-28 00:17:36

The workers option seems to be ignored, and flaccid seg faults on large files.

How large and was it a seg fault or was an error message given? As far as I remember proper streamed read was implemented, it shouldn't be reading the entire file into memory before encoding or anything like that. If it was a large wav file then that might explain it, wav support currently is poor.

The last commit which I pushed after TheBigBadBoy first tried compiling the code fixed a regression in fixed mode where it only ran single-threaded. They may not have updated to include that commit before building the binary. Alternatively something else may have got lost in the shuffle of refactoring/tidying/minimising which were some of the last things I did in feb.

Title: Re: Multithreading
Post by: Replica9000 on 2023-07-28 19:15:35

Quote from: cid42 on 2023-07-28 09:34:50

Quote from: Replica9000 on 2023-07-28 00:17:36
The workers option seems to be ignored, and flaccid seg faults on large files.
How large and was it a seg fault or was an error message given? As far as I remember proper streamed read was implemented, it shouldn't be reading the entire file into memory before encoding or anything like that. If it was a large wav file then that might explain it, wav support currently is poor.

The last commit which I pushed after TheBigBadBoy first tried compiling the code fixed a regression in fixed mode where it only ran single-threaded. They may not have updated to include that commit before building the binary. Alternatively something else may have got lost in the shuffle of refactoring/tidying/minimising which were some of the last things I did in feb.

The WAV files are over 1GB. I also tried with the file already compressed to FLAC (629MB).

./flaccid --in NIN_The_Fragile.flac --out NIN_The_Fragile-fl.flac --workers 8 --preset 8
Segmentation fault

I end up with an incomplete 410MB file.

This is what I get at the end of strace

--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x7f46b207a000} ---
+++ killed by SIGSEGV +++

I was gonna attempt to build a copy myself, but not having much luck at the moment.

./common.h:34:29: error: unknown type name 'FLAC__StaticEncoder'; did you mean 'FLAC__StreamEncoder'?

Title: Re: Multithreading
Post by: TheBigBadBoy on 2023-07-28 21:01:54

Quote from: ktf on 2023-07-28 06:35:03

From my experience, using PGO and/or LTO for compiling libFLAC shrinks the binary size and slows down encoding instead of speeding it up. I haven't seen a build where either (or both) improve speed.

Mmmmh... I highly doubt that PGO will slow down the encoding.
For LTO I don't really know, but I'll only trust numbers ^^

Quote from: cid42 on 2023-07-28 09:34:50

Thank you for a static Linux build, doubt I could have figured that out.

No problem. If you want I can also share the 4 static libraries needed to compile flaccid statically.
Once you have them it's just plug and play.

Quote from: cid42 on 2023-07-28 09:34:50

You almost certainly want to prepend 8 to these options, the first character if it's '0' to '8' defines the compression level. Without defining 8 you're using libflac's default of 5.

In the flac manual, we can read

       -8, --compression-level-8
              Synonymous with -l 12 -b 4096 -m -r 6 -A subdivide_tukey(3)

I override them all with the string "mepl32r15" (and using apod-anal + out), so shouldn't change a thing even if I added '5' or '8' at the beginning (at least in `flac`, don't know about `flaccid`).

Thanks for explanations about tweak and merge.

Quote from: Replica9000 on 2023-07-28 19:15:35

I was gonna attempt to build a copy myself, but not having much luck at the moment.
Code: [Select]
./common.h:34:29: error: unknown type name 'FLAC__StaticEncoder'; did you mean 'FLAC__StreamEncoder'?

@Replica9000 You got the same error I had :))
You simply need the static_encoder branch (instead of the main): https://github.com/chocolate42/flac/tree/static_encoder

Title: Re: Multithreading
Post by: TheBigBadBoy on 2023-07-28 21:03:50

Oh forgot a thing:

Quote from: cid42 on 2023-07-28 09:34:50

The last commit which I pushed after TheBigBadBoy first tried compiling the code fixed a regression in fixed mode where it only ran single-threaded. They may not have updated to include that commit before building the binary. Alternatively something else may have got lost in the shuffle of refactoring/tidying/minimising which were some of the last things I did in feb.

I have included that commit before compiling ^^
I kept the sources, and `git pull` says everything (flac mod + flaccid) are up to date.

Title: Re: Multithreading
Post by: cid42 on 2023-07-28 21:14:49

@TheBigBadBoy: That's true, if you've covered everything a preset will do it shouldn't matter. flaccid uses the libflac api and just makes sure to call the preset first if present before any manual options.

I wonder how widely the static binaries you've made can actually be used, Linux or more specifically glibc tends to be a pain being forwards not backwards compatible even with static builds. It's why projects like Prime95 use ancient versions of CentOS to build their static Linux binaries, searching online that does seem to be the common solution.

@Replica9000: Thank you for details about the seg fault, it might help me be able to reproduce and fix.

Title: Re: Multithreading
Post by: TheBigBadBoy on 2023-07-28 21:26:35

@cid42 idk about the static libs, it was just long to read how to build them properly and wait for them to build...
Right now they should be usable on any x86-64 Linux (I think).
I first tried to search that online, but found absolutely nothing (and Arch Linux does not provide any static libs with pacman, while Termux does - for aarch64/Android).

I tested flaccid again (this time without profiling data), with a faster encode `--mode peakset --blocksize-list 576,1152,1728,2304,2880,3456,4032,4608 --analysis-comp 8p --output-comp 8ep --queue 8192 --tweak 1 --merge 0 --workers 16`
Was quite fast, FLAC input was 4 min 36 sec 44.1kHz stereo s16, encoded in only 386 sec (6 min 26 sec).
I lookedcarefully this time, and **CPU usage never reached more than 100%** !
(Using 16 threads would result in 1600% CPU usage)

Anyway, if you need help to build stuffs or try some other commit don't hesitate to tag me.

I fully support this flaccid project. I like it, please continue ^^

Title: Re: Multithreading
Post by: cid42 on 2023-07-28 22:30:27

For some reason the static build is just not multithreaded, my normal build is multithreaded. Tried fixed/gasc/peakset modes (easy enough with presets 8/9/10) on a quad core with 4 workers, all pegged at 100% with the static build and 400% with the normal build.

Title: Re: Multithreading
Post by: TheBigBadBoy on 2023-07-28 22:33:57

Thanks for letting me know, I'll look at it tomorrow. The cause is surely OMP, but it is quite weird I managed to compile flaccid without any warning or errors related to OMP.

Title: Re: Multithreading
Post by: cid42 on 2023-07-30 18:50:30

Fixed the seg fault and pushed to repo, the problem was underallocating working memory when storing potential seekpoints. With an average blocksize of 4096 it would start using unallocated memory at ~71 minutes 44.1KHz input and quickly seg fault, when the code assumed enough space was reserved for ~90 minutes. Always use sizeof.

Until further notice avoid gasc mode (including preset 9), testing with a large file uncovered something causing corruption which may or may not be related to using a large file. Maybe there's something off about queue handling or something. If a fix isn't obvious gasc will probably soon be removed temporarily (or permanently if gasc can't be made more useful).

Now that libflac can multithread there's no point mirroring presets 0 to 8. I'm tempted to make flaccid's -0 the same as flac's -8p to set expectations, make presets >0 progressively stronger peakset settings, and maybe <0 progressively weaker variable encodes that may overlap in strength/speed with flac's presets.

Title: Re: Multithreading
Post by: TheBigBadBoy on 2023-07-30 20:49:28

Thanks for the update!

I managed to get OMP working, and now `flaccid` is using around 1550% CPU usage with 16 (resulting in 5x faster encoding compared to single threaded), good to see ^^ .

Same specs as before: Linux x86-64, generic (mtune), static, optimized (O3 flto fprofile-instr-gen), stripped, UPX'd (compressed with -9 --ultra-brute).
This build has already the latest update (segfault on large files *fixed*).

But this time, I have a Warning:

Using 'dlopen' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking

in other words, some code (EDIT: not libFLAC) uses `dlopen` (aka dynamic library open), so this static build is not really static (as would be any static build with libFLAC code).

So idk if people using my build need the same library version as me (libc.so.6) or the same glibc version as me (2.37-3).
Anyway, anyone already have glibc installed, I just don't know if glibc is updated often/rarely :/

Title: Re: Multithreading
Post by: ktf on 2023-07-30 20:57:31

Quote from: TheBigBadBoy on 2023-07-30 20:49:28

But this time, I have a Warning:
Code: [Select]
Using 'dlopen' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
in other words, libFLAC's code uses `dlopen` (aka dynamic library open)

No, it doesn't. I assume that dlopen is somewhere else, maybe in the OpenMP code, the gcc libs or libc?

Title: Re: Multithreading
Post by: TheBigBadBoy on 2023-07-30 21:22:17

Oh yeah, my bad...
`grep` found matches with the text "dlopen" in my static OMP library (libomp.a).

And it also explains why I didn't have this warning when I first compiled (since OMP didn't work).

I have to admit that I don't know if there's a solution to "remove" this `dlopen()`, and if there is, how to find it.
I think I'll leave it at that (unless someone knows how).
Of course, if need be, I can rebuild it.

Title: Re: Multithreading
Post by: cid42 on 2023-07-30 21:50:04

dlopen issues sounds like libc. A common solution is to use a relatively ancient version of glibc from for example an ancient version of centos (it and likely future versions of glibc should be able to run it). Or maybe musl would be a good way to go now.

Title: Re: Multithreading
Post by: TheBigBadBoy on 2023-07-30 22:00:28

@cid42 I don't think so, because it's not the first time I compile statically an executable (on the other hand, it's the first time I use a static version of OMP).

You can see here that OMP is using `dlopen()`: https://github.com/llvm/llvm-project/blob/858a2865d3008b35f22597a411b2b4f7110aaa15/openmp/runtime/src/kmp_alloc.cpp#L1277 (one example among others)

Also, I only needed to compile 4 static libs in order to make a static build of flaccid: libcrypto.a, libFLAC-static.a, libogg.a, libomp.a.
The problem is really OMP, and even an "old" version from 2020 already has `dlopen()` calls. https://github.com/llvm-mirror/openmp

Title: Re: Multithreading
Post by: Replica9000 on 2023-07-31 11:06:43

The new flaccid binary @TheBigBadBoy provided seems to work well with the larger files. For the simple presets, even 9 and 9p seems to work (no corruption in my tests so far). But when I was getting into the complex settings, I found an error with merge when used in combination with queue set to anything above 256.

./flaccid --in NIN_The_Fragile.wav --out NIN_The_Fragile-fl.flac --workers 16 --mode peakset pr7l12m --queue 8192 --merge 1 --tweak 0
Processed 1/7
Processed 3/7
Processed 7/7
merge(1) saved 6721 bytes with 423 merges                               Processed 1/7                                                           
Processed 3/7                                                           
Processed 7/7
merge(1) saved 2604 bytes with 152 merges
Processed 1/7
Processed 3/7
Processed 7/7
merge(1) saved 2764 bytes with 160 merges
Processed 1/7
Processed 3/7
Processed 7/7
merge(1) saved 3256 bytes with 189 merges
Processed 1/7
Processed 3/7
Processed 7/7
Processed 1/7
Processed 3/7
Processed 7/7
merge(1) saved 5227 bytes with 275 merges
flaccid: common.c:128: void simple_enc_encode(simple_enc *, flac_settings *, input *, uint32_t, uint64_t, int, stats *): Assertion `samples' failed.
Aborted

Title: Re: Multithreading
Post by: cid42 on 2023-07-31 15:43:25

Thanks, that assertion implies that 0 blocksize frames are trying to be encoded, which used to be a definite no but now are possible as the queue is implemented as an array. If it is what I think then it occurs because analysis and output settings are different and it just never got caught in testing. Strange that queue size matters, strange that it doesn't fail at the first instance of a 0 blocksize encode. There might be something deeper going on.

Title: Re: Multithreading
Post by: Replica9000 on 2023-07-31 16:14:49

Even with the same input file and options used, it doesn't fail at the same point. Sometimes it fails early into the file, sometimes later. There was even a couple successful runs.

Title: Re: Multithreading
Post by: cid42 on 2023-07-31 16:29:53

Only with merge right? I think I've found a flaw in the logic.

edit: Another classic memory handling error, introduced at some point during the transition to an output queue and more commonly triggered by longer files.

Title: Re: Multithreading
Post by: Replica9000 on 2023-07-31 16:50:31

Only when using both merge and queue. Anytime I remove one of those two options, there's no errors.

Title: Re: Multithreading
Post by: Replica9000 on 2023-08-05 04:03:05

Built a static Linux 64-bit binary using libFLAC git-31ccd3df. Targeted for AVX2 capable CPUs. Also tried my hand at a static Win64 build.

Title: Re: Multithreading
Post by: cid42 on 2023-08-10 13:06:17

As of the latest commit:

Fixed the merge bug, caused by a race condition when multithreading. A variable was erroneously shared by threads, rarely a merge occurred but the counter didn't update, if a pass ended with the counter at zero when it should have been >0 then kaboom. Seems to have been triggered more often with larger files and larger queues because the extra rope gave it more chance to hang itself
Reviewed the rest of the multithreading code, there's not much of it as most of it was combined into the queue code with only some short snippets in gset/gasc/peakset. It all seems sound
Updated the (still convoluted and dense) help/readme to answer questions TheBigBadBoy and Replica9000 recently had

I'll leave updating to the latest libflac until libflac multithreading is merged or there's a major compression benefit. Right now flaccid is 70 commits behind but there's nothing too relevant AFAICT. Haven't revisited gasc yet.

Title: Re: Multithreading
Post by: Replica9000 on 2023-08-14 03:22:55

Static 64-bit binaries of flaccid git-ce72e568 using libFLAC git-31ccd3df for AVX2 capable CPUs.

Linux file = 696.20 KB
Windows file = 404.21 KB

Title: Re: Multithreading
Post by: TheBigBadBoy on 2023-08-14 22:33:49

Thanks for the update cid42!
And yes, updating the libFLAC version used by flaccid is not really usefull (for now), as there would be no real benefit (the main "things" of flaccid being multithreading and compression).

@Replica9000 Thanks for the builds. There are matches of "dlopen" (dynamic library open) in your Linux build, which was quite intended (like mine).
But I just built it statically on Termux for aarch64 CPU architecture, and surprisingly I got no warning about dlopen when compiling (no strict glibc version to have), and there is no match for dlopen in the archive.
So I guess there's a way to have the same behaviour for a x86-64 executable, but I need to dig in more.

Title: Re: Multithreading
Post by: Replica9000 on 2023-08-15 01:15:12

Quote from: TheBigBadBoy on 2023-08-14 22:33:49

@Replica9000 Thanks for the builds. There are matches of "dlopen" (dynamic library open) in your Linux build, which was quite intended (like mine).
But I just built it statically on Termux for aarch64 CPU architecture, and surprisingly I got no warning about dlopen when compiling (no strict glibc version to have), and there is no match for dlopen in the archive.
So I guess there's a way to have the same behaviour for a x86-64 executable, but I need to dig in more.

I only get one warning about dlopen when compiling:

warning: Using 'dlopen' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking

But it doesn't seem the static binary is calling for glibc

readlinkat(AT_FDCWD, "/proc/self/exe", "/home/replica/user-data/builds/f"..., 4096) = 74
openat(AT_FDCWD, "/sys/devices/system/cpu/possible", O_RDONLY|O_CLOEXEC) = 3

vs a dynamic binary

openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/libgtk3-nocsd.so.0", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/libogg.so.0", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libm.so.6", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libmvec.so.1", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/libcrypto.so.3", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/libgomp.so.1", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libdl.so.2", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/sys/devices/system/cpu/possible", O_RDONLY|O_CLOEXEC) = 3

So isn't dlopen just making internal calls to the libs statically built into the binary?

Title: Re: Multithreading
Post by: TheBigBadBoy on 2023-08-16 10:22:07

I don't really know.

It's the first time I deal with dlopen (used in the static OMP library).

At least the executable working properly :)

Title: Re: Multithreading
Post by: TheBigBadBoy on 2023-12-26 18:38:49

Quote from: cid42 on 2023-02-06 23:08:04

Probably not worth it
--mode peakset --blocksize-list 576,1152,1728,2304,2880,3456,4032,4608 --analysis-comp 8p --output-comp 8ep --queue 8192 --tweak 1 --merge 0

@cid42 so I tried really really slow settings for fun on a 6 seconds file (I attached it).

Several times I get an "Error: Init failed" (this error is defined in common.c in flaccid) when using slow settings.
For example if I use this command:

flaccid --in 12_-_Napalm_Death_-_You_Suffer.flac --lax --out out.flac --preserve-flac-metadata --queue 8192 --workers 13 --tweak 1 --merge 0 --analysis-apod subdivide_tukey\(3\) --output-apod subdivide_tukey\(6\) --analysis-comp mepl32r15 --output-comp mepl32r15 --mode peakset --blocksize-list 256,512

Ideally I would use more blocksizes in the list (something like `seq -s, 256 256 5120`), but I shortened it to get the error faster (exact same output).
Here is the output:

(null)
Processed 1/3
Processed 3/3
Error: Init failed

With that same command, if I limit myself to fewer blocksizes (and bigger step) with `seq -s, 512 512 5120`, it works completely fine and output is created in 155 seconds (not too long if you want to test).

So I guessed flaccid could only work with blocksize >= 512, so I tried different settings (only removed `--queue` arg):

flaccid --in 12_-_Napalm_Death_-_You_Suffer.flac --lax --out out.flac --preserve-flac-metadata --workers 13 --tweak 1 --merge 0 --analysis-apod subdivide_tukey\(3\) --output-apod subdivide_tukey\(6\) --analysis-comp mepl32r15 --output-comp mepl32r15 --mode peakset --blocksize-list $(seq -s, 256 256 512)

Still has a blocksize of 256, and "Error: Init failed" but different output:

(null)
Processed 1/3
Processed 3/3
tweak(1) saved 20 bytes with 6 tweaks
# ~100 lines like this; then
Error: Init failed

Got this error after 73 seconds. Output FLAC file was almost done: 50ms shorter than input (only one frame is missing ?), MD5 not computed and 2 placeholders in the seektable.

You surely guessed it, I wrote this message in the hope that the error can be fixed.
I also have a question: why is it needed to have all the blocksizes (to bruteforce) multiples of the first one of the list ?
It is surely easier to implement and parallelize, but I find a waste of time when dealing with blocksize steps of 32, 64, 128 (since blocksize from 32 to ~1024 are completely useless to bruteforce).

Many thanks for your software and time.

Title: Re: Multithreading
Post by: Replica9000 on 2023-12-27 00:25:59

I'm seeing the same thing here. I did some limited testing. In combination with the --lax option, I can reproduce if the blocksize list has any combination that can equal 256. If the blocksize list has anything smaller or larger than 256, but doesn't equal 256, there's no error.

Title: Re: Multithreading
Post by: cid42 on 2023-12-27 13:27:58

Quote from: TheBigBadBoy on 2023-12-26 18:38:49

You surely guessed it, I wrote this message in the hope that the error can be fixed.

Thank you for making it easy by providing the exact file and settings to reproduce, the latest commit should now work. It looks to be the small (4 sample) partial frame at the end that fails, it looks like it might be failing because the strong lax settings provided can't technically create valid output with such a small frame. I've added a catch-all to the init_static_encoder function that detects when initialisation fails, and tries again with less aggressive settings (presumably ./flac does something similar or the different way it handles the last frame means it doesn't encounter the problem). Not the most elegant solution but it should catch edge cases including this. I can't guarantee that all edge cases are now working, or even that there isn't an underlying problem yet to resolve as this catch-all may just be masking it.

Quote from: TheBigBadBoy on 2023-12-26 18:38:49

I also have a question: why is it needed to have all the blocksizes (to bruteforce) multiples of the first one of the list ?

It means that regardless of the chosen blocks in a given chain, they always start and end on a discrete boundary that's a multiple of the smallest blocksize. Unconstrained bruteforce is not feasible and an intelligent algorithm that competes well has not been found, peakset is a reasonable compromise.

Title: Re: Multithreading
Post by: TheBigBadBoy on 2023-12-28 17:12:45

Thanks for the quick update!

It indeed fixes the issue.

I built a static executable once again. Same specs as before: Linux x86-64, generic, static, PGO LTO optimized (O3 flto fprofile-instr-gen), stripped, UPX'd (compressed with -9 --ultra-brute).

Title: Re: Multithreading
Post by: Kraeved on 2024-04-19 20:21:48

Quote from: Replica9000 on 2023-08-05 04:03:05

Built a static Linux 64-bit binary using libFLAC git-31ccd3df. Targeted for AVX2 capable CPUs. Also tried my hand at a static Win64 build.

Dear @Replica9000, if possible, consider building a Windows version without the need for AVX instructions.

(https://i4.imageban.ru/out/2024/02/25/ef921b2d6df14d10ceb8beb722f16200.png)

Title: Re: Multithreading
Post by: Replica9000 on 2024-04-20 00:55:20

Quote from: Kraeved on 2024-04-19 20:21:48

Quote from: Replica9000 on 2023-08-05 04:03:05
Built a static Linux 64-bit binary using libFLAC git-31ccd3df. Targeted for AVX2 capable CPUs. Also tried my hand at a static Win64 build.

Dear @Replica9000, if possible, consider building a Windows version without the need for AVX instructions.

(https://i4.imageban.ru/out/2024/02/25/ef921b2d6df14d10ceb8beb722f16200.png)

My test setup is Win 7 x64 in a VM, so I'm not sure how well these perform on real hardware. I included 32-bit and 64-bit, with an asm/noasm version of each. On my Linux PC, the no asm builds perform faster for 16-bit files, but the opposite seemed to be the case in the VM. Flaccid only supports 16-bit, but the couple 24-bit files I tried encoded fine.

Title: Re: Multithreading
Post by: cid42 on 2024-04-20 14:06:07

Flaccid only supports 16 bit wav/raw input as the wav formats are many and I never quite got the wav library working. But it should support arbitrary flac input (iirc), so a hack to encode any supportable wav would be pipe/convert to flac first. Not ideal but workable. Might revisit when flac gets a major update, for now just busy with other things.

Title: Re: Multithreading
Post by: Kraeved on 2024-04-20 23:36:30

@Replica9000, unfortunately, binaries from your archive crash upon launch on my end.
Are you sure they do not need SSE4 or something else that I might not have (https://hydrogenaud.io/index.php/topic,123248.msg1043262.html#msg1043262)?

(https://i1.imageban.ru/out/2024/04/21/8ff45e95cc793cd5244d0f85953fd637.png)

Title: Re: Multithreading
Post by: Replica9000 on 2024-04-21 02:25:02

@cid42 The files I used were indeed 24-bit FLAC. I read the github page a while back remembering something about 16-bit only. I forgot that it applied to only wav/raw pcm.

Quote from: cid42 on 2023-02-18 22:42:22

....some brave soul can convert their entire collection if they wish. I recommend checking the result with flac -t before you delete the originals, this is still alpha you know ;)

A while back I started recompressing my library using more aggressive settings, updating tags and embedded album art. About half way through I started using flaccid. I've used it on thousands of tracks and no issues. :)

@Kraeved I had used a generic CPU flag. I recompiled these binaries with the core2 flag.

Title: Re: Multithreading
Post by: Kraeved on 2024-04-21 03:29:05

Thank you for the opportunity to get acquainted with Flaccid.

* hyperfine (https://github.com/sharkdp/hyperfine).exe --warmup 3 --runs 3 "commands"
* flac.exe x64 20240309, flake.exe x86 from CUETools 2.2.5
* in.wav, 44100 Hz 16 bit stereo, 99 423 788 bytes

  Command                                                           Mean time [s]        
 ---------------------------------------------------------------- ---------------- 
  flac -7 -f in.wav -o out.flac                                     8.716 ± 0.047        
  flaccid_w64 --preset 7 --in in.wav --out out.flac                 9.208 ± 0.176        
  flake -7 -f in.wav -o out.flac                                   10.496 ± 0.182       
  flaccid_w64_noasm --preset 7 --in in.wav --out out.flac          11.274 ± 0.041       
  flaccid_w32 --preset 7 --in ind.wav --out out.flac               16.057 ± 0.038       
  flaccid_w32_noasm --preset 7 --in ind.wav --out out.flac         17.986 ± 0.184

Title: Re: Multithreading
Post by: Replica9000 on 2024-04-21 04:55:20

It looks like the binaries without asm optimizations are slower for you too. Maybe it's something to do with Windows or the CPU flag used.

On my system:

./flac -8p in.wav = 53.615s
./flac_noasm -8p in.wav = 44.648s
./flaccid --preset 8p --in in.wav --out out.wav = 53.606s
./flaccid_noasm --preset 8p --in in.wav --out out.wav = 44.261s

Edit: Binaries compiled with core2 CPU flag. Maybe no asm optimizations is only beneficial with AVX or better.

./flac_core2 -8p in.wav = 55.375s
./flac_core2_noasm -8p in.wav = 2m23.837s

Title: Re: Multithreading
Post by: Replica9000 on 2024-04-23 21:38:50

Builds for flaccid git-777a62f6, using libFlac git-31ccd3df.

zip file is for Windows
tgz file is for Linux.

Title: Re: Multithreading
Post by: The Chief on 2024-05-10 19:17:29

Quote

S:\_Test>"C:\Program Files\flac\flac.exe" -d "02 - Naked Love.flac" -c -s | "C:\Program Files\flac\flaccid_w64.exe" --preset 8ep --workers 8 --in - --input-format wav --out test.flac
Error: Currently piping not supported for wav input

Can it be implemented, please? I want to use flaccid in foobar converter with piping.

Title: Re: Multithreading
Post by: cid42 on 2024-05-11 11:37:58

Quote from: The Chief on 2024-05-10 19:17:29

Quote
S:\_Test>"C:\Program Files\flac\flac.exe" -d "02 - Naked Love.flac" -c -s | "C:\Program Files\flac\flaccid_w64.exe" --preset 8ep --workers 8 --in - --input-format wav --out test.flac
Error: Currently piping not supported for wav input
Can it be implemented, please? I want to use flaccid in foobar converter with piping.

Possibly. Here's the note I left for myself in the code:

	if(strcmp(path, "-")==0){
		//drwav doesn't seem to have convenient FILE* functions
		//so something might have to be done with raw memory
		_("Currently piping not supported for wav input");//TODO
	}

Wav support is flaky in general, even if piping gets added non-16 bit wav support needs work. I'll make some time in the next few days to give it another go, but no promises. Current me is only slightly smarter than past me, lets be honest probably slightly dumber.

If you're only dealing with CDDA then a workaround which should work with the current binary is to pipe the raw samples, --input-format cdda in flaccid, --force-raw-format from ./flac and whatever it is from foobar.

Alternatively, it's a hack but it works for all input ./flac supports, pipe from foobar to ./flac to ./flaccid, this way ./flac deals with the format mess for free and flaccid gets fed flac files which are fully supported. ./flac can use -0 which uses a rounding error of resources compared to flaccid's brute forcing.

(Could the flac pipe hack be built into flaccid to support wav input without drwav? libflac is already a dependency and it might be the path of least resistance)

Title: Re: Multithreading
Post by: The Chief on 2024-05-11 12:20:18

Quote from: cid42 on 2024-05-11 11:37:58

Alternatively, it's a hack but it works for all input ./flac supports, pipe from foobar to ./flac to ./flaccid, this way ./flac deals with the format mess for free and flaccid gets fed flac files which are fully supported. ./flac can use -0 which uses a rounding error of resources compared to flaccid's brute forcing.

Thanx a lot! It solved the problem.

Title: Re: Multithreading
Post by: Replica9000 on 2024-05-11 15:10:39

This works with Bash on Linux

flaccid --preset 8 --input-format cdda --in <(flac -c -s -d Foreward.flac) --out Foreward2.flac

This method doesn't preserve any tags though.

Text only | Text with Images

Powered by ElkArte 1.1.8 | Credits | SimplePortal 1.0.0 © 2008-2024