Multithreading

Topic: Multithreading (Read 32370 times) previous topic - next topic

0 Members and 1 Guest are viewing this topic.

Re: Multithreading

Reply #50 – 2022-12-30 13:18:35

Quote from: cid42 on 2022-12-30 08:41:21

Yes mostly used -8p as the output setting as an arbitrary lever to not pull, if too many levers are pulled at once a benchmark doesn't really show anything.

Sure, except that you are too modest

. Improving over -8p on size and time simultaneously means you are at something which would have been used had it been implemented in the reference encoder (and known well enough).

Re: Multithreading

Reply #51 – 2022-12-30 14:48:08

I agree. Many have tried and failed at getting variable blocksizes to be both faster and smaller than fixed blocksizes. In fact, getting an encoder to output a variable blocksize stream that is consistently smaller is a feat already, even if the encoder is quite a bit slower.

Re: Multithreading

Reply #52 – 2022-12-31 10:51:26

Thanks. All I'm doing is exploring some simple things to a conclusion because it's fun.

Quote from: ktf on 2022-12-30 14:48:08

I agree. Many have tried and failed at getting variable blocksizes to be both faster and smaller than fixed blocksizes.

peakset with analysis 3 output 8 blocklist 1152..4608 is faster always and smaller most of the time, but there's a subset of tracks that do better with fixed -8pb4096 (see round6, ~7.6% of everything I have was smaller with fixed albeit when it was better it was mostly marginal). I think it mainly boils down to variable not being of much benefit when the tracks are predictable (so mostly uses the top blocksize anyway, so the p in fixed -8p runs away with it).

One way to improve could be an adaptable strategy that detects when variable -8 isn't of much benefit and switching to fixed -8p for a stretch. A proper adaptable strategy would be hell to implement but something simple like this might be okay (lets be honest it would probably hurt at least as much as it helps so I've already talked myself out of trying it for now).

Another way to passively improve could be to introduce "fractional" output settings between -8 and -8p, by simply using say -8 X% of the time and -8p Y% of the time. If the game is to use some of the spare time available to a variable run to improve efficiency, this may be the best way to do it. It would at least make the input that prefers fixed to perform better with variable (if not prefer it). Another benefit would be in allowing the expected time to complete to be tuned much finer when doing variable encodes (relying directly on flac/mode settings is pretty coarse). This will be the next thing implemented unless something better comes along.

Quote from: ktf on 2022-12-30 14:48:08

In fact, getting an encoder to output a variable blocksize stream that is consistently smaller is a feat already, even if the encoder is quite a bit slower.

If the encoder is allowed to be slower as long as it nearly always produces smaller files then brute force has a massive advantage, the main wrinkle is having to save 3 bytes per frame to account for sample-based addressing. 4096 is near ideal on average but it shouldn't take much tweaking of a 4096 stream to get the vast majority of content marginally across the line.

Round 5 tl;dr:

Code: [Select]

Size      CPUEst    Settings
870739898	147.56386 gasc 1152..4608 analysis 8 output 8
870746851	120.52655 peakset 1152..4608 analysis 3 output 8
870942164	111.21946 gasc 1152..4608 analysis 7 output 8
872823885	185.98145 fixed -8pb4096
873203809	183.78757 fixed -8pb4608
874088872	 93.83656 gasc 1152..4608 analysis 6 output 8

Luckily one track in round 5 performed worse, otherwise the assumption would have incorrectly been that fixed was dead in the water. Round 6 tests wider (>1000 tracks) for a better view:

Code: [Select]

Size        CPUEst     Settings
30886326138	4473.30063 peakset 1152..4608 analysis 3 output 8
30946173364	7032.54908 fixed -8pb4096

Peakset was more space-efficient than fixed ~92.4% of the time, completing in ~65% of the time on average.

Re: Multithreading

Reply #53 – 2023-01-01 13:06:53

Disregard the above round 5, there was a mistake in the gasc implementation that meant the output settings were never triggered (so the above round 5 is analysis x output x not analysis x output 8, no wonder it scaled so strongly with analysis settings). Round 6 is valid.

The fixed results show that gasc is more compelling:

Code: [Select]

Round5Fixed tl;dr
All variable output -8, all variable blocksizes 1152..4608
* 870166176 499.55594 peakset analysis 8
* 870231352 301.22267 peakset analysis 7
* 870333821 256.93233 peakset analysis 6
* 870500404 170.39666 peakset analysis 5
* 870663610 117.48744 gasc analysis 6
  870712531 129.71610 gasc analysis 7
  870739898 148.85595 gasc analysis 8
  870746851 120.34103 peakset analysis 3
* 870777535  94.49873 gasc analysis 5
* 870935582  78.17366 gasc analysis 3
  871363551 128.79010 peakset analysis 2
* 871431044  76.93183 gasc analysis 2
  871569362 108.65949 peakset analysis 0
* 871658798  70.50157 gasc analysis 0
  872823885 183.95303 fixed -8pb4096

* No stronger setting was faster

So gasc is stronger at quicker settings, peakset is stronger at slower settings but only one of the best-in-show peakset results fulfilled the task of being quicker than fixed -8p.

Implemented fractional output settings for gasc (not for peakset yet, that requires some heavy refactoring and I'm about to run out of free time for the next several weeks). Fractional output can be used whenever analysis settings are different to the standard output settings. Settings of --comp-output 8 --comp-outputalt 8p --outperc 60 use -8 60% of the time, -8p 40% of the time.

Repeating the above tests but with all gasc encodes using fractional output to try and take roughly the same amount of time as a fixed -8p encode gives these results:

Code: [Select]

Round 7 tl;dr out 8 outalt 8p
  870337444 180.93322 gasc analysis 5 outperc 27
  870369148 176.91559 gasc analysis 6 outperc 50
  870431237 177.77482 gasc analysis 3 outperc 17
  870464232 179.80446 gasc analysis 7 outperc 57
* 870500404 170.39666 peakset analysis 5
  870905130 175.96778 gasc analysis 2 outperc 17
  871104714 176.42821 gasc analysis 0 outperc 10
* 872823885 183.95303 fixed -8pb4096

* Results from round5fixed for comparison

So the top gasc+outperc results can beat peakset 5 at least for this small corpus. The bigger corpus is running but it'll take a while as gasc is not multi-threaded as it cannot easily be done (at least not purely, well-performing multi-threading can and will probably have to be done by chunking, a reasonably well-performing pure dual core implementation is possible but not worth the hassle).

Attempts could be made to smartly choose when to use stronger output settings with frac-output, but it's not clear how to go about that. The current implementation is indiscriminate deterministic and evenly spread and that's probably how it will stay unless there's an objective win to a different strategy (that's also easy to implement).

Re: Multithreading

Reply #54 – 2023-01-01 15:15:00

Quote from: cid42 on 2022-12-31 10:51:26

One way to improve could be an adaptable strategy that detects when variable -8 isn't of much benefit and switching to fixed -8p for a stretch. A proper adaptable strategy would be hell to implement but something simple like this might be okay (lets be honest it would probably hurt at least as much as it helps so I've already talked myself out of trying it for now).

Another way to passively improve could be to introduce "fractional" output settings between -8 and -8p, by simply using say -8 X% of the time and -8p Y% of the time.

For output, the X must be updated by learning for this to make any sense? Just randomizing independently isn't really "useful": those who think -p is worth it will want -p to be applied, right?
If it is for analysis ... then how? Use -p on some block sizes and not on others?

Re: Multithreading

Reply #55 – 2023-01-01 19:55:16

The percentage of output blocks that use X or Y output settings is defined by the user in the same way all the other settings are chosen. As long as both output settings are stronger than the analysis setting it makes sense, either way you're replacing whatever the analysis phase finds with a stronger setting. There are a number of reasons it might be a good idea to add this lever:

It allows a config to be scaled to a different fixed time target relatively easily, it boils things down to a single number that should be somewhat linear in time and space between two points
The way it scales is between two known good flac settings, an arbitrary combination of which should also be good. If you know that A takes 1.2 time and B takes 1.8 time, this single setting can allow you to tailor a time to encode output to anything between 1.2 and 1.8
It's much harder (if not impossible in most cases) to choose a singular set of flac settings that sits in time between two other sets of flac settings in a space-efficient way. You need to be an expert and even then you don't have arbitrary granularity and not all settings are made equal
It's relatively fine-grained even in the integer 1-100% way it's implemented. If you want something better than -8 but not as slow as -8p, and p is the most time-efficient way to improve efficiency, this can do that
Different configurations that take the same amount of time to complete are directly comparable in efficiency. Round 7 above did this (I did trial and error using a single track to find the outperc settings to use) so that all the gasc configs take roughly the same time to complete as fixed -8pb4096. This allowed gasc to beat peakset in that time frame which isn't possible without fractional-output
Works just as well with fixed blocksizes

Another arguably better way to implement fractional-output could be for the user to specify "roughly fixed -8p speed please" with the encoder having a concept of time and choosing output settings accordingly. That would be an easier interface as the same user setting would apply regardless of other settings used, however it would be complicated to implement for a number of reasons including building some concept of time into the encoder and probably losing determinism in the process.

If the game is to pull the optimal levers to maximise space-efficiency for a given time-efficiency, a lever that can scale a given config to better match a time is useful.

Re: Multithreading

Reply #56 – 2023-01-01 21:31:17

First, what it can be good for, if someone is dependent on "be done in time T": fit it for post-processing later.
It is technically possibly to "apply -p later", but obviously only by successively reducing the predictor precision. So if someone writes a "post-processor" that takes a .flac file, reading each subframe and - subject to user's choice I guess - does one out of two:
I: starts with the predictor at the actual quantified precision q, then work itself downwards to see if that improves,
or
II: if quantified at q=15, start out from that and do the -p routine; if quantified at q<15, assume that it has been done already.

Then the initial run could do as much of this work as the time budget allows for. Then the initial run should be adapted to the choice between I and II:
I: start at 15, run down to q (which may be determined by the time budget!) but stop if it does not improve.
II: leave some at 15 and run the full -p on the others.

Now, why I don't like the idea at all: precisely the linearity in bytes saved per second spent. It goes against the principle of picking the low-hanging improvements first. Say if you want to achieve the speed of -8, you don't run -3 on most of the data and then spend the time saved on running -8ep on a tiny fraction of the data only to get a little bit more than -8 out of that.
So if -p saves B bytes per second and that is acceptable, then take those bytes. If it isn't, then find something between -8 and -8p that is cheaper. There is something such.

For an example from the table I posted at https://hydrogenaud.io/index.php/topic,122949.msg1015508.html#msg1015508: suppose you are willing to spend the time equal to using -8 on 80 percent of the data and -8p on 20 percent of it. That would take .8*833 + .2*3003 = 666+601=1267 seconds. The size would be .8*11969604531+.2*11961291433 = 11967941911, which is bigger than using -8 -A "subdivide_tukey(4)".

Re: Multithreading

Reply #57 – 2023-01-01 23:27:01

Ideally the output settings used with frac-out do not surround a setting in time that has a better space-efficiency than the interpolation does at that time. If -8 -A "subdivide_tukey(4)" almost always does fit that bill, then frac-out should probably use that as one of its settings with -8 or -8p (or something even closer) as the other to hit some target within a narrower range.

How many settings are there that can be said to have a strong tendency to have the best space-efficiency for its time, ie there exists no surrounding pair of settings whose interpolated value at that point of time is more space-efficient? If these super settings exist do they tend to be strongly ordered, how much volatility is there in timing and in space-efficiency?

It would be nice if there were a set of super settings that could confidently be applied (either individually or adjacent with frac-out for more granularity) to "normal" output and be happy that it's close to optimal. Like presets but on steroids. I doubt the set if it exists has too many members, many settings search within a range which must introduce at least some volatility, which should limit how close super settings can get to each other.

How obvious is it which apod settings should be better than others? Is there an exhaustive list of apod functions that can be tested with representative input to collate a rule-of-thumb ordered set of super settings?

Re: Multithreading

Reply #58 – 2023-01-02 00:33:53

If you put up a size/time diagram, there should be a "convex" relationship: given two settings A (faster) and B (compresses better), dash a line between them; if a third setting C is between them in size but slower than the line, throw it out. Same if it is between them in time but larger than the line. (Here I disregard the -l affecting decoding speed, FLAC decodes faster than anything.)
At https://hydrogenaud.io/index.php/topic,123025.msg1016761.html#msg1016761 I argue that certain settings should be avoided.

I'm not sure what you mean by "volatility" here, there is surely quite a lot of variation between signals - but either the decision has to be made before selecting how to encode, or one must learn from the signal during encoding. The "signal analysis" could be a first-round encoding - like you have employed here, using faster settings for analysis.
Some properties can be guessed from the "format"; although some CDDA signals do fare better with -8e than with -8p, that is getting very rare with 1.4.0, ktf & co nearly killed it. But not so for high resolution, where I more often see -8e beating -8p, e.g. my table at https://hydrogenaud.io/index.php/topic,123025.msg1018116.html#msg1018116 (but stacking up with apodization functions typically does even better).

As for the windowing:
Reference flac introduced a lot of possible apodization functions, and the tukey (cosine-tapering the beginning and end) turned out successful after a bit of testing (including here at HA). ktf crafted the partial_tukey and punchout_tukey a decade ago (they leave out portions of the signal) and the subdivide_tukey introduced in 1.4.0 does this in a way that recycles more calculations and covers more possibilities. Saying that subdivide_tukey(N+1) subsumes subdivide_tukey(N) isn't completely true - you have to upscale the tapering parameter by (N+1)/N. And still it will estimate before encoding and could be slightly wrong. But by and large, stepping up the number of those will spend progressively more time searching for the even smaller needles in the haystack. If you want something with time consumption between -A subdivide_tukey(4) and -A subdivide_tukey(5), you can give -A tukey(7e-1);subdivide_tukey(4);flattop or something like that.
Oh, or different taperings. https://hydrogenaud.io/index.php/topic,123025.msg1017245.html#msg1017245 .

Re: Multithreading

Reply #59 – 2023-01-02 20:53:37

Round 8 tl;dr
1172 tracks, a few dozen dupes from VA compilations. All variable blocksizes 1152:4608

Code: [Select]

Size        CPUEst     TracksGreaterThanFixedCounterpart Settings
35086113631 7703.72447    5 (0.43%)                      gasc analysis 5 output 8 outputalt 8p outperc 27
35086680315 7429.37314    6 (0.51%)                      gasc analysis 5 output 8 subdivide_tukey(4) outputalt 8p outperc 35
35088009462 7582.09801   10 (0.85%)                      gasc analysis 6 output 8 outputalt 8p outperc 50
35089443839 7629.84433    7 (0.60%)                      gasc analysis 3 output 8 outputalt 8p outperc 17
35090596517 7737.50519   18 (1.54%)                      gasc analysis 7 output 8 outputalt 8p outperc 57
35092376807 7042.73055   67 (5.72%)                      peakset analysis 5 output 8
35096064696 5757.87535   63 (5.38%)                      gasc analysis 6 output 8 subdivide_tukey(4)
35096857576 6269.30259   64 (5.46%)                      gasc analysis 7 output 8 subdivide_tukey(4)
35099235041 4687.09807   68 (5.80%)                      gasc analysis 5 output 8 subdivide_tukey(4)
35104983452 4098.46001   94 (8.02%)                      gasc analysis 3 output 8 subdivide_tukey(4)
35165983516 7972.30849    0 (0.00%)                      fixed -8pb4096

Without subdivide_tukey(4) is slightly ahead but takes a bit longer, you may be right that subdivide_tukey(4) and other intermediate settings may be slightly better choices but this test is a wash. Should have set outperc slightly lower than 35 to get time targets closer, for that matter all gasc runs could do with reducing outperc slightly to better match fixed run but I was more concerned with making sure none of them exceeded fixed -8p time.

Only a handful of tracks from each of the best runs were bigger than their fixed -8pb4096 counterpart which is a good thing to note, it means that when using good variable settings targeting the same time as a fixed setting there is very little downside to using variable blocksize (at least for -8p and probably all "slower" settings that give the analysis phase enough time to benefit).

Re: Multithreading

Reply #60 – 2023-01-02 23:14:01

The figures for "peakset analysis 5 output 8" are a size saving of 0.2 percent.
What does plain -8 yield of size on this corpus? -p doesn't save as much as 0.2? (That is rare in a big corpus.)
Yes I know what you say about pulling too many levers, but again consider the following perspective: If you have found an algorithm that - starting from -8 - triples the size savings of -p in less of a time cost, that is damn impressive - and for an algorithm that isn't even tweaking the apodization functions.
(No reason not to tweak the functions in the final stage, of course.)

If I have understood your notation, the number after "analysis" is the preset used for analysis, e.g. "analysis 6" means you use a -6 setting for the analysis stage?
It is then a bit weird that gasc analysis 6 output 8 subdivide_tukey(4) is better than "7", but I guess this is just one of those coincidences that happen. -6 has -5's LPC order.

... by the way, -8 -A subdivide_tukey(5) would also fall between -8 -A subdivide_tukey(4) and -8p, but cost so much more than (4) in my testing that you would rather use -8p. Gut feeling says that when a finer subdivision isn't much useful with 4096 blocks, then it is no better at smaller blocks - and that assumption could very well fail.

Oh, and: 1152:4608, does that mean the multiples of 576 or ...?

Re: Multithreading

Reply #61 – 2023-01-03 00:18:40

You're right about the notation, by 1152:4608 I meant blocksizes 1152,2304,3456,4608, that's also what I mean whenever there's something like 1152..4608, contiguous multiples of min blocksize up to whatever. It is odd that gasc 6 beats 7 but it's not impossible, possibly for some reason gasc 7 has more instances of a smaller blocksize being chosen before a bigger more appropriate blocksize gets tested (hazard of early exit strategy). Or more likely they perform so similarly one gets an advantage over the other by random chance, in an even fight 7 should win but they're not working on the exact same set of frames as the frames might be staggered.

These are the tracks that the top 2 runs output bigger than the fixed run:

Code: [Select]

Heat OST - Various Artists [1995]/11 - Moby - New Dawn Fades.flac
LudAndSchlattsMusicalEmporium/PMM-Bach-Cello-Suite-No.-1-G-Major-MASTER_V1.flac
LudAndSchlattsMusicalEmporium/PMM-Romeo-and-Julliet-MASTER-V1.flac
LudAndSchlattsMusicalEmporium/t.flac
TOOL - Fear Inoculum (Deluxe) (2019)/03 - Litanie contre la Peur.flac
Tool-Lateralus-CD-FLAC-2001-SCORN/04-tool-mantra.flac

t.flac is a copy of bach erroneously left in the corpus from previous testing. Litanie and tool mantra are both simple ambient filler. Classical is classical and apparently moby is also simple no offense moby. So it's no surprise that those tracks perform worse, the variable without being lax was of limited benefit relative to missing out on p output.

And here's the corpus with fixed -8b4096, as always nothing unnecessary not even a seektable:

Code: [Select]

35188815331 2021.08537 fixed -8b4096

The benefit with p does look slim. There's a little high res but it's mostly CDDA, there's a lot of metal/punk/electronic but there's also a selection of classical and OST's.

Re: Multithreading

Reply #62 – 2023-01-16 00:18:24

Made a tool (attached, should compile fine for windows) to dump some stats about a flac bitstream. Probably not going to develop it much further and the code is nothing to write home about. Works on my PC but YMMV, here's some example output:

Code: [Select]

STREAMINFO{
 blocksize min 4096 max 4096
 framesize min 14 max 13664
 samplerate 44100
 channels 2 bits_per_sample 16 total samples 6791829
 Blocking strategy: Fixed
}

Metadata bit stats (% including bitstream):
   304 (0.000184%) STREAMINFO

Frame header stats (% excluding metadata):
 23226 (0.014046%) bits spent on sync codes
  3318 (0.002007%) bits spent on frame reservations to maintain syncability
  1659 (0.001003%) bits spent on block strategy bit
  6652 (0.004023%) bits spent encoding blocksize
  6636 (0.004013%) bits spent encoding samplerate
  6636 (0.004013%) bits spent encoding channel assignment
  4977 (0.003010%) bits spent encoding sample size
 25520 (0.015433%) bits spent encoding current frame/sample index with UTF8
 13272 (0.008026%) bits spent encoding frame header crc8

Subframe header stats (% excluding metadata)
  3318 (0.002007%) bits spent on subframe reservations to maintain syncability
 19908 (0.012039%) bits spent encoding model type
  3318 (0.002007%) bits spent on wasted bits flag

Modelling stats (bit % excluding metadata) (excluding residual bits)
    18 (0.542495%) subframes used constant modelling
     0 (0.000000%) subframes used verbatim modelling
     6 (0.180832%) subframes used fixed modelling
  3294 (99.276673%) subframes used lpc modelling
   288 (0.000174%) bits spent on constant
     0 (0.000000%) bits spent on verbatim
    48 (0.000029%) bits spent on fixed
927478 (0.560881%) bits spent on LPC

Residual stats (% excluding metadata):
  3300 (0.001996%) bits spent on residual reservations to maintain syncability
  3300 (0.001996%) bits spent on residual type (4 or 5 bit rice parameter)
164275633 (99.343672%) bits spent on residual encoding

Frame footer stats (% excluding metadata):
 26544 (0.016052%) bits spent encoding frame footer crc16
  5913 (0.003576%) bits spent on frame padding for byte alignment

Combined stats (% excluding metadata)
927814 (0.561084%) total bits spent on modelling
164282233 (99.347663%) total bits spent on residual
150897 (0.091253%) total bits spent on overhead (frame_header+subframe_header+footer

Miscellaneous stats:
 27432 (100.000000%) of residual partitions stored rice-encoded
     0 (0.000000%) of residual partitions stored verbatim
  9936 (0.006009%) total bits spent on pure reservations to maintain syncability (not including the many reserved values in used elements or end-of-frame padding)

Used it to see how the structure of the bitstream changes as different compression settings are used. Tested with the album version of this song without the talking intro: https://www.youtube.com/watch?v=MODhTJwebz8

Code: [Select]

Percentages are just for the bitstream, no metadata blocks are included in the 100% not even streaminfo

Filesize  Frames  Overhead  Modelling  Residual  Settings
22131570   1659    0.085%    0.067%    99.848%  fixed -0b4096
20941266   1659    0.090%    0.324%    99.586%  fixed -3b4096
20705886   1659    0.091%    0.418%    99.491%  fixed -6b4096
20681000   1659    0.091%    0.579%    99.330%  fixed -8b4096
20670160   1659    0.091%    0.561%    99.348%  fixed -8pb4096
20649171   1721    0.120%    0.565%    99.315%  peakset all -8p, blocksizes 1152,2304,3456,4608 no merge no tweak
20641291   1721    0.126%    0.566%    99.308%  peakset all -8p, blocksizes 1152,2304,3456,4608, no merge with tweak set to maximum iterations
20614856    904    0.070%    0.305%    99.625%  peakset all -8p, blocksizes 1152,2304,3456,4608, non-subset with merge and tweak set to maximum iterations
20614731    958    0.074%    0.318%    99.608%  peakset all -8p, blocksizes 576,1152,1728,2304,2880,3456,4032,4608, non-subset with merge and tweak set to maximum iterations
20613363    921    0.071%    0.312%    99.616%  peakset all -8p, blocksizes 1152,2304,3456,4608,5760,6912,8064,9216, non-subset with merge and tweak set to maximum iterations

No surprise that as compression effort increases for fixed runs that the overhead and model proportions increase, stronger settings search for heavier models which take more space to define
The overhead of the weakest peakset jumping relative to fixed is mostly because variable frame indexing is by sample instead of frame number, adding approx 2 bytes per header
By adding tweak in the next run overhead increased slightly thanks to having to encode a few blocksizes with a literal instead of one of the dozen or so predefined common values
This track apparently prefers blocksizes well above 4k on average so lax allows a lot of overhead to be saved by reducing the number of frames. Less frames also means less models to store. Merge did a good job of clawing upwards in blocksize despite the max blocksize used in analysis being below what it probably should be for this input
Adding intermediate blocksizes without increasing the max blocksize used during analysis didn't help much, the analysis stage performed better but it ate some of tweak/merges lunch to do so
Finally changing the blocksize list to something better-fitting the input was beneficial, but there wasn't much to gain

Re: Multithreading

Reply #63 – 2023-02-03 19:58:26

No major new features worth benchmarking but a lot of quality of life improvements. The code's a mess by design as it's all experimentation, this is the start of tidying by keeping what works and reworking how some things are done into what they probably should have been from the start:

Implement an output queue. Instead of encoding and dumping an output frame directly after analysis has found it, add it to the queue to be processed later
Batched output frame encoding is multithreaded (when analysis settings are different to output settings). This allows a user to introduce multithreading even if the analysis algorithm used is single-threaded
Move merge and tweak passes to act on the output queue instead of being intertwined with the analysis algorithm. Merge/tweak passes are triggered on queue flush
Separating merge/tweak entirely simplifies analysis implementations and allows tweak/merge to be used with modes they previusly couldn't (gasc couldn't use either but now can use both, chunk couldn't use merge and still can't as it hasn't been ported yet).
Another benefit is that the queue size can be tuned to reduce the time merge/tweak take to execute with the tradeoff of slightly lower efficiency (as boundaries are introduced that tweak/merge cannot cross). A pass is iterative and acts on all known frames, even if only a small part of the set is benefitting from iteration all frames are tested. A smaller queue localises the high-order passes.
Multiple output settings has been implemented for the output queue, so anything that implements the queue gets it for free
Implement fixed blocking strategy as its own mode instead of fudging it with chunk mode. Slightly less overhead and a few checks to validate the commandline. Technically the implementation is still a fudge just in a different way (to be able to reuse queue code instead of reimplementing a multithreaded queue)
peakset ported to use output queue so it catches up on some features. As well as allowing the merge/tweak code to be removed from peakset there is multithreading of output frame encoding which was single-threaded before so there should be some speedup

Not all deprecated code has been removed yet as chunk/gset still uses it, they will either be ported to use the queue or deprecated TODO. The code in merge.c/h and tweak.c/h is deprecated, replaced by functions queue_tweak/qtweak/queue_merge/qmerge in common.c. flist functions are deprecated, peakset still uses the flist struct but it probably shouldn't.

I'm predicting a queue size somewhere in the range 128-1024 is probably a good tradeoff to speed up tweak/merge with little efficiency loss. For settings that rely heavily on these it might be a nice overall speed bump (ie peakset subset with strong tweak). Might test this week.

Re: Multithreading

Reply #64 – 2023-02-04 12:16:35

I finally got some time to look into this code. It seems my compiles keep segfaulting at the very last frame (which is a different size than the others). Does this sound familiar?

edit: the segfault backtrace in gdb

Code: [Select]

Reading symbols from ./flaccid.exe...
(gdb) r
Starting program: /home/m_van/flac-cid42/src/flaccid/flaccid.exe --mode peakset --in test-input.flac --out test-output.flac
[New Thread 3516.0x4930]
test-input.flac Processed 1/7
Processed 3/7
Processed 7/7

Thread 1 received signal SIGSEGV, Segmentation fault.
0x00007ff6e97aa008 in MD5 ()
(gdb) bt
#0  0x00007ff6e97aa008 in MD5 ()
#1  0x00007ff6e971b672 in peak_main ()
#2  0x00007ff6e97a73a8 in main ()

Re: Multithreading

Reply #65 – 2023-02-04 14:39:32

Quote from: ktf on 2023-02-04 12:16:35

I finally got some time to look into this code. It seems my compiles keep segfaulting at the very last frame (which is a different size than the others). Does this sound familiar?

edit: the segfault backtrace in gdb

Code: [Select]
Reading symbols from ./flaccid.exe...
(gdb) r
Starting program: /home/m_van/flac-cid42/src/flaccid/flaccid.exe --mode peakset --in test-input.flac --out test-output.flac
[New Thread 3516.0x4930]
test-input.flac Processed 1/7
Processed 3/7
Processed 7/7

Thread 1 received signal SIGSEGV, Segmentation fault.
0x00007ff6e97aa008 in MD5 ()
(gdb) bt
#0  0x00007ff6e97aa008 in MD5 ()
#1  0x00007ff6e971b672 in peak_main ()
#2  0x00007ff6e97a73a8 in main ()

at the very last frame you'll likely have a different number of samples than 'buffer size' so you'll have a segfault

Re: Multithreading

Reply #66 – 2023-02-04 19:44:57

Quote from: ktf on 2023-02-04 12:16:35

I finally got some time to look into this code. It seems my compiles keep segfaulting at the very last frame (which is a different size than the others). Does this sound familiar?

edit: the segfault backtrace in gdb

Code: [Select]
Reading symbols from ./flaccid.exe...
(gdb) r
Starting program: /home/m_van/flac-cid42/src/flaccid/flaccid.exe --mode peakset --in test-input.flac --out test-output.flac
[New Thread 3516.0x4930]
test-input.flac Processed 1/7
Processed 3/7
Processed 7/7

Thread 1 received signal SIGSEGV, Segmentation fault.
0x00007ff6e97aa008 in MD5 ()
(gdb) bt
#0  0x00007ff6e97aa008 in MD5 ()
#1  0x00007ff6e971b672 in peak_main ()
#2  0x00007ff6e97a73a8 in main ()

Coincidentally commit 4b41a0fdb9068d802a624bd35b4419f73f0a7fb8 fixed a last frame bug in peakset introduced the previous commit when porting to the queue, but it looks like the problem is actually that I broke mbedtls support at some point and didn't notice. Peakset currently doesn't hash on the fly it does it after frame processing all at once, so MD5 in the trace is a giveaway.

The latest commit 138832bf232f42b92e3ea4c33f168930d45d4135 fixes the mbedtls path for me, to compile on Linux with gcc I need to add

Code: [Select]

-I./ mbedtls/platform_util.c mbedtls/md5.c

to the gcc command, YMMV.

flaccid should compile without warnings, if there's warnings there may be a problem but I've only tested with gcc on Linux. MSVC probably warns about some things, if it does let me know and I can try to eliminate them.

Re: Multithreading

Reply #67 – 2023-02-05 07:34:14

Quote from: cid42 on 2023-02-04 19:44:57

flaccid should compile without warnings, if there's warnings there may be a problem but I've only tested with gcc on Linux. MSVC probably warns about some things, if it does let me know and I can try to eliminate them.

I get a lot of warnings like these

Code: [Select]

chunk.c: In function 'chunk_main':
chunk.c:214:49: warning: implicit declaration of function 'MD5_Update' [-Wimplicit-function-declaration]
  214 |                                                 MD5_Update(&ctx, ((void*)input)+iloc, ilen);
      |                                                 ^~~~~~~~~~
chunk.c:214:49: warning: nested extern declaration of 'MD5_Update' [-Wnested-externs]
chunk.c:232:33: warning: implicit declaration of function 'MD5_Final' [-Wimplicit-function-declaration]
  232 |                                 MD5_Final(set->hash, &ctx);
      |                                 ^~~~~~~~~
chunk.c:232:33: warning: nested extern declaration of 'MD5_Final' [-Wnested-externs]

I couldn't figure out what I did wrong, so I just disabled MD5 calculation altogether. Even with that, I couldn't get gasc to work, so I did a comparison with peakset only. I compiled with

Code: [Select]

gcc -o flaccid -DFLAC__NO_DLL -fopenmp *.c -I../flac-cid42/include ../flac-cid42/src/libFLAC/.libs/libFLAC-static.a -logg -lmbedtls -lmbedcrypto -Wall -O3 -funroll-loops  -Wall -Wextra -Wstrict-prototypes -Wmissing-prototypes -Waggregate-return -Wcast-align -Wnested-externs -Wshadow -Wundef -Wmissing-declarations -Winline  -Wdeclaration-after-statement -fvisibility=hidden -fstack-protector-strong -Ofast

Here is a nice graph

This is about 8.5 hours of music from very different genres. In fact, two tracks randomly selected from every album I used for my comparison here. Compared are flac -4, -5, -6, -7, -8, -8p (from left to right) with flaccid peakset output-comp 8, peakset (with defaults), peakset with a 9 blocksizes (all multiples of 512 up to 4608) and peakset with 18 blocksizes (all multiples of 256 up to 4608) Times are CPU times, so 'ignoring' multithreading. edit: for those interested, here is a breakdown per track.

Very impressive.

Re: Multithreading

Reply #68 – 2023-02-05 11:11:07

Ooh I made it to one of your PDF's

Are those warnings after updating to the latest commit? I was getting warnings like that trying to build on Linux using the mbedtls path, fixed it after you were having issues.

Here's how I compile the mbedtls path on Linux:

Code: [Select]

gcc -oflaccid-mbed chunk.c common.c fixed.c flaccid.c gasc.c gset.c load.c merge.c peakset.c tweak.c -I./ mbedtls/platform_util.c mbedtls/md5.c -O3 -funroll-loops  -Wall -Wextra -Wstrict-prototypes -Wmissing-prototypes -Waggregate-return -Wcast-align -Wnested-externs -Wshadow -Wundef -Wmissing-declarations -Winline  -Wdeclaration-after-statement -fvisibility=hidden -fstack-protector-strong -Wno-sign-compare -I../flac/include -lm -fopenmp /home/u20/Documents/mountain/runtime/flac/linbuild2/src/libFLAC/libFLAC.a -I../flac/include -lm -fopenmp -logg

The difference being that mbedtls is embedded instead of using a library, if you have issues on the latest commit this might help.

FWIW here's the compile command for OpenSSL which is what I normally use:

Code: [Select]

gcc -oflaccid chunk.c common.c fixed.c flaccid.c gasc.c gset.c load.c merge.c peakset.c tweak.c -DUSE_OPENSSL -lcrypto -O3 -funroll-loops  -Wall -Wextra -Wstrict-prototypes -Wmissing-prototypes -Waggregate-return -Wcast-align -Wnested-externs -Wshadow -Wundef -Wmissing-declarations -Winline  -Wdeclaration-after-statement -fvisibility=hidden -fstack-protector-strong -Wno-sign-compare /home/u20/Documents/mountain/runtime/flac/linbuild2/src/libFLAC/libFLAC.a -I../flac/include -lm -fopenmp -logg

There were a few gasc bugs introduced a few commits ago from implementing the queue that are now fixed, the commits tested working on a few tracks but I was too eager and pushed before wider testing. Commit 2d1c66e36b26f750ba7a8c9cf6afc90099515aaa fixed gasc multithreading, commit 4b41a0fdb9068d802a624bd35b4419f73f0a7fb8 fixed a gasc bug in hashing the last frame. All known bugs have been fixed in the latest commit, if you're having issues with commit 138832bf232f42b92e3ea4c33f168930d45d4135 or later then it's something I'm unaware of. I'll try and be better at not pushing bugs from now on.

Re: Multithreading

Reply #69 – 2023-02-05 11:36:22

Interesting indeed. Though I realize that I have too often overlooked the log scale, damn sometimes I am just stupid and even sticking to it ... but anyway:

* Bumping prediction order from 8 to 12 makes for quite a bit, and that is not a surprise. Not unlikely that could have been exploited further, which within subset is only possible for high resolution
* Optimizing blocksize - which obviously also makes for a better fitting predictor - can easily make for slightly more than half of that improvement

Which effect is largest?

So for testing purposes, I am curious what happens in the following comparison.
* Take the leftmost red (peakset output-comp 8, is that?) and limit it to be comparable to -6 (prediction order, apodization, max Rice parameter) It would still improve over -6; how much? Is it cheaper/better than going -l12?
* To get "everything else equal", one would have to consider the fact -r6 means half the partition size for half the block size. At least to compare predictor fit one might consider to lock it down to ... well -r9 would of course make everything equal? Not time taken, I mean - it is so -r6 will run 7 calculations rounds?

Wondering if the association between blocksize and the actual partitioning of that block shows anything surprising?
Some phenomena possibly at play:
1: bad-fitting predictor leads to larger residuals and to more block splitting and finer (in terms of absolute sample count) partitioning
2: "short noisy parts" surely lead to bad-fitting predictor, but there the more noise-alike the less point in improving it and so you might as well just use the "neighbouring frame's vector" and merge the blocks.

Re: Multithreading

Reply #70 – 2023-02-05 15:21:40

Actually fixed gasc as of commit 5fb1861601759de021a8b25f9038f5034da890be. gasc/fixed/peakset passed flac-test-files and MD5 works for all bit depth input now. Latest commit adds a --no-md5 option which disables md5 for fixed/gasc/peakset.

Re: Multithreading

Reply #71 – 2023-02-06 19:21:21

gset and chunk have both been ported to the new codebase (now everything has been ported with feature parity), gset is probably not as good as gasc but now that chunk can benefit just as much from merge/tweak it may be competitive with gasc. Chunk has been reworked to multithread within a chunk instead of multiple chunks simultaneously, heavily simplifying the implementation. I've also removed the ability to stride >2 as chunk is already low effort and this simplifies the code further, now every chunk subdivides by 2 not n. Every mode has been tested to the point where it passes flac-test-files and a few hundred other tracks. Hopefully an issue hasn't crept in, maybe edge-case tests beyond what flac-test-files catches wouldn't be a bad thing to implement instead of relying on giant benchmarks to weed out 1/1000 bugs.

If using tweak and trying to be time-efficient, --tweak 1 --queue 64 or something like it may be a better use of equivalent time than --tweak 1024 --queue 2048 (focusing the most time on the frames that show the most benefit).

Re: Multithreading

Reply #72 – 2023-02-06 20:53:56

Thanks! Compiles better indeed. Changes to the help text are an improvement

Would you consider adding a few examples as to what you would consider good choices? Much like `flac` has presets? I just tried a few things before running the benchmark, but perhaps you can do some suggestions? It is for example not really clear whether --analysis-apod and --output-apod are populated by default and are only to override what --analysis-comp and --output-comp set or not.

If you can do some suggestions I can run the benchmark again. There are quite a few suggestions in this thread already, but many might be outdated by now. It would be nice to show how much room for improvement FLAC really still has

edit: would you like me to share a Windows compile here for others to test?

Re: Multithreading

Reply #73 – 2023-02-06 23:08:04

Quote from: ktf on 2023-02-06 20:53:56

Thanks! Compiles better indeed. Changes to the help text are an improvement

Would you consider adding a few examples as to what you would consider good choices? Much like `flac` has presets? I just tried a few things before running the benchmark, but perhaps you can do some suggestions? It is for example not really clear whether --analysis-apod and --output-apod are populated by default and are only to override what --analysis-comp and --output-comp set or not.

The apod options if used override, they're NULL by default so whatever apod the flac preset uses is what's normally active.

Quote from: ktf on 2023-02-06 20:53:56

If you can do some suggestions I can run the benchmark again. There are quite a few suggestions in this thread already, but many might be outdated by now. It would be nice to show how much room for improvement FLAC really still has

I can add some presets maybe by the end of the week. Until then here's some sprawling pointers, most from the thread should apply. It's been a while since experimenting and TBH it would probably take a lot of testing to answer properly.

Avoid flac presets 1 and 4 for cleaner comparisons, flaccid can't use -M which means preset 1 is really 0 and preset 4 is 3l8
Unlike every other mode, gasc doesn't use --blocksize-list and instead uses --blocksize-limit-lower to define the minimum blocksize. The interface for gasc may require a rethink
For subset encodes, even if tweak is enabled blocksize 4608 should probably be in the list for full coverage
Tweak passes are not very useful when the minimum blocksize in the list is very small. The distance to tweak is based on the smallest blocksize so as not to cover the same ground analysis did (particularly peakset in mind). If the minimum blocksize is below 576 it's probably not worthwhile using unless doing a slow encode
A smaller queue should speed tweak/merge up considerably, not tested. A queue size in the ballpark of 100 is probably the way to go for anything that's not meant to be very slow, very slow analysis should use a large queue just to avoid the minor efficiency loss
For lax encodes there is normally some benefit to the analysis blocksize list extending beyond 4608, but there's limited benefit (unless the input has a high frequency). Merge seems a more time-efficient way to attempt to encode large blocksizes in general
Sometimes merge gets in the way of tweak, often when merge is disabled tweak makes up the difference. Merge is far weaker of the two, which makes sense as tweak tries to shape to the signal whereas merge only benefits when the signal is simple enough that overhead is minimised by merging. Large blocksizes are nice but less useful than they seem
Chunk may be more suitable for quick lax encodes, it can quickly extend into large blocksizes which gasc cannot (then again gasc+merge+lax might compete). Probably best to use chunk with tweak, probably best that the minimum blocksize isn't too small so tweak can do some heavy lifting
gasc is probably more suitable for quick subset encodes but not extensively tested
4096 and other common blocksizes are common for a reason, any blocksize list for analysis should have good coverage at least of 2k-4k, probably 1k too. Below 1k you can probably get away with relying on tweak or nothing to extend into smaller blocksizes, unless going very slow
For very slow encodes, output encode settings are used much less often so you could potentially get away with using -e
Even for slow encodes I don't think peakset should go below a blocksize of 768, maybe 576 at a push. Ideally peakset gets the frames in the ballpark and tweak has room to refine, beyond that you're probably better off increasing flac settings
Don't bother with --outperc unless you want to tune to fix time for a specific comparison
Analysis and output settings shouldn't be too different, ideally analysis is a good predictor of output otherwise it could be detrimental
Even for very quick encodes I don't see the point of using very light analysis settings, very quick encodes have low analysis effort which already does the job
The way tweak/merge is defined it keeps iterating until a pass saves less than the specified number of bytes. Passes are done on a queue, so a smaller queue makes saving x bytes harder

With the above in mind these settings might be reasonable for subset (again I haven't tested queue depth or gasc much, nor chunk with decent tweak/merge, some of these are more guess than knowledge). Seems reasonable to pair stronger flac settings with stronger flaccid settings, it could be tweaked many ways. In probably ascending difficulty

Quick
--mode gasc --blocksize-limit-lower 1536 --tweak 0 --merge 0 --analysis-comp 6 --output-comp 6
--mode chunk --blocksize-list 1152,2304,4608 --tweak 0 --merge 0 --analysis-comp 6 --output-comp 6
--mode gasc --blocksize-limit-lower 1152 --tweak 0 --merge 0 --analysis-comp 6 --output-comp 6
--mode chunk --blocksize-list 1152,2304,4608 --tweak 64 --queue 16 --merge 0 --analysis-comp 6 --output-comp 8
--mode gasc --blocksize-limit-lower 1152 --tweak 64 --queue 16 --merge 0 --analysis-comp 6 --output-comp 8
Mid
--mode peakset --blocksize-list 1152,2304,3456,4608 --analysis-comp 8 --output-comp 8p --queue 64 --tweak 1 --merge 0
--mode peakset --blocksize-list 768,1536,2304,3072,3840,4608 --analysis-comp 8 --output-comp 8p --queue 64 --tweak 1 --merge 0
Slow
--mode peakset --blocksize-list 768,1536,2304,3072,3840,4608 --analysis-comp 8p --output-comp 8p --queue 8192 --tweak 1 --merge 0
Probably not worth it
--mode peakset --blocksize-list 768,1536,2304,3072,3840,4608 --analysis-comp 8p --output-comp 8ep --queue 8192 --tweak 1 --merge 0 (this may have better efficiency than the last two as tweak may perform better)
--mode peakset --blocksize-list 576,1152,1728,2304,2880,3456,4032,4608 --analysis-comp 8p --output-comp 8ep --queue 8192 --tweak 1 --merge 0
--mode peakset --blocksize-list 576,1152,1728,2304,2880,3456,4032,4608 --analysis-comp 8p --output-comp 8ep --queue 8192 --tweak 1 --merge 0

Probably far from optimal but it might be a good starting point. I'd consider the first peakset setting on the list as probably the most balanced.

Quote from: ktf on 2023-02-06 20:53:56

edit: would you like me to share a Windows compile here for others to test?

Sure, crack on sharing whatever you like. I can normally cross-compile but had trouble as libflac is involved.

Re: Multithreading

Reply #74 – 2023-02-07 12:41:44

Here's a Windows x64 binary for anyone wanting to give this a try.

Notice