Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: Multithreading (Read 9232 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

Re: Multithreading

Reply #75
Thank you for posting a binary, a few people kicking the tyres should shake out any remaining bugs in handling.

Case in point I've just fixed subset handling, high sample rates were being limited more strictly than they should have been (they were limited as if they were <=48KHz, ie a blocksize limit of 4608 not 16384 and lpc order of 12 not 32). The above binary should be fine for <=48KHz, for subset >48KHz it may perform slightly worse than it could. Not fully tested but it's just a UI change, saw the binary and figured it should get pushed asap in case someone tries with high sample rate input.

Re: Multithreading

Reply #76
And that's with ~8KB of padding.  The smallest I can make your example while still subset (if you consider variable blocksize to be subset) is this:
I think it is more appropriate to ask here. "What I consider" is unimportant. If variable blocksize is implemented in future Xiph releases, would it be subset according to the "official" standard? In other words, would it require --lax?

Re: Multithreading

Reply #77
Technically it's subset, but some players fail even if supposedly they support subset.

From earlier in the thread:
So, the CPU time is 'single-core time' I guess? So when using 4 CPU cores wall time would be a quarter of that?

I think there are plenty people on this forum that would be willing to use this, creating a subset, fully spec-compliant FLAC file that might play in their hardware system. Half a percent is quite an improvement, even at the cost of having the encoder run near real-time on a single core. So once again: very impressive.

Sadly non-standard blocksizes and variable blocksize streams are not well supported features in many hardware decoders, see here, but I think many people here won't mind.

Re: Multithreading

Reply #78
Thanks. Because the (buggy) CUETools.Flake encoder does not require --lax when using variable blocksize, so hopefully it will be within subset in future official releases.

Re: Multithreading

Reply #79
New CUETools has disabled the variable blocksize as a quick fix, so by now there is no maintained encoder supporting it.

Re: Multithreading

Reply #80
With the above in mind these settings might be reasonable for subset (again I haven't tested queue depth or gasc much, nor chunk with decent tweak/merge, some of these are more guess than knowledge). Seems reasonable to pair stronger flac settings with stronger flaccid settings, it could be tweaked many ways. In probably ascending difficulty

  • Quick
  • --mode gasc --blocksize-limit-lower 1536 --tweak 0 --merge 0 --analysis-comp 6 --output-comp 6
Probably far from optimal but it might be a good starting point. I'd consider the first peakset setting on the list as probably the most balanced.

cmd
Code: [Select]
H:\>flaccid --in phantom.wav --out phantom.flac --mode gasc --blocksize-limit-lower 1536 --tweak 0 --merge 0 --analysis-comp 6 --output-comp 6
phantom.wav     settings        mode(gasc);lax(0);analysis_comp(6);analysis_apod((null));output_comp(6);output_apod((null));tweak(0);merge(0);blocksize_limit_lower(1536);blocksize_limit_upper(4608)
effort  analysis(2.734);tweak(0.000);merge(0.000);output(0.000)  subtiming       analysis(0.00000);tweak(0.00000);merge(0.00000) size    518191375        cpu_time        62.12700

foo_bitcompare
Code: [Select]
Differences found in compared tracks; the tracks became identical after applying offset and truncating first/last samples.
Extra leading/trailing sections contained non-null samples.

Comparing:
"H:\phantom.flac"
"H:\phantom.wav"
Differences found: length mismatch - 1:40:31.333583 vs 1:40:31.333333, 265981811 vs 265981800 samples.
Compared 265981800 samples.
Discarded last 11 samples containing non-null values from the longer file.
Differences found within the compared range: 531100405 values, 0:00.000000 - 1:40:31.333311, peak: 1.631927 (+4.25 dBTP) at 0:40.642426, 1ch
Channel difference peaks: 1.631927 (+4.25 dBTP) 1.481720 (+3.42 dBTP)
File #1 peaks: 0.999969 (-0.00 dBTP) 1.000000 (0.00 dBTP)
File #2 peaks: 0.999969 (-0.00 dBTP) 1.000000 (0.00 dBTP)
Detected offset as -11 samples.

Comparing again with corrected offset...
Compared 265981800 samples, with offset of -11.
Discarded 11 leading samples containing non-null values from file #1.
No differences in decoded data found within the compared range.
Channel peaks: 0.999969 (-0.00 dBTP) 1.000000 (0.00 dBTP)

Re: Multithreading

Reply #81
This is because flaccid doesn't support wav input yet, it treats the wav as raw and sees the wav header as samples. Only using FLAC input is recommended for now.
Music: sounds arranged such that they construct feelings.

Re: Multithreading

Reply #82
I see. For comparison, the same file compressed using flac 1.4.2 -8 is 515831607 bytes.

Re: Multithreading

Reply #83
Ooops  :-[   Off to the recycle bin you go, tonight's files ...

Anyway, for testing now: what is a suggested setting for "cheap improvement"?


Anyway, errors from compressing .wav are not that big, so here are tonight's sizes from the corpus in my signature, all after metaflac --remove-all --dont-use-padding. I used "partial_tukey(1)" because I was too lazy checking whether decimal points are an issue with this build, and because I intended to increase it later.
Left number: -8p using reference FLAC; middle: mode(gasc);lax(0);analysis_comp(5);analysis_apod(partial_tukey(1));output_comp(8p);output_apod(partial_tukey(1));tweak(0);merge(4096);blocksize_limit_lower(1024);blocksize_limit_upper(4608)
And right: plain "gasc"

3 983 858 119 vs 3 982 064 862 vs 3 978 851 946 for the classical music. The latter makes for bigger difference
3 979 488 237 vs 3 975 772 637 vs 3 973 080 798 for the heavier stuff. The first difference is bigger.
3 997 945 099 vs 3 988 216 396 vs 3 983 642 274 for the "none of the above" pop/jazz/etc.
Last line is a "what?!". Savings of nearly ten and then nearly five - and a quick glance reveals that it makes biggest difference for Kraftwerk's TdF soundtrack:
301 973 233 vs 299 405 596 vs 297 844 789.

Re: Multithreading

Reply #84
It'll be difficult to compete with fixed at quick settings as there's not a lot of time and the way a variable blocksize is chosen regardless of mode involves encoding the input multiple times. It might take a new algorithm that can more smartly shape the blocksizes chosen to the input to be competitive for quick encodes, possibly with little or no brute forcing. For now a variable encoding is best suited to compete with slower fixed encodes and to increase how well an input can viably be compressed beyond "slower".

It'll take some tweaking to find suitable presets, those were OTOH and while they should be ascending in difficulty I didn't consider how they compare to flac, the quick settings suggested should probably be disregarded as fixed probably beats them all. gasc is a greedy algorithm so can produce worse results than fixed.

I've found a bug in tweak that makes it perform worse than it should thanks to porting to the output queue, the bug doesn't invalidate the output. Apologies but any benchmarking should probably avoid tweak for a little while, hopefully I have time today to fix it. The port to output queue didn't go as smoothly as I thought, there may be other dragons lurking.

BTW all stat output except cpu_time is deprecated as a lot of it is inaccurate now and was only meant for testing.

 

Re: Multithreading

Reply #85
Sure worth pointing out that as of now there is not much hope that this will compete at speed and surely not at this stage - but it is still worth finding where the lowest-hanging fruit are. Say if merely checking 2048 & 4096 is sufficient to take out the biggest part of the possible improvements, that would likely be the cheapest lever to pull (borrowing a phrase from you) - and it would be worth pursuing for a reasonable preset.

(There must be some nice thesis idea for a clever jr researcher here: find a heuristic that takes as input your typical .flac file with a 4096 frame size, does a quick analysis on the predictor and the ... hm, rice partition order and parameter I guess? - to decide what frames to split. I've seen the IEEE publish far less interesting ideas than that.)

Re: Multithreading

Reply #86
Tweak should now be fixed, it mainly affected peakset (tweak did nothing on first flush and 2nd flush onwards had a negative impact) but there was also a general bug (2nd flush onwards was working with a less-optimal tweak distance but at least the impact was still positive). Haven't had a huge amount of time to test the fix, but pushing this is better than leaving it as it was.

gasc is now used by using --blocksize-list with a single blocksize as it should be, --blocksize-limit-lower should never have been used for this purpose.

Re: Multithreading

Reply #87
As of commit 0f421d4d6897aaa3829cca51831f5113d2a9f792:
  • Peakset now supports multiple windows of user-defined size (in millions of samples, with --peakset-window size), so peakset no longer requires access to the entire file at once. This paves the way for RAM to not scale with input size (in a future commit when the loader no longer reads the entire input before encoding begins, big TODO)
  • Add --no-seek option to disable output stream seeking, meaning the header is not updated after encoding. Requires --no-md5 to be set so that the user understands that hashing has to be disabled
  • Add output pipe support (with --out -). Either fully cached in RAM to allow header to be updated before writing, or used in tandem with --no-seek to write to pipe as soon as output frames are created
  • Fixed out-of-spec issue with blocksize_min in header when blocksize of partial last frame of variable encode is less than 16 samples. The partial last frame of variable encodes is still represented in the header but that's within spec (blocksize_min/max are bounds, nothing is mentioned about frame sizes but I assume the partial frame should always be present there even with fixed encodes as there is a tiny chance it's the biggest frame size)
  • Fixed potential issue with peakset and small queues, set->blocksize_min strikes again. set->blocksize_min has now been eliminated where possible to reflect its change of use
  • Fixed rare case where a variable encode looks like a fixed encode in the header. This probably occurs when input is very small, it can occur when the settings used are pathological or wildly inefficient, it is very unlikely to occur under normal circumstances
  • Fixed stat gathering and made it multithread-safe. Subtiming eliminated and everything gathered properly instead of using mode-specific shortcuts
  • Fixed subset with high samplerates still being restricted slightly too strictly
  • Fully convert modes to using simple_enc instead of manually defining edge cases. Fixed implementation dramatically simplified
  • Boilerplate sucked out of beginning and end of all mode implementations to simplify them
  • flanal utility to validate flac files is now in the repo and has been updated slightly to work with input created with --no-seek
It's getting there but still alpha for now. Once input handling is improved it'll be possible to implement an input pipe, once that's in we're close to all the major features being in place to consider it beta.

Re: Multithreading

Reply #88
Thinking about it there's no reason flaccid needs to reuse presets -0 to -8, those could match ./flac's fixed -0 to -8 until better variable encodes taking roughly the same time possibly replace them. The midrange of the previous suggestions could start at -9.

...
(There must be some nice thesis idea for a clever jr researcher here: find a heuristic that takes as input your typical .flac file with a 4096 frame size, does a quick analysis on the predictor and the ... hm, rice partition order and parameter I guess? - to decide what frames to split. I've seen the IEEE publish far less interesting ideas than that.)
Maybe adjacent frames that have similar models are ripe for merging, mostly this will not be subset. Maybe when adjacent modelling varies by a lot it's a sign that more blocksize nuance is needed in that area, the levers at the fixed encodes disposal were changed but maybe they weren't the ideal levers for the job. edit: This amounts to targeted tweak/merge phases just with an initial encode fed in by the input instead of computed fresh. If a targeted tweak/merge is viable it's not just viable for flac input but are candidates to replace the current tweak/merge passes. It seems possible that it's viable, kind of a halfway hack between being smart and being bruteforce.

Regardless of how the variable permutation is chosen, the model of the closest frame at a given point in the input could be used to encode the variable frames. You'd do it to narrow the search space where time is at a premium, the main benefit would be using a smaller max-lpc-order when you seem to be able to get away with it. You could apply it to flac-input, or again use this prior knowledge to speed up the tweak/merge test encodes.

Re: Multithreading

Reply #89
From the spec:
Quote
The partition order MUST be so that the block size is evenly divisible by the number of partitions. This means for example that for all odd block sizes, only partition order 0 is allowed. The partition order also MUST be so that the (block size >> partition order) is larger than the predictor order. This means for example that with a block size of 4096 and a predictor order of 4, partition order cannot be larger than 9

The first part means it may be wise to stick to blocksizes divisible by 2^n for n as large as is feasible. Sane analysis settings will do this anyway, but tweak currently chooses the tweak distance each pass with (min_blocksize_used_in_analysis/(tweak_pass+2)). This definitely introduces blocksizes that can only be stored with low order partitions fairly early, and it looks like it's a contributing factor to successive tweak pass benefit looking like a decreasing reverse sawtooth wave.

Ensuring tweak distance is a multiple of 16 (or 8/32/64) might be enough, something like ((((min_blocksize_used_in_analysis/(tweak_pass+2)))/16)*16). Results using that are inconclusive, it appears to help a tweak pass get a better result sooner but the restriction means a full tweak run (aka --tweak 1) exits earlier resulting in a slightly worse result (but much quicker). If that's true in general then it's still a win IMO, ideally for a general encode you do a few passes that grab most of the benefit and call it good.

Code: [Select]
     Amount saved in pass               Total saved by a certain point
pass    1    2    4    8   16   32   64     1      2      4      8     16     32     64
 1   3900 3900 3900 3900 3900 3900 4012
 2   1982 1982 1982 1982 1982 1982 1681
 3   1165 1165 1165 1165 1165 1376 1491
 4     69  146  803  803  803 1099 1662
 5   1189 1032  740  740  740  125  295
 6    126  150  527  527  527  863   40
 7    467  460  260  260  606  123   11  8898   8835   9377   9377   9723   9468   9192
 8    489  484  411  411   87    9
 9     34  314  304  304  486  820
10    140  58    77  397   63  152
11    431 358   315   62    7    9       9992  10049  10484  10551  10366  10458
12     95  88    87  176  550
13     25 234   140   17  74
14     50  25   104  428  11            10162  10396  10815  11172  11001
15    123  77    13   60
16      9 372   368    7                10294  10845  11196  11239
17    436  73    72
18     17  37    75
19     83  91    15
20     15  13   167                     10845  11059  11525
21     10        16
22      1         5                     10856         11546
23    181
24      3
25     10
26      8
27    100
28      3                               11161

Maybe law of small numbers, this is the only test.

Re: Multithreading

Reply #90
That may give quite different treatment when subset is imposed, limiting to partition order 8, i.e. 1/256th of the block. Sure there are exceptional signals where constraining yourself to subset means you should do -b2048 or maybe even lower only to get a smaller partition (in samples) at -r8. And when your project is to tweak block size, it could very well be that those signals are many enough to care about. Assuming subset is a constraint.

(It would likely need to be that in the reference encoder. I have no idea whether there is any subset-compliant decoder - meaning it would handle partition size of 1 sample and block size 64 - that would choke on partition size of 32 samples and block size 4096. But anyway ...)


As for this:
Maybe adjacent frames that have similar models are ripe for merging, mostly this will not be subset.
Yes, that is merging. I was talking splitting - which for all that I know, may not even be on the table in what you are working on.
(Generally I don't even have any idea how much a "good" (like in -8 with a reasonable -q) predictor vector differs from an "optimal" of the same -q. I know I can run some encodes and parse -a output, but ... not happening with my clumsiness at coding.)

Re: Multithreading

Reply #91
That may give quite different treatment when subset is imposed, limiting to partition order 8, i.e. 1/256th of the block. Sure there are exceptional signals where constraining yourself to subset means you should do -b2048 or maybe even lower only to get a smaller partition (in samples) at -r8. And when your project is to tweak block size, it could very well be that those signals are many enough to care about. Assuming subset is a constraint.

(It would likely need to be that in the reference encoder. I have no idea whether there is any subset-compliant decoder - meaning it would handle partition size of 1 sample and block size 64 - that would choke on partition size of 32 samples and block size 4096. But anyway ...)
Not sure I'm following, I think you misunderstand the test which is fair enough as it was communicated poorly (haven't talked about increasing the min partition order and the test is subset). The test merely limits the blocksizes tweak tries to a multiple of 1/2/4/8/16/32/64 (meaning max partition order is probably at least 0/1/2/3/4/5/6 respectively), that's the columns with rows successive tweak passes. By doing this it allows flac to use higher partition orders if it wants to more often, there's less oddball sizes tested which is good in theory because they are at a disadvantage only having access to lower orders.

As for this:
Maybe adjacent frames that have similar models are ripe for merging, mostly this will not be subset.
Yes, that is merging. I was talking splitting - which for all that I know, may not even be on the table in what you are working on.
(Generally I don't even have any idea how much a "good" (like in -8 with a reasonable -q) predictor vector differs from an "optimal" of the same -q. I know I can run some encodes and parse -a output, but ... not happening with my clumsiness at coding.)
I haven't thought about splitting as in the context of refining peakset analysis we know that splitting is not beneficial, if it was then peakset would have chosen smaller blocks (the exception being when the smallest blocksize available to peakset is used, that is something to consider but tweak does cover that). In an ideal world the other modes approach the accuracy of peakset for a fraction of the cost, so splitting shouldn't be beneficial enough to bother with with there either.

In the context of using a fixed encode to create a variable encode, I don't know how useful looking at a single block is. It's probably best not to split solo or tweak, but to take adjacent frames either side into account when determining the fate of the frame in between. After all frames aren't in a bubble and the frame boundaries were not chosen smartly.

Not working on anything as such, just keep poking things which is how I noticed the partition restriction in the spec. There's been some pretty big updates to the code like input is now read only when analysis needs it and a lot of sane refactoring, but you're looking at the research for new ideas. There's wav input to get working properly, seektable would be nice, and presets TODO. But today I went off on a tangent and am creating a benchmark to test different ways to encode a residual for fun, nothing that's going to help here.

edit: Split could potentially be its own pass before merge or tweak, that only tries to split frames that use the smallest blocksize available to analysis. It might capture a fair amount of what tweak currently does cheaply.

Re: Multithreading

Reply #92
As of commit 25e49fcd2122ff8001070eb3a3e5b1e9805653d5:

Initial preset implementation is in, options --preset and --preset-apod. Presets cannot be mixed with manual definitions, if you try it should error.
  • Presets 0..8 match ./flac so that people can encode fixed with settings they're familiar with, just multithreaded
  • Preset 9 is currently gasc starting at blocksize 1536 with analysis setting 3m and output setting 8. This is slightly more space-efficient than -8 in testing but it's not guaranteed
  • Preset 10 is currently peakset using 1152,2304,3456,4608 with analysis setting 3m and output setting 8
  • More variable presets will come and the two that exist might change
  • These options from ./flac work with all presets, appended to the preset number: e/l/m/p/q/r
  • -b from ./flac to pick blocksize also works, with the fixed presets 0..8 only
  • --preset-apod overrides the apod settings used by the preset, it doesn't append

Other notable changes:
  • Reworked input handling to now read only as much input as is immediately needed by analysis
  • --input-format forces input to be read as a particular format. Useful for piped input or when extension is different from typical. Valid formats are: flac wav cdda
  • flac input can now be piped by specifying '--in -' in tandem with '--input-format flac'
  • Raw input reworked to only be for CDDA, automatically chosen for files with .bin exension. Pipe works and requires '--input-format cdda'
  • Initial wav support implemented but it's basic. No pipe, 16 bit only, not very well tested
  • Hashing is now done by input, greatly simplifying the rest of the interface
  • Total samples of input no longer used during encode so it passes flac-test-files 45, now all flac-test-files pass
  • If total samples is available from input header (flac/wav), it's checked on completion
  • If MD5 is available from input header (flac) and we're hashing, it's checked on completion

Full wav support is a major missing feature, seektable support is another must-have IMO. Is there anything else major besides those that should be on the TODO?

Re: Multithreading

Reply #93
As of commit 0fd319c11503b4a31b355310ed73ca9a7366e551:
  • Seektable support added with --seektable val. -1 (default) to let program size the seektable based on input header (flac/wav), fallback to 100 seekpoints. 0 disables the seektable, n>0 tells the program to use that many seekpoints. The program reserves the right to override users choice if it detects that it's massive overkill
  • Optional metadata preservation from flac input files with --preserve-flac-metadata. Preserves the following meadata blocktypes: Application, Vorbis, Cuesheet, Picture, Undefined, aka everything that can be preserved
  • Added a --metacomp option to flanal allowing it to check if --preserve-flac-metadata losslessly preserved the metadata

Both seektable and metadata preservation has passed all tests including preserving the oddball metadata in flac-test-files, that doesn't mean they're free from bugs. If anyone has weird input they'd like to throw at it please do, I've tried to account for edge-cases but may have missed something.

There's also a separate branch in the repo that doesn't have these new features but does have a split pass as discussed a few posts above, enabled with --split. It doesn't seem very effective so unless it can be reworked to be useful it's not going to be mainlined, but maybe I'm not testing with suitable input.

Now that seektable and preserving metadata is in some brave soul can convert their entire collection if they wish. I recommend checking the result with flac -t before you delete the originals, this is still alpha you know ;)

edit: By preserving metadata I mean anything in the flac spec, not very familiar with metadata but there can also be some stuff after the bitstream? That's not currently kept.