Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: Multithreading (Read 23756 times) previous topic - next topic
0 Members and 2 Guests are viewing this topic.

Re: Multithreading

Reply #25
Very impressive. However, I've found out in the past that focussing optimization on one particular sample can give skewed results.

Could you perhaps compare flac -8p and on of the settings you find promising on some more material? I'm curious whether the gain we see here (~0.4% - ~0.5%) holds on other kinds of music.
Music: sounds arranged such that they construct feelings.

Re: Multithreading

Reply #26
For reference, how much time do -8p and -8pb4608 take?
But anyway, the size savings from the 1.7 seconds run to the tenfold time, is around 0.225 percent. Which of course isn't miraculous, but quite impressive compared to the 0.09 percent space saved by going -8 to -8p, on this particular file set.

(For the the 96/24 version of the same album, -8 to -8p makes for only 0.066 percent. Still more than a "-e" saves.)

Re: Multithreading

Reply #27
Implemented tweak passes for chunk mode. Here's some results from different albums, all subset, there's some more in the pipe. Not as good a spread as perhaps there should be but my internet is slow so it's just what's available.
  • Chunk with a single blocksize does a fixed blocksize, so chunk:8p:8p:0:4096 and chunk:8p:8p:0:4608 are ./flac's -8p and -8pb4608
  • peakset:8p:8p:10:512,1024,1536,2048,2560,3072,3584,4096,4608 would be slower than realtime on a single core. It puts the 4 sliders tested (blocksize choice, analysis settings, mode choice, tweak passes) in the upper range, not fully off the deep end but a lot of effort (~90, varies give or take 10 by tweak effort)
  • There's no change in placement except for 4096 vs 4608, it's not a massive corpus but the results seem relatively stable so far
  • Input is fully read to the processes RAM before timing to minimise SSD as a factor, output is still written so ./flac -t can validate

Code: [Select]
Size      CPUTimeEst Album                            Settings(mode:analysis_comp:output_comp:tweak_passes:blocksizes)
317603594 5158.64833 ChemicalBrothers-NoGeography     peakset:8p:8p:10:512,1024,1536,2048,2560,3072,3584,4096,4608
317685039 3434.47596 ChemicalBrothers-NoGeography     peakset:8p:8p:4:576,1152,1728,2304,2880,3456,4032,4608
317705060 1412.46291 ChemicalBrothers-NoGeography     peakset:8:8p:4:576,1152,1728,2304,2880,3456,4032,4608
317739305 1138.52770 ChemicalBrothers-NoGeography     peakset:8:8p:4:768,1536,2304,3072,3840,4608
317980994  390.43533 ChemicalBrothers-NoGeography     peakset:8:8p:1:1152,2304,3456,4608
318038269  271.71686 ChemicalBrothers-NoGeography     peakset:5:8p:1:1152,2304,3456,4608
318256373  239.11328 ChemicalBrothers-NoGeography     chunk:8:8p:1:576,1152,2304,4608
319200096   69.77557 ChemicalBrothers-NoGeography     chunk:8p:8p:0:4096
319300930   68.78749 ChemicalBrothers-NoGeography     chunk:8p:8p:0:4608
523589456 8715.10619 DaftPunk-MusiqueVol1             peakset:8p:8p:10:512,1024,1536,2048,2560,3072,3584,4096,4608
523835812 5692.51132 DaftPunk-MusiqueVol1             peakset:8p:8p:4:576,1152,1728,2304,2880,3456,4032,4608
523870247 2466.42877 DaftPunk-MusiqueVol1             peakset:8:8p:4:576,1152,1728,2304,2880,3456,4032,4608
523993599 2058.40626 DaftPunk-MusiqueVol1             peakset:8:8p:4:768,1536,2304,3072,3840,4608
524629652  685.96927 DaftPunk-MusiqueVol1             peakset:8:8p:1:1152,2304,3456,4608
524731195  495.35884 DaftPunk-MusiqueVol1             peakset:5:8p:1:1152,2304,3456,4608
525063374  449.26300 DaftPunk-MusiqueVol1             chunk:8:8p:1:576,1152,2304,4608
528212165  112.73555 DaftPunk-MusiqueVol1             chunk:8p:8p:0:4096
528655833  111.28555 DaftPunk-MusiqueVol1             chunk:8p:8p:0:4608
257706353 4486.87945 Eels-DaisiesOfTheGalaxy          peakset:8p:8p:10:512,1024,1536,2048,2560,3072,3584,4096,4608
257762825 3000.10531 Eels-DaisiesOfTheGalaxy          peakset:8p:8p:4:576,1152,1728,2304,2880,3456,4032,4608
257778499 1211.72235 Eels-DaisiesOfTheGalaxy          peakset:8:8p:4:576,1152,1728,2304,2880,3456,4032,4608
257818027  964.54024 Eels-DaisiesOfTheGalaxy          peakset:8:8p:4:768,1536,2304,3072,3840,4608
257990037  339.80432 Eels-DaisiesOfTheGalaxy          peakset:8:8p:1:1152,2304,3456,4608
258020585  233.79144 Eels-DaisiesOfTheGalaxy          peakset:5:8p:1:1152,2304,3456,4608
258224956  208.56897 Eels-DaisiesOfTheGalaxy          chunk:8:8p:1:576,1152,2304,4608
259022020   63.11173 Eels-DaisiesOfTheGalaxy          chunk:8p:8p:0:4096
259110152   62.28333 Eels-DaisiesOfTheGalaxy          chunk:8p:8p:0:4608
477271623 7616.88662 GunsNRoses-ChineseDemocracy      peakset:8p:8p:10:512,1024,1536,2048,2560,3072,3584,4096,4608
477348516 5138.17785 GunsNRoses-ChineseDemocracy      peakset:8p:8p:4:576,1152,1728,2304,2880,3456,4032,4608
477369279 1957.24402 GunsNRoses-ChineseDemocracy      peakset:8:8p:4:576,1152,1728,2304,2880,3456,4032,4608
477424204 1502.32550 GunsNRoses-ChineseDemocracy      peakset:8:8p:4:768,1536,2304,3072,3840,4608
477658393  533.66029 GunsNRoses-ChineseDemocracy      peakset:8:8p:1:1152,2304,3456,4608
477703519  356.51980 GunsNRoses-ChineseDemocracy      peakset:5:8p:1:1152,2304,3456,4608
478051214  330.46694 GunsNRoses-ChineseDemocracy      chunk:8:8p:1:576,1152,2304,4608
479170287  110.70671 GunsNRoses-ChineseDemocracy      chunk:8p:8p:0:4096
479278372  108.72984 GunsNRoses-ChineseDemocracy      chunk:8p:8p:0:4608
122521118 2479.95507 MikeOldfield-TubularBellsPartOne peakset:8p:8p:10:512,1024,1536,2048,2560,3072,3584,4096,4608
122548191 1683.96844 MikeOldfield-TubularBellsPartOne peakset:8p:8p:4:576,1152,1728,2304,2880,3456,4032,4608
122560169  643.47294 MikeOldfield-TubularBellsPartOne peakset:8:8p:4:576,1152,1728,2304,2880,3456,4032,4608
122578013  484.59866 MikeOldfield-TubularBellsPartOne peakset:8:8p:4:768,1536,2304,3072,3840,4608
122648202  173.11262 MikeOldfield-TubularBellsPartOne peakset:8:8p:1:1152,2304,3456,4608
122670414  112.27985 MikeOldfield-TubularBellsPartOne peakset:5:8p:1:1152,2304,3456,4608
122774533  107.24483 MikeOldfield-TubularBellsPartOne chunk:8:8p:1:576,1152,2304,4608
123012084   36.47129 MikeOldfield-TubularBellsPartOne chunk:8p:8p:0:4608
123016735   37.33374 MikeOldfield-TubularBellsPartOne chunk:8p:8p:0:4096
250230777 4426.60597 NineInchNails-TheSlip            peakset:8p:8p:10:512,1024,1536,2048,2560,3072,3584,4096,4608
250300683 3001.93029 NineInchNails-TheSlip            peakset:8p:8p:4:576,1152,1728,2304,2880,3456,4032,4608
250327830 1176.29593 NineInchNails-TheSlip            peakset:8:8p:4:576,1152,1728,2304,2880,3456,4032,4608
250363924  925.24971 NineInchNails-TheSlip            peakset:8:8p:4:768,1536,2304,3072,3840,4608
250553480  327.97218 NineInchNails-TheSlip            peakset:8:8p:1:1152,2304,3456,4608
250607320  215.46922 NineInchNails-TheSlip            peakset:5:8p:1:1152,2304,3456,4608
250756438  206.58991 NineInchNails-TheSlip            chunk:8:8p:1:576,1152,2304,4608
251739815   63.01859 NineInchNails-TheSlip            chunk:8p:8p:0:4096
251832015   62.26452 NineInchNails-TheSlip            chunk:8p:8p:0:4608
423291336 6309.92615 prodigy-FatOfTheLand             peakset:8p:8p:10:512,1024,1536,2048,2560,3072,3584,4096,4608
423387560 4166.49125 prodigy-FatOfTheLand             peakset:8p:8p:4:576,1152,1728,2304,2880,3456,4032,4608
423407230 1727.42195 prodigy-FatOfTheLand             peakset:8:8p:4:576,1152,1728,2304,2880,3456,4032,4608
423470736 1394.59504 prodigy-FatOfTheLand             peakset:8:8p:4:768,1536,2304,3072,3840,4608
423746793  475.69761 prodigy-FatOfTheLand             peakset:8:8p:1:1152,2304,3456,4608
423804552  329.79729 prodigy-FatOfTheLand             peakset:5:8p:1:1152,2304,3456,4608
424132847  287.47611 prodigy-FatOfTheLand             chunk:8:8p:1:576,1152,2304,4608
425201004   84.63871 prodigy-FatOfTheLand             chunk:8p:8p:0:4096
425296934   83.59144 prodigy-FatOfTheLand             chunk:8p:8p:0:4608

There's probably a lot of efficiency left on the table sticking to subset. The histogram of best blocksizes is likely a bell curve with 4096/4608 near the peak, meaning subset likely cuts off too soon to have good coverage of the curve. Finding ways to extend into --lax without blowing out encode time with peakset TODO.

edit: Tests done on a different PC with 8 cores so timings not comparable with previous posts.

Re: Multithreading

Reply #28
Second part of album stats:
Code: [Select]
Size      CPUTimeEst Album                                Settings(mode:analysis_comp:output_comp:tweak_passes:blocksizes)
486754940 7401.92191 PunkAndNewWaveUltimateCollection-CD1 peakset:8p:8p:10:512,1024,1536,2048,2560,3072,3584,4096,4608
486854649 4919.80083 PunkAndNewWaveUltimateCollection-CD1 peakset:8p:8p:4:576,1152,1728,2304,2880,3456,4032,4608
486876075 1984.64279 PunkAndNewWaveUltimateCollection-CD1 peakset:8:8p:4:576,1152,1728,2304,2880,3456,4032,4608
486950076 1567.01817 PunkAndNewWaveUltimateCollection-CD1 peakset:8:8p:4:768,1536,2304,3072,3840,4608
487249654  543.47563 PunkAndNewWaveUltimateCollection-CD1 peakset:8:8p:1:1152,2304,3456,4608
487297154  373.30103 PunkAndNewWaveUltimateCollection-CD1 peakset:5:8p:1:1152,2304,3456,4608
487662645  334.13117 PunkAndNewWaveUltimateCollection-CD1 chunk:8:8p:1:576,1152,2304,4608
489080538  101.41232 PunkAndNewWaveUltimateCollection-CD1 chunk:8p:8p:0:4096
489290781  100.16961 PunkAndNewWaveUltimateCollection-CD1 chunk:8p:8p:0:4608
371061128 5677.33136 Radiohead-OKComputer                 peakset:8p:8p:10:512,1024,1536,2048,2560,3072,3584,4096,4608
371129145 3793.16594 Radiohead-OKComputer                 peakset:8p:8p:4:576,1152,1728,2304,2880,3456,4032,4608
371146775 1487.96806 Radiohead-OKComputer                 peakset:8:8p:4:576,1152,1728,2304,2880,3456,4032,4608
371195155 1163.05440 Radiohead-OKComputer                 peakset:8:8p:4:768,1536,2304,3072,3840,4608
371382589  409.29770 Radiohead-OKComputer                 peakset:8:8p:1:1152,2304,3456,4608
371416496  280.44687 Radiohead-OKComputer                 peakset:5:8p:1:1152,2304,3456,4608
371661749  254.26818 Radiohead-OKComputer                 chunk:8:8p:1:576,1152,2304,4608
372533424   80.60868 Radiohead-OKComputer                 chunk:8p:8p:0:4096
372625690   79.15295 Radiohead-OKComputer                 chunk:8p:8p:0:4608
329603351 4875.99008 RATM-EvilEmpire                      peakset:8p:8p:10:512,1024,1536,2048,2560,3072,3584,4096,4608
329653372 3276.83557 RATM-EvilEmpire                      peakset:8p:8p:4:576,1152,1728,2304,2880,3456,4032,4608
329667407 1275.60028 RATM-EvilEmpire                      peakset:8:8p:4:576,1152,1728,2304,2880,3456,4032,4608
329709473  989.10326 RATM-EvilEmpire                      peakset:8:8p:4:768,1536,2304,3072,3840,4608
329861813  348.56314 RATM-EvilEmpire                      peakset:8:8p:1:1152,2304,3456,4608
329879280  230.40419 RATM-EvilEmpire                      peakset:5:8p:1:1152,2304,3456,4608
330093058  216.85232 RATM-EvilEmpire                      chunk:8:8p:1:576,1152,2304,4608
330830099   68.85256 RATM-EvilEmpire                      chunk:8p:8p:0:4096
330915000   68.60566 RATM-EvilEmpire                      chunk:8p:8p:0:4608
332628659 4991.72321 SOAD-Toxicity                        peakset:8p:8p:10:512,1024,1536,2048,2560,3072,3584,4096,4608
332701247 3291.49541 SOAD-Toxicity                        peakset:8p:8p:4:576,1152,1728,2304,2880,3456,4032,4608
332720306 1352.51852 SOAD-Toxicity                        peakset:8:8p:4:576,1152,1728,2304,2880,3456,4032,4608
332779597 1084.18631 SOAD-Toxicity                        peakset:8:8p:4:768,1536,2304,3072,3840,4608
332989575  371.13667 SOAD-Toxicity                        peakset:8:8p:1:1152,2304,3456,4608
333014456  261.12648 SOAD-Toxicity                        peakset:5:8p:1:1152,2304,3456,4608
333259634  225.99539 SOAD-Toxicity                        chunk:8:8p:1:576,1152,2304,4608
334130781   66.84236 SOAD-Toxicity                        chunk:8p:8p:0:4096
334223800   66.30961 SOAD-Toxicity                        chunk:8p:8p:0:4608
412422849 7760.62776 ThinLizzy-TheCollection              peakset:8p:8p:10:512,1024,1536,2048,2560,3072,3584,4096,4608
412505798 5182.30982 ThinLizzy-TheCollection              peakset:8p:8p:4:576,1152,1728,2304,2880,3456,4032,4608
412527983 2049.92532 ThinLizzy-TheCollection              peakset:8:8p:4:576,1152,1728,2304,2880,3456,4032,4608
412595671 1600.67376 ThinLizzy-TheCollection              peakset:8:8p:4:768,1536,2304,3072,3840,4608
412872643  552.71229 ThinLizzy-TheCollection              peakset:8:8p:1:1152,2304,3456,4608
412912756  386.65311 ThinLizzy-TheCollection              peakset:5:8p:1:1152,2304,3456,4608
413370741  330.46464 ThinLizzy-TheCollection              chunk:8:8p:1:576,1152,2304,4608
414442451  109.35235 ThinLizzy-TheCollection              chunk:8p:8p:0:4096
414571508  108.20086 ThinLizzy-TheCollection              chunk:8p:8p:0:4608

Totals of all 12 albums:
Code: [Select]
Size       CPUTimeEst  Settings(mode:analysis_comp:output_comp:tweak_passes:blocksizes)
4304685184 69901.6021  peakset:8p:8p:10:512,1024,1536,2048,2560,3072,3584,4096,4608
4305712837 46581.26799 peakset:8p:8p:4:576,1152,1728,2304,2880,3456,4032,4608
4305956860 18745.70384 peakset:8:8p:4:576,1152,1728,2304,2880,3456,4032,4608
4306617780 14872.27901 peakset:8:8p:4:768,1536,2304,3072,3840,4608
4309563825  5151.83705 peakset:8:8p:1:1152,2304,3456,4608
4310095996  3546.86498 peakset:5:8p:1:1152,2304,3456,4608
4313307562  3190.43474 chunk:8:8p:1:576,1152,2304,4608
4326579415  968.38887  chunk:8p:8p:0:4096
4328113099  955.85215  chunk:8p:8p:0:4608

Re: Multithreading

Reply #29
So, the CPU time is 'single-core time' I guess? So when using 4 CPU cores wall time would be a quarter of that?

I think there are plenty people on this forum that would be willing to use this, creating a subset, fully spec-compliant FLAC file that might play in their hardware system. Half a percent is quite an improvement, even at the cost of having the encoder run near real-time on a single core. So once again: very impressive.

Sadly non-standard blocksizes and variable blocksize streams are not well supported features in many hardware decoders, see here, but I think many people here won't mind.
Music: sounds arranged such that they construct feelings.

Re: Multithreading

Reply #30
The "peakset:5:8p:1:1152,2304,3456,4608" looks particularly interesting. But 4096 seems so good for CDDA that I might wonder what sizes you would obtain by just encoding higher multiples of 512.

To get an idea of benefit/cost (in that order), I took the total of 12 table and computed the file size difference per second extra taken. All relative to the next row. In order to pick the lower-hanging fruit first, you would want the following sequence to be increasing - it isn't far off:
44 (that means, the top "peakset:8p:8p:10:" row saves 44 bytes per extra second taken over the second row)
9 (less than the 44, so if you are willing to wait for this one, you should have gone to the top row)
171
303
332
9010 (this is how much the "peakset:5:8p:1:" improves over the next, which has to do with the next one being not-as-good)
5973
122335 (note the number of digits! The biggg difference is chunk:8p:8p:0:4096 relative to chunk:8p:8p:0:4608)
- (nothing for the last row, it has no next)

Bypassing the not particularly-good 5973, then "peakset:5:8p:1:" saves 6393 per extra second (edit: over "chunk:8p:8p:0:4096"). The benefit of going 4608 -> 4096 is 19x as big.

Re: Multithreading

Reply #31
So, the CPU time is 'single-core time' I guess? So when using 4 CPU cores wall time would be a quarter of that?

I think there are plenty people on this forum that would be willing to use this, creating a subset, fully spec-compliant FLAC file that might play in their hardware system. Half a percent is quite an improvement, even at the cost of having the encoder run near real-time on a single core. So once again: very impressive.

Sadly non-standard blocksizes and variable blocksize streams are not well supported features in many hardware decoders, see here, but I think many people here won't mind.
It should roughly translate to that yes, greed mode seems to be inflated but I think it's because greed has lower occupancy which seems to be measured by clock_t in a roundabout way, not necessarily a bad thing to measure IMO. Chunk/peakset have good occupancy and the numbers seem to make sense give or take. All core workloads run at lower clocks on a lot of hardware, especially low power parts like the laptop part this was tested with which might skew results a bit, also as you go for quicker encode settings IO and hashing take proportionally more time. Short of doing something like counting cycles by tapping into CPU performance registers I don't think there's an accurate way for a single benchmark to truly scale to a different number of cores.

I'm not against one day whipping this code into a usable frontend, right now it's just a testbed to scratch a benchmark itch. Is there a known example where variable blocksizes are supported but only within a narrower range than 16-65535? Currently tweak is somewhat limited, merge unlimited. Range limiters will be added soon anyway as extra levers to limit the excessive less-useful computation in tweak/merge, curious if there's some informal line in the sand between subset and "go nuts".

A preliminary test of merge passes indicates merging should happen before tweaking, and that merging alone is probably not all that should be done to expand into large blocksizes. Instead of doing 512:4608 or 576:4608 at the top end it's likely 768:6144 is the way to go, maybe 1024:8192 1152:9216 or more on some input. It might be fine to rely on tweak passes to shape the frames into lower sizes where beneficial if it means that the O(block_count^2) effort scaling of the peakset analysis phase doesn't get too out of control.

The "peakset:5:8p:1:1152,2304,3456,4608" looks particularly interesting. But 4096 seems so good for CDDA that I might wonder what sizes you would obtain by just encoding higher multiples of 512.

To get an idea of benefit/cost (in that order), I took the total of 12 table and computed the file size difference per second extra taken. All relative to the next row. In order to pick the lower-hanging fruit first, you would want the following sequence to be increasing - it isn't far off:
44 (that means, the top "peakset:8p:8p:10:" row saves 44 bytes per extra second taken over the second row)
9 (less than the 44, so if you are willing to wait for this one, you should have gone to the top row)
171
303
332
9010 (this is how much the "peakset:5:8p:1:" improves over the next, which has to do with the next one being not-as-good)
5973
122335 (note the number of digits! The biggg difference is chunk:8p:8p:0:4096 relative to chunk:8p:8p:0:4608)
- (nothing for the last row, it has no next)

Bypassing the not particularly-good 5973, then "peakset:5:8p:1:" saves 6393 per extra second (edit: over "chunk:8p:8p:0:4096"). The benefit of going 4608 -> 4096 is 19x as big.

My analysis is far from exhaustive, the rule of thumb seems to be that around 4k is very important as you'd expect. At a guess the frequency chart of best blocksizes for most input can be approximated with a bell curve centred at 3.5-4.5k (the actual best chart is probably skewed more towards smaller blocksizes with a longer tail to the right, 2-3k seems next-most important). You're suggesting trying things like 4096,4608 and 3584,4096,4608? If a good algorithm that could efficiently use just these sizes were made then perhaps they would perform well, however chunk and peakset cannot do it. Peakset could be 512,3584,4096,4608 for an effort of 25, unlikely to perform better than 768,1536,2304,3072,3840,4608 with an effort of 21 IMO.

Re: Multithreading

Reply #32
Is there a known example where variable blocksizes are supported but only within a narrower range than 16-65535?
You mean, an implementation that encodes variable block size? CUETools Flake does - but it is buggy: https://hydrogenaud.io/index.php/topic,123016
... and therefore about to be removed. Source available though.


You're suggesting trying things like 4096,4608 and 3584,4096,4608? If a good algorithm that could efficiently use just these sizes were made then perhaps they would perform well, however chunk and peakset cannot do it.
I was rather thinking: if you are looking to find those block sizes that will indeed often be "optimal" for size, then well, efficiency can come later. Say, your nineteen-hour (it is seconds, right?) peakset:8p:8p:10:512,1024,1536,2048,2560,3072,3584,4096,4608: Each block is a multiple 1 to 9 of 512. How does the histogram look?
(And they aren't independent over time, but I have no idea whether a time series plot - with ultra-fine resolution - would be a "moving cloud" or just scattered.)


As for efficiency:
If I understood the end of Reply 23 right, you use "-5" runs to determine block sizes and then -8p to encode? Reference FLAC does a bit of guesstimate-first-and-encode-the-winner, so that approach is quite a bit in the spirit. Half speed and smaller file isn't bad either.
(-7 is closer to -8p, if this is to be explored further.)

Re: Multithreading

Reply #33
Is there a known example where variable blocksizes are supported but only within a narrower range than 16-65535?
You mean, an implementation that encodes variable block size? CUETools Flake does - but it is buggy: https://hydrogenaud.io/index.php/topic,123016
... and therefore about to be removed. Source available though.
I meant is there an example of a decoder that supports variable blocksizes, but only works with some range that isn't 16:4608 or 16:65535. I guess we are talking hardware decoder as for a software decoder it would be a fixable bug.

As for efficiency:
If I understood the end of Reply 23 right, you use "-5" runs to determine block sizes and then -8p to encode? Reference FLAC does a bit of guesstimate-first-and-encode-the-winner, so that approach is quite a bit in the spirit. Half speed and smaller file isn't bad either.
(-7 is closer to -8p, if this is to be explored further.)
There's an analysis phase which can use different compression settings to the final encode, then there's an optional merge phase (WIP) which merges adjacent frames if the result is smaller, then an optional tweak phase which tweaks the boundary between adjacent frames (merge and tweak use the final encode settings). The analysis phase can only choose from the list of blocksizes given to it, merge and tweak are an imperfect way to grab some low hanging fruit by trying blocksizes not on the list. It can be thought of as the analysis phase roughly approximating the optimal permutation, and merge/tweak refining the approximation a little more. Merge and tweak are imperfect by nature and their performance is heavily influenced by implementation which is a WIP.

I've tried analysis settings of -0, -5, -8, -8p with output settings fixed at -8p for comparison, not exhaustive and not smart, just a spread to see if it made a difference. -5 -8 and -8p had positive results at quick to slow settings, -0 was not good at finding permutations for -8p to encode (not surprising, -0 only uses fixed predictors so it has very little in common with -8p).

You're suggesting trying things like 4096,4608 and 3584,4096,4608? If a good algorithm that could efficiently use just these sizes were made then perhaps they would perform well, however chunk and peakset cannot do it.
I was rather thinking: if you are looking to find those block sizes that will indeed often be "optimal" for size, then well, efficiency can come later. Say, your nineteen-hour (it is seconds, right?) peakset:8p:8p:10:512,1024,1536,2048,2560,3072,3584,4096,4608: Each block is a multiple 1 to 9 of 512. How does the histogram look?
(And they aren't independent over time, but I have no idea whether a time series plot - with ultra-fine resolution - would be a "moving cloud" or just scattered.)
It is seconds, an estimate using clock_t to measure CPU time (if you're familiar with /bin/time, it's similar to the user time it gives). The testing is done on an 8 core Zen 2 4700u in an Asus PN50, which is a mini PC using laptop components. All-core the clocks average around 2.4GHz as it throttles hard from thermals doing an 8 core AVX2 workload, when thermal or power limits are reached (especially in laptop parts) clockspeeds dip. I believe the 19 hour result means it would take 19 hours at 2.4GHz, however running single core it's more likely to run at 3.5GHz say which should reduce the wall time to 13 hours or so. Desktop components tend to not have such a dramatic difference in clockspeeds, higher power limits better coolers etc.

If you did a peakset encode with merge and tweak disabled (and analysis settings the same as output settings), the result is optimal for the list of blocksizes and compression setting used. Using ./flac -a you can build the histogram, attached are some examples (track 1 of NiN The Slip which is an ambient album introduction, SOAD toxicity which is noisy with some quiet sections, and sultans of swing which is more even rock).

Re: Multithreading

Reply #34
Surprising, those histograms - at least if you picked them random, rather than "see how much they vary"?
Although they are number of blocks, so for the rockers, there will be about the same number of samples in 4096 blocks as in 2048 blocks.

@ktf : since I totally suck at reading code, I'm thinking-aloud stupid questions about the subdivide_tukey or something similar to it:
* A "reasonably sane" suggestion for something you could either label "2" or "3", would be to compute "three Hann bumps": Over [0 ... 0.5], [0.25 ... 0.75] and [.5 ... 1]. Sum of them would be tukey(.5), and if that is what the encoder wants to select ... jolly good.
* Should the encoder want to select only #1 or only #3, that is likely a hint that the window should be split? Then the entire thing would be better off by "the troublesome half inheriting the other half's predictor", and although it could be that all predictors are bad for that half, well, it is likely not worse to give it its own.
* Should the encoder want to use the sum of #1 and #3 ("Hann-punching" out the middle), then who knows whether there is enough to save by splitting the frame. Worth testing.
* Should the encoder want to use #1 and #2, you have a tukey(two thirds) on [0 ... 0.75] and can likely split too.
* Should the encoder want to use #2 and #3 ... now it is worth asking, should the first quarter be encoded separately or should it have been merged with the previous block?
Disregarding the subset consideration about merging in "a quarter" to an already 4096 block.

Re: Multithreading

Reply #35
@ktf : since I totally suck at reading code, I'm thinking-aloud stupid questions about the subdivide_tukey or something similar to it
I really don't think those are usable for analysing whether a block should be split up. I think a completely new algorithm is needed to determine where to split up blocks.
Music: sounds arranged such that they construct feelings.

Re: Multithreading

Reply #36
Picked the slip track 1 out of habit, realised it was a terrible example of the general case being mostly ambient so picked toxicity, realised the quiet portions were causing a second peak in the histogram as it's basically two signals interleaved so picked sultans of swing which has an even signal throughout. The sultan result is what I'd expect in general, that rough shape (possibly scaled in x and/or y) should apply to most "normal" input.

A cheap way to be able to choose the scaling factor would be a good way to adapt to input instead of the user providing a fixed blocksize list.

It also shows that some input may benefit from holes in the blocksizes chosen particularly in the upper range, ie 1,2,3,4,8 may perform only slightly worse than 1,2,3,4,5,6,7,8 for half the effort, but notably more efficient than 1,2,3,4[,5,6] for some input. You'd have holes to account for a signal not being even throughout, the majority of the track would be covered by 1,2,3,4 with the quieter/ambient parts covered by 8.

You could get really fancy and detect when a signal changes and adapt the set of blocksizes on the fly, but that is approaching proper signal analysis and it's likely that using the signal directly rather than ham-fisted frame-size-brute-force is the way to go when you get that deep.

Re: Multithreading

Reply #37
@ktf : since I totally suck at reading code, I'm thinking-aloud stupid questions about the subdivide_tukey or something similar to it
I really don't think those are usable for analysing whether a block should be split up. I think a completely new algorithm is needed to determine where to split up blocks.

Maybe not. It is a long shot.
The idea was around over in this compression improvement thread 41 to 44 - especially your "a certain set of partial windows @ 44.1kHz with blocksize 4096 [...]" and TBeck's musings over windowing at https://hydrogenaud.io/index.php/topic,44229.msg402752.html#msg402752 . He looks at changing signal characteristics yes - and would, at least in that early version, do simple bisection if needed.

I was imagining that this would not only be a "cheap" way to guess it - but also easy to make sense out of test results.
If a "variable" build would for two consecutive 2048 frames let one of them use the same predictor as the current "constant blocksize" build would use for the entire 4096, you would get a picture of whether those signals are those that call for this treatment.

(All these considerations are before we start asking whether other block sizes should have other apodization functions. In the same 2006 thread linked to above, that was a big collaborative effort that ended up with Josh concluding on tukey(5e-1) as default - but I don't see block size being part of the tests.)

Re: Multithreading

Reply #38
@ktf : since I totally suck at reading code, I'm thinking-aloud stupid questions about the subdivide_tukey or something similar to it
I really don't think those are usable for analysing whether a block should be split up. I think a completely new algorithm is needed to determine where to split up blocks.
Analysing brute-forced encodes could be a way to reverse engineer a heuristic algorithm for variable blocksizes based on input. Beyond pathological examples it's probably hard to make a good algorithm that's widely applicable, but it may be an easier approach than (or augment) proper signal analysis.

(All these considerations are before we start asking whether other block sizes should have other apodization functions. In the same 2006 thread linked to above, that was a big collaborative effort that ended up with Josh concluding on tukey(5e-1) as default - but I don't see block size being part of the tests.)
There's always another dimension to add isn't there :P

Made some improvements to merge/tweak mode and done another round of testing. A few changes that need explaining:
  • Merge passes added with --merge threshold, where user passes a byte threshold. Multiple passes over the input is done until a pass fails to save at least that many bytes
  • Tweak phase modified to act like merge does, user passes a threshold instead of specifying the number of passes
  • Merge and tweak can be done using analysis settings or output settings, with --merge-when 0/1 --tweak-when 0/1 to specify before/after
  • A minimum and maximum blocksize can be set with --blocksize-limit-lower limit --blocksize-limit-upper limit, this is to fine-tune how tweak/merge passes act and usually doesn't do much. Some input doesn't require very large blocksizes or very small blocksizes, this would be a way to avoid testing them at all (instead of relying on tweak/merge thresholds failing it's basically a way to make the last passes of particularly merge more efficient, maybe)
  • Starred results are subset, everything else --lax

Code: [Select]
  Size          CPUTimeEst Settings
* 4326579415    887.49764  Fixed blocksize 4096
* 4312846890    1172.31728 mode(peakset);analysis_comp(5r3);analysis_apod((null));tweak(99999);merge(99999);blocksize_limit_lower(1152);blocksize_limit_upper(4608);analysis_blocksizes(1152,2304,4608)     
  4311188781    1384.49821 mode(peakset);analysis_comp(5r3);analysis_apod((null));tweak(99999);merge(99999);blocksize_limit_lower(1152);blocksize_limit_upper(8192);analysis_blocksizes(1152,2304,4608)     
* 4308148582    3497.63002 mode(peakset);analysis_comp(5);analysis_apod(partial_tukey(2));tweak(1024);merge(4096);blocksize_limit_lower(256);blocksize_limit_upper(4608);analysis_blocksizes(768,1536,2304,3072,3840,4608)     
  4305653857    2518.54257 mode(peakset);analysis_comp(5);analysis_apod(partial_tukey(2));tweak(10240);merge(10240);blocksize_limit_lower(256);blocksize_limit_upper(65535);analysis_blocksizes(1152,2304,3456,4608)     
  4303811368    4699.85835 mode(peakset);analysis_comp(5);analysis_apod(partial_tukey(2));tweak(1024);merge(1024);blocksize_limit_lower(256);blocksize_limit_upper(65535);analysis_blocksizes(768,1536,2304,3072,3840,4608)     
  4302718243   19406.89554 mode(peakset);analysis_comp(8p);analysis_apod(partial_tukey(2));tweak(1024);merge(1024);blocksize_limit_lower(256);blocksize_limit_upper(65535);analysis_blocksizes(512,1024,1536,2048,2560,3072,3584,4096,4608)     
  4301991677   26677.72353 mode(peakset);analysis_comp(8p);analysis_apod(partial_tukey(2));tweak(64);merge(512);blocksize_limit_lower(256);blocksize_limit_upper(65535);analysis_blocksizes(512,1024,1536,2048,2560,3072,3584,4096,4608)     
  4300096544  142492.31279 mode(peakset);analysis_comp(8p);analysis_apod((null));tweak(1);merge(1);blocksize_limit_lower(256);blocksize_limit_upper(65535);analysis_blocksizes(512,1024,1536,2048,2560,3072,3584,4096,4608)     

The big win is from doing tweak/merge with analysis settings instead of output settings. Using just partial_tukey(2) also gave a good speedup seemingly without much loss of accuracy. Just tried all 6 combinations of the apod used in -8 and picked what seemed good, I don't know enough about apod settings to make an informed decision beyond that.

edit: Round 2 testing all done with merge/tweak using analysis settings

Re: Multithreading

Reply #39
I was imagining that this would not only be a "cheap" way to guess it - but also easy to make sense out of test results.
I think it isn't cheap enough though. Also, there might be other reasons a 'partial' window turns out to be the best, for example because part of a signal is cleaner. That wouldn't necessarily mean that it is a good block to split up, just that part of a block can be ignored on analysis.

@ktf : since I totally suck at reading code, I'm thinking-aloud stupid questions about the subdivide_tukey or something similar to it
I really don't think those are usable for analysing whether a block should be split up. I think a completely new algorithm is needed to determine where to split up blocks.
Analysing brute-forced encodes could be a way to reverse engineer a heuristic algorithm for variable blocksizes based on input. Beyond pathological examples it's probably hard to make a good algorithm that's widely applicable, but it may be an easier approach than (or augment) proper signal analysis.
Yes, I agree your work is very valuable. It gives insights on what FLACs algorithms tend to do when a variable blocksize is possible. Also, it gives a good indication of how much there is to gain. It seems it is about 0.5% (which is almost 0.8 percentagepoint on this material) which is quite a lot.
Music: sounds arranged such that they construct feelings.

 

Re: Multithreading

Reply #40
The second line in the above table - the fastest among the variable block size - takes out more than half of the possible improvement, and at a thirtysomething percent time penalty - that isn't much, provided the first is standard -8p.

There's always another dimension to add isn't there :P
Oh sure. My point was - I think wrongly! - the notion that you get this improvement despite running tests "on the 4096 home ground", and that would "bias the results against you".

But then I tested (using reference FLAC, corpus as in signature). A bunch of apodization functions with different block sizes. Quite puzzling results, it seems that 4096 was a good choice no matter what.
Two tests:
* First, something that by gut feeling "could happen to turn out good": different block sizes 3072, 4096 and 4608 with only the tukey(P), but trying to keep the same number of samples untapered. That is, 4096 with tukey(.5) keeps the middle 2048; then 3072 would also do that, tapering first and last 512, that is, tukey(.333). Etc.
Trying that with a few values for P at 4096 and translating up and down, ...
Findings: 4096 still wins overall.
* fourteen different block sizes with just -5bxxx -A "<precisely one>" -7<the same>, that is also restricting to one function, but beefing up the -l and -r. Functions were hann (yes I know that's tukey(1)), hamming, welch - and two known bad ones: triangle and rectangle.
Findings:
-> Classical music: bigger is better. 11 of 14 classical music albums had a strictly monotone relationship, every function.
-> Other genres & all aggregated: 4096 wins hands down. With an exception for rectangle, where 4608 wins eighty percent of the albums. More is better with rectangle? No support for the notion that tapering helps because 4096 is "too big".
-> 2560 often comes out bad. Not in the average over files, where it is always between the neighbouring block sizes, but if there is one the stands out as worse than the neighbours in indvidual files ... it is 2560. 25 percent of the time, 2560 was worse than both 2304 and 2880. Other block sizes did not experience the same (bar a small handful of entries in a four thousand chances).
I have not the slightest idea why. The phenomenon is "mildly visible" in your graphs.
-> 4096 is surprisingly slow. Dunno, hardware cache size? By and large, larger block size is faster until 3456, then 4096 is about as slow as 1152.

Re: Multithreading

Reply #41
  • Added partial support for flac input, MD5 is disabled for non-16 bit input for now as it's a pain. 24 bit input passes the listen test as does flac-test-files but no other testing has been done (as an aside mpv 0.32.0 which uses libavcodec 58.54.100 from ffmpeg 4.2.7 doesn't play the original version of '55 - file 48-53 combined.flac', but it can play the individual tracks 48-53. This mpv/ffmpeg is a few years old so it may be fixed already)
  • Assuming flac input actually works properly flaccid can now encode everything libFLAC can, with the caveat that the decompressed input fits in RAM as it's currently stored in full for testing
  • TODO: Peakset currently requires seekable input if it were to be implemented as a generally usable frontend (ie all input needs to be available as it fully encodes/analyses the input then builds the optimal permutation in reverse then processes all input again with output settings, a proper frontend cannot rely on loading everything to RAM so input must be seekable). This is because peakset uses a single window the size of the input allowing it to be optimal. There's probably a way to stick with a single window without having to be seekable, but it seems horribly complicated to combine the multi-threaded processing phase with the analysis phase in tight-enough lockstep that occupancy doesn't suffer. By far the easier way is to chunk the input into multiple windows (not sliding windows), not for MT but to limit RAM. It wouldn't be fully optimal but it would be close (non-optimal split at window boundaries, a window could feasibly be 100+MiB on most modern systems so most input would functionally be optimal and everything else a rounding error away from optimal).
Now that flac input works it's easier to test wider. Attached is most of the lossless I have, mostly CDDA with a little 24 bit (LudAndSchlattsMusicalEmporium is mostly classical 24 bit 48KHz with an open license from here https://pmmusic.pro/downloads/ ) (blondie parallel lines is 24 bit 192 KHz). As before input is read to RAM before timing so flac input decode is irrelevant (other than being single-threaded so giving the chip a little time to recover thermals, should be negligible). Classical/punk/rock/metal/synth/electronic pretty well covered across decades, a lot of atmospheric in some of the OST's, many VA albums that might flesh things out even more. A few runs with quick settings to see which settings are more important, same with some slower settings and 1 even slower (but not off the deep end) for comparison. These are the best results, quick chunk settings were tried but none beat quick peakset:

Code: [Select]
Size        CPUTimeEst Settings
28487101750 6358.74381 Fixed 4096
28392114967 10040.00751 mode(peakset);analysis_comp(5r3);analysis_apod((null));output_comp(8p);output_apod((null));tweak_after(0);tweak(65536);tweak_early_exit(0);merge_after(0);merge(65536);blocksize_limit_lower(576);blocksize_limit_upper(8192);analysis_blocksizes(1152,2304,4608)
28383268498 23101.79561 mode(peakset);lax(0);analysis_comp(5);analysis_apod(partial_tukey(2));output_comp(8p);output_apod((null));tweak_after(0);tweak(768);tweak_early_exit(0);merge_after(0);merge(0);blocksize_limit_lower(256);blocksize_limit_upper(4608);analysis_blocksizes(768,1536,2304,3072,3840,4608)
28347347740 31982.75614 mode(peakset);analysis_comp(5);analysis_apod(partial_tukey(2));output_comp(8p);output_apod((null));tweak_after(0);tweak(768);tweak_early_exit(0);merge_after(0);merge(768);blocksize_limit_lower(256);blocksize_limit_upper(65535);analysis_blocksizes(768,1536,2304,3072,3840,4608)
28340323346 111480.3915 mode(peakset);lax(1);analysis_comp(5);analysis_apod(partial_tukey(2));output_comp(8p);output_apod((null));tweak_after(0);tweak(1);tweak_early_exit(0);merge_after(0);merge(1);blocksize_limit_lower(256);blocksize_limit_upper(65535);analysis_blocksizes(768,1536,2304,3072,3840,4608)
28335653960 148385.92248 mode(peakset);lax(1);analysis_comp(8);analysis_apod(partial_tukey(2));output_comp(8p);output_apod((null));tweak_after(0);tweak(1);tweak_early_exit(0);merge_after(0);merge(1);blocksize_limit_lower(256);blocksize_limit_upper(65535);analysis_blocksizes(512,1024,1536,2048,2560,3072,3584,4096,4608)

Merge and particularly tweak are not optimal, it's not immediately obvious how best to make them more optimal. Ideally tweak would provide a strictly decreasing improvement for each additional pass, currently it's like a decreasing inverse sawtooth. It's not clear what new things to try:
  • Maybe a partial-brute-force mode that collects a sliding window of metrics from the signal and brute-forces the n best candidates from that, basically poor-mans signal analysis plus brute force to make up for the poor quality.
  • How about using the chosen settings from a strong smaller blocksize encode starting at x, to cheaply try larger blocksizes starting at x? ie find a model that fits the input well and use it to encode as much of the signal as it can efficiently do

Re: Multithreading

Reply #42
Folks,
here is a (Master's) thesis from 2007 on variable FLAC block size: https://www.collectionscanada.gc.ca/obj/thesescanada/vol2/002/MR31230.PDF
But the results aren't that impressive. .15, .13, .20 percentage points for the "cheap" algorithm that makes for a 5.5x slowdown - I guess that those expensive algorithms that make the slowdown factor skyrocket up to the thousands, are more interesting for comparison than for applicability.
And I would not be surprised if the subsequent improvements (and in particular the partial_tukey/punchout_tukey/subdivide_tukey) have taken out some of the possible gains.

Anyway, if any of those ideas are useful for anything - there they are.

Re: Multithreading

Reply #43
Folks,
here is a (Master's) thesis from 2007 on variable FLAC block size: https://www.collectionscanada.gc.ca/obj/thesescanada/vol2/002/MR31230.PDF
But the results aren't that impressive. .15, .13, .20 percentage points for the "cheap" algorithm that makes for a 5.5x slowdown - I guess that those expensive algorithms that make the slowdown factor skyrocket up to the thousands, are more interesting for comparison than for applicability.
And I would not be surprised if the subsequent improvements (and in particular the partial_tukey/punchout_tukey/subdivide_tukey) have taken out some of the possible gains.

Anyway, if any of those ideas are useful for anything - there they are.

This very nice paragraph seems notable.

Spoiler (click to show/hide)

Re: Multithreading

Reply #44
Yes, most likely.
In reality, flac -8 is doing something in between 64 and 4096 samples at the time. Whereas other slower formats adapt and update the predictor vector every sample (was that right about WavPack and the animals?), FLAC keep that one and tweak the residual coding method - sometimes with baffling results on extreme material: https://hydrogenaud.io/index.php/topic,123025.msg1018251.html#msg1018251 . Maxing out the Rice partitioning reduces the file size by ten percent, to the point where where it beats the heaviest OptimFROG.
(But not xz ...)

For those who want to stay within subset, this was a case where you would cut the block size in half just to be able to change the Rice encoding parameter twice as often. But it is not something you will see that often.

Re: Multithreading

Reply #45
It looks like the extreme block size search in 7.1.2 corresponds to peakset with contiguous steps, but instead of stopping at some sensible upper blocksize they use the full range available (they say 65536 but it should not have included 65536, minor mistake or it wasn't excluded in 2007). It's little wonder that the extreme search starting at min_blocksize 512 took 28 hours, that is peakset with an effort of 8256 (or 8128 exluding 65536) on what must have been a pretty weak CPU.

Against all odds the corpus they used still exists, so I re-ran some of the tests from table 7.2 at 8p flac v1.42, with the extreme512 results below limited to 16384 instead of 65536 for an effort level of "only" 528 instead of >8k:

Code: [Select]
File FixedBlocking "Extreme512"
     8pb4096       8p peakset 512..16384 contiguous
 00  38.58%        38.35%
 03  58.34%        58.15%
 05  68.59%        68.35%

Might implement blocking method 2 from 6.1.2 to see how it compares (one day, time is short). It's a simple iterative process that tries bigger and bigger blocksizes with a fixed step until the next size up has worse efficiency than the current blocksize+step, at which point it outputs the current blocksize. So it's simple, greedy with a different greed metric than flaccid's greed mode, multithreads poorly (which isn't a knock against it in general). Probably has a smaller effort than current greed mode and close in efficiency.


Re: Multithreading

Reply #47
  • Implemented method 2 from the paper, named gasc for greed ascending
  • Renamed greed mode to gset, at each step it encodes all sizes from a set and greedily picks the most efficient

All analysis and output settings at -8p with tweak and merge disabled. This isn't a maximal efficiency or bang for buck test but a comparison of modes, a mix of a few dozen songs skewed towards metal/punk/electronic:
Code: [Select]
Relative Size       CPUEst    Settings
100.00%  872823885  194.03077 Fixed 4096
 99.91%  872079622  910.36765 gset 1536..4608
 99.89%  871852994 1832.12025 gset 768..4608
 99.73%  870451258  622.45694 gasc 1536..4608
 99.72%  870421404 2957.32347 lax gasc 512..65535
 99.69%  870155090 1287.84806 gasc 512..4608
 99.69%  870108957  965.25097 gasc 768..4608
 99.69%  870108468  746.15064 gasc 1152..4608
 99.68%  870072803 1415.20407 peakset 1536..4608
 99.64%  869659199 1504.23657 lax gasc 1024..65535
 99.58%  869119347 4689.57017 peakset 768..4608
gset has been trounced by gasc, gasc is faster and more efficient. Some unintuitive results from gasc, it seems that when too-small a blocksize_min is used the algorithm performs worse (compare gasc subset results 512, 768, 1152). gasc 1152..4608 is interesting, relatively cheaply it approaches peakset performance (the more natural comparison would be peakset 1152..4608 which was not tested, it'd sit roughly between the peakset results tested in size and time). When tweak/merge come into play gasc should benefit more than peakset as it has more ground to gain, when quick analysis settings come into play peakset should benefit more than gasc as it does more analysis.

Re: Multithreading

Reply #48
A counterintuitive result or two yes, but gasc 1536..4608 and gasc 1152..4608 look quite good.
I did some tests with 1.4.0, where -8pe took nearly 12x the time of -8p. Not saying that -8pe is any good (indeed I said something quite different about it) but it is kinda a benchmark to what time that, at least, some have been willing to pay.

(You probably have mentioned it somewhere up in this thread, but ... you are using a "-p" on everything? Reason for asking: what if you do everything without -p - would you then beat -8p in shorter time?)

Re: Multithreading

Reply #49
Yes mostly used -8p as the output setting as an arbitrary lever to not pull, if too many levers are pulled at once a benchmark doesn't really show anything.

Done a few quick tests, variable blocksize peakset can easily beat fixed in size and time simultaneously. gasc can too but not as well, to beat size it needs to crank up the analysis settings to the point where the time it takes is not that competitive (gasc can scale time down much further than peakset can but not at competitive size):
Code: [Select]
Size     CPUEst
29904618 5.05966 Fixed -8pb4608
29902783 5.29877 Fixed -8pb4096
29883629 3.54501 gasc analysis 7 output 8 tweak 0 merge 0 list 1152,2304,3456,4608
29879435 4.70175 gasc analysis 8 output 8 tweak 0 merge 0 list 1152,2304,3456,4608
29874167 3.27683 peakset analysis 3 output 8 tweak 0 merge 0 list 1152,2304,3456,4608
29869549 3.79056 peakset analysis 3 output 8 tweak 999999 merge 0 list 1152,2304,3456,4608
There may be better settings to improve further.