I've been worried that most people who complain about the loudness wars are using technically incorrect metrics. More specifically, what is being lost in the loudness wars is not the "lack of loudness" but dynamics. But everybody keeps using ReplayGain numbers and PCM waveform plots - which have no direct relation to dynamics, as they only properly measure loudness/peaks, and not their ranges. ReplayGain uses the 95th percentile of loudness; Audacity will fill a signal plot up to the peak of the signal without taking the rest of the samples into account.

I think that having a better way to estimate loudness would be useful in a few situations. At the very least, it would provide more accurate information on the victims of the loudness war, and more objective information than subjective evaluations. Listerners who are looking for music of a specific dynamic range (either very high for system testing, or very low for background music) could use the information.

The "canonical" way to estimate dynamic range, as I understand it (and note that I haven't consulted any audio books on this so I'm not particularly knowledgable - correct me if I'm wrong), is to compute the rms peak-to-average ratio. I see two problems with this. The biggest problem is that the measurement is very dependent on the rms block length. Much heavily-compressed music will still show a lot of dynamics with a common block length like 50ms, even though these dynamics may not be audible due to masking. Conversely, using a very long block size (10 seconds or more) will correctly identify compressed music as having extremely little dynamic range, but ignores shorter-term dynamics that may still be audible. Also, most peak-to-average measurements do not use any form of loudness equalization, and while that's obviously not a well-solved problem, one should at least try to take it into account IMHO. Finally, if the loudness of the signal is not normally distributed, then large changes in the distribution below the 50th percentile could compromise the accuracy of any single-number measurement.

I think I've come up with a scheme to solve all of these problems, although I haven't worked out all the kinks yet. Basically, I'm using three different block lengths at the same time, to compute three different loudness estimations: one for short-term transients, one for time scales on the order of one beat, and one for long-term loudness changes. Right now the block lengths are 0.1s, 1s and 10s respectively, although they could change. For each time scale, I highpass filter at 10hz and apply the ITU 1770 loudness filters, the pre-filter shelf and RLB, to give a rudimentary loudness equalization. the numbers honestly don't change much when I disable those filters though. Finally, I generate a histogram plot of RMS energy for each block length and measure the range between the 90th and 10th percentiles. These three numbers describe the dynamic range of the signal over all important time scales.

And if nothing else, the plots obtained through this analysis look a lot prettier and informative than Audacity plots....

Here are some examples. The first plot of two is the loudness vs time, the second plot is the cumulative distribution of loudness - but note that the x-value of 140 corresponds to 0db (haven't fixed the x-axis yet). The white plot is 10s, the red plot is 1s and the grey plot is 0.1s.

John Mayer, "Waiting On The World To Change", ReplayGain -7.96dB. Dynamic range estimated at 7.4dB (10s), 11.15dB (1s), 21.85dB (0.1s).

Pierre Boulez and the Chicago Symphony Orchestra, "Ionisation" composed by Varese. ReplayGain +9.04db. Dynamic range estimated at 43.7dB (10s), 50.49dB (1s), 54.97dB (0.1s).

Merzbow, "I Lead You Towards Glorious Times". Note that because of the filtering and the intense mastering, the loudness plot is over 0db. ReplayGain -20.64dB. Dynamic range estimated at 0.78db (10s), 1.15dB (1s), 2.05dB (0.1s).

Note how in the Merzbow, the signal is compressed at such a small time scale that there is not much difference between the three measurements, while in the pop music (John Mayer), the extensive dynamics at short time scales goes away at long time scales. The difference in measurements between 10s and 0.1s reflects the loudness coherence. Music which has very fast loud/soft transitions, but the same rough loudness level across the entire piece, will show much more dynamic range at the 0.1s scale than on the 10s scale.

What do all of you think? Is this something worth pursuing further? Alternatively is this just a fishing expedition and I should just do a straight-up implementation of a more common dynamic range measurement?