Re: Objective difference measurements to predict listening test results?
Reply #52 – 2016-09-26 09:43:31
Serge, for df to be more generic than what it is now, it needs to know the concept of artifact and not only the concept of distortion. And then, it needs to know that some artifacts are more subtle and others are more annoying, even if the distortion produced is not as strong. Phase decorrelation or stereo imaging can have a big impact on analysis but don't translate to big artifacts. I think you wrote that you take some of these into account when calculating, but then you also need to weight the frequency bands. That's what I meant when I said that the smaller the differences, the more disperse your graphic gets, because those small differences are rated differently by the listeners. On the other side, big differences are considered bad, no matter what. Your proposal is understandable. This approach already has its history and got extensive research to the moment. Resulting products are well known – PEAQ, PESQ, EAQUAL, OPERA, POLQA and others. I don't think I can add something valuable to this area of research and to be honest I think that the goal of building some objective metric that can predict results of all listening tests is hardly achievable, because for the purpose you need reliable models of human hearing, cognition and comprehension. Sorry, this task is too complicated, at least for a personal research project. I'd like to look into less complicated sub-task - to research psychoacoustic potential of pure Df parameter; to find the special cases where pure Df can be used for prediction of quality scores. In any case – whether you develop a more sophisticated, psycoacousticaly strengthened Df, or use pure Df – you will need some procedure of verification. Definitely it will be based on comparison with results of some listening tests - the metric is intended exactly for this, isn't it. For example in Rec. ITU-R BS.1387-1 such verification was performed using 84 test samples from different listening tests. Would be nice to have access to those data bases. Meantime I try to use publicly available results of listening tests. And my questions are: How many listening tests would be enough to be sure that any such metric works? How many mismatches with true quality scores are tolerable? What is max. error level allowable? Is it necessary to treat every mismatch as a flaw of the metric and try to tune the latter? I think that verification procedure is a key. It is necessary for both simple and enhanced Df metric (or any other methods for objective measurements of perceived audio quality). So, what test/procedure/experiment would be sufficient in practice to prove that any such audio metric really works?