Double blind ABX on paper looks to be such an obvious way of settling much if not all the rancour surrounding recorded music enjoyment. So why are these arguments still raging? Is ABX testing flawed when applied to audio?
If we apply the same testing protocols to the differences between lets say two cameras IQ, we would need to ensure we were comparing like with like even if comparing models with dissimilar pixel counts by normalising exposure & making sure the prints were of identical size & scale. We would end up with sets of 3 prints that test subjects view side by side. Those testers can then concentrate on qualitative aspects such as image noise, sharpness, colour fidelity etc then be asked whether or not print A or B was the same as X rather than scoring for particular attributes. I suggest applying ABX style comparisons to visual phenomena is uncontroversial but audio is far from that. Comparing audio & visual perceived quality is useful in that it highlights the differences such as audio being sequential whereas prints are static. Comparing audio with video is more obvious but misses the point Im making.
So what is so difficult about comparing audio? We presumably cannot play music through two devices simultaneously to get anything meaningful so has to be played sequentially which relies on memory. Presumably we need to know how long memory is reliable for regarding gaps between sequences & the length of those sequences. This aspect of memory is absolutely crucial: just how reliable is our audible memory? If asked what differences we hear, what are we testing our memory or our hearing skills? Are they one & the same for instance?
Certain precautions are obvious such as normalising volume levels & minimising expectation bias that can skew results when samples can be seen (or heard such as a fan whirring away). Most audio doesnt emit smell but if it did, it would just be another aspect to be eliminated. My view that anyone arguing against blind tests in a formal environment has something to lose if the outcome doesnt follow their expectations. Blind testing informally such as when done at home is less important but does depend on how honest people are willing to be with themselves. These variables are fairly easy to control though.
So what else is needed apart from the equipment to compare, test subjects & an unchanging comfortable listening area? We could stabilise the mains, temperature etc. Wed need someone familiar with statistics & testing protocols. If we were to compare just two amplifiers (that measure virtually the same), what information is the test trying to determine? We could keep things simple & ask if subjects can tell them apart rather than which one they preferred. We would also want to think about maximising the validity & try to minimise both false positives & negatives. We would also need to think through how long the tests would last, who to invite, if they need to be taught any listening skills, if they need to be familiar with some or all the music to be used well in advance. We would also need to determine very carefully the validity in relying on just short-term memory or include long term memory (Im thinking big here, I want to determine once & for all if audio DB ABX testing is useful or not!)
By its very nature, ABX testing when done properly can more or less eliminate false positives through tried & tested statistical analysis but how the hell do we avoid false negatives? We need to know if subjects either say they cant distinguish between tests because they genuinely cant tell, dont want to tell or if the differences just dont exist, period. If, for instance, we told the participants to only say they could hear differences between A, B & X when they were 100% certain, we may immediately skew the results with false negatives. What Im suggesting is that the chance between false positives & false negatives should be equal otherwise a sceptical participant can just say no difference to each test sequence. Those people who dont believe there can possibly be any differences are going to have a far easier time than those who believe the opposite! So how do we avoid false negatives? Does anyone else think we need to?
To recap I have two concerns: the validity of short-term memory & eliminating false negatives. Some will say BD ABX testing is the best weve got! Im asking if thats good enough. It maybe too grandiose but Im getting bored with some saying that all competently designed amps MUST sound the same on one hand & those who believe that their ears are vastly better than any measurement on the other. If ABX testing is flawed, why the hell would anyone who isnt an out & out objectivist not believe their hearing?
If we apply the same testing protocols to the differences between lets say two cameras IQ, we would need to ensure we were comparing like with like even if comparing models with dissimilar pixel counts by normalising exposure & making sure the prints were of identical size & scale. We would end up with sets of 3 prints that test subjects view side by side. Those testers can then concentrate on qualitative aspects such as image noise, sharpness, colour fidelity etc then be asked whether or not print A or B was the same as X rather than scoring for particular attributes. I suggest applying ABX style comparisons to visual phenomena is uncontroversial but audio is far from that. Comparing audio & visual perceived quality is useful in that it highlights the differences such as audio being sequential whereas prints are static. Comparing audio with video is more obvious but misses the point Im making.
So what is so difficult about comparing audio? We presumably cannot play music through two devices simultaneously to get anything meaningful so has to be played sequentially which relies on memory. Presumably we need to know how long memory is reliable for regarding gaps between sequences & the length of those sequences. This aspect of memory is absolutely crucial: just how reliable is our audible memory? If asked what differences we hear, what are we testing our memory or our hearing skills? Are they one & the same for instance?
Certain precautions are obvious such as normalising volume levels & minimising expectation bias that can skew results when samples can be seen (or heard such as a fan whirring away). Most audio doesnt emit smell but if it did, it would just be another aspect to be eliminated. My view that anyone arguing against blind tests in a formal environment has something to lose if the outcome doesnt follow their expectations. Blind testing informally such as when done at home is less important but does depend on how honest people are willing to be with themselves. These variables are fairly easy to control though.
So what else is needed apart from the equipment to compare, test subjects & an unchanging comfortable listening area? We could stabilise the mains, temperature etc. Wed need someone familiar with statistics & testing protocols. If we were to compare just two amplifiers (that measure virtually the same), what information is the test trying to determine? We could keep things simple & ask if subjects can tell them apart rather than which one they preferred. We would also want to think about maximising the validity & try to minimise both false positives & negatives. We would also need to think through how long the tests would last, who to invite, if they need to be taught any listening skills, if they need to be familiar with some or all the music to be used well in advance. We would also need to determine very carefully the validity in relying on just short-term memory or include long term memory (Im thinking big here, I want to determine once & for all if audio DB ABX testing is useful or not!)
By its very nature, ABX testing when done properly can more or less eliminate false positives through tried & tested statistical analysis but how the hell do we avoid false negatives? We need to know if subjects either say they cant distinguish between tests because they genuinely cant tell, dont want to tell or if the differences just dont exist, period. If, for instance, we told the participants to only say they could hear differences between A, B & X when they were 100% certain, we may immediately skew the results with false negatives. What Im suggesting is that the chance between false positives & false negatives should be equal otherwise a sceptical participant can just say no difference to each test sequence. Those people who dont believe there can possibly be any differences are going to have a far easier time than those who believe the opposite! So how do we avoid false negatives? Does anyone else think we need to?
To recap I have two concerns: the validity of short-term memory & eliminating false negatives. Some will say BD ABX testing is the best weve got! Im asking if thats good enough. It maybe too grandiose but Im getting bored with some saying that all competently designed amps MUST sound the same on one hand & those who believe that their ears are vastly better than any measurement on the other. If ABX testing is flawed, why the hell would anyone who isnt an out & out objectivist not believe their hearing?