advertisement


Does ABX testing work? (long)

busb

Mine's a pint of beer please
Double blind ABX on paper looks to be such an obvious way of settling much if not all the rancour surrounding recorded music enjoyment. So why are these arguments still raging? Is ABX testing flawed when applied to audio?
If we apply the same testing protocols to the differences between let’s say two camera’s IQ, we would need to ensure we were comparing like with like – even if comparing models with dissimilar pixel counts by normalising exposure & making sure the prints were of identical size & scale. We would end up with sets of 3 prints that test subjects view side by side. Those testers can then concentrate on qualitative aspects such as image noise, sharpness, colour fidelity etc then be asked whether or not print A or B was the same as X rather than scoring for particular attributes. I suggest applying ABX style comparisons to visual phenomena is uncontroversial but audio is far from that. Comparing audio & visual perceived quality is useful in that it highlights the differences such as audio being sequential whereas prints are static. Comparing audio with video is more obvious but misses the point I’m making.

So what is so difficult about comparing audio? We presumably cannot play music through two devices simultaneously to get anything meaningful so has to be played sequentially which relies on memory. Presumably we need to know how long memory is reliable for regarding gaps between sequences & the length of those sequences. This aspect of memory is absolutely crucial: just how reliable is our audible memory? If asked what differences we hear, what are we testing – our memory or our hearing skills? Are they one & the same for instance?

Certain precautions are obvious such as normalising volume levels & minimising expectation bias that can skew results when samples can be seen (or heard such as a fan whirring away). Most audio doesn’t emit smell but if it did, it would just be another aspect to be eliminated. My view that anyone arguing against blind tests in a formal environment has something to lose if the outcome doesn’t follow their expectations. Blind testing informally such as when done at home is less important but does depend on how honest people are willing to be with themselves. These variables are fairly easy to control though.
So what else is needed apart from the equipment to compare, test subjects & an unchanging comfortable listening area? We could stabilise the mains, temperature etc. We’d need someone familiar with statistics & testing protocols. If we were to compare just two amplifiers (that measure virtually the same), what information is the test trying to determine? We could keep things simple & ask if subjects can tell them apart rather than which one they preferred. We would also want to think about maximising the validity & try to minimise both false positives & negatives. We would also need to think through how long the tests would last, who to invite, if they need to be taught any listening skills, if they need to be familiar with some or all the music to be used well in advance. We would also need to determine very carefully the validity in relying on just short-term memory or include long term memory (I’m thinking big here, I want to determine once & for all if audio DB ABX testing is useful or not!)

By its very nature, ABX testing when done properly can more or less eliminate false positives through tried & tested statistical analysis but how the hell do we avoid false negatives? We need to know if subjects either say they can’t distinguish between tests because they genuinely can’t tell, don’t want to tell or if the differences just don’t exist, period. If, for instance, we told the participants to only say they could hear differences between A, B & X when they were 100% certain, we may immediately skew the results with false negatives. What I’m suggesting is that the chance between false positives & false negatives should be equal otherwise a sceptical participant can just say no difference to each test sequence. Those people who don’t believe there can possibly be any differences are going to have a far easier time than those who believe the opposite! So how do we avoid false negatives? Does anyone else think we need to?

To recap I have two concerns: the validity of short-term memory & eliminating false negatives. Some will say BD ABX testing is the best we’ve got! I’m asking if that’s good enough. It maybe too grandiose but I’m getting bored with some saying that all competently designed amps MUST sound the same on one hand & those who believe that their ears are vastly better than any measurement on the other. If ABX testing is flawed, why the hell would anyone who isn’t an out & out objectivist not believe their hearing?
 
Hi busb,
A very thoughtful and detailed question, from someone who is used to this type of evaluation maybe?
As a consumer and enthusiast for hi-fi, I would probably ignore the results and here's why.
Unlike other physical phenomena, the requrement of hi-fi equipment is very listener dependant, there is a strong physcological element, which we probably will never fully understand.
Music lovers only requirement is, does it push the right buttons to increase my enjoyment of music?
I know, I cannot say for sure, what characteristic of sound gives me that extra frisson of excitement when I am listening to my favourite music.
Obviously manufactures have to have some method of consumer evaluation, ABX testing may help.
Simon
 
I predict another long rambling thread with much circulatory argument and no resolution of value.....:(
 
I predict another long rambling thread with much circulatory argument and no resolution of value.....:(

Then it's very easy - don't participate. I don't expect every thread to be written we my interests in mind so skip many of them. Simples.
 
I think the only real benefit of ABX testing against subjective criteria may be in proving humans are crap at ABX testing.
 
The last sentence by the OP is the key.

All forms of testing are flawed in some way - but some are far more flawed than others.

The memory issue is a real one, amply demonstrated by the recent ABX scale test here, but its way more reliable than the usual alternative because it at least eliminates some of the major problems. If memory effect is considered a significant factor, as in the test mentioned, then switch to instantaneous A/B and the problem goes way.
Between them, those options have it pretty well covered.

You ask why we should not believe our hearing, but you answer the question yourself IMO.
Most listening isn't level matched, isn't blind and allows more time to elapse between switching, so immediately you have three proven and understood potential sources of error.

Eliminating these things, or at least minimising them in the case of memory inadequacy, should not be controversial since we know they exist.

Now someone will come along and say, well sorry but I can trust my ears and how dare you tell me I can't. To which there are two answers, three if you entertain the possibility of superhuman ability.

Firstly, people generally can trust their ears when making a general observation such as 'I like the sound of this system and dislike another'. You are listening to the overall sound and it either pleases you or it doesn't. In this sense the reaction is much like listening to a live concert and you''ll like some and dislike others. That's because we are dealing with sonic effects on a far larger scale than might be the case when comparing electronics which according to the specification should show no/negligible difference. In these cases it becomes necessary to be more rigorous in approach if you aren't to be mislead. I've seen it happen enough times now, and have even caused it to happen just to demonstrate that you really cannot just trust the ears under such compromised conditions.

The usefulness of these things is down to the individual. If you approach them with an open mind they can be a very useful tool. In fact they have no use whatsoever if you don't approach them in this way. On the larger scale for research or reviewing purposes it is clearly beneficial to use listeners who aren't predisposed to the view they won't hear differences under certain circumstances - you want listeners who are fully expecting to hear A sound different to B or C.
 
Then it's very easy - don't participate. I don't expect every thread to be written we my interests in mind so skip many of them. Simples.

Hi busb,
A very thoughtful and detailed question, from someone who is used to this type of evaluation maybe?
As a consumer and enthusiast for hi-fi, I would probably ignore the results and here's why.
Unlike other physical phenomena, the requrement of hi-fi equipment is very listener dependant, there is a strong physcological element, which we probably will never fully understand.
Music lovers only requirement is, does it push the right buttons to increase my enjoyment of music?
I know, I cannot say for sure, what characteristic of sound gives me that extra frisson of excitement when I am listening to my favourite music.
Obviously manufactures have to have some method of consumer evaluation, ABX testing may help.
Simon

The only time I've participated in a blind ABX test was with two other readers of WHF. We eventually found out that we were comparing three different music servers feeding the same DAC etc. We were hard pushed to hear any differences because we kept saying that we weren't sure when repeating parts of the test. Otherwise, I have no other experience.

My experience is with repair & recalibration of some of the test equipment used for audio testing such as HP8903B audio analysers, audio spectrum analysers, etc. So am familiar with THD+N measurements, gain compression, swept frequency response etc. This has fueled my fascination of why these measurements don't seem to tally with what me hearing tells me. Rather than argue about the validity of distortion measurements across a mostly resistive load, I'm asking how valid ABX testing is.

There will be some that have such an entrenched view that even the second coming wouldn't convince them but most people are a little more flexible. There's a big gap between the measurement fascists & audio foo purchasers!
 
The last sentence by the OP is the key.

All forms of testing are flawed in some way - but some are far more flawed than others.

The memory issue is a real one, amply demonstrated by the recent ABX scale test here, but its way more reliable than the usual alternative because it at least eliminates some of the major problems. If memory effect is considered a significant factor, as in the test mentioned, then switch to instantaneous A/B and the problem goes way.
Between them, those options have it pretty well covered.

You ask why we should not believe our hearing, but you answer the question yourself IMO.
Most listening isn't level matched, isn't blind and allows more time to elapse between switching, so immediately you have three proven and understood potential sources of error.

Eliminating these things, or at least minimising them in the case of memory inadequacy, should not be controversial since we know they exist.

Now someone will come along and say, well sorry but I can trust my ears and how dare you tell me I can't. To which there are two answers, three if you entertain the possibility of superhuman ability.

Firstly, people generally can trust their ears when making a general observation such as 'I like the sound of this system and dislike another'. You are listening to the overall sound and it either pleases you or it doesn't. In this sense the reaction is much like listening to a live concert and you''ll like some and dislike others. That's because we are dealing with sonic effects on a far larger scale than might be the case when comparing electronics which according to the specification should show no/negligible difference. In these cases it becomes necessary to be more rigorous in approach if you aren't to be mislead. I've seen it happen enough times now, and have even caused it to happen just to demonstrate that you really cannot just trust the ears under such compromised conditions.

The usefulness of these things is down to the individual. If you approach them with an open mind they can be a very useful tool. In fact they have no use whatsoever if you don't approach them in this way. On the larger scale for research or reviewing purposes it is clearly beneficial to use listeners who aren't predisposed to the view they won't hear differences under certain circumstances - you want listeners who are fully expecting to hear A sound different to B or C.

My own views have changed in that I now acknowledge my hearing can play tricks! If my brain didn't play these tricks, hearing just wouldn't work much of the time. How can we possibly understand a conversation in a noisy room without our brain's ability to filter out unwanted speech? To me, the whole concept that our hearing is a unchanging transducer is plain obsurd. Another realisation is that my enjoyment is idependent of reproduction quality. How many people don't like a song they hear on a cheap radio but change their mind when heard on a decent system? How many people never listen to poorly recorded music out of choice? I dislike sloppy recording quality but it doesn't stop me listen to particular artists. The music is the key, hearing it on a great system only ices the cake! Perhaps professional musicians just don't need expensive HiFi - they don't notice incremental improvements because the difference between 99 & 100 faeries dancing on the head of a pin is missing the whole purpose of music anyway! My last point illustrates the vanishingly small improvements some chase which is little more than obsessive behavior.

Let's assume that DB ABX testing does indeed work. If we then apply such tests to things like cable directionality & people can't tell the difference, it's one less aspect to spend vast amounts of money on by all but the obsessives who will never be converted to reality. ABX testing needs to be convincingly free of basic flaws such as false negatives though.
 
I may be missing the point, but why would anyone want to ABX test?

Last month I demoed 2 TTs. The arm and cart were the same, so as close to a level matched test as is reasonably possible. I had half an hour or so with each, with 10 mins between to swap. I used most of the same LPs for the two. The two were clearly different, and I went away a happy but poorer man.

I tried them both in plain sight, and had no particular preconceptions or prejudice. The dealer was careful not to say anything to me about the character of either product before I listened. I think it was a fair test. But what would be the use of then putting one of the two behind a curtain to try to identify it? It wouldn't change my opinion of the two products, rather it seems to be a balance between egotism (let me prove just how good my ears are...) and something that could be optimistically termed 'academic interest' and less kindly 'social misfit who lives in the cupboard under the stair obsessing to the 'n'th degree about something that is ultimately irrelevant'. (Apologies to any of said misfits who may be reading - please don't take my tone too seriously!)

I suppose it might have uses if I'd gone into the demo with a preconceived idea of what I'd plump for - but in that case there's more at stake than just sound quality!
 
I think the only real benefit of ABX testing against subjective criteria may be in proving humans are crap at ABX testing.

That's what I'm trying to find out. Is instantly switched ABX testing all what it's cracked up to be? I've gone into HiFi shop demos & left totally baffled! Is short-term memory comparisons helpful or just confusing?

Several years ago, I bought a new CDP. I was a little underwhelmed by it, thought it sounded more or less the same as the one it replaced. I forgot about the expense & carried on listening to it. I'd get around to playing particular pieces of music then realised that I was hearing certain aspects I'd not noticed before so dug out more stuff to find the same thing. Long-term memory was kicking in here? ABX testing may need to account for not just short-term listening that often causes confusion.
 
My own views have changed in that I now acknowledge my hearing can play tricks! If my brain didn't play these tricks, hearing just wouldn't work much of the time. How can we possibly understand a conversation in a noisy room without our brain's ability to filter out unwanted speech? To me, the whole concept that our hearing is a unchanging transducer is plain obsurd. Another realisation is that my enjoyment is idependent of reproduction quality. How many people don't like a song they hear on a cheap radio but change their mind when heard on a decent system? How many people never listen to poorly recorded music out of choice? I dislike sloppy recording quality but it doesn't stop me listen to particular artists. The music is the key, hearing it on a great system only ices the cake! Perhaps professional musicians just don't need expensive HiFi - they don't notice incremental improvements because the difference between 99 & 100 faeries dancing on the head of a pin is missing the whole purpose of music anyway! My last point illustrates the vanishingly small improvements some chase which is little more than obsessive behavior.

Let's assume that DB ABX testing does indeed work. If we then apply such tests to things like cable directionality & people can't tell the difference, it's one less aspect to spend vast amounts of money on by all but the obsessives who will never be converted to reality. ABX testing needs to be convincingly free of basic flaws such as false negatives though.

Abso-bloody-lutely!!!

Chris
 
My own views have changed in that I now acknowledge my hearing can play tricks! If my brain didn't play these tricks, hearing just wouldn't work much of the time. How can we possibly understand a conversation in a noisy room without our brain's ability to filter out unwanted speech? To me, the whole concept that our hearing is a unchanging transducer is plain obsurd. Another realisation is that my enjoyment is idependent of reproduction quality. How many people don't like a song they hear on a cheap radio but change their mind when heard on a decent system? How many people never listen to poorly recorded music out of choice? I dislike sloppy recording quality but it doesn't stop me listen to particular artists. The music is the key, hearing it on a great system only ices the cake! Perhaps professional musicians just don't need expensive HiFi - they don't notice incremental improvements because the difference between 99 & 100 faeries dancing on the head of a pin is missing the whole purpose of music anyway! My last point illustrates the vanishingly small improvements some chase which is little more than obsessive behavior.

Well done with those realisations. There are more
 
I shall make it easy for all of you....

We are all different.

We listen out for different details and or process and prioritise different 'aspects' of sound differently.

What might be true for you, may not be true for me.

They are no universally truisms regarding audio testing and sensitivity that are not gross over generalisations, (this one included).

There is no singularly correct answer to 'which is best?' whenever it relies on any subjective judgement by more than one person.

We are easily fooled by our senses.

There is a poor correlation between perceived sound quality and measured performance because they are not the same thing, one is an individual's emotional response and the other is a series of electrical measurements.

....it really is that simple.
 
That's what I'm trying to find out. Is instantly switched ABX testing all what it's cracked up to be? I've gone into HiFi shop demos & left totally baffled! Is short-term memory comparisons helpful or just confusing?

Several years ago, I bought a new CDP. I was a little underwhelmed by it, thought it sounded more or less the same as the one it replaced. I forgot about the expense & carried on listening to it. I'd get around to playing particular pieces of music then realised that I was hearing certain aspects I'd not noticed before so dug out more stuff to find the same thing. Long-term memory was kicking in here? ABX testing may need to account for not just short-term listening that often causes confusion.

Surely familiarisation is the more likely explanation? Similar to the "burn in" phenomenon.

Chris
 
Surely familiarisation is the more likely explanation? Similar to the "burn in" phenomenon.

Chris

No I don't think so, it is completely normal to notice new things after prolonged exposure to something, whether that be sound vision taste or any sense.

Its a large part of how we learn... when you get over the initial exposure to an experience you can then focus more easily on the details.

I work in the field of art and design and recently some other designers need to copy some of my work.

When it came back they had got the gist of it but it just didn't look "right", some smaller details where different, when i pointed out what they where missing they where then able to see it too....
 
I shall make it easy for all of you....

We are all different.

We listen out for different details and or process and prioritise different 'aspects' of sound differently.

What might be true for you, may not be true for me.

They are no universally truisms regarding audio testing and sensitivity that are not gross over generalisations, (this one included).

There is no singularly correct answer to 'which is best?' whenever it relies on any subjective judgement by more than one person.

We are easily fooled by our senses.

There is a poor correlation between perceived sound quality and measured performance because they are not the same thing, one is an individual's emotional response and the other is a series of electrical measurements.

....it really is that simple.

Is it?! "We are easily fooled by our senses". Indeed. ABX testing isn't about pitting one person's opinions against someone else's but an individual's ability to confirm what they think they can hear. If, (& I'm saying it's a big if) ABX testing works & it proves that perceived differences are merely imagined, I am not going to spend money where it ain't necessary! Whether or not other people want to continue to spend vast amounts on kettle leads is up to them.
 
It sounds pretty complicated to me, but you do need to strip out the process of looking for and verifying the presence of real differences and the entirely subjective response of individual preference.
ABX and AB has little or nothing to do with the latter and everything to do with the former.

The idea that we are all different and focus on different things is no reason to use comparative testing that is fundamentally flawed.
 


advertisement


Back
Top