I know this is an area in which you have particular expertise, so I'm curious as to the bolded bit in the above.
My understanding of the methodology, in general terms, is this:
- Play extract of music on Device A.
- Play same extract of music on Device B.
- Play same extract of music on either Device A or Device B without attendees being aware which is in use. Ask attendees to choose whether it is A or B which is playing.
- Repeat 3) until desired level of statistical significance is achievable.
It seems to me that aural memory is engaged in a) remembering the sound of A, of B, and of A vs B and what differences (if any) were detected; and b) recalling these when listening to the device during the repeated stages at line 3. Plus, there may be time taken to disconnect A and reconnect B, even if this is only scant seconds. We have been told, numerous times, that aural memory is only reliable for scant seconds. And if aural memory is unreliable, this is a mechanism by which confusion can set in between the original A and B, over time (eg during 4, above, depending on number of repetitions).
It is misgivings in the above that lead me to ask advocates of blind testing whether they have ever run a control test for their methodology (ie, to show whether it can reliably detect known audible differences, and how sensitive it can be - how gross do the differences have to be in order to be reliably detected). I've yet to hear that this has been done, still less to any statistically significant extent.
Some background - I worked in medical imaging, mainly testing to see if adding something in to MRI scanning increased and/or made more accurate the information that a Radiologist gains from the images. For this discussion we have to ignore the hardest question (how do you know that the lump on the image is really there) and just deal with getting a bunch of Radiologists to look at the images and give you unbiased evaluations of what they see. First thing is to remove all evidence of the sex of the patient (some lesions have are more common in one sex or the other), second to remove evidence of age (same reason), the hospital where the image was obtained (hospitals tend to see more of one disease type) and of the country (due to the Radiologists knowing the population bias for different diseases - eg more primary liver cancer in Japan, but more rarely liver metastases). Just taking out those cues reduced the accuracy of top-level radiologists from 90+% to 60+%. That's important because it means that in the study the only thing leading to a diagnosis is what's in the images rather than all those possible biases which will be at different levels across the Radiologists.
To try to get to the point, very early on we did simple studies, 3 or more Radiologists in a room (on separate workstations) looking at images and writing down number of lesions, location of lesions, size of lesions. etc etc, plus a diagnosis. Took a couple of weekends, give the data to statisticians, home and hosed. But not. First up we learned about dominant participant bias, where over time the less senior Radiologists started to fit in with the most senior guy, especially after breaks where they could talk together. All future studies were one Radiologist at a time.
The big surprise though was visual memory (at last getting to the audio comparison). Although visual memory is thought to be longer than audio memory, it is still fairly short (just ask a detective looking for witnesses). We learned that some/many experienced Radiologists can accurately recall images from a case they saw a couple of months (and hundreds of patients) ago and use that information to inform decisions about an image in front of them. Bad news for us as we now had to increase the period between seeing different images from the same patient from a couple of weeks to 3 months plus. The point with audio is that I'd bet 50p that 'golden ears' is actually at least partly a manifestation of extended audio memory.
How did we deal with that? The last, most complex and costly we designed specifically asked the Radiologist not to make a diagnosis (at least not until the very end) but to describe what they saw in the image, in detail. That was then compared to reality (this was a liver disease study where the liver was to be removed from the patient and examined in detail by a Pathologist which was the 'truth' that the Radiologist's assessment was compared to). The study took 18 months of weekends and worked at a level we couldn't have dreamed for.
So, if I was designing a study (no I won't!) to compare two things that are going to be very similar using people with a huge interest in what is being assessed and some of whom may have enhanced memory/discrimination skills, I would not get them to do something they do all the time where the biases could have maximum effect.
I'd suggest a test that doesn't rely on listening to the music in the normal way, which gives a greater chance of getting some discrimination:
1. Get agreement on the music to be played - duration probably doesn't matter
2. Get some agreed experts to listen to the music using one of the systems (no need to be blind) and to describe what they hear as being exemplary, most noticeable and whatever else (within reason). Could be the snare drum being just left of centre, can hear the type of drumstick (not my area of expertise), violin note decays realistically - anything.
3. Get your victims doing the test (but not together to avoid dominant participant bias), but get them just to score the items identified before, and obviously blind to what version of the system is in use. Repeat randomly with the systems changing and also with the same system being used twice/3 times in succession. These guys need to be different from the people in part 2.
Never going to happen and will take months to do if someone is daft enough to try, but I think this would remove/reduce the problem with the standard methodology described by Steve in post #1781. Clearly I had the advantage of 8-figure budgets and willing participants prepared to give up time to have their knowledge and ability tested in evil ways (I loved working with Radiologists because they were so keen to learn and improve in what is a difficult speciality).
Not sure if that helps or hinders, but I hope it answered the question!