Double blind tests

A difference can be seen at a point of statistical significance which is significantly away from the norm (in an A/B comparison, 50%)

This is correct statistically. On the page pointed to we get the first test at 79% getting it right (on 80 tests), on the second 62% (on 126 tests) and 49% (on 84 tests).

Assuming they applied the right statistical test (my o level stats is over 20 years old so I'll not pass judgement) and assuming the experiment itself wasn't flawed (and that they got their sums right), it means that on the first test there is a 0.05% chance that the result could have been achieved without a real difference, on the second test a 0.5% chance and on the last there is no difference.

These were tests between different power amps, and it shows (accepting the asumptions) that a difference could be heard.

However, one should always, always ask how much one believes what one reads on the net ;)
 
Hi,

I have spent ages searching the web and have spoken with several people in the hi-fi industry in the past couple of days and not one could point me to an example of a conclusive double blind ABX test.

Could you please DEFINE "conclusive ABX test"?

What is conclusive? Do you accept "difference proven beyond resonable doubt" for a small sample size with a significance of .2? Do you insist on .05 (which is good enough for medical trials but requires very large sample sizes)?

Have you considered the "type b statistical error" for the tests you reference?

For information, "Type A Error" is defined as incorrectly rejecting the null hypothesis (that is "no difference") when it is in fact true. This is typically expressed in significance. A significance of .2 means that if 100 Tests where conducted only 20 in these 100 would be likely to show a Type A Error and so on.

A Type B statistical Error is defined as incorrectly accepting the null hypothesis (that is "no difference") when it is in fact false and difference exists. The problem is that with a reduction in the likelyhoof of Type A Errors we increase the likelyhood of Type B Errors.

So, if we apply a significnace of .05 to a small sample size trial our likelyhood of a Type B error can reach well past .5, or a 50/50 chance of not finding a difference that in fact exists.

The basic problem is bad and inappropriate statistics, which impose a significance such that for a given sample size the likelyhood of identifying differences becomes severely compromised. If we about balance the risk of Type A and Type B errors we find that for small differences (say as in interconnects) we need sample sizes in the region of 100 in order to have any reasonable statistical power, smaller sample sizes are basically forcing us to accept that our sample size is too small to apply significance levels that allow us a reasonable avoidance of Type A Errors, should we still choose them (like the often quoted .05 significnace) we simply increase the likelyhood of Type B Errors to near certainties.

Note, non of this has much to do with what was heard, rather than with statistics. or, as I always tend to say, I do not trust statistics that I have not faked myself.

Ciao T
 
Yes, yes yes.........

Firstly is this a random process with a normal distribution?

If you asked 100 people to eat cabbage or a mars bar and asked them to tell you which was a mars bar....guess what, you'll get 100% right because the test is so easy. The outcome is no longer random with a normal distribution, it has become biased.

And the reverse can happen.

Now if you take a superb audio signal, push it through a crap D/A system, use a crap switch system to do ABX, add a bit of pressure (TEST,TEST) you start to get totally random results around zero. Why, because there is no difference or because your test is so crude.

When I see the ABX page I see poor digitisation systems confusing whether amp A is better than amp B. Sorry if you must (and it seems some must) then it has to be done the hard way. Or you could try single blind to make it a bit easier (but no chatting with audience of course)

Jeez, isn't it easier to err..... listen.....

Yep, you could be deluded or suffering a placebo effect, but please don't try and tell us that ABX is 'scientific' when you layer good music signals with crap 'intermediates'.

It is scientific claptrap, and then to layer the Null Hypothesis on top of this crude test to give it 5%, 1% Type A or B outcomes is showing me you have a calculator with 8 digits when you use a 12" ruler to measure um sizes. Yep my answer is 5.01347245 +/- 25.
 
I suggest anyone who believes ABX testing is inadequate for testing human hearing writes an article explaining their reasoning and makes it available for peer-review. It could be a significant new discovery.

-- Ian
 
Can ABX be used to test long term emotional response Ian?

By the way , I can't believe this place is still holding the F**king debate. Lost in space.

As an aside, my speakers sound damned good with a crocodile clip used to connect them to some bare telephone cable. Back off to a more rewarding cul de sac in cyber space I think...
 
Can ABX be used to test long term emotional response Ian?

That's what you get from music, not from tiny differences in sound, real or imagined.

As an aside, my speakers sound damned good with a crocodile clip used to connect them to some bare telephone cable.

You make my point exactly. Keep them like that, they won't sound any better with anything else, as I'm sure you know.

-- Ian
 
I suggest anyone who believes ABX testing is inadequate for testing human hearing writes an article explaining their reasoning and makes it available for peer-review. It could be a significant new discovery.

-- Ian

No need it has been done before many times. Just read any standard text on the problems with doing ABX tests on human responses. Inadequate methodology is well known about. As is using crap digitisation. God if people ain't happy with 44.1kHz/16bit what hope is there for using a £5 board in a PC. Do the test properly first then anlayse it to hell.
 
No need it has been done before many times. Just read any standard text on the problems with doing ABX tests on human responses. Inadequate methodology is well known about. As is using crap digitisation. God if people ain't happy with 44.1kHz/16bit what hope is there for using a £5 board in a PC. Do the test properly first then anlayse it to hell.

George, really, do it. Publish a paper about it. You need to get your findings into the scientific community if you're going to get anywhere with this. I wish you luck with it, you clearly have the beginnings of a compelling argument.

-- Ian
 
I have been listening to music between posts - and its sounds great to me - it might me muppet music to you I dont know I dont caare. I hope yours all does sound good to you and that you are listening to it a lot.Thats the most important thing .Ian and I agree on that. - I think?
 
Hi,

George, really, do it. Publish a paper about it. You need to get your findings into the scientific community if you're going to get anywhere with this.

Funny, quite extensive criticism of small scale/small sample size ABX Tests statistical analysis was published in the 80's in the JAES, was presented at AES conferences etc. All a matter of record.

Non of this has stopped the proponents of small scale ABX testing from continuing to use statistics that are severely flawed. Now, if there is substantial scientific evidence that you use inappropriate methodes, in print, peer reviewed and you continue to utilise the same self methodology, what do you call this?

Now, if you where actually interrested in establishing the truth, you would adjust your methodology to ensure that you use relaible statistics. Of course if instead you merely wish to make suppositions that support a given position (eg. the position "everything sounds the same") and the poor statistical methodes tend to fairly reliably generate suitable "evidence" (however flawed), you'd sure as hell keep on going the exact way you have to.

I personally posit that double blind testing is applicable to audio, just as to any other field. And statistical analysis is applicable to any suitably "blinded" process to determine the likelyhood of any of the observed data being random or non-random.

In order to produce results (with any kind of experiment) that have statistical power (that is can claim any relevance outside the precise setting of the experiment) one needs to follow certain strictures of both the scientific methode and of statistical analysis.

Anyone failing to follow said strictures and who nevertheless trumpets around his or her flawed reasoning from insufficient data is by definition not a scientist but a charlatan. Thus one should look at the appropriatness of the statistical methode for subject at hand and then consioder if one deals with scientists or charlatans.

A hint, the use of a .05 significance level with small sample sizes (below a few 100) and insistence that any "not different" result can be supported with any reasonable confidence is a strong indication of a lack of strictness and appropriatness in statistical analysis.

Ciao T
 
Well said T, though you don't mention the difficulties in actually designing the experiment so that the data collected isn't itself flawed or could be said to be by those who don't wish to accept the results whichever way they fall. Or that even if the null hypothesis is rejected, such a result only tells you that a difference could be heard between two specific products and not which is better.
 
Funny, quite extensive criticism of small scale/small sample size ABX Tests statistical analysis was published in the 80's in the JAES, was presented at AES conferences etc. All a matter of record.

But that's entirely different from a wholesale rejection of ABX as a suitable test for human hearing, which is what George and others propose. His logic is: "I can hear a difference, therefore an ABX test which doesn't support this result is necessarily flawed". I have no problems with debates about sample sizes, methodologies, significance levels, etc, since at least in such debates all sides are starting from the same basic set of reasonable - rather than religious - assumptions.

-- Ian
 
But that's entirely different from a wholesale rejection of ABX as a suitable test for human hearing, which is what George and others propose. His logic is: "I can hear a difference, therefore an ABX test which doesn't support this result is necessarily flawed". I have no problems with debates about sample sizes, methodologies, significance levels, etc, since at least in such debates all sides are starting from the same basic set of reasonable - rather than religious - assumptions.

-- Ian

Ian, if we can just be a little bit dispassionate about the whole topic for a moment, it isn't the double blind ABX test methodology per se that is flawed, because it works perfectly well in the scientific community by and large as a benchmark test.

In my humble opinion, it is not only the listeners being used in the panel and their hearing acuity that is the problem, it is also the subjects under test too, both of which are heavily susceptible to subjective bias and other factors. To expand on that, in real terms there isn't a huge chasm of "difference" between the exotic setup and the lesser setup in the system in absolute terms under test in the Matrix example and also the ABX results page I linked to earlier. I also contend that a listening panel selected for the test is also tainted by knowing that they are participating in a "test" which they subconsciously know will have a dramatic impact on themselves and the hi-fi world if published - whatever the result. If you took 20 people off a railway platform that had never heard a hi-fi before and had no interest in the outcome of the test, then I contend you would have a different set of results in the analysis assuming of course the subjects under test held sufficient differences to be identifiable in the first place. Even a margin of 5% beyond the average result would mean this held some validity.

It is that wholly subjectivite element that drives the entire hobby which is part of the issue here, the second being that if you take the bottom of the range CD player for example and the top of the range CD player then there isn't a huge chasm of easily identifiable differences to put to the test anyway.
 
Hi,

Well said T, though you don't mention the difficulties in actually designing the experiment so that the data collected isn't itself flawed or could be said to be by those who don't wish to accept the results whichever way they fall. Or that even if the null hypothesis is rejected, such a result only tells you that a difference could be heard between two specific products and not which is better.

I did not feel either one needs any specific mention, it falls under "scientific methode".

Of course, a well designed experiment would be verified by first ensuring that all readily aknowledged audible differences are reliably detected, failure to do so inherently invalidates the experiment as least with respect to any applicability outside the precise settings of the experiment.

And yes, despite claims from the ABX Charlatans that failure to rejecte the null hypothesis constitutes "proof" of their contention it constitutes nothing such, though large sample sizes and solid statistics give a significant statistical power to any such contention if present.

Currently I am aware of only one recent example of solid double blind testing and that is in the area of perceptual coding (Dolby Digital, MP3/MP4). There is quite a bit of solid research there, also about limits of audibility, which add new fuel to old debates. However, unsurprisingly, the ABX Charlatans also completely ignore this research.

Ciao T
 
Hi,

it is also the subjects under test too, both of which are heavily susceptible to subjective bias and other factors.

Hear hear.

Yes indeed. The problem is that the test may be double blind towards the actual identity of the DUT, it is however not blind as to the nature of the test. If the subject has strong opinions/convictions about what is being tested his or her perception provide a strong randomising factor.

Whoever is convinced that no difference exists will percieve non where one exists and equally whoever is convinced that a difference exists will percieve one even where non is present. Both will "score" pretty random, IF the presentation has an equal amount of "different" and "same" trials.

BTW, if the amount of trials is unequal then the "score" is weighted towards the more frequent presentation in the above scenario.

Ciao T
 
Currently I am aware of only one recent example of solid double blind testing and that is in the area of perceptual coding (Dolby Digital, MP3/MP4). There is quite a bit of solid research there, also about limits of audibility.
Ciao T
Hi.
I have indeed notice how most audiophile editors seemed to completely ignore proper research about DD, MP3 etc.

BTW I am interested to know what bit of research on limits of audibility you keep refering to. Do you have the reference?
 
Yes indeed. The problem is that the test may be double blind towards the actual identity of the DUT, it is however not blind as to the nature of the test. If the subject has strong opinions/convictions about what is being tested his or her perception provide a strong randomising factor.

Eliminating or minimising expectancy effects is a pretty elementary part of designing any rigorous ABX testing protocol I would think. I'm not sure you're describing an in-principle problem. If people don't know what is being tested/changed, but are just listening to music and asked to comment on what they hear, and those under test are not just or even predominantly those who describe themselves as audiophiles, expectancy can be minimised. There is nothing unique about audio testing in this regard, any good ABX test in any field needs to be sensitive to expectancy effects.

-- Ian
 
Hi,

BTW I am interested to know what bit of research on limits of audibility you keep refering to. Do you have the reference?

There is nothing specific. If you go through the research papers on perceptual coding you will find many items that indirectly relate to audibility limits. It is quite amusing how selective the ear/brain system is as what it notices and what not.

Sorry, no handy references.

Ciao T
 
Hi,

Eliminating or minimising expectancy effects is a pretty elementary part of designing any rigorous ABX testing protocol I would think.

Well, if Tom Noisaine tests "goldeneared audiophiles" in a major publication by clearly telling them that they will listen to two different systems one of which uses "junk" electronics and a "poor" setup and another that uses expensive "audiophile" valve Amplifiers has he eliminated or minimised expectancy?

If he then applies a .05 significance level to a single test subject, has he used appropriate statistical methodes?

If he then trumpets in his article ex cathedra that because his subjects, individually, under those conditions, individually failed to show any significant identification (I should add taking the data from ALL test subjects in fact showed a clear tendency for identification even when a .05 level of significance was used) that clearly all differences are imaginary, has he used appropriate analysis to come to his conclusions?

Yet is test of this type of methodology and approach whcih make up as far I can tell ALL of the published tests of the ABX "crowd". Not one tests attempted to minimise expectations, to use appropriate statistical methodes and to use realistic analytical methodes to give for example a confidence interval, instead of their draconian "not different" notion.

I'm not sure you're describing an in-principle problem.

I agree. I describe a "in operation" problem, but one that is so common as to have in effect become a principle one.

When I talked about audio blind testing I forgot ONE really shining example of testing. I hope Markus will not begrudge me some extensive "fair use" citation from God is in the Nuances - Markus Sauer - Stereophile, Jannuary 2000:

Markus Sauer said:
Expert testimony

Jürgen Ackermann is a 37-year-old psychologist living in Frankfurt, Germany. He has long been interested in music and its reproduction, building amplifiers and speakers for himself as well as for some friends. His current home system includes a home-brew tube preamp, a home-brew single-ended triode power amp (the power in question being all of 2W from a single 2A3 per channel), and modified Klipschorns. This system is seriously loud when required, those sound bursts from Flim and the BB's Tricycle coming across as positively threatening---yet it whispers with a clarity and conviction most minimonitors fall short of. His amp is remarkable in that there is none of the hum that is generally unavoidable with direct-heated triodes. He has designed an indirect heating that relies on very precise balancing of voltages, and has made it work beautifully.

As part of his doctoral thesis, Ackermann researched the experience of music reproduction in the home. He conducted an experiment, setting up three systems in a room of the Frankfurt Hochschule für Musik und Darstellende Kunst (Music and Performing Arts University). The first system consisted of an analog record player, ca $4800, and a tube pre- and triode power-amp combination worth ca $4500 (hereinafter called the analog system). The second system substituted a respected CD player, ca $2400, which has been well reviewed worldwide, including in the pages of Stereophile, but retained the tube amps. The third system kept the CD player but was powered by a transistor pre-/power combination worth ca $11,000 (hereinafter called the digital system).

The components had been selected as being reasonably representative of their kind. The loudspeakers were held constant and had been selected for their ability to sound equally good driven by tubes or transistors. If anything, the system favored the expensive transistor combo, which had been selected because it was one of the best-selling combinations in its price range, and also because comparative listening tests against some other transistor amps had revealed this combo to sound particularly good in the test configuration. All three systems were played at exactly the same loudness level.

Ackermann found 53 people from all walks of life willing to participate in his experiment: hi-fi enthusiasts, musicians, and "normal" people with no special relation to music or its reproduction. The selection of participants was not truly stochastic, but the sample was large enough to give meaningful results.

Participants were seated in a room before a pair of loudspeakers. The part of the room behind the speakers was partitioned off with dense cloth so that the participants could not know what went on behind this curtain. Indeed, they had no idea what was going on or what, if anything, was changed between trials, except that they were going to be interviewed on their reactions to several pieces of music. Ackermann made the system changeovers without once interacting with the participants.

The participants were received and instructed by a student who was paid for her time. This student, who had no knowledge of things hi-fi, was instructed to sit behind the participants so she could not influence the participants even subconsciously. The student first gave the participants a questionnaire that asked for their musical likes and dislikes. A second questionnaire asked how the participants normally listened to music, and a third questionnaire tried to establish the emotional base level at which each participant entered into the experiment. These and all other questionnaires were standard forms developed for musico-sociological research, and had been pretested to be meaningful and easily understandable by the participants.

Then the participants were played a standardized set of three musical pieces. These were tracks from Larry Conklin's Dolphin Grace (light jazz), Sally Barker's This Rhythm Is Mine (pop), and Italian Violin Musik, 1600-1750 on Edition Open Window (baroque classical music). The tracks had been selected after a preparatory experiment showed that they gave meaningful results. None was offensive to the participants; strong individual likes or dislikes could not influence the experiment's outcome.

After the first run-through, participants were given three more questionnaires: one asking for their emotional balance (the same questionnaire as before the music began), one asking how the participants had experienced the musical tracks, and one asking for their opinions on these tracks.

Then the participants were played the same tracks on a different system, and again had to fill out the three questionnaires; and so on with the third system. The sequence of the three systems was randomized so that familiarity effects, or fatigue, could not influence the overall outcome.

After the third trial, the participants were asked to fill out, besides the three standard questionnaires, a final questionnaire asking whether they had a music system at home, what it consisted of, and how expensive the components were.

Finally, the participants were asked by the student which of the three still-unidentified systems they would buy. The student also took notes of the participants' behavior during the tests: Did they react to the music by moving their feet? Did they sit through the presentation, or did they talk or stand up while the music was playing? and so on.

The tests were not exhaustive, in the sense that further questions might have shed even more light on the subjects' response to the three systems. But, as each test took about two hours, it was felt that this was the maximum time that people without any interest in the outcome of the experiment would be willing to be subjected to the rigors of being under very close scrutiny (13 multi-page questionnaires to fill out---what a chore).

Care was taken to keep exterior factors constant. The listening room was not darkened, because it was felt that listening in a dark room would be too far outside everyday experience for most participants. It is well known that lighting conditions have an effect on people's mood (or why do you turn out the lights when you want to share a little intimacy with your partner?). To keep lighting conditions constant, the experiments were restricted to a time slot between around 10am and 2pm, which meant that only two or, at a pinch, three persons a day could be interviewed. The time of day at which each interview was conducted was noted; it will be interesting to see if there is a correlation between time of day and the results.

Giving the complete results of Jürgen Ackermann's experiment would be way beyond the scope of an article such as this one; besides, Ackermann has not yet completed his statistical analysis. But there are already some results that seem interesting enough to warrant a preliminary report.

Let's start with the emotional states of the participants. The participants began with a base tension level of 3.26; with the digital system this dropped to 2.35, and with the analog system to 1.75. Nervousness was raised from a base level of 1.8 to 2.2 by the digital system, but fell to 1.1 with the analog system. The need for relaxation fell from a base level of 2.6 to 1.9 with the analog system, but rose to 2.9 with the digital system. The ability to concentrate remained constant with the analog system at 4.3, but fell to 3.6 with the digital system. Relaxedness stayed constant with the digital system at 4.0, but rose to 4.6 with the analog system. This shows that the analog system worked toward a feeling of serenity in the participant, whereas the digital system heightened tension and stress.

Equally interesting was the response to the question of whether the participants liked the music they were played. With the analog system, 43 out of the 53 participants said they liked the Larry Conklin piece, 46 the baroque music, and 38 the Sally Barker piece. The music was heard as interesting, emotionally appealing, and engaging. Via the digital system, the levels fell to 31, 33, and 33, respectively. The same music was now more often experienced as boring. Food for thought.

The questionnaire asking for the listeners' experience of the music gave just as interesting results. Thirty participants sang along with the music under their breaths when it was played via the analog system, and only 19 with the digital system. Forty-seven participants said they had let themselves be carried along by the analog system, 19 with the digital system. When questioned whether the music had influenced their movements (tapping their feet, etc.), the numbers were 30 and 25. Forty-six participants had been inspired to think about the music by the analog system, 34 by the digital system. Forty-seven participants said the music had improved their sense of well-being via the analog system, 31 via the digital.

Conversely, no participant said that the analog system had impaired their sense of well-being, but 16 participants said so of the digital system! This must be one of the most astonishing, and irritating, results of Ackermann's experiment. How can it be that we spend a lot of money on something that makes us feel worse?!

The results of the "intermediate" CD/tube system were consistently between those of the digital and analog systems.

At the end of the test, the participants were asked which of the systems they would buy. Those listeners who had some experience of things hi-fi preferred the digital system, which they thought sounded better. Those participants without such experience preferred the analog system's sound. The conclusion Ackermann drew from this is that the sound of modern hi-fi is the result of a learning process. When told that a certain sound is what they should aim for, often enough people will accept this concept of sound as their internal reference.

Another inference that may be drawn from this question is that there was no correlation between what the participants experienced as good sound and which system made them feel good. In other words, the perceived quality of sound had no influence on whether the participants liked the music and its emotional impact on the listeners. One participant, a musician, even responded that he could hear absolutely no difference in sound between the presentations, yet his emotional response was very different on the three trials, and showed complete conformity with the rest of the participants.

Some of you will have noticed that there was one long-term test subject: the student who accompanied the participants during their time in the listening room. The poor girl had to listen to the above-mentioned pieces 159 times! At the end of the experiment, she asked Ackermann what the systems were. She said she couldn't stand the sound of one of the systems anymore, feeling physically attacked by its sound. By now, it won't surprise you that the system in question turned out to be the CD/solid-state one.

Ciao T
 

Latest posts

Back
Top