Double blind tests

Discussion in 'Hi-Fi and General Audio' started by kmac, Jun 27, 2007.

  1. kmac

    Uncle Ants In Recordeo Speramus

    Joined:
    Dec 5, 2003
    Messages:
    1,928
    Likes Received:
    0
    Location:
    East Midlands
    This is correct statistically. On the page pointed to we get the first test at 79% getting it right (on 80 tests), on the second 62% (on 126 tests) and 49% (on 84 tests).

    Assuming they applied the right statistical test (my o level stats is over 20 years old so I'll not pass judgement) and assuming the experiment itself wasn't flawed (and that they got their sums right), it means that on the first test there is a 0.05% chance that the result could have been achieved without a real difference, on the second test a 0.5% chance and on the last there is no difference.

    These were tests between different power amps, and it shows (accepting the asumptions) that a difference could be heard.

    However, one should always, always ask how much one believes what one reads on the net ;)
     
    Uncle Ants, Jul 11, 2007
  2. kmac

    3DSonics away working hard on "it"

    Joined:
    Sep 28, 2004
    Messages:
    1,469
    Likes Received:
    0
    Location:
    Planet Dirt, somewhere on it
    Hi,

    Could you please DEFINE "conclusive ABX test"?

    What is conclusive? Do you accept "difference proven beyond resonable doubt" for a small sample size with a significance of .2? Do you insist on .05 (which is good enough for medical trials but requires very large sample sizes)?

    Have you considered the "type b statistical error" for the tests you reference?

    For information, "Type A Error" is defined as incorrectly rejecting the null hypothesis (that is "no difference") when it is in fact true. This is typically expressed in significance. A significance of .2 means that if 100 Tests where conducted only 20 in these 100 would be likely to show a Type A Error and so on.

    A Type B statistical Error is defined as incorrectly accepting the null hypothesis (that is "no difference") when it is in fact false and difference exists. The problem is that with a reduction in the likelyhoof of Type A Errors we increase the likelyhood of Type B Errors.

    So, if we apply a significnace of .05 to a small sample size trial our likelyhood of a Type B error can reach well past .5, or a 50/50 chance of not finding a difference that in fact exists.

    The basic problem is bad and inappropriate statistics, which impose a significance such that for a given sample size the likelyhood of identifying differences becomes severely compromised. If we about balance the risk of Type A and Type B errors we find that for small differences (say as in interconnects) we need sample sizes in the region of 100 in order to have any reasonable statistical power, smaller sample sizes are basically forcing us to accept that our sample size is too small to apply significance levels that allow us a reasonable avoidance of Type A Errors, should we still choose them (like the often quoted .05 significnace) we simply increase the likelyhood of Type B Errors to near certainties.

    Note, non of this has much to do with what was heard, rather than with statistics. or, as I always tend to say, I do not trust statistics that I have not faked myself.

    Ciao T
     
    3DSonics, Jul 11, 2007
  3. kmac

    George Sallit

    Joined:
    Jun 5, 2005
    Messages:
    64
    Likes Received:
    0
    Yes, yes yes.........

    Firstly is this a random process with a normal distribution?

    If you asked 100 people to eat cabbage or a mars bar and asked them to tell you which was a mars bar....guess what, you'll get 100% right because the test is so easy. The outcome is no longer random with a normal distribution, it has become biased.

    And the reverse can happen.

    Now if you take a superb audio signal, push it through a crap D/A system, use a crap switch system to do ABX, add a bit of pressure (TEST,TEST) you start to get totally random results around zero. Why, because there is no difference or because your test is so crude.

    When I see the ABX page I see poor digitisation systems confusing whether amp A is better than amp B. Sorry if you must (and it seems some must) then it has to be done the hard way. Or you could try single blind to make it a bit easier (but no chatting with audience of course)

    Jeez, isn't it easier to err..... listen.....

    Yep, you could be deluded or suffering a placebo effect, but please don't try and tell us that ABX is 'scientific' when you layer good music signals with crap 'intermediates'.

    It is scientific claptrap, and then to layer the Null Hypothesis on top of this crude test to give it 5%, 1% Type A or B outcomes is showing me you have a calculator with 8 digits when you use a 12" ruler to measure um sizes. Yep my answer is 5.01347245 +/- 25.
     
    George Sallit, Jul 11, 2007
  4. kmac

    sideshowbob Trisha

    Joined:
    Jun 20, 2003
    Messages:
    3,092
    Likes Received:
    0
    Location:
    London
    I suggest anyone who believes ABX testing is inadequate for testing human hearing writes an article explaining their reasoning and makes it available for peer-review. It could be a significant new discovery.

    -- Ian
     
    sideshowbob, Jul 11, 2007
  5. kmac

    Stereo Mic

    Joined:
    Aug 30, 2005
    Messages:
    2,309
    Likes Received:
    0
    Can ABX be used to test long term emotional response Ian?

    By the way , I can't believe this place is still holding the F**king debate. Lost in space.

    As an aside, my speakers sound damned good with a crocodile clip used to connect them to some bare telephone cable. Back off to a more rewarding cul de sac in cyber space I think...
     
    Stereo Mic, Jul 11, 2007
  6. kmac

    sideshowbob Trisha

    Joined:
    Jun 20, 2003
    Messages:
    3,092
    Likes Received:
    0
    Location:
    London
    That's what you get from music, not from tiny differences in sound, real or imagined.

    You make my point exactly. Keep them like that, they won't sound any better with anything else, as I'm sure you know.

    -- Ian
     
    sideshowbob, Jul 11, 2007
  7. kmac

    George Sallit

    Joined:
    Jun 5, 2005
    Messages:
    64
    Likes Received:
    0
    No need it has been done before many times. Just read any standard text on the problems with doing ABX tests on human responses. Inadequate methodology is well known about. As is using crap digitisation. God if people ain't happy with 44.1kHz/16bit what hope is there for using a £5 board in a PC. Do the test properly first then anlayse it to hell.
     
    George Sallit, Jul 11, 2007
  8. kmac

    sideshowbob Trisha

    Joined:
    Jun 20, 2003
    Messages:
    3,092
    Likes Received:
    0
    Location:
    London
    George, really, do it. Publish a paper about it. You need to get your findings into the scientific community if you're going to get anywhere with this. I wish you luck with it, you clearly have the beginnings of a compelling argument.

    -- Ian
     
    sideshowbob, Jul 11, 2007
  9. kmac

    ADPully

    Joined:
    May 20, 2007
    Messages:
    265
    Likes Received:
    0
    Location:
    Oxford
    I have been listening to music between posts - and its sounds great to me - it might me muppet music to you I dont know I dont caare. I hope yours all does sound good to you and that you are listening to it a lot.Thats the most important thing .Ian and I agree on that. - I think?
     
    ADPully, Jul 11, 2007
  10. kmac

    ADPully

    Joined:
    May 20, 2007
    Messages:
    265
    Likes Received:
    0
    Location:
    Oxford
    The music Waterbearer - Sally Oldfield
     
    ADPully, Jul 11, 2007
  11. kmac

    3DSonics away working hard on "it"

    Joined:
    Sep 28, 2004
    Messages:
    1,469
    Likes Received:
    0
    Location:
    Planet Dirt, somewhere on it
    Hi,

    Funny, quite extensive criticism of small scale/small sample size ABX Tests statistical analysis was published in the 80's in the JAES, was presented at AES conferences etc. All a matter of record.

    Non of this has stopped the proponents of small scale ABX testing from continuing to use statistics that are severely flawed. Now, if there is substantial scientific evidence that you use inappropriate methodes, in print, peer reviewed and you continue to utilise the same self methodology, what do you call this?

    Now, if you where actually interrested in establishing the truth, you would adjust your methodology to ensure that you use relaible statistics. Of course if instead you merely wish to make suppositions that support a given position (eg. the position "everything sounds the same") and the poor statistical methodes tend to fairly reliably generate suitable "evidence" (however flawed), you'd sure as hell keep on going the exact way you have to.

    I personally posit that double blind testing is applicable to audio, just as to any other field. And statistical analysis is applicable to any suitably "blinded" process to determine the likelyhood of any of the observed data being random or non-random.

    In order to produce results (with any kind of experiment) that have statistical power (that is can claim any relevance outside the precise setting of the experiment) one needs to follow certain strictures of both the scientific methode and of statistical analysis.

    Anyone failing to follow said strictures and who nevertheless trumpets around his or her flawed reasoning from insufficient data is by definition not a scientist but a charlatan. Thus one should look at the appropriatness of the statistical methode for subject at hand and then consioder if one deals with scientists or charlatans.

    A hint, the use of a .05 significance level with small sample sizes (below a few 100) and insistence that any "not different" result can be supported with any reasonable confidence is a strong indication of a lack of strictness and appropriatness in statistical analysis.

    Ciao T
     
    3DSonics, Jul 12, 2007
  12. kmac

    Uncle Ants In Recordeo Speramus

    Joined:
    Dec 5, 2003
    Messages:
    1,928
    Likes Received:
    0
    Location:
    East Midlands
    Well said T, though you don't mention the difficulties in actually designing the experiment so that the data collected isn't itself flawed or could be said to be by those who don't wish to accept the results whichever way they fall. Or that even if the null hypothesis is rejected, such a result only tells you that a difference could be heard between two specific products and not which is better.
     
    Uncle Ants, Jul 12, 2007
  13. kmac

    sideshowbob Trisha

    Joined:
    Jun 20, 2003
    Messages:
    3,092
    Likes Received:
    0
    Location:
    London
    But that's entirely different from a wholesale rejection of ABX as a suitable test for human hearing, which is what George and others propose. His logic is: "I can hear a difference, therefore an ABX test which doesn't support this result is necessarily flawed". I have no problems with debates about sample sizes, methodologies, significance levels, etc, since at least in such debates all sides are starting from the same basic set of reasonable - rather than religious - assumptions.

    -- Ian
     
    sideshowbob, Jul 12, 2007
  14. kmac

    Effem Cable manufacturer

    Joined:
    Jan 27, 2005
    Messages:
    269
    Likes Received:
    0
    Location:
    Sunny Cornwall
    Ian, if we can just be a little bit dispassionate about the whole topic for a moment, it isn't the double blind ABX test methodology per se that is flawed, because it works perfectly well in the scientific community by and large as a benchmark test.

    In my humble opinion, it is not only the listeners being used in the panel and their hearing acuity that is the problem, it is also the subjects under test too, both of which are heavily susceptible to subjective bias and other factors. To expand on that, in real terms there isn't a huge chasm of "difference" between the exotic setup and the lesser setup in the system in absolute terms under test in the Matrix example and also the ABX results page I linked to earlier. I also contend that a listening panel selected for the test is also tainted by knowing that they are participating in a "test" which they subconsciously know will have a dramatic impact on themselves and the hi-fi world if published - whatever the result. If you took 20 people off a railway platform that had never heard a hi-fi before and had no interest in the outcome of the test, then I contend you would have a different set of results in the analysis assuming of course the subjects under test held sufficient differences to be identifiable in the first place. Even a margin of 5% beyond the average result would mean this held some validity.

    It is that wholly subjectivite element that drives the entire hobby which is part of the issue here, the second being that if you take the bottom of the range CD player for example and the top of the range CD player then there isn't a huge chasm of easily identifiable differences to put to the test anyway.
     
    Effem, Jul 12, 2007
  15. kmac

    3DSonics away working hard on "it"

    Joined:
    Sep 28, 2004
    Messages:
    1,469
    Likes Received:
    0
    Location:
    Planet Dirt, somewhere on it
    Hi,

    I did not feel either one needs any specific mention, it falls under "scientific methode".

    Of course, a well designed experiment would be verified by first ensuring that all readily aknowledged audible differences are reliably detected, failure to do so inherently invalidates the experiment as least with respect to any applicability outside the precise settings of the experiment.

    And yes, despite claims from the ABX Charlatans that failure to rejecte the null hypothesis constitutes "proof" of their contention it constitutes nothing such, though large sample sizes and solid statistics give a significant statistical power to any such contention if present.

    Currently I am aware of only one recent example of solid double blind testing and that is in the area of perceptual coding (Dolby Digital, MP3/MP4). There is quite a bit of solid research there, also about limits of audibility, which add new fuel to old debates. However, unsurprisingly, the ABX Charlatans also completely ignore this research.

    Ciao T
     
    3DSonics, Jul 12, 2007
  16. kmac

    3DSonics away working hard on "it"

    Joined:
    Sep 28, 2004
    Messages:
    1,469
    Likes Received:
    0
    Location:
    Planet Dirt, somewhere on it
    Hi,

    Hear hear.

    Yes indeed. The problem is that the test may be double blind towards the actual identity of the DUT, it is however not blind as to the nature of the test. If the subject has strong opinions/convictions about what is being tested his or her perception provide a strong randomising factor.

    Whoever is convinced that no difference exists will percieve non where one exists and equally whoever is convinced that a difference exists will percieve one even where non is present. Both will "score" pretty random, IF the presentation has an equal amount of "different" and "same" trials.

    BTW, if the amount of trials is unequal then the "score" is weighted towards the more frequent presentation in the above scenario.

    Ciao T
     
    3DSonics, Jul 12, 2007
  17. kmac

    wolfgang

    Joined:
    Jun 19, 2003
    Messages:
    814
    Likes Received:
    0
    Location:
    Scotland
    Hi.
    I have indeed notice how most audiophile editors seemed to completely ignore proper research about DD, MP3 etc.

    BTW I am interested to know what bit of research on limits of audibility you keep refering to. Do you have the reference?
     
    wolfgang, Jul 12, 2007
  18. kmac

    sideshowbob Trisha

    Joined:
    Jun 20, 2003
    Messages:
    3,092
    Likes Received:
    0
    Location:
    London
    Eliminating or minimising expectancy effects is a pretty elementary part of designing any rigorous ABX testing protocol I would think. I'm not sure you're describing an in-principle problem. If people don't know what is being tested/changed, but are just listening to music and asked to comment on what they hear, and those under test are not just or even predominantly those who describe themselves as audiophiles, expectancy can be minimised. There is nothing unique about audio testing in this regard, any good ABX test in any field needs to be sensitive to expectancy effects.

    -- Ian
     
    sideshowbob, Jul 12, 2007
  19. kmac

    3DSonics away working hard on "it"

    Joined:
    Sep 28, 2004
    Messages:
    1,469
    Likes Received:
    0
    Location:
    Planet Dirt, somewhere on it
    Hi,

    There is nothing specific. If you go through the research papers on perceptual coding you will find many items that indirectly relate to audibility limits. It is quite amusing how selective the ear/brain system is as what it notices and what not.

    Sorry, no handy references.

    Ciao T
     
    3DSonics, Jul 12, 2007
  20. kmac

    3DSonics away working hard on "it"

    Joined:
    Sep 28, 2004
    Messages:
    1,469
    Likes Received:
    0
    Location:
    Planet Dirt, somewhere on it
    Hi,

    Well, if Tom Noisaine tests "goldeneared audiophiles" in a major publication by clearly telling them that they will listen to two different systems one of which uses "junk" electronics and a "poor" setup and another that uses expensive "audiophile" valve Amplifiers has he eliminated or minimised expectancy?

    If he then applies a .05 significance level to a single test subject, has he used appropriate statistical methodes?

    If he then trumpets in his article ex cathedra that because his subjects, individually, under those conditions, individually failed to show any significant identification (I should add taking the data from ALL test subjects in fact showed a clear tendency for identification even when a .05 level of significance was used) that clearly all differences are imaginary, has he used appropriate analysis to come to his conclusions?

    Yet is test of this type of methodology and approach whcih make up as far I can tell ALL of the published tests of the ABX "crowd". Not one tests attempted to minimise expectations, to use appropriate statistical methodes and to use realistic analytical methodes to give for example a confidence interval, instead of their draconian "not different" notion.

    I agree. I describe a "in operation" problem, but one that is so common as to have in effect become a principle one.

    When I talked about audio blind testing I forgot ONE really shining example of testing. I hope Markus will not begrudge me some extensive "fair use" citation from God is in the Nuances - Markus Sauer - Stereophile, Jannuary 2000:

    Ciao T
     
    3DSonics, Jul 12, 2007
Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.
Similar Threads
Loading...