The way I interpret it is that under the blind testing conditions used for the test, the test was able to distinguish differences between the components, but unfortunately it was not good enough to allow the testers to distinguish which components they actually preferred.
On longer exposure it became obvious that they didnt prefer each others amps. Thats what I have been saying all along. IME, to really get a feel for whether you actually like a component enough to want to live with it takes quite a long time, and certainly longer than a simple blind test.
Whether this could be resolved by better experimental design e.g. longer tests periods, I am not sure.