The testing is getting more and more problematic. And on each AV conference there are multiple papers about how to do proper testing (not that I think that all of them make sense

)
I have objections against all AV-Comparatives tests performed, also the Av-Test, but those are less 'documented', so it's hard to tell where the deficiencies lie.
The usual points about static testing are:
a) the tests are carried long after the real infection took place, so it's kind of useless from today's point of view
b) the tests are carried without any context state information. Such information - if there is file named "document.doc .exe" in email, this is enough to ban the execution
c) the tests are carried only with the signature engines - they don't test the other generic protection engines the products may have
d) the tests don't know anything about the relationship of the samples. If you detect the dropper, you don't have to detect the dropped binary.
e) the tests are too binary-centric and have only small amount of script/pdf/flash malware, althought these are one of the main vectors of getting thru to your computer.
f) there is little of no info on how the testbeds are created. All these 99.1% and such scores are complete nonsense from my point of view. The overlap of the product's detections is not as great as clementi/marx tests suggest.
This is not an excuse, that's an explanation what your really should read from the static tests. Yep, it's nice to be on the first places, but the world does not end if you're not there.
Regarding the pro-active test, this is the most flawed test of them all. It does _NOT_ test the ability of the product to protect you from the unknown malware. It tests the ability of the signature engines to detect the samples Av-Comparatives got in the test's timeframe. For example, what if the engine authors already had the samples and wrote the detections and Av-Comparatives added them later? We're back again in the 'testedbed construction' problem.