It requires fewer samples to get statistical significance.field tests are good because you get all the uncontrolled variables that you may not be aware of.
Sorry, people use the excuse of uncontrolled variables without really understanding the system. Variables can be controlled, but it takes a lot of work to characterize, measure and control those variables. This is one of the things that Gear Skeptic does a reasonable job at.
I like to do 4 tests – condition 1, condition 2, condition 2, condition 1. The difference between the two condition 1 tests, and between the two condition 2 tests will give you an idea how repeatable the test is, and if the difference between condition 1 and condition 2 is much greater, then maybe you’ve actually discovered something. Doing 1, 2, 2, 1 will cancel out any linear errors, like as the canister gets more empty.
Well, not quite. First, you need replicates (N much greater than 2). Second, you need randomization.  Then you need analysis. One of the main areas of improvement in Gear Skeptic’s analysis is single to limited sample sizes. I understand the reluctence as replication substantially increases the amount of testing. When I validate a system, I normally do 6 replicates of each test. In large test, I also randomize. So, in your example, I would be doing at least 12 test, not 4 and the results would only be valid if I could statistically prove a difference. My 2 cents.
