Yeah, I agree. A sample size of three with tight measurements is likely a fair initial test.
A sample size of 5 is what I used for 90% of my testing. Toss out the high and low as likely my error in preparing things. Average the remaining 3 to get a single average data point. Median, mode, stddev were usually pretty much ignored since they really only report on the test itself (known to be rather sloppy.) All using a large measuring cup as a volume measurement for water, and, a simple kitchen scale with an accuracy of 1 gram for fuel weights for cannisters (all calibrated at work at Cornell’s ChemEng lab initially.) Nowhere near your scale’s accuracy, though.
In the field, I brought my scale once, and measured all burns for a couple weeks. There was a discrepancy between lab testing and field results on the MINUS side. It seems I was out in summer and all my water during testing was ice water! Nice to know, I guess. This validated the lab tests in my mind, though some conditions can change this. 0F environment, rain and wind, etc…




