- Groups in experiments
- Sample size
- Power analysis
- Choosing the appropriate power calculation
- Power calculator for t-tests
- Parameters in a power analysis for t-tests
- Representing groups in the EDA
- Calculating sample size in the EDA
- Indicating sample size in an EDA diagram
Every experiment should include at least one control or comparator group. Controls can be negative controls, for example untreated animals, or animals receiving a placebo or sham treatment, which would generally be more appropriate. Positive controls are sometimes included and used to check that an expected effect can be detected in the experimental settings.
The choice of control or comparator group depends on the objective of the experiment; sometimes, a separate control with no treatment may not be needed. For example, if the aim of the experiment is to compare a treatment administrated by different methods (e.g. intraperitoneal administration or oral gavage), a third group with no treatment is unnecessary.
Many experiments also include a group receiving an intervention that is being tested. For example, in a study comparing the effect of a drug vs a control substance on blood glucose levels, one group would receive the drug/intervention and the other would receive the control.
The sample size relates to the number of experimental units per group, which may differ from the number of animals if the experimental unit is not the individual animal. If the experimental unit comprises multiple animals (for example, a cage or a litter, see the experimental unit section for additional examples), the sample size is less than the number of animals per treatment group.
In experiments which are designed to test a formal hypothesis using inferential statistics and the generation of a p-value, the number of experimental units per group should be determined using an appropriate method such as a power analysis; basing sample sizes solely on historical precedent should be avoided as this can lead to serious over- or under-estimation of animals required. It has been shown that if an original study is ‘just’ statistically significant, a replication study has at least a 50% chance of failure if the same sample size is used as in the original study.
Some types of experiments are not intended to test a formal hypothesis; these include, for example, preliminary experiments designed to test for adverse effects or assess technical issues, or experiments based on success or failure of a desired goal such as the production of a transgenic line. In such cases power calculations are not appropriate and sample sizes can be estimated based on experience, depending on the goal of the experiment. Data collected in these experiments can sometimes be used to compute the sample size needed in follow up studies designed and powered to test some of the hypotheses generated.
It is usually recommended to use a balanced design, in which all experimental groups have equal size, as this maximises sensitivity; For example in studies which involve only two groups, or several groups where all pairwise comparisons are made. However, on occasion, for example for experiments involving planned comparison of several treatment groups back to a common control group, sensitivity can be increased by putting more animals in the control group (See Bate and Karp, 2014 for further reading).
In a hypothesis-testing experiment, samples are taken from a population of animals. If a difference between the treatment groups is observed, researchers have to determine whether that difference is due to a sampling effect or a real treatment effect. A statistical test is used to manage the sampling issue and help make an informed decision by the calculation of a p-value. The p-value is the chance of obtaining results as extreme as, or even more extreme than, those observed if the null hypothesis is true. The smaller the p-value, the more unlikely it would be to have obtained the observed data if the null hypothesis is true and there is no treatment effect. A threshold α is set, and by convention a p-value below the threshold is deemed statistically significant i.e. such a result is sufficiently unlikely that one can conclude that the null hypothesis is, in fact, not true. The table below describes the possible outcomes when using a statistical test to assess whether to accept or reject a null hypothesis.
|No biologically relevant effect||Biologically relevant effect|
p < threshold (α)
H0 unlikely to be true
Type 1 error (α)
|Correct acceptance of H1
|Statistically not significant
p > threshold (α)
H0 likely to be true
|Correct rejection of H1||False negative
Type 2 error (β)
A power calculation is an approach to assess the risk of making a false negative call. The power (1-β) is the probability that the experiment will correctly lead to the rejection of a false null hypothesis, thus the power is the probability of achieving statistically significant results when in reality there is a biologically relevant effect.
The significance threshold (α) is the probability of obtaining a significant result by chance (a false positive) when the null hypothesis is true. When set at 0.05, it means that the risk of obtaining a false positive is 1 in 20, or 5%.
The smaller the sample size the lower the statistical power; there is little value in running an experiment with a low power. For example, with a power of 10% the probability of obtaining a false negative result is 90%. In other words it is very difficult to prove the existence of a ‘true’ effect when too few animals are used.
In addition, the lower the power, the lower the probability that an observed effect that reaches statistical significance actually reflects a true effect; small sample sizes can lead to unusual and unreliable results (false positives). Finally, even when an underpowered study discovers a true effect, it is likely that the magnitude of the effect is exaggerated (see Button et al, 2013 for further reading).
Under-powered in vivo experiments waste time and resources, lead to unnecessary animal suffering and result in erroneous biological conclusions. In over-powered experiments (where the sample size is too large), the statistical test becomes oversensitive and an effect too small to have any biological relevance may be statistically significant. Statistical significance should not be confused with biological significance.
For the conclusion of the study to be scientifically valid, the sample size needs to be chosen correctly so that biological relevance and statistical significance complement each other. A target power between 80-95% is deemed acceptable depending on the risk of obtaining a false negative result the experimenter is willing to take.
Sample sizes can be estimated based on a power analysis which is specific to the statistical test which will be used to analyse the data. Other appropriate approaches to sample size planning include Bayesian and frequentist methods; these are not discussed here.
While power calculations are a valuable tool to use in the planning stage of an experiment, it is not appropriate to use them after the experiment has been conducted to aid the interpretation of the results. When power calculations are carried out post-experiment, based on the observed effect size (rather than a pre-defined effect size of biological interest), we must make the assumption that the effect size in the experiment is identical to the true effect size in the population. This assumption is likely to be false, especially if the sample size is small. In this setting, the observed significance level (p-value) is directly related to the observed power so high (non-significant) p-values necessarily correspond to low observed power and it would be erroneous to conclude that low observed power provides weak evidence that the null hypothesis is true since high p-values provide evidence of the contrary. Thus computing the observed power after obtaining the p-value cannot bring any more information and change the interpretation of the p-value.
When determining a group size it is important to consider the type of experiment, for example a factorial design with many factor combinations will require fewer animals per group (everything else being equal) than a standard comparison between two or three treatment groups. The decision tree below can be used to help decide the type of power calculation appropriate for a particular experiment.
Calculations for paired and unpaired t-tests can be done within the EDA, or using the power calculator below. More comprehensive power analysis software is available from several sources, including Russ Lenth’s power and sample size or G Power. However, these tools should not be used without a thorough understanding of the parameters requested in the sample size computation and it would be preferable to seek statistical help in the first instance.
In the power calculation tool below, enter the parameters for the experiment. Use the tabs at the top of the power calculation tool to choose either a paired or an unpaired t-test, depending on your experimental design. For help deciding which power calculator is appropriate for your study use the decision tree above.
In the power calculation tool below fill out all fields except the N per group, and click Calculate. The number of experimental units per group will be displayed in the field N per group. The power calculator uses R 3.5.2 and the package power.t.test. For more information about the parameters in a power calculation for t-tests see the section below.
Sample size calculation for a t-test is based on a mathematical relationship between the following parameters: effect size, variability, significance level, power and sample size; these are described below.
The effect size is the minimum difference between two groups under study, which would be of interest biologically and would be worth taking forward into further work or clinical trials. It is based on the primary outcome measure.
Researchers should always have an idea of what effect size would be of biological importance prior to carrying out an experiment. This is not based on prior knowledge of the magnitude of the treatment effect but on a difference that the investigator wants the experiment to be able to detect. In other words, the effect size is the minimum effect that is considered to be important, not one that has been estimated or observed from experimental data in the past. Careful consideration of the effect size allows the experiment to be powered to detect only meaningful effects and not generate statistically significant results that are not biologically relevant.
Cohen’s d is a standardised effect size; it represents the difference between treatment and control means, calibrated in units of variability. It is computed as: Cohen's d = |m1 - m2| / average SD
If no information is available to estimate the variability, or it is not possible to estimate the size of a biologically significant effect, a standardised effect size can be used instead of a biologically relevant one. Cohen’s d can be interpreted in terms of the percentage of overlap of the outcome measures in the treatment group with those of the control group.
The original guidelines suggested by Cohen in the field of social sciences suggested that small, medium and large effects were represented by d = 0.2, 0.5 and 0.8, respectively. However, in work with laboratory animals, it is generally accepted that these might more realistically be set at:
Cohen’s d = 0.5: small effect, Cohen’s d =1.0: medium effect, Cohen’s d =1.5: large effect (see Wahlsten 2011 for more information).
The sample size is related to the amount of variability between the experimental units. The larger the variability, the more animals will be required to achieve reliable results (all other things being equal). You should also consider if you are performing:
- The treatment comparison between animals, where animals are assigned to different treatment groups. In this case the power analysis for the unpaired t-test should be considered.
- The treatment comparison within-animal (i.e. each animal is used as its own control). In which case the power analysis tool for the paired t-test may be more appropriate.
Calculating the average SD (standard deviation) used in the power calculation for an unpaired t-test
There are several methods to estimate variability depending on the information available. These are listed below in order from what we believe will be the most accurate to the least accurate.
1. The most accurate estimate of variability for future studies is from data collected from a preliminary experiment carried out under identical conditions to the planned experiment, e.g. a previous experiment in the same lab, testing the same treatment, under similar conditions, on animals with the same characteristics. Such experiments can be sometimes be carried out to test for adverse effect or assess technical issues. Depending on the number of animals used, they can be used to estimate SD. As rule of thumb, less than 10 animals in total is unlikely to provide an accurate estimate of SD (see http://www.graphpad.com/guides/prism/6/statistics/index.htm?stat_confidence_interval_of_a_stand.htm for further information).
As shown in the spreadsheet below, with two groups or more, the variability can be derived from the mean square of the residuals in an ANOVA table. The SD is calculated as the square root of this number. Alternatively, if there are only two groups the SD can be calculated as the square root of the pooled variance in a t-test table.
Download Excel spreadsheet with examples of SD calculations: Estimating average SD (unpaired t-test).xlsx
Please note that using a covariate or a blocking factor may reduce the variability, hence allowing the same power to be achieved with a reduced sample size. Software such as InVivoStat can be used to calculate the variability from datasets with blocking factors or covariates (using the single measure parametric analysis module). The power calculation module in InVivoStat can also be used to run the power analysis directly from a dataset.
2. If there are no data from a preliminary experiment conducted under identical conditions and the cost/benefit assessment cannot justify using extra animals for a preliminary experiment to estimate the SD, consider previous experiments conducted under similar conditions in the same lab (i.e. same animal characteristics and methods) but possibly testing other treatments. As different treatments may induce different levels of variability, it is probably best to only consider the SD of the control group (assuming the variability is expected to be the same across all groups). This can be calculated in Excel using the function STDEV().
3. If none of the above is available, i.e. no experiment using the same type of animals in the same settings have been carried out in your lab before, it may be possible to estimate the variability of from an experiment reported in the literature, but lab-to-lab differences could make this approach unreliable. If the ANOVA table is reported, it may provide an estimate of the underlying variability of the results you will obtain, but if not then the SD of the control groups can be used instead. Please note that error bars reported in the literature are not necessarily SD.
If SEM (standard error of the mean) are reported, SD can be calculated using the formula: SD = SEM × √n
If 95% CI (confidence intervals) are reported, SD can be calculated using the formula: SD = √n × (upper limit - lower limit) / 3.92
4. It may be the case that you have access to a historical database of the control group data from many previous experiments, e.g. toxicity studies, routinely run over the years on animals with the same characteristics. Care should be taken with such information though as you may not have control over the information available, for example animals may be from different batches, different suppliers or have been housed under different husbandry regimes (over time) and this may influence the underlying variability, it may be wise to consult a statistician before using such a database as a source of information. However such databases do offer a large amount of information, as usually many animals will be included in them, and hence they may provide a useful estimate of the between-animal SD used in a between-animal test such as an unpaired t-test.
Calculating the SD of the differences used in the power calculation for a paired t-test
Preliminary data collected under identical conditions to the planned experiment e.g. a previous experiment in the same lab, testing the same treatment on animals with the same characteristics is needed to estimate the variability in studies where animals are used as their own control.
As shown in the spreadsheet below, to begin with the absolute difference between the two responses is calculated for each animal. Then the SD of these differences (across all animals) is then used as the estimate of the SD within the group of animals (within-animal SD).
Download Excel spreadsheet with examples of SD calculations: Estimating SD of the difference (paired t-test).xlsx
Cohen’s d is a standardised effect size which is expressed in units of variability. For that reason, when Cohen’s d effect sizes are used in the t-test power calculator above (see standard effect sizes above), the variability has to be set to 1.
The significance level or threshold (α) is the probability of obtaining a significant result by chance (a false positive) when the null hypothesis is true (i.e. there are no real, biologically relevant differences between the groups). It is usually set at 0.05 which means that the risk of obtaining a false positive is 5%; however it may be appropriate on occasion to use a different value.
The power (1-β) is the probability that the experiment will correctly lead to the rejection of a false null hypothesis (i.e. detect that there is a difference when there is one). A target power between 80-95% is deemed acceptable depending on the risk of obtaining a false negative result the experimenter is willing to take.
If H1 is directional (one-sided), then the experiment can be powered and analysed with a one-sided test, which is very rare in biology, and researchers have to accept the null hypothesis even if the results show a strong effect in the opposite direction than set in the alternative hypothesis.
Two-sided tests with a non-directional H1 are much more common and allow researchers to detect the effect of a treatment regardless of its direction.
N refers to the number of experimental units needed per group, i.e. the sample size.
A group node represents a group of animals or experimental units in an experiment. It can be the initial pool of animals, or a treatment or control group which has been allocated to go through a specific intervention or measurement.
In the EDA diagram, groups are usually produced via a randomisation, and the allocation node is used to indicate how animals were allocated to groups.
The group nodes are used to define the distinct groups of animals or experimental units in the experiment. They can be labelled to indicate what will happen to that group or can be simply labelled group 1, group 2 or group 3 etc. Every experiment should include at least one control or comparator group, and many also include a group that receives an intervention (e.g. the 'test' group). Distinct groups must have distinct labels, as the EDA will use the labels when assessing the diagram. If the same group is indicated multiple times on a diagram, it should have the exact same label each time it appears.
The properties of the group node detail the characteristics of that group, including its role in the experiment (e.g. 'test' or 'control/comparator') and details about the sample size, including justification for the sample size.
The power calculator above is also available within the EDA application. It can be found by clicking TOOLS in the top menu and selecting 'Power Calculation'.
In the EDA power calculator the effect size should be expressed as the absolute difference in means (|m1 – m2|, where m1 and m2 represent the mean in treatment and control groups and the difference is expressed as a positive number). It should have a practical meaning, e.g. a 3 second change, and not be given as a percentage.
Cohen’s d is a standardised effect size which is expressed in units of variability. For that reason, when Cohen’s d effect sizes are used in the power calculator (see standard effect sizes above), the variability has to be set to 1.
See above for more information about parameters in a t-test power calculation.
Fill out all fields in the power calculator except N per group and click Calculate. The number of experimental units per group will be displayed in the field N per group. The power calculator uses R 3.5.2 and the package power.t.test.
In the EDA diagram, the sample size should be indicated in the properties of the group node. The planned number of experimental units relates to the sample size determined in the planning stages before the experiment is conducted. The sample size needs to be adjusted for potential loss of data or animals, if this is expected. A justification for the sample size, i.e. how it was determined, should also be provided. This could be the details of the power calculation used to determine the sample size (e.g. details of the type of power calculator, effect size, variability, significance level and power).
Once the experiment has been conducted, if the actual sample size differs from the planned sample size because for example, attrition rate is higher or lower than anticipated, the actual number of experimental units can then be indicated.
References and further reading
BATE, S. T. 2018. How to decide on your sample size when the power calculation is not straightforward. NC3Rs blog.
BATE, S. T. & CLARK, R. A. 2014. The Design and Statistical Analysis of Animal Experiments, Cambridge University Press.
BATE, S. & KARP, N. A. 2014. A common control group - optimising the experiment design to maximise sensitivity. PLoS One, 9, e114872. doi: 10.1371/journal.pone.0114872
BUTTON, K. S., IOANNIDIS, J. P., MOKRYSZ, C., NOSEK, B. A., FLINT, J., ROBINSON, E. S. & MUNAFO, M. R. 2013. Power failure: why small sample size undermines the reliability of neuroscience. Nat Rev Neurosci, 14, 365-76. doi: 10.1038/nrn3475
COHEN, J. 1992. A power primer. Psychol Bull, 112, 155-9. doi: 10.1037/0033-2909.112.1.155
DELL, R. B., HOLLERAN, S. & RAMAKRISHNAN, R. 2002. Sample size determination. ILAR J, 43, 207-13. doi: 10.1093/ilar.43.4.207
FAUL, F., ERDFELDER, E., LANG, A. G. & BUCHNER, A. 2007. G*Power 3: a flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behav Res Methods, 39, 175-91. doi: 10.3758/bf03193146
FESTING, M. F. & ALTMAN, D. G. 2002. Guidelines for the design and statistical analysis of experiments using laboratory animals. ILAR J, 43, 244-58. doi: 10.1093/ilar.43.4.244
FESTING, M. F. W., OVEREND, P., GAINES DAS, R., CORTINA BORJA, M. & BERDOY, M. 2002. The design of animal experiments: reducing the use of animals in research through better experimental design, London UK, Royal Society of Medicine.
FESTING, M. F., http://www.3rs-reduction.co.uk/html/6__power_and_sample_size.html. [Accessed 15-01-2015]
FITTS, D. A. 2011. Ethics and animal numbers: informal analyses, uncertain sample sizes, inefficient replications, and type I errors. J Am Assoc Lab Anim Sci, 50, 445-53.
FRY, D. 2014. Chapter 8 - Experimental Design: Reduction and Refinement in Studies Using Animals. In: TURNER, K. B. V. (ed.) Laboratory Animal Welfare. Boston: Academic Press.
HOENIG, J. M. & HEISEY, D. M. 2001. The abuse of power: the pervasive fallacy of power calculations for data analysis. The American Statistician, 55, 19-24. 10.1198/000313001300339897
HUBRECHT, R. & KIRKWOOD, J. 2010. The UFAW handbook on the care and management of laboratory and other research animals, Oxford, Wiley-Blackwell.
LENTH RV. Some practical guidelines for effective sample size determination. 2001. Am Stat. 55, 187-93. doi: 10.1198/000313001317098149
MEAD, R. 1988. The design of experiments : statistical principles for practical applications, Cambridge [England]; New York, Cambridge University Press.
WAHLSTEN, D. 2011. Chapter 5 - Sample Size. In: WAHLSTEN, D. (ed.) Mouse Behavioral Testing. London: Academic Press.
ZAKZANIS, K. K. 2001. Statistics to tell the truth, the whole truth, and nothing but the truth: formulae, illustrative numerical examples, and heuristic interpretation of effect size analyses for neuropsychological researchers. Arch Clin Neuropsychol, 16, 653-67. doi: 10.1093/arclin/16.7.653