At the time of writing, it is about thirty years since Altman and Bland (1983, 1986) proposed their method to compare two methods of measurement. The development and popularity of this method have been entertainingly described by Bland and Altman (2012). In this article, among other things, they write "It is a great pity that limits of agreement are often stated without their confidence intervals". This is one of several topics we shall explain in the following material, which is intended to assist those who use our software to analyse their data.
When comparing two methods of measurement, it is usual to take pairs of measurements, one in each pair taken with the two different methods. This is particularly likely if the quantity to be measured is unlikely to have a truly constant value, such as heart rate (in contrast to a quantity such as height (at least in the short term). If several pairs of measurements are taken from each subject, this data structure has to be taken into account during the analysis, by applying the appropriate methods.
Otherwise,
How much each of these statistics are underestimated depends on the degree of between-subjects variability. If variation between subjects is ignored (usually an incorrect thing to do), the analysis will have a "pooled data" approach. Unless between-subjects variation is known, then errors that result from the "pooled data" approach cannot be known. Recent articles by Bland and Altman (2007) and Myles (2007) emphasize how between-subjects variation leads to underestimation of the limits of agreement. However this is only one effect of between-subjects variation. In addition to smaller limits of agreement there is also substantial underestimation of the 95% confidence intervals of these limits, as emphasized by Hamilton and Davies (2010).
If we would perform a standard Bland-Altman analysis for the first four subjects of the example data, we would obtain the following graph, where the red lines represent the LoA, and the dark red lines the 95% CI for the LoA (range -2.15 to 2.09):
However, if we take into account that these data are actually ten measurements in each of just four subjects, we would obtain the following graph, where especially the 95% CI are much larger (range -6.68 to 6.63):
This is an extreme case, but it demonstrates that the between-subjects variance cannot be precisely estimated from a small number of subjects. Bland and Altman (2012) state that it is not hard to calculate to the confidence intervals. However these calculations are not trivial, and at present there is no dedicated software, apart from the SAS macro by Zou (2013).
The variance of the differences should not depend on the mean value. this may be visually inspected by observing the (studentized) residuals versus mean plot:
In the above plot, the distribution of the residuals (in the Y direction) does not seem to
be dependent on the mean values (X direction).
But if heteroscedasticity is visible, a transformation of the data may be used, see Altman and Bland (1983).
Spearman's rank correlation coefficient (ρ, between the ranks of the differences and the means) is given,
with its 95% confidence interval.
The analysis remains valid if there is an important correlation, but less useful,
because the variance of the differences
at a specific mean value will be smaller than calculated.
Furthermore, the variance of the differences should not vary between individuals. This may be visually inspected by observing the residuals versus subject ID plot:
In the above plot, the distribution of the residuals does not seem to be dependent on the subject numbers. See Bland and Altman (2007) for further discussion.
Q-Q plots may be visually inspected to assess the normality of the distributions underlying the differences:
The quantiles of the differences are plotted versus the quantiles of the normal distribution. If these lie close to the line of identity, the distribution of the differences may be approximated by the normal distribution. Outliers may be detected if there are a few points far from the line of identity.
A nonparametric analysis is performed as well. The bias and LoA are then defined as the 50, and 2.5 and 97.5% percentiles, and with a bootstrap procedure their 95% confidence intervals are estimated. Usually, the LoA and especially their CI will not be as accurate as obtained from the parametric analysis. However, discrepancies between the parametric and nonparametric approaches may indicate problems with the data, such as outliers.
The "True value varies" method should always be used (see the manuscript). The "True value constant" method does give some additional information on repeatability (see below). These analyses are automatically carried out and the results are shown below the data form. Analyses performed here can only be used to compare methods, e.g., to demonstrate the differences when the "Pooled data" method is applied.
The methods described above estimate both within subject and between-subjects variability, which is not possible when the data contain just one measurement for each subject. Rather than letting the methods fail, this situation is detected and the data are handled accordingly (see Carkeet (2015)).
DG Altman and JM Bland, Measurement in medicine: the analysis of method comparison studies. The Statistician 1983; 32:307-317.
JM Bland and DG Altman, Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 1986; i:307-310.
JM Bland and DG Altman, Measuring agreement in method comparison studies, Statistical Methods in Medical Research 1999; 8:135-160.
JM Bland and DG Altman, Agreement between methods of measurement with multiple observations per individual, Journal of Biopharmaceutical Statistics 2007; 17:571-582.
JM Bland and DG Altman, Agreed statistics: measurement method comparison, Anesthesiology 2012; 116:182-185.
A Carkeet, Exact Parametric Confidence Intervals for Bland-Altman Limits of Agreement, Optometry and Vision Science 2015.
C Hamilton and S Lewis, The importance of using the correct bounds on the Bland-Altman limits of agreement when multiple measurements are recorded per patient, Journal of Clinical Monitoring and Computing 2010: 24:163-175.
PS Myles, Using the Bland-Altman method to measure agreement with repeated measures (Editorial), British Journal of Anaesthesia 2007; 99:309-311.
GY Zou, Confidence interval estimation for the Bland-Altman limits of agreement with multiple observations per individual, Stat Methods Med Res 2013; 22: 630-642.
Send comments to Erik Olofsen.