Using reliability measures to analyze inter-rater agreement

The International Olympic Committee (IOC), responding to media criticism, wants to test whether scores given by judges trained through the IOC program are "reliable"; that is, while the precise scores given by two judges may differ, good performances receive higher scores than average performances, and average performances receive higher scores than poor performances.

You can test for this possibility using the intraclass correlation coefficient or ICC 1. It is an ANOVA-type model in which the judges' scores are responses. Choosing an appropriate model may take some thought. First, you must consider the sources of variation. One source is the performances, which you can suppose are a random sample from a large pool of performances. Another source is the judges, who you can suppose are a random sample from a large pool of trained judges. Thus, you should use a two-way random effects model. If this set of judges is unique in some way and cannot be considered part of a larger pool of judges, then you should use a two-way mixed effects model. If you did not know which scores were given by which judge, then you would have to use a one-way random effects model.

Moreover, you simply suppose that the judges have similar patterns of scores, so you will check for consistency rather than absolute agreement. If IOC regulations are stricter and if identical (rather than similar) patterns of scores are necessary for successful training, then you would look at the two-way random model with absolute agreement.

Consider that the IOC has asked seven trained judges to score 300 performances. This information is collected in judges.sav. See the topic Sample Files for more information. Use Reliability Analysis to measure the level of agreement between their scores.

Next

1 McGraw, K. O., and S. P. Wong. 1996. Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1:1, 30-46.