Summary
The reliability estimates produced under the mixed and random ICC models are numerically identical. The difference lies in the interpretation. The results of an analysis using the mixed effects model cannot be generalized to other raters. Also, under the mixed model for the Average Measure Intraclass Correlation, you must assume that no rater-performance interaction exists; that is, judges do not give comparatively higher scores to performances by their own countrymen, and they do not give comparatively lower scores to performances because the gymnast is short or tall or has dark hair or for any other reason that has nothing to do with the performance.
While reliability is defined in terms of proportions of variances, it is possible to get negative reliability estimates when the samples are very badly correlated. In such cases, the scale is probably unsuitable for its intended purpose. Negative estimates will also result from reversely coded items, as would happen if one judge scored so that 0.0 were the highest score and 10.0 were the lowest score.