# How does SPSS Statistics calculate percentiles in FREQUENCIES or CTABLES?

## Question

I'm running the FREQUENCIES procedure in SPSS Statistics, and requesting percentiles. In some cases the values I'm getting are not matching those given by other procedures such as CTABLES or other programs or sources. How exactly does FREQUENCIES or CTABLES compute percentiles?

SPSS has five different methods for computation of percentiles (see the statistical algorithms for the EXAMINE procedure, available via Help>Algorithms).

The method used in FREQUENCIES or MEANS procedure, and the default method in EXAMINE, is what's known as HAVERAGE, or the weighted average at X-sub-(w+1)*p, which we describe as the weighted average of X-sub-i and X-sub-i+1, where i is the integer part of (w+1)*p (w is the weighted case count, which would often be called N).

This method can be confusing, as it can give results where the estimated percentile is higher than the case representing p% of the sample distribution. It is, however, the method yielding an unbiased estimate of a population percentile for p.

A useful discussion of percentiles that features this version in its primary definition is provided in an online engineering statistics text provided by the U.S. National Institute on Standards and Technology at http://www.itl.nist.gov/div898/handbook/prc/section2/prc262.htm .

Another method that is widely used is AEMPIRICAL, which will give either one of the values in the data set or half way between two values. CTABLES and the Visual Bander both use the AEMPIRICAL method. It is the "third way" discussed on the NIST web site.

Another useful site is http://www.xycoon.com/quartiles.htm , which discusses eight different methods for computation of percentiles. HAVERAGE is method 2 on the xycoon web site, and AEMPIRICAL is method 4.

Many people are disturbed by the existence of different values for the same percentile on the same data, but this is unavoidable. Consider the definition of a percentile from the NIST web site: "The pth percentile is a value, Y(p), such that at most (100p)% of the measurements are less than this value and at most 100(1- p)% are greater." Even if you are working with a population rather than a sample, unless N*p (or W*p) is exactly equal to the cumulative case count up to a certain case, nothing actually satisfies the definition for the given data, and you're left trying to estimate the best approximation.

## Related Information

[{"Product":{"code":"SSLVMB","label":"IBM SPSS Statistics"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Component":"Not Applicable","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"Not Applicable","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

50362

Modified date:
16 April 2020

swg21480663