Collation-Order properties

Character ordering or collation refers to the culture-specific ordering of characters. This ordering differs from that based on the ordinal value of a character in a code set.

Character ordering or collation refers to the culture-specific ordering of characters. This ordering differs from that based on the ordinal value of a character in a code set. Collation-based ordering is dependent on the language. Character collation is specified by the LC_COLLATE category. The term collating element refers to one or more characters that have a collation value in a specific locale. The Spanish ll character is an example of a multicharacter collating element.

To sort the characters in any given language in the proper order, a weight is assigned to each character so that the characters sort as expected. However, a character's sort value and code-point value are not necessarily related.

One set of weights is not sufficient to sort strings for all languages. For example, in the case of the German words b<a-umlaut>ch and bane, if there is only one set of weights, and the weight of the letter a is less than that of <a-umlaut>, then bane sorts before b<a-umlaut>ch. However, the opposite result is correct. To satisfy the requirement of this example, two sets of weights, the Primary and Secondary Weights, are given to each character in the language. In the case of the characters a and <a-umlaut>, they have the same Primary Weights, but differ in their Secondary Weights. In the German locale, the Secondary Weight of a is less than that of <a-umlaut>.

The sorting algorithm first compares the two strings based on the Primary Weights of each character. If the Primary Weight values are the same, the two strings are compared again based on their Secondary Weights. In this example, the Primary Weights of the first two characters ba and b<a-umlaut> are the same, but the Primary Weights of the characters that follow (c and n, respectively) differ. As a result of this comparison, b<a-umlaut>ch is sorted before bane.

Here, the Secondary Weights are not used to collate the strings. However, as in the case of the strings bach and b<a-umlaut>ch, Secondary Weights must be used to get the proper order. When compared using Primary Weight values, these two strings are found to be equivalent. To break the tie, the Secondary Weights of a and <a-umlaut> are used. Because the Secondary Weight of a is less than that of <a-umlaut>, the string bach sorts before b<a-umlaut>ch.

Characters having the same Primary Weights belong to the same equivalence class. In this example, the characters a and <a-umlaut> are said to be members of the same equivalence class.

In string collation, each pair of strings is first compared based on Primary Weight. If the two strings are equal, they are compared again based on their Secondary Weights. If still equal, they are compared again based on Tertiary Weights up to the limit set by the COLL_WEIGHTS_MAX collating weight limit specified in the sys/limits.h file.