Message Sets: Using regular expressions to parse data elements

Use regular expressions to identify parts of an input message that are associated with subfields.

If your input messages can contain subfields whose presence or absence can be determined only by examining the actual value of the data (for example, an optional field of numeric digits followed by one or more alphabetic characters) you must use the Data Element Separation method Use Data Pattern.

This situation is particularly relevant to messages that conform to the SWIFT industry standard. To use this method, you must provide regular expressions to identify those portions of an input message that are to be associated with subfields. You must provide a regular expression value for the Data Pattern property of each child of the complex type.

When parsing, data is matched in turn with each child of the complex type. The parser does this by using the regular expression for the child to determine the number of characters from the message that apply for that child. This number of characters is the length of the longest string, starting from the current position in the message, that matches the regular expression. If the longest string that matches the regular expression is of length zero, the element is present in the message, and the empty string is used for the value. If no string matches the regular expression, the element is not present. This situation might cause a subsequent validation error if the element is required.

After the number of characters from the input message has been determined, normal data conversion, or further parsing in the case of a complex element, is performed on the text of the input message to assign values to elements. This might lead to data overrun or underrun errors if the length identified by the pattern is not appropriate for the definition of the child.

Message Sets: Regular expression syntax explains the full syntax rules and how to apply them, but the following table gives a few simple examples of parsing using data patterns. A more complex example appears after the table.

Input message Data Pattern Value matched
"123456ABC" [0-9]* "123456"
"123" [A-Z]* ""
"123" [A-Z]+ Not present
"0x2A2B" \x2A+ X'2A'
"ABCD123"

[A-Z]{1,3} first field

[A-Z]{2,4} second field

"ABC" - first field (the longest string matching the pattern)

Not present - second field (minimum length of two alphabetic characters is not present)

"ABCDEFGHIJ1234"

[A-Z]{1,3} first field, repeat

[0-9]+ second field

"ABC" - first field [1]

"DEF" - first field [2]

"GHI" - first field [3]

"J" - first field [4]

"1234" - second field (the repeating field is terminated when the data "1234" no longer matches the data pattern specified for the first field.)

The following example shows three-field pattern matching.

Message definition:
	Complex type: Data Element Separation=Use Data Pattern
	Field1:	xsd:string minOccurs=1, maxOccurs=1, Length=5, Pad=SPACE,
				Data Pattern=".{5}"
	Field2:	xsd:int minOccurs=0, maxOccurs=1,
				Data Pattern="[0-9]{0,6}"
	Field3:	xsd:string minOccurs=1, maxOccurs=1, minLength=3, maxLength=4,
				Data Pattern="[A-Z][A-Za-z0-9]{2,3}"

Input1:		"ABCDE123F12"
Result1:		Field1="ABCDE", Field2="123", Field3="F12"

Input2:		"ABCDEF12"
Result2:		Field1="ABCDE", Field2=not present, Field3="F12"

Input3:		"ABCDE123456XXXX"
Result3:		Field1="ABCDE", Field2="123456", Field3="XXXX"

Input4:		"ABCDE1234567"
Result4:		Field1="ABCDE", Field2="123456", Field3=not present,
				which causes an exception if validation is enabled. One
				character ("7") remains unassigned to any element, which
				also causes an exception.

In the case of a repeating child, instances of the child are parsed for as many times as the pattern is matched. This is applied even if Max Occurs is specified for the repeating element and the number of occurrences exceeds the upper bound. Therefore some terminating condition must be determinable from the regular expression pattern for the element. The table above includes an example of a repeating element.

When parsing, the data from the input message that matches the Data Pattern, and that is assigned to an element, is not further scanned for delimiters of a higher level complex type. This behavior is similar to that of Data Element Separation method Fixed Length. However, you can code a regular expression that will match data to one of a number of possible delimiters.

When writing, if a length is specified for a child, the value is padded as appropriate to that length. This behavior is similar to that of Data Element Separation method Variable Length Elements Delimited, but without delimiters.

If the message includes a complex type that has Composition set to Choice, you can set the Data Element Separation method to Use Data Pattern. In this case, the Data Pattern values of the children are used to resolve the choice. Starting with the first child, the first pattern to provide a match determines which child is present. Therefore the order of children in a choice might be important.

A complex type can contain repeating children with Max Occurs unbounded. Length, and other associated properties such as Justification and Padding, can optionally be specified for the children.

See Message Sets: TDS message model integrity for rules that you must follow when using the Data Element Separation method Use Data Pattern, and refer to Message Sets: Combinations of Composition and Content Validation for details of valid settings of Composition and Content Validation.