Design for large files

Processing of large files presents many challenges, which might be best addressed at design level rather than later, during implementation. A large file might

Require a lot of memory to parse into a logical tree.
Create a large database unit of work during mapping.
Force serialized processing during mapping.

Some of these issues can be addressed during implementation by using partial parsing that is coupled with chunking and by making interim database commits. FTM also supports a concept, called fragmentation, that can address all of the above issues.

The fragmentation feature is intended to support fragmentation services, which can transform a large message to or from smaller fragments. Input fragmentation services take large input files, messages, or both, and split them into a number of fragments for consumption by a standard FTM wrapper flow. Similarly, output fragmentation services consume a number of associated fragments that are produced by standard FTM outbound mappers and combine them into a single large file or message.

FTM can then process the fragments separately, and even concurrently, while maintaining the relationship between them. The following figure shows how FTM can use a number of fragments to represent a single transmission.

Starting with FTM V2.1.1, fragmentation is supported in the EndMapper V2 interface and includes support for nested batches. Fragmentation with EndMapper V1 does not support nested batches.

The following figure shows how FTM can use a number of fragments to represent a single transmission that includes nested batches.

Figure 2. Inbound fragmentation model with nested batches

Inbound fragmentation services and FTM exchange data that is related to the physical transmission and the fragment in the <usr> folder of an RFH2 header. The RFH2 header should be present on all WebSphere® MQ messages that are exchanged between a fragmentation service and FTM.

The following table shows the fragment data that can be included in the RFH2 header under the path MQRFH2.usr.ibmepp.fraginfo.

Table 1. Fragment data for the RFH2 header
RFH2 header field	Required	Associated database column	Description
transmissionCustomerRef	N	TRANSMISSION_BASE.CID	Customer reference for the transmission object. This value is associated with the CID field of the associated physical transmission object in FTM, and is used to link all related fragments together.
transmissionId	Y	TRANSMISSION_BASE.UID	Unique identifier of the fragmented transmission
totalcount	Y¹	TRANSMISSION_BASE.FRAG_COUNT	Total number of fragments in the transmission
fragmentCustomerRef	N	FRAGMENT_BASE.CID	Customer reference for the fragment
fragmentId	N	FRAGMENT_BASE.UID	The UID of the fragment object
sequence	Y	FRAGMENT_BASE.SEQUENCE	Sequence number of the current fragment
firstBatSeq	Y	FRAGMENT_BASE.FIRST_BAT_SEQ	Specifies the sequence number of the first batch in the fragment
firstBatSpan[]	Y²	FRAGMENT_BASE.FIRST_BAT_SPAN	For batches that complete in this fragment but started in an earlier one, specifies how many fragments the first batches in the fragment span.
lastBatSeq	Y	FRAGMENT_BASE.LAST_BAT_SEQ	Specifies the sequence number of the last batch in the fragment at the top level of the hierarchy only. For example, ignoring any nested batches.
totalBatComplete	Y	n/a	Indicates how many batches complete in this fragment.
lastBatComplete	N³	FRAGMENT_BASE.LAST_BAT_COMPLETE	Indicates whether the last batch in the fragment is complete.
firstBatChildSeq[]	N	FRAGMENT_BASE.TXN_SEQ_OFFSET	Specifies the next sequence number for the first child in the first batch (including hierarchy for nested batches). This allows an offset to be carried over when a batch spans multiple fragments. In earlier versions of FTM, this field was 'firstBatInitialTxnSeq' and could not repeat. Backwards compatibility is retained for older fragmentation implementations, where EndMapper v1 is used with no batch nesting.
bytesoffset	N	FRAGMENT_BASE.BYTE_OFFSET	Offset, in bytes, from the beginning of the associated physical transmission to the beginning of the fragment data.
length	N	n/a	The size, in bytes, of the fragment.
Notes: At least one fragment must include this information so that the process can know when all fragments are processed. This must be provided for the first batch in a fragment if it completes in that fragment. Where this information is not provided, it is assumed to indicate that the batch continues in a later fragment. This is mandatory if using EndMapper V1. When using EndMapper V2, use 'totalBatComplete', which is compatible with nested batches. A value of Y with EndMapper V2 indicates that all batches (nested or otherwise) are complete.

The fragments are logged to the FTM database as FRAGMENT objects in the FRAGMENT_BASE table.

FTM does not use the FRH2 interface for outbound fragments. Instead, it uses the standard message grouping features of IBM® MQ, the TRANSMISSION ID is used as the message group identifier.

The defragmenter should be configured to:

Read messages by group / commit by group.
Read messages in sequence.
Wait for all messages in a group to be available.

FTM uses the 'Last Message In Group' options when sending the final fragment.

Using this, the defragmenter can ensure that the message can be reconstructed from the fragments correctly. When working with nested outbound batches, the fragment records BATCH_ID column must be set to indicate the ID of the first batch in the fragment. That information, along with the column TXN_SEQ_OFFSET, allows the outbound mappers to locate the correct set of batches and transactions that must be mapped in each fragment.