Collecting entity and transaction statistics

The MDM statistics collection feature enables you to collect statistics about entities and transactions in the InfoSphere® MDM system.

The MDM statistics collection feature leverages Apache Kafka stream and connector applications to compute and process statistics data. Kafka server capabilities allow InfoSphere MDM to store statistics data with automatic persistence management and online stream processing in a time-based window. This is critical because the amount of collected statistics data can be extremely large in a production system, and that volume keeps increasing with every transaction that is run. The MDM database alone would quickly become overwhelmed and performance would deteriorate. To avoid impacting InfoSphere MDM system performance, the statistics are collected in real time, but are then stored in statistics tables for hourly aggregation.

Restriction: The MDM statistics collection feature is not fully supported on z/OS operating systems. This is due to the fact that statistics collection relies on the Apache Kafka Streams application, which does not support z/OS. For more information, including a workaround, see Known issues and limitations of the statistics collection feature.

Data flow and architecture

In simple terms, the MDM statistics collection feature works as follows:

  1. The statistics collection facilities collect entity and transaction statistics data at real time when users run transactions.
  2. The collected data is published into Kafka input topics at the end of each transaction.
  3. Statistics streams, a long running Kafka client application built on Kafka Streams, consumes the data and aggregates the data with a fixed hourly window.
  4. The computed results are recorded in Kafka output topics.
  5. The statistics connector, a long running Kafka client application built on Kafka Connect, consumes the results and stores them in two statistics tables: EntityStatistics and TransactionStatistics.

The MDM statistics collection feature leverages the Kafka producer components and its own statistics components on the InfoSphere MDM operational server to collect data. It also contains a statistics streams application and a statistics connector application, both of which are run in Kafka environments.

Figure 1. MDM Statistics Collection component architecture
The architecture flow diagram for the statistics feature shows data being collected and published by Kafka streams, connectors, and topics
Important: Continuous availability of Kafka brokers is critical for MDM statistics. Kafka needs Apache Zookeeper to manage the cluster and to ensure its continuous availability. For more information, see Known issues and limitations of the statistics collection feature.
Tip: If you wish to enhance the MDM statistics collection feature to support any additional domains or attributes beyond the default, you must customize the InfoSphere MDM statistics collection services, the data collection facilities, and the Kafka statistics producers, streams, and connectors. That customization work is out of the scope of this document.