Best practices

Best practice recommendations are based on the experience of IBM® customers, service representatives, and quality assurance testers. These best practices are not requirements and might not fit all environments. The intent is to provide general guidance in areas of concern that have arisen in the practice of using the software.

Use iterative approach
Arriving at the correct configuration for your site can be an iterative task where you monitor your test environment and adjust. Ensuring that your test environment is a reasonable reflection of the production environment helps you arrive at an optimal configuration before moving to production.
Memory allocation
Set the MESSAGEPOOLSIZE configuration value to 512 MB (536870912) to avoid memory issues. The REGION value for the CACCNTL step must be increased accordingly; set that value to 768 M or larger. If allowed, you can use REGION=0M. This allows for a lot of growth in the environment without running into memory allocation problems repeatedly. The virtual storage addresses should only be backed by real or auxiliary storage if they are referenced, so having a larger size does not necessarily incur more cost when it is not needed, but it does make more virtual storage addressing available if needed. The default MESSAGEPOOL and REGION values were updated to 512 MB (536870912) and 768 M for newly customized environments with APAR PH42853 applied.
Use application profiling tool
For performance issues such as high CPU, suspected loop, and low throughput scenarios, you should gather information by using an application profiling tool (for example APA). You should also examine address space settings such as setting the address space to non-swappable in the PPT, increasing the dispatching priority, service class, or changing the WLM policy.
Set TCP/IP buffers
Ensure that the PTFs for APAR PI99642 are applied and the TCP/IP configuration allows for at least 1M buffers. Use the netstat all (port nnnn) command to verify that buffers are set as expected.
Monitor storage usage
Monitor real and auxiliary storage usage as part of standard z/OS® health checks. Cache space uses shared memory objects that are outside of the address space region value and contributes to system usage of real and auxiliary storage.
Tips for diagnostic and event logs
The diagnostic log and event log streams can be in either coupling facility or DASD backed log streams. Because these logs should not be shared across multiple servers, use DASD-only to avoid using coupling facility resources. DASD-only has the restriction that if CACPRTLG is run while the server is active, it must run on the same LPAR where the server is connected to the log stream.

Diagnostic and event log sizing

The shipped sizing for the diagnostic and event logs is generally too small. Consider the following practices:

  • The customized members create relatively small log streams for the diagnostic and event logs that are used by the server address spaces. In larger deployments, it makes sense to increase the size of these log streams. For example, STG_SIZE(3000) would provide a staging data set size of around 10 MB. Recommendations from the system logger is to use 10 MB as a minimum, which is substantially larger than the customized member shipped default of STG_SIZE(512) or about 2 MB. System logger recommends a minimum offload dataset of 1 MB, using a value of LS_SIZE(256), while the customized member is shipped with LS_SIZE(1024), so that would give an offload data set of around 4 MB. A larger size can be used and is recommended, if the site can support them.
  • Consider the following settings for both diagnostic and event log allocations for servers:
    DATA TYPE(LOGR) REPORT(NO)
    DEFINE LOGSTREAM NAME(???log stream name???)
    DASDONLY(YES) LS_SIZE(2400) STG_SIZE(3000) STG_DATACLAS(????) RETPD(14) AUTODELETE(YES)

    These settings result in a 10 MB staging data set and 8 MB offload data set so that each offload data set can hold a percentage of the log stream that is equal to the HIGHOFFLOAD value. It depends on the site and environment activity, but the goal is to create a staging size that is large enough for a day of logging and then limit offloads and the number of data sets that are needed for offload. Change your test environments to use these values and evaluate.

Classic address spaces
The CACCAT, CACINDX, CACCFGD, and CACCFGX DDs should all be unique to the address space, but should be on shared DASD if the server will be moving across LPARs. Classic address spaces do not support hot standby. However, you can stop a server on one LPAR in the sysplex and restart it on another, or recover on another LPAR after an LPAR loss if you made the metadata available from all systems.
Restarting Classic servers
Classic does not coordinate with Automated Recovery Management (ARM). Classic is not sysplex-enabled (hot standby). However, you can use dVIPA or Sysplex Distributor to restart Classic servers on another LPAR in the sysplex, a form of cold standby through automation. Classic does not support automatic takeover by another standby system that is already running.
DDs that should be backed up
The CECSUB, CECRM, CACCFGD, and CACCFGX DDs should be part of a regular backup.
Use CDA for diagnostic metrics
For performance issues, gather diagnostic metrics by using the Classic Data Architect (CDA).
Number of subscriptions
Start with one or a few subscriptions and use more if:
  • You need more throughput to the target because each subscription will use its own TCP/IP connection.
  • You have different administration requirements for different workloads (that is, you will start or stop the subscriptions at different times).
  • You need to send data to different targets (each subscription can have only one target).

Each subscription requires additional resources and can make management more complex.

Message store cache
Ensure that the capture service is configured to use a single 64-bit memory object to cache changes. The capture service CACHEINDICATOR parameter should be set to 2 (the default). Any newly created servers automatically use this setting, but if you have an existing server, it might be using the previous default (0) which is not as efficient as the newer implementation. To switch to the new default, use an operator command or CDA to set the CACHEINDICATOR configuration to default. The next time a capture cache is created for a subscription, the cache will start using a single 64-bit memory object for caching changes.
Order of server shutdown commands
When you want to stop a server, first issue STOP,ALL for a controlled end of subscriptions and quiesce processing. This action might take some time if changes are pending at the target. If the shutdown takes too long (a threshold the site can set), upgrade the stop to STOP,ALL,IMMEDIATE and wait at least five minutes before moving to cancel. Canceling the address space does not give the server a chance to end processing and could lead to unpredictable problems; cancel should not be the first option when ending the server.
Preventing BPXM023I messages
Sites should prevent BPXM023I messages for servers to avoid problems with system automation. You must allow the server address space READ access to BPX.CONSOLE in the FACILITY resource class of your ESM if you are seeing the BPXM023I messages. See the "Securing a Classic data server" topic for your product.
Use log streams that are backed by the coupling facility
Use log streams that are backed by the coupling facility for the replication logs to allow movement across servers and for improved performance. Work with site experts for properly defining log stream resources. Outside a proof-of-concept environment, DASD-only is usually not appropriate for a replication logstream.

The best practice is for a replication log to be defined as coupling facility backed in a sysplex because a DASD-only backed log stream might only be accessed from one system in a sysplex at a time (z/OS System Logger restriction). Most customers cannot accept that restriction in a production environment.

Use the CFSizer application to estimate the size of your replication log. The CICS® documentation provides guidance related to forward recovery log streams. Sizing requirements for replication logging are similar to forward recovery logging. However, replication logging has both UNDO and REDO (before and after images) in the replication log along with transaction semantics (commit/rollback) for UORs, which affects recoverable files. This architecture means that sizing a replication log can draw from the sizing information for forward recovery logs, with noted considerations (accounting for updates that have both images). The CICS documentation includes formulas for figuring out structure sizes for forward recovery log streams that are backed by the coupling facility.

The following links are useful:

The following example shows a sample structure definition. You can use the CFSizer tool to estimate the required sizes.

STRUCTURE NAME(replication_log_structure)
INITSIZE(init_size) SIZE(size)
PREFLIST(cfname) REBUILDPERCENT(1)

If you double the sizes that are recommended by the CFSizer tool for INITSIZE and SIZE, you can arrive at a starting point that accounts for before and after images for replication logging. Over time, experience might help better set these number for the specific application. The number of logs in a structure affects the size of the structure. Smaller structures can be allocated, rebuilt, and recovered more quickly.

The coupling facility is divided into structures, and log stream coupling facility structures are defined to the LOGR policy by using the DEFINE STRUCTURE syntax of the IXCMIAPU utility. Pay particular attention to the LOGR keywords MAXBUFFSIZE, AVGBUFFSIZE, and LOGSNUM. The MAXBUFFSIZE and AVGBUFSIZE values depend on the VSAM clusters that are being captured to the replication log – some simplistic sample values would be 64000 and 2048, respectively. A value of 64000 is recommended for MAXBUFSIZE because it keeps the structure list elements allocation to 256 bytes; any value greater than 65276 causes the list elements to be allocated in 512-byte units and can be wasteful. The average buffer size that is assigned to the structure should be close to the blocks that are written to log streams. A large difference between the average buffer size of the structure and the block size that is written for a log in the structure can lead to space problems before the structure is full. Recommendations for LOGSNUM are to specify a value of 10-20 for an optimally defined system (maximum is 512); this setting controls how much of the structure space is available to any particular log stream. Having more logs per structure makes it more difficult to tune when the logstreams have different requirements.

DEFINE STRUCTURE NAME(replication_log_structure) LOGSNUM(10)
MAXBUFSIZE(maximum_buffer_size) AVGBUFSIZE(average_buffer_size)

Define your log stream for the replication log in the structure. Pay particular attention to the definitions for STRUCTNAME, STG_DUPLEX, LS_SIZE, AUTODELETE(YES), RETPD(n), HIGHOFFLOAD and LOWOFFLOAD.

The RETPD value determines how long data is retained for restart after an outage. The retention period should be greater than the longest outage that you expect. If the retention period is shorter than the outage, restarting replication might fail if the replication log data cannot be found. In these cases an initial reload is required.

For better performance avoid "DASD shifts" or switching from one offload DSN to another during peak periods. You can increase the offload data set size by using larger LS_SIZE values and your goal is to limit the number of allocations during peak periods. You should also consider using LS_ALLOCAHEAD(1) if your site has the DASD available to support the additional allocation.

You should also consider offloading smaller amounts of data more often. Use your structure size along with the HIGHOFFLOAD and LOWOFFLOAD values to control how much data is offloaded. You might start with HIGHOFFLOAD(50) and LOWOFFLOAD(10) and monitor your environment. Using HIGHOFFLOAD(50) can help ensure that you have space for writing while offloading occurs; you don't want to wait for offloads to complete. Using LOWOFFLOAD(10) helps keep some data in interim storage for the VSAM log reader to browse the replication log.

Consider using WARNPRIMARY(YES) and using z/OS health checks to know when you're experiencing issues related to your replication logs.

Isolate VSAM base clusters by workload
Isolate VSAM base clusters by workload (subscription) to their own replication log so that each subscription only needs to read its own log data. Each subscription can only have one replication log, but multiple subscriptions can read the same replication log. Having multiple subscriptions in the same replication log means that each subscription is reading all of the data, which is likely to be wasteful (duplicate reading of the log). For low-volume environments, you might choose this type of setup for simplicity, but for a best practice, isolate.
Capture cache
For VSAM, a size of 256 MB might be sufficient because each subscription has its own log reader. You can change this setting by using the CAPTURECACHE option of the SET,REPL z/OS operator modify command. For details, see SET,REPL command.
Dedicated AOR
Provide a dedicated AOR for the target apply writer and utility transactions to execute in. This can be part of an existing CICSPlex®.
Target PSB scheduling
Scheduling PSBs is costly. To limit PSB scheduling during normal replication, it is recommended that you choose a non-zero PSB scheduling option for the SCHEDULEBEHAVIOR value when suitable for your environment.
Persistent subscriptions
By default, when you create a subscription it is marked as persistent. If persistence is enabled for a subscription, when the source server starts if the capture service cannot start replication for a subscription it retries starting replication at 30-second intervals until it is successful. When replication is started for a persistent subscription, if the connection with the target is lost, the capture service tries to restart replication at 30-second intervals until it is successful.

In a production environment you generally want this kind of processing enabled. However, during the development phase the persistent subscription retry behavior might not be appropriate. Consider initially disabling persistence and then re-enabling it when you have a production environment established.

Also consider enabling limiting the number of retries that the capture service performs by setting the CONNECTRETRYLMT configuration parameter to a non-zero value.