IBM Support

Best practices for configuring a reliable and scalable peer group for quota enforcement

Question & Answer


Question

What are the best practices for configuring a reliable and scalable peer group for quota enforcement?

Cause

Improper changes to the peer group configuration of quota enforcement could cause unexpected results in the peer group

Answer

To avoid unexpected results when operational failures occur or configuration changes improperly in the peer group of quota enforcement, administrators can follow these best practices:
Peer group requirements for quota enforcement configuration:
  • A peer group must contain at least three peers for quota enforcement failover when the primary becomes unavailable. Failover automatically occurs when both of the following conditions are met.
    1. The primary failure is detected when two replicas agree that the primary is not reachable after a timeout of 10 seconds.
    2. More than half of all peers in the peer group must be reachable.
  • When failover is performed after 10 seconds, a new primary is selected and the other replicas are informed about the new primary address when they connect to peers. After failover, all data changes are written to the new primary and the new primary synchronizes the data across the peer group. When the original primary resumes to be operational, it works as a replica in the peer group.
  • Make sure that all GatewayScript files used by all peers in the peer group are the same. Equivalent configuration ensures that the threshold for the specific traffic type is the same across the peer group.
  • Based on your requirements for quota enforcement, decide whether to enable or disable strict mode. In a peer group, when the primary becomes unavailable, before failover occurs, replicas lose connection to the primary. In this situation, the replica behaves differently based on the strict mode.
    • Enabled strict mode: The replica with enabled strict mode cannot process the request.
    • Disabled strict mode: If service performance and availability are more important than data-consistency, you can disable strict mode for the replica so that this replica can process the request locally. The replica with disabled strict mode writes the threshold and associated metadata to the local data storage. In this situation, the I/O transaction can be impacted. After failover occurs, the connection is resumed between replicas and the new primary. The threshold and associated metadata stored by the replica can be overwritten by the new primary when the new primary synchronizes the data to all replicas. Data-consistency can be affected across the peer group.
  • Decide whether to use memory or RAID volume for data storage. The threshold and the associated metadata, and the counter and the associated metadata can be persisted on the RAID volume or stored in memory. When quota enforcement works in peer group mode, all peers must use the same data storage location. In other words, all peers must store data in RAID volume or memory. Combination of RAID volume and memory in the peer group is not allowed.
    If data storage of all peers is in-memory, the following behaviors occur:
    • After you configure the peer group for quota enforcement, when you want to reconfigure or manually reboot a peer, the following rules must be met.
      • When you reconfigure or reboot the primary, make sure that a replica is first switched to the primary. Then, you can reconfigure or reboot the original primary. In this case, the originally stored data remains in the memory of the new primary.
      • When you reconfigure or reboot a replica, the replica synchronizes data with the primary. In this case, the originally stored data remains in the primary memory.
    • When the primary becomes unavailable, before failover occurs and during the primary timeout (10 seconds), after the primary is automatically restarted, the database in the primary becomes empty. The replicas synchronize data with the resumed primary. In this case, the originally stored data is lost.

When you configure a peer group for quota enforcement, you can follow these rules:
  1. Make sure that the administrative state is enabled. Otherwise, enable the administrative state.
  2. When you create a peer group, add peer members in the Peers list one at a time by starting peers in sequence. The peer connects to other peers in the order that are specified in the Peers list. Remove peers one at a time by disabling the quota enforcement server when you attempt to stop part of or the whole peer group. You cannot start or stop all peers at the same time.
  3. You can specify whether data storage is persisted on the RAID volume or is in-memory.
    • For persistent storage, select the RAID volume that must be raid0 RAID volume.
    • For in-memory storage, do not select the RAID volume. By default, the data storage is in-memory.
  4. The priority affects only the result of the failover and it does not affect the role of a peer that joins a peer group. When the peer group is down, you must restart all peers in the peer group one at a time. You cannot restart all peers at the same time. Failover occurs when more than half of all peers return to work and these peers must be reachable. In this situation, the peer that first resumes active works as the new primary.
  5. All peers in the peer group must use the same SSL configuration. This configuration means that all settings that are configured in the following items must be the same: state of SSL enablement, key alias, and certificate alias.
    • When you want to change the state of SSL enablement from enabled SSL to disabled SSL, follow these steps:
      1. Disable the quota enforcement server on all peers.
      2. Make sure that you disable SSL on all peers.
      3. Enable the quota enforcement server on the primary.
      4. Enable the quota enforcement server on replicas one at a time.
      5. Check the quota enforcement server status provider on all peers to make sure that all peers are operational and in peer group mode.
    • When you want to change the state of SSL enablement from disabled SSL to enabled SSL, follow these steps:
      1. Disable the quota enforcement server on all peers.
      2. Ensure that you enable SSL on all peers; and ensure that all peers use the same key alias and certificate alias.
      3. Enable the quota enforcement server on the primary.
      4. Enable the quota enforcement server on replicas one at a time.
      5. Check the quota enforcement server status provider on all peers to make sure that all peers are operational and in peer group mode.
6. All peers must use the same strict mode.
For more information, see the Quota enforcement topic in IBM Documentation.

To avoid unexpected results in the following situations:
  • If you disable the network interface (Ethernet or VLAN) on the peer where quota enforcement is enabled, the in-flight transaction can be blocked. When the network interface is disabled on the primary, it can take long time for replicas to elect the new primary. Therefore, to avoid these unexpected results, switch the peer from primary to replica, perform quiesce action on the service, and then disable the network interface for any maintenance routines.
  • To safely remove a peer from the peer group, make sure that the role of the target peer is secondary. If the role of the target peer is primary, you must first change the role of a suitable replica peer to primary by manually executing the quota-enforcement-switch-master command. Then, you can remove the target peer (the original primary) by changing the operational state of its quota enforcement server to down. Removing a peer when the operational state of the quota enforcement server is up can affect the result of the failover procedure. The removed peer is still considered to be a member in the peer group.
  • If a peer starts or restarts without valid peers to connect to, it fails to join or rejoin the existing peer group and becomes the primary of its own. To avoid such failures, add as many valid peers in the peer list to increase the chance of successful connection to existing peers.
  • In strict mode, to protect against an out-of-memory instance, carefully configure the IBM DataPower Gateway throttle-threshold. Use a more conservative value for the throttle-threshold than the default value, 20%, which means that more buffer and lower risk are considered. For more information about the throttle-threshold, see the Configuring throttle settings topic in IBM Documentation.

Tips
The following two types of timeout occur in different conditions.
  • 10-second timeout
    Failover occurs when both of the following conditions are met:
    1. The primary failure is detected when two replicas agree that the primary is not reachable after a timeout of 10 seconds.
    2. More than half of all peers in the peer group must be reachable.
  • 30-second timeout
    When there is no any TCP level acknowledgment between the primary and replicas, for example because of network outage, the transactions are terminated in 30-second I/O timeout.
Normally, the connection lost between the primary and replicas can be detected and the failover is triggered (after 10 seconds). When the connection is lost, currently the quota enforcement server does not try establishing connection again for the same transaction.
  • If packet dropped between replicas and the primary, the incoming traffic can stay at the GatewayScript action for a while until I/O timeout (30 seconds). Then, ratelimit module API call returns with errors.
  • If packet rejected between replicas and the primary, the ratelimit module API call returns with errors immediately.
In either case, you can check the response arguments from ratelimit module API to see whether the ratelimit policy was enforced correctly; and decide what to do next by your GatewayScript file.

[{"Product":{"code":"SS9H2Y","label":"IBM DataPower Gateway"},"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Component":"General","Platform":[{"code":"PF009","label":"Firmware"}],"Version":"7.5","Edition":"Edition Independent","Line of Business":{"code":"LOB45","label":"Automation"}}]

Document Information

Modified date:
26 October 2021

UID

swg21981525