Contents


A deep dive into the new software-defined converged infrastructure for SAS Foundation mixed workloads

Configure IBM ESS, IBM Spectrum Scale, IBM POWER8, and Mellanox to deliver excellent performance for SAS analytics

Comments

In the technical brief titled "A new software-defined converged infrastructure for SAS Foundation mixed workloads" you read about the new software-defined converged infrastructure. The technical brief described an architecture and methodology for delivering exceptional performance for a SAS Mixed Analytics workload used for internal testing at SAS. Key elements of the converged infrastructure include IBM® Elastic Storage Server (ESS), IBM Power® server, and Ethernet-based storage fabric from Mellanox.

This article describes the purpose, goals, and results of the testing including the technical details behind the tests, the specifications for the test environment, and details about the test scenarios and performance data from those tests. At the end, you will find guidelines for tuning the converged infrastructure to achieve optimal performance.

System architecture, configuration, tuning, and file system creation

Figure 1 illustrates the architecture and configuration used for testing SAS software with IBM Elastic Storage Server and IBM Power E880 server in the lab environment. Figure 2 shows the ESS network configuration.

Figure 1. Solution architecture for SAS on IBM Power server and IBM Elastic storage server
Figure 2. ESS network hardware configuration

Configuration

This section describes the detailed configuration for each component of the architecture.

Software

  • SAS 9.4 TS1M3 64-bit
  • IBM AIX 7.2 (7200-00-02-1614)
  • IBM PowerVM® Enterprise Edition
  • Virtual I/O Server (VIOS) 2.2.3.50
  • IBM Spectrum Scale™ (formerly IBM GPFS) 4.2.1.1
  • IBM ESS version 4.5.1
    • Red Hat 7.1
  • MLNX-OS 3.3.6.1002

Network configuration

  • IBM Switch Model: 8831-NF2 (Mellanox SX1710)
  • Mellanox ConnectX-3 40GbE adapters IBM Feature Code EC3A
  • 36 Ports 40GbE / 56GbE Switch
  • MLNX-OS Version 3.6.1002
  • Global Pause Flow Control enabled
  • TCP/IP only traffic

IBM Power System E880 server configuration

  • Model: 9119-MHE
  • Firmware version: IBM FW830.00 (SC830_048)
  • Processor architecture: POWER8
  • Clock speed: 4356 MHz
  • SMT: OFF, 2, 4, 8 (SMT4 is default and it was used during the benchmark)
  • Cores: 64 (62 cores for the LPARs under test, 2 cores for VIOS)
  • Memory: 512 GB (384 GB for the LPARs under test, 8 GB for VIOS)
  • Internal drives: Twelve 600 GB (for booting VIOS and LPARs)
  • Four expansion drawers each with one 40GbE dual port adapter, IBM Feature Code EC3A, in an x16 slot

ESS configuration

  • Model: 5146-GL4
  • Two IBM Power System S822L as I/O servers
  • 256 GB (16 x 16GB DRAM)
  • An IBM Power System S821L server for xCat management server
  • An IBM 7042-CR8 Rack-mounted Hardware Management Console (HMC)
  • Storage interface: three LSI 9206-16e Quad-port 6Gbps SAS adapters (A3F2) per I/O server
  • I/O networking: three 2-port Dual 40GbE Mellanox ConnectX-3 adapter (EC3A) per I/O server
  • ALB bonding of 3 Mellanox adapter ports per ESS I/O server
  • Redundant Array of Independent Disks (RAID) controllers: IBM PCIe IPR SAS Adapter. One IPR adapter per server for RAID 10 OS boot drive per server
  • Switches:
    • One 1GbE switch with two VLANs providing two isolated subnets for service and management networks.
    • IBM 8831-NF2 – 40GbE switch, Mellanox model SX1710
  • Four DCS3700 JBOD 60-drive enclosures (1818-80E, 60 drive slots)
    • Each with fifty-eight 2 TB 7.2K LN-SAS HDDs + two 400 GB solid-state drives (SSDs)
  • 16 SAS cables

ESS Spectrum Scale file system creation

The following Spectrum Scale file system parameters were used to create the SASWORK, SASDATA, and SASUTIL application storage space. Various file system block sizes were initially created and tested for performance.

In general, the Spectrum Scale file system block size can be calculated by taking the application's block size and multiplying it by 32. IBM ESS uses GPFS Native RAID (GNR). A simple explanation is that the GNR divides the file system block size into 32 sub-blocks for staging across to the disk subsystem.

However, with the SAS BUFSIZE of 256 KB, the SAS workload testing in the lab environment with various file system block sizes determined that an 8 MB or a 16 MB file system block size performed best.

Sample file system create and mount command:

# gssgenvdisks --vdisk-suffix _sasdata_8m --create-vdisk --create-filesystem --filesystem-name sasdata_8m --data-vdisk-size 4000 --data-blocksize 8M
   
# mmmount all
Fri Jun  3 19:21:25 CDT 2016: mmmount: Mounting file systems ...

# df -h
Filesystem       Size  Used Avail Use% Mounted on
/dev/sda3        246G  3.0G  244G   2% /
devtmpfs          60G     0   60G   0% /dev
tmpfs             60G     0   60G   0% /dev/shm
tmpfs             60G   95M   60G   1% /run
tmpfs             60G     0   60G   0% /sys/fs/cgroup
/dev/sda2        497M  156M  341M  32% /boot
/dev/sasdata_1m   16T  264M   16T   1% /gpfs/sasdata_1m
/dev/saswork_1m   16T  264M   16T   1% /gpfs/saswork_1m
/dev/sasutil_1m  7.9T  264M  7.9T   1% /gpfs/sasutil_1m
/dev/sasutil_4m  7.9T  288M  7.9T   1% /gpfs/sasutil_4m
/dev/saswork_4m   16T  288M   16T   1% /gpfs/saswork_4m
/dev/sasdata_4m   16T  288M   16T   1% /gpfs/sasdata_4m
/dev/sasdata_8m   16T  320M   16T   1% /gpfs/sasdata_8m
/dev/sasutil_8m  7.9T  320M  7.9T   1% /gpfs/sasutil_8m
/dev/saswork_8m   16T   16T     0 100% /gpfs/saswork_8m

Workload, test scenarios, and results

This section describes the workload that was used to perform the test, the test scenarios, and detailed results.

Workload

The workload used during the performance validation is a SAS Foundation mixed analytics workload. The workload consists of a mix of analytics jobs that run concurrently. The jobs stress compute, memory, and I/O capabilities of a given IT infrastructure.

The workload consists of 20 individual SAS program tests: ten compute-intensive, two memory-intensive, and eight I/O-intensive. Some of the tests run using existing data stores and some tests generate their own data for execution during the test run. The tests are a mix of short-running (in minutes) and long-running (in hours) jobs. The tests are repeated to run concurrently, or in a serial fashion, or both to achieve a 20-test average concurrent workload, or a 30-test average concurrent workload. The 20-test workload consists of 71 total jobs run and the 30-test workload consists of 101 total jobs run. During the peak load, the 30-test workload can employ 55 processors and generate I/O-intensive jobs concurrently.

The performance metric for the workload is workload response time (in minutes), which is the cumulative real time of all the jobs in the workload. Lower response time is better. However, other performance metrics, such as processor time (user + system time), server utilization, and I/O throughput were also studied. These metrics were gathered to understand the impact on performance when compression is enabled.

The workload response time (real time) and processor time (user + system time) are captured from the log files of the SAS jobs. These statistics are logged with the SAS FULLSTIMER option. IBM Power Systems™, starting with the IBM POWER7® processor architecture, use Processor Utilization Resource Register (PURR) accounting for accurate reporting of system usage. The PURR factor for POWER8 processors needs to be applied to the processor time metrics described in the document. For more details about the PURR factor, refer to the "SAS Business Analytics deployment on IBM POWER8 processor- based systems with IBM XIV Storage System and IBM FlashSystem" paper and read its Appendix B.

Test scenarios

The following scenarios were run as part of the benchmark:

  • Single node tests: 20-test mixed analytics workload
  • Scalability tests: 20- and 30-test mixed analytics workload
  • Tests with Mellanox fabric running at 56GbE speed: 30-test mixed analytics workload

The test was performed with no competing workload running on both server and storage systems. The test team collected the workload, host side and storage side performance metrics and compared them between baseline and final tests.

Test results

This section describes the results for the test scenarios performed.

Single node test: 20-test mixed analytics workload

The configuration of the logical partition (LPAR) used for testing the 20-test mixed analytics workload include:

  • 16 cores (dedicated mode) running SMT4
  • 96 GB memory
  • One 40GbE port
  • 16 MB block size for Spectrum Scale file systems (SASWORK, SASDATA, and SASUTIL)
Figure 3. Network I/O throughput for a 20-test mixed analytics workload on a single node
Figure 4. Processor utilization for a 20-test mixed analytics workload on a single node

Figure 3 and Figure 4 show the network I/O throughput and processor utilization for the 20-test single node test. Here are the key results:

  • Real time was 1073 minutes and user + system time was 793 minutes (applied PURR factor of 0.6).
  • Average and peak I/O throughput are 2.75 GBps and 4 GBps respectively.
  • I/O throughput was approximately 175 MBps per core considering 16 cores.
  • Processor utilization was 60% out of the 16 cores allocated for the LPAR.

Scalability test: 20-test mixed analytics workload

Scalability testing was performed by linearly scaling the workload and nodes – scaling a 20-test workload from one node (with 20 concurrent tests) to four nodes (with 80 concurrent tests that include 20 concurrent tests per node). The back-end storage and compute remained the same as the workload was scaled. The one-, two-, three-, and four-node tests ran a total of 20, 40, 60, and 80 mixed analytics workload tests respectively. Similar scalability testing was performed with a 30-test mixed analytics workload from one node to 4 nodes as well.

The configuration of the LPARs used for scalability testing include:

  • 16 cores (dedicated mode) for two LPARs and 15 cores (dedicated mode) for the other two LPARs
  • SMT4
  • 96 GB memory per LPAR
  • One 40GbE port per LPAR
  • 16 MB block size for Spectrum Scale file systems (SASWORK, SASDATA, and SASUTIL)
Figure 5. Summary of performance metrics for the 20-test scalability testing

Result from the scalability testing with a 20-test workload is summarized in Figure 5. The graphs in Figures 6 through 10 provide the I/O throughput achieved during the testing.

Figure 6. Average and cumulative real time when scaled to four nodes
Figure 7. I/O throughput for 20-test workload on a single node
Figure 8. I/O throughput when a 20-test workload scaled to two nodes (total 40 tests)
Figure 9. I/O throughput when a 20-test workload scaled to three nodes (total 60 tests)
Figure 10. I/O throughput when a 20-test workload scaled to four nodes (total 80 tests)

Tests with Mellanox fabric running at 56GbE speed

This section describes the Mellanox fabric, its configuration in the environment, and the test results.

Mellanox 56GbE fabric with ESS

IBM in the Power IO portfolio can build a complete end-to-end Mellanox fabric at 40GbE for an ESS-based storage solution by using Feature Code EC3A/EC3B adapters, with Switch 8831-NF2, connected by cables EB40-EB42 and EB4A-EB4G. Doing this provides a robust low latency (approximately 330ns port-to-port) Ethernet based TCP/IP storage fabric.

The interesting part for this storage deployment is that the Mellanox switch can run the fabric in a Mellanox only 56GbE mode. By changing the line speed entry on each port to 56000 at the switch, you can gain an additional 40% bandwidth from the existing networking hardware without any further investment. The cabling used must be capable of running at 56GbE speed. The adapters used can automatically negotiate speeds with the switch.

The ports of the network switches, hosts, and clients were tuned to run at 56GbE speed in the lab environment and the tests were repeated to see performance improvements.

With the Mellanox fabric running at 56GbE speed, the following tests were performed to measure the performance benefits compared to 40GbE speeds.

  • I/O tests using the gpfsperf tool
  • 30-test workload – single node as well as scalability testing with four nodes

I/O test results using gpfsperf tool on four nodes

As part of the Spectrum Scale deployment, there are several as-is performance tools such as gpfsperf and nsdperf that can be used to help validate system performance. See the Spectrum Scale documentation referred in the "Additional reading" section of this article to find information on these tools. The gpfsperf tool can be used to measure read, write, and read/write I/O performance on Spectrum Scale (GPFS) file systems. The tool was used to measure I/O throughput when the network ports were running at 40GbE and 56GbE speeds. The tool was run simultaneously on all the four nodes to stress the network and ESS storage. Figure 11 shows a comparison of I/O throughput when ports were at a speed of 40GbE and 56GbE.

Figure 11. Comparison of I/O throughputs achieved using test tool when fabric is running at 40GbE and 56GbE speeds

The test with the gpfsperf tool showed that without any additional infrastructure or upgrades to the existing network infrastructure, the overall read/write I/O throughput (70:30) was improved by 8% to 10%. The ESS GL4 storage I/O throughput limits have been reached during the tests.

Sample gpfsperf sequential write command:

/usr/lpp/mmfs/samples/perf/gpfsperf create seq /gpfs/sasdata_1m/data/n1aa -r 1m -th $1 -n 3072M &

Sample gpfsperf sequential read command:

/usr/lpp/mmfs/samples/perf/gpfsperf read seq /gpfs/sasdata_1m/data/n1aa -r 1m -th $1 &

Sample gpfsperf sequential read/write command:

/usr/lpp/mmfs/samples/perf/gpfsperf mixrw seq /gpfs/sasdata_1m/data/n1aa -r 1m -th $1 -n 3072M -readratio 70 &

SAS workload test results with 56GbE fabric

The 20-test mixed analytics workload was not network I/O constrained; hence the 20-test (on a single node or multiple nodes) did not see any performance improvement compared to the 40GbE results. However, the 30-test run showed improved performance when the network ports were tuned to run at 56GbE speed, when compared with ports running at 40GbE speed.

  • 5% reduction in real time for a 30-test workload on a single node with ports running at 56GbE speed.
  • 8% reduction in real time when a 30-test workload was run on all four nodes simultaneously (total 120 tests) with ports running at 56GbE speed.
  • Four node test achieved peak I/O throughput of 16 GBps at 56GbE speed compared to 14 GBps at 40GbE speed. The test achieved an average I/O throughput of 12.15 GBps at 56GbE speed compared to 11 GBps at 40GbE speed.

Figure 12 and Figure 13 show the I/O throughput for the 30-test workload at 40GbE and 56GbE speeds.

Figure 12. I/O throughput when a 30-test workload scaled to four nodes (total 120 tests) with 40GbE speeds
Figure 13. I/O throughput when a 30-test workload scaled to four nodes (total 120 tests) with 56GbE speeds

Tuning

This section provides guidance and suggestions on how to tune each aspect of the environment.

Switch tuning

There are five modified switch tuning parameters:

  • Flow control
Interface ethernet 1/n flowcontrol receive on force
Interface ethernet 1/n flowcontrol send on force
  • Interface speed
Interface ethernet 1/n speed 56000, where n= port 1-36
  • Interface MTU size
Interface ethernet 1/n mtu 9000, where n= port 1-36
  • LAG configuration tuning if needed
Interface port-channel y flowcontrol receive on force, where y = 1 – max number of LAG groups
Interface port-channel y flowcontrol send on force, where y = 1 – max number of LAG groups
  • LAG load-balancing
port-channel load-balance ethernet source-destination-ip source-destination-mac source-destination-port

For redundancy, the client nodes (LPARs) had dual-port adapters. Due to the limitations of the PCI Gen3.0 x8 bus that the adapter plugs into, the maximum total bandwidth of the adapter is limited to 56GbE. When increased bandwidth and redundancy are required, the recommendation is to run the switch ports at 56GbE to increase the bandwidth. And configure the adapter in Mode 1, Active / Standby to provide redundancy.

Note: The lab environment had a 36-port 40GbE / 56GbE switch. It had four links from the client nodes (LPARS) and seven links from the ESS storage, making a total of 11 ports in use. Customers might not wish to commit a full switch to only 11 ports. Mellanox has an option available through IBM Business Partners, a lower port count switch MSX-1012B-2BFS with 12 ports, and this uses the same MLNX-OS and ASIC and has the same features as IBM 8831-NF2.

AIX client network tuning parameters

The following operating system network tunables were changed from AIX default values. You can find a full list of the lsattr command output and the no -a command output as well as the Spectrum Scale tunable parameters in the "Appendix: Tuning Parameters" section.

Changes made to AIX SAS client adapter interface en3 versus default adapter settings

     # en3
     mtu           9000          Maximum IP Packet Size for This Device        True
     rfc1323       1             Enable/Disable TCP RFC 1323 Window Scaling    True
     tcp_nodelay   1             Enable/Disable TCP_NODELAY Option             True
     tcp_recvspace 1048576       Set Socket Buffer Space for Receiving         True
     tcp_sendspace 1048576       Set Socket Buffer Space for Sending           True
     thread        on            Enable/Disable thread attribute               True

Changes made to AIX SAS client adapter device ent3 from default

# ent3
jumbo_frames    yes       Request jumbo frames                            True
jumbo_size      9014      Requested jumbo frame size                      True
large_receive   yes       Request Rx TCP segment aggregation              True
large_send      yes       Request Tx TCP segment offload                  True
tx_comp_cnt     2048      Tx completions before hardware notification     True

Comparison of AIX network environment /adapter parameter changes 'no -L -F'with the default value

General network parameters

-------------------------------------------------------------------------------------------------
NAME                      CUR    DEF    BOOT   MIN    MAX    UNIT           TYPE     DEPENDENCIES
-------------------------------------------------------------------------------------------------
fasttimo                  100    200    100    50     200    millisecond       D
-------------------------------------------------------------------------------------------------
sb_max                    32M    1M     32M    4K     8E-1   byte              D
-------------------------------------------------------------------------------------------------
##Restricted tunables
poolbuckets               7      1      1      1      20     numeric           D
-------------------------------------------------------------------------------------------------

TCP network tunable parameters

--------------------------------------------------------------------------------
NAME            CUR    DEF    BOOT   MIN    MAX    UNIT      TYPE  DEPENDENCIES
--------------------------------------------------------------------------------
hstcp           1      0      1      0      1      boolean   D
--------------------------------------------------------------------------------
rfc1323         1      0      1      0      1      boolean   C
--------------------------------------------------------------------------------
sack            1      0      1      0      1      boolean   C
--------------------------------------------------------------------------------
tcp_mssdflt     8960   1460   8960   1      64K-1  byte      C
--------------------------------------------------------------------------------
tcp_recvspace   856K   16K    856K   4K     8E-1   byte      C     sb_max
--------------------------------------------------------------------------------
tcp_sendspace   856K   16K    856K   4K     8E-1   byte      C     sb_max
--------------------------------------------------------------------------------

UDP network tunable parameters

NAME            CUR    DEF    BOOT   MIN    MAX    UNIT   TYPE  DEPENDENCIES
--------------------------------------------------------------------------------
udp_recvspace   768K   42080  768K   4K     8E-1   byte   C     sb_max
--------------------------------------------------------------------------------
udp_sendspace   256K   9K     256K   4K     8E-1   byte   C     sb_max
--------------------------------------------------------------------------------
      
n/a means parameter not supported by the current platform or kernel
      
Parameter types:
S = Static: cannot be changed
D = Dynamic: can be freely changed
B = Bosboot: can only be changed using bosboot and reboot
R = Reboot: can only be changed during reboot
C = Connect: changes are only effective for future socket connections
M = Mount: changes are only effective for future mountings
I = Incremental: can only be incremented
      
Value conventions:
K = Kilo: 2^10       G = Giga: 2^30       P = Peta: 2^50
M = Mega: 2^20       T = Tera: 2^40       E = Exa: 2^60

Note: The auto configured port speed change is achieved by power cycling the LPAR after the attached switch port has been changed to speed 56000.

ESS Linux I/O server adapter bonding changes

After extensive testing, the ESS network adapter bond0 parameters were changed from LACP to ALB.

# vi /etc/sysconfig/network-scripts/ifcfg-bond-bond0
BONDING_OPTS=”miimon=100 mode=balance-alb xmit_hash_policy=layer3+4”
MTU=9000

ESS Linux I/O server network tuning parameters

The following operating system network tunable parameters were changed from the default values for the Linux ESS I/O Network Shared Disk (NSD) servers.

ppc64_cpu --smt=2
ethtool -G enP4p1s0 rx 8192 tx 8192
ethtool -G enP9p1s0 rx 8192 tx 8192
ethtool -G enp1s0 rx 8192 tx 8192
mlnx_tune -r -c
ethtool -K enP9p1s0d1 tx-nocache-copy off
ethtool -K enP4p1s0d1 tx-nocache-copy off
ethtool -K enp1s0d1 tx-nocache-copy off

Note: ESS node network tunables were already pre-set/tuned as part of the ESS installation process.

ESS Spectrum Scale tuning parameters

The following Spectrum Scale cluster tunables were changed from the default values and used for the mixed AIX/Linux GPFS cluster. Beginning with Spectrum Scale 4.2.0.3, a period "." in the first column means that the parameter was changed by the workerThreads parameter. This is sometimes called the autotune feature, where changing the one parameter workerThreads will also cause other tunables to be automatically changed from the default.

The Spectrum Scale tunables that were changed (listed by most to least performance significance) were Pagepool, workerThreads, prefetchPct, maxFilesToCache, maxblocksize, and maxMBpS. These tunables provide the most significant performance gains for SAS Mixed Analytics workloads. In general, the most important Spectrum Scale tunable for SAS workloads is Pagepool. Increasing Pagepool on the client nodes provided the largest performance improvement over other Spectrum Scale tunables from initial environment testing. And from follow-on related ESS GL4 testing in other environments, we predict that performance would improve by 5% to 10% over the numbers reported in this article if Pagepool had been increased from 32 GB to 64 GB on the client nodes.

Thus, the following highlighted parameters are deemed to be more significant tunable changes to focus on first. Note that for the ESS, many of the default configuration values are changed as part of the ESS installation process. The ESS is highly optimized and required few tuning changes for our testing. Example, ESS nodes Pagepool is at a maximum size of 72 GB by default. See Appendix: Tuning parameters for a full list of Spectrum Scale configuration tunables.

Client nodes running AIX:

  • maxblocksize 16777216
  • maxFilesToCache 50000
  • maxMBpS 24000
  • Pagepool 34359738368
  • prefetchPct 40
  • workerThreads 1024

ESS/Linux nodes:

  • maxblocksize 16777216
  • maxFilesToCache 50000
  • maxMBpS 24000
  • prefetchPct 40
  • seqDiscardThreshhold 1073741824
  • workerThreads 1024

Note: Many of the non-default parameters are already set by the ESS installation process with Spectrum Scale performance scripts.

Summary

IBM and Mellanox have achieved an effective and relatively inexpensive Ethernet with JBOD nearline disk storage solution for SAS workloads with performance comparable to more-expensive mid-tier Fibre Channel attached flash storage. The AIX on an IBM POWER8 processor-based server used as the SAS clients differentiated themselves as powerful work engines for this successful POC. The Mellanox-only Ethernet high-speed storage network was crucial in facilitating the ESS full I/O throughput and providing the ability to run 40GbE fabric at no additional cost beyond the solution configuration of the switch to 56GbE and the auto configuration of the 40GbE adapters to run at 56GbE, enabled by the use of 56GbE cables available in the IBM portfolio

The performance metrics collected from this SAS Mixed Analytics workload proof of concept demonstrate that the solution achieved the full potential of IBM Elastic Storage Server, the Mellanox network, and Power E880 with the SAS application. The cross-discipline team achieved excellent performance working interactively to tune and optimize all parts of the systems using the combination of team server, network, storage, and application expertise. This was the key to success for this POC in addition to the industry-leading hardware and software.

Additional reading

IBM and SAS white papers

IBM Power Systems

IBM storage solutions

Networking

Appendix: Tuning parameters

The following operating system network tunable parameters were used for the AIX SAS clients. The lsattr AIX command can be used to display the attribute characteristics and the possible values of the attributes for a specific device. For example:

lsattr -El ent3

Note: Interface-specific tunables take precedence over no set parameters due to rfc1323 enablement.

[root@brazos06]> # lsattr -El ent3
alt_addr        0x000000000000     Alternate Ethernet address                       True
bar0            0x88100000         Bus memory address 0                             False
bar1            0x80000000         Bus memory address 1                             False
bar2            0x88000000         Bus memory address 2                             False
chksum_offload  yes                Request checksum offload                         True
delay_open      no                 Delay open until link state is known             True
devid           0xb31503101410b504 Device ID                                        False
eeh_cfgsp_delay 999                EEH config space delay (miliseconds)             False
eeh_reset_delay 5                  EEH reset delay (seconds)                        False
flow_ctrl       yes                Request flow control                             True
flow_ctrl_rx    yes                Receive pause frames                             True
flow_ctrl_tx    yes                Transmit pause frames                            True
intr_cnt        10                 Interrupt event coalesce counter                 True
intr_priority   3                  Interrupt priority                               False
intr_time       5                  Interrupt event coalesce timer (microseconds)    True
ipv6_offload    yes                Request IPV6 stateless offloads                  True
jumbo_frames    yes                Request jumbo frames                             True
jumbo_size      9014               Requested jumbo frame size                       True
large_receive   yes                Request Rx TCP segment aggregation               True
large_send      yes                Request Tx TCP segment offload                   True
link_delay_mode logging            Link status delay mode                           True
link_delay_time 5                  Link status delay timer (seconds)                True
lro_threshold   2                  Rx TCP segment aggregation minimum pkt threshold True
media_speed     40000_Full_Duplex  Requested Media speed                            False
queue_pairs     8                  Requested number of queue pairs                  True
queues_rdma     1                  Requested number of RDMA event queues            True
rdma            desired            Request RDMA                                     True
rom_mem         0x0                ROM memory address                               False
rsp_comp_cnt    128                RSP Completions Before Hardware Notification     True
rsp_limit       1000               Response queue entries processed per interrupt   True
rsp_max_events  512                Max RSP events that can be received              True
rx_buffer_low   90                 Rx queue buffer replenish threshold              True
rx_chain        16                 Rx packets chained for stack processing          True
rx_comp_limit   128                Response queue entries processed per interrupt   True
rx_max_pkts     2048               Rx queue maximum packet count                    True
rx_notify_cnt   128                Rx packets per Rx complete notification          True
rx_send_cnt     8                  Rx Immediate Data mode                           True
systrc_enable   no                 Enable config debug tracing                      True
timer_eeh       1                  EEH event poll timer (seconds)                   True
timer_error     1                  Error poll timer (seconds)                       True
timer_link      1                  Link poll timer (seconds)                        True
timer_stats     0                  Statistics poll timer (seconds)                  True
tx_comp_cnt     2048               Tx completions before hardware notification      True
tx_comp_limit   1                  Tx completions processed per event               False
tx_free_delay   no                 Delay free of Tx packet mbufs                    True
tx_limit        1024               Tx packets sent per transmit thread              True
tx_max_pkts     1024               Tx queue maximum packet count                    True
tx_notify_cnt   64                 Tx packets per Tx complete notification          True
tx_swq_max_pkts 8192               Software Tx queue maximum packet count           True
use_alt_addr    no                 Request alternate Ethernet address               True
vpd_missing     no                 VPD is not present                               True

You can use the no AIX command to manage the tuning parameters of the network. For example:

no -a output

arpqsize = 1024
arpt_killc = 20
arptab_bsiz = 7
arptab_nb = 149
bcastping = 0
bsd_loglevel = 3
clean_partial_conns = 0
delayack = 0
delayackports = {}
dgd_flush_cached_route = 0
dgd_packets_lost = 3
dgd_ping_time = 5
dgd_retry_time = 5
directed_broadcast = 0
fasttimo = 100
hstcp = 1
icmp6_errmsg_rate = 10
icmpaddressmask = 0
ie5_old_multicast_mapping = 0
ifsize = 256
igmpv2_deliver = 0
init_high_wat = 0
ip6_defttl = 64
ip6_prune = 1
ip6forwarding = 0
ip6srcrouteforward = 1
ip_ifdelete_notify = 0
ip_nfrag = 200
ipforwarding = 0
ipfragttl = 2
ipignoreredirects = 0
ipqmaxlen = 100
ipsendredirects = 1
ipsrcrouteforward = 1
ipsrcrouterecv = 0
ipsrcroutesend = 1
limited_ss = 0
llsleep_timeout = 3
lo_perf = 1
lowthresh = 90
main_if6 = 0
main_site6 = 0
maxnip6q = 20
maxttl = 255
medthresh = 95
mpr_policy = 1
multi_homed = 1
nbc_limit = 12582912
nbc_max_cache = 131072
nbc_min_cache = 1
nbc_ofile_hashsz = 12841
nbc_pseg = 0
nbc_pseg_limit = 25165824
ndd_event_name = {all}
ndd_event_tracing = 0
ndogthreads = 0
ndp_mmaxtries = 3
ndp_umaxtries = 3
ndpqsize = 50
ndpt_down = 3
ndpt_keep = 120
ndpt_probe = 5
ndpt_reachable = 30
ndpt_retrans = 1
net_buf_size = {all}
net_buf_type = {all}
net_malloc_frag_mask = {0}
netm_page_promote = 1
nonlocsrcroute = 0
nstrpush = 8
passive_dgd = 0
pmtu_default_age = 10
pmtu_expire = 10
pmtu_rediscover_interval = 30
psebufcalls = 20
psecache = 1
psetimers = 20
rfc1122addrchk = 0
rfc1323 = 1
rfc2414 = 1
route_expire = 1
routerevalidate = 0
rtentry_lock_complex = 1
rto_high = 64
rto_length = 13
rto_limit = 7
rto_low = 1
sack = 1
sb_max = 33554432
send_file_duration = 300
site6_index = 0
sockthresh = 85
sodebug = 0
sodebug_env = 0
somaxconn = 1024
strctlsz = 1024
strmsgsz = 0
strthresh = 85
strturncnt = 15
subnetsarelocal = 1
tcp_bad_port_limit = 0
tcp_cwnd_modified = 0
tcp_ecn = 0
tcp_ephemeral_high = 65535
tcp_ephemeral_low = 32768
tcp_fastlo = 0
tcp_fastlo_crosswpar = 0
tcp_finwait2 = 1200
tcp_icmpsecure = 0
tcp_init_window = 0
tcp_inpcb_hashtab_siz = 24499
tcp_keepcnt = 8
tcp_keepidle = 14400
tcp_keepinit = 150
tcp_keepintvl = 150
tcp_limited_transmit = 1
tcp_low_rto = 0
tcp_maxburst = 0
tcp_mssdflt = 8960
tcp_nagle_limit = 65535
tcp_nagleoverride = 0
tcp_ndebug = 100
tcp_newreno = 1
tcp_nodelayack = 1
tcp_pmtu_discover = 1
tcp_recvspace = 876544
tcp_sendspace = 876544
tcp_tcpsecure = 0
tcp_timewait = 1
tcp_ttl = 60
tcprexmtthresh = 3
tcptr_enable = 0
thewall = 50331648
timer_wheel_tick = 0
tn_filter = 1
udp_bad_port_limit = 0
udp_ephemeral_high = 65535
udp_ephemeral_low = 32768
udp_inpcb_hashtab_siz = 24499
udp_pmtu_discover = 1
udp_recv_perf = 0
udp_recvspace = 786432
udp_sendspace = 262144
udp_ttl = 30
udpcksum = 1
use_sndbufpool = 1

Spectrum Scale tuning parameters

The following Spectrum Scale cluster tunables for the mixed AIX/Linux GPFS cluster are listed as a reference. Some tunables were changed from the default values as indicated by the "!" mark before the parameters below. Recent versions of Spectrum Scale have the autotune feature where changing workerThreads will also cause other tunables to be automatically changed from the default. The parameters highlighted (in bold) are deemed to be the more significant tunable changes to focus first. Note that for the ESS, many of the default configuration values are changed as part of the ESS installation process. The ESS is highly optimized and required few tuning changes.

AIX nodes

 ! ccrEnabled 0
 ! cipherList AUTHONLY
 ! deadlockDataCollectionDailyLimit 10
 ! deadlockDetectionThreshold 0
 ! dmapiFileHandleSize 32
 ! expelDataCollectionDailyLimit 10
 ! logBufferCount 20
 ! logWrapThreads 128
 ! maxblocksize 16777216
 ! maxBufferDescs 32768
 ! maxFilesToCache 50000
 ! maxMBpS 24000
 ! maxReceiverThreads 128
 ! maxStatCache 10000
 ! minReleaseLevel 1502
 ! pagepool 34359738368
 ! prefetchPct 40
 ! scatterBuffers 0
 ! seqDiscardThreshhold 1073741824
 ! socketMaxListenConnections 512
 ! worker1Threads 1024
 ! workerThreads 1024

ESS/Linux modes

Note: Many of these non-default parameters are already set by the ESS installation process.

 ! ccrEnabled 0
 ! cipherList AUTHONLY
 ! deadlockDataCollectionDailyLimit 10
 ! deadlockDetectionThreshold 0
 ! dmapiFileHandleSize 32
 ! envVar MLX4_USE_MUTEX 1 MLX5_SHUT_UP_BF 1 MLX5_USE_MUTEX 1
 ! expelDataCollectionDailyLimit 10
 ! flushedDataTarget 1024
 ! flushedInodeTarget 1024
 ! ioHistorySize 65536
 ! logBufferCount 20
 ! logWrapAmountPct 10
 ! logWrapThreads 128
 ! maxAllocRegionsPerNode 32
 ! maxBackgroundDeletionThreads 16
 ! maxblocksize 16777216
 ! maxBufferCleaners 1024
 ! maxBufferDescs 2097152
 ! maxFileCleaners 1024
 ! maxFilesToCache 50000
 ! maxGeneralThreads 1280
 ! maxInodeDeallocPrefetch 128
 ! maxMBpS 24000
 ! maxReceiverThreads 128
 ! maxStatCache 10000
 ! minReleaseLevel 1502
 ! myNodeConfigNumber 1
 ! nsdClientCksumTypeLocal NsdCksum_Ck64
 ! nsdClientCksumTypeRemote NsdCksum_Ck64
 ! nsdInlineWriteMax 32768
 ! nsdMaxWorkerThreads 3072
 ! nsdMinWorkerThreads 3072
 ! nsdMultiQueue 512
 ! nsdRAIDBlockDeviceMaxSectorsKB 8192
 ! nsdRAIDBlockDeviceNrRequests 32
 ! nsdRAIDBlockDeviceQueueDepth 16
 ! nsdRAIDBlockDeviceScheduler deadline
 ! nsdRAIDBufferPoolSizePct (% of PagePool) 80
 ! nsdRAIDEventLogToConsole all
 ! nsdRAIDFastWriteFSDataLimit 262144
 ! nsdRAIDFastWriteFSMetadataLimit 1048576
 ! nsdRAIDFlusherBuffersLimitPct 80
 ! nsdRAIDBlockDeviceQueueDepth 16
 ! nsdRAIDBlockDeviceScheduler deadline
 ! nsdRAIDBufferPoolSizePct (% of PagePool) 80
 ! nsdRAIDEventLogToConsole all
 ! nsdRAIDFastWriteFSDataLimit 262144
 ! nsdRAIDFastWriteFSMetadataLimit 1048576
 ! nsdRAIDFlusherBuffersLimitPct 80
 ! nsdRAIDFlusherBuffersLowWatermarkPct 20
 ! nsdRAIDFlusherFWLogHighWatermarkMB 1000
 ! nsdRAIDFlusherFWLogLimitMB 5000
 ! nsdRAIDFlusherThreadsHighWatermark 512
 ! nsdRAIDFlusherThreadsLowWatermark 1
 ! nsdRAIDFlusherTracksLimitPct 80
 ! nsdRAIDFlusherTracksLowWatermarkPct 20
 ! nsdRAIDMaxTransientStale2FT 1
 ! nsdRAIDMaxTransientStale3FT 1
 ! nsdRAIDReconstructAggressiveness 1
 ! nsdRAIDSmallBufferSize 262144
 ! nsdRAIDSmallThreadRatio 2
 ! nsdRAIDThreadsPerQueue 16
 ! nsdRAIDTracks 131072
 ! nspdQueues 64
 ! numaMemoryInterleave yes
 ! pagepool 76168560640
 ! prefetchPct 40
 ! prefetchThreads 341
 ! scatterBuffers 0
 ! scatterBufferSize 262144
 ! seqDiscardThreshhold 1073741824
 ! socketMaxListenConnections 512
 ! syncWorkerThreads 256
 ! worker1Threads 1024
 ! worker3Threads 32
 ! workerThreads 1024

Downloadable resources


Comments

Sign in or register to add and subscribe to comments.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Linux, AIX and UNIX
ArticleID=1049351
ArticleTitle=A deep dive into the new software-defined converged infrastructure for SAS Foundation mixed workloads
publish-date=09122017