Turbocharging WebSphere Commerce web sites with POWER7 technology

The retail industry recognizes that a significant shift is occurring in consumer shopping behavior. IBM’s recent Institute of Business Value study on "Meeting the Demands of the Smarter Consumer" found that shoppers demand to interact with retailers in a way that is both relevant and timely. IBM® WebSphere® Commerce provides software that develops and delivers high quality, best of breed web sites. Retailers are concerned about the reliability and performance of their web presence for consumers 24 hour a day, 7 days a week access, which is satisfied by the underlying hardware infrastructure. This article discusses the architecture of the POWER7® chip, details key characteristics of the Websphere Commerce workload, and explains how POWER7 techonology provides key advantages on meeting the needs of the smarter consumer.

Mikhail Genkin (genkin@ca.ibm.com), Performance Architect, WebSphere Commerce, IBM

Author photoMikhail Genkin is the Performance Architect for WebSphere Commerce. He has 15 years of software development experience and has contributed to many IBM products including WebSphere Commerce, WebSphere Process Server, WebSphere Integration Developer, Rational Application Developer, and Visual Age for Java-Enterprise Edition.



Boyd Dimmock (bkd@us.ibm.com), Distinguished Engineer, CTO for Distribution Industry, IBM

Author photo of Boyd DimmockBoyd Dimmock is a Distinguished Engineer and the Systems and Technology Group (STG) Chief Technology Officer for the Distribution Industry. She works worldwide and across STG brands to differentiate systems value into business solutions, including distributed store environments as well as web commerce and enterprise analytics. She has 37 years of experience developing and marketing retail systems and application software.



20 April 2011

Also available in Chinese

Introduction

IBM brings a solution to this space that has been finely tuned to deliver a seamless and branded shopping experience across all channels, including digital and physical touch points within each channel. WebSphere Commerce V7 drives improved customer loyalty and increased shopping cart sizes by delivering rich, personalized, and contextually relevant content at each stage of the shopping experience. The software infrastructure is based on service-oriented architecture (SOA) with WebSphere Application Server (hereafter called Application Server) and DB2® as its underlying solution components.

The architectural base of Application Server and DB2 delivers optimized performance and scalability on IBM’s POWER7 systems. The solution is based on software and hardware architectures that deliver speed of data access for near real time and interactive use of information. Customer access to the commerce platform is enhanced for users as the new hardware solution has reduced query times up to ~50% based on high and unparalleled performance of the Power servers. POWER7 brings unique virtualization, advanced memory management, breakthrough workload support, and world class availability to this solution. These characteristics are explained below.

Understanding POWER7 architecture

This section uses P750 and P780 systems as examples when discussing POWER7 chip architecture and capabilities.

POWER7 chip and cache differentiation

The POWER7 chip architecture continues IBM’s differentiation by building upon the performance leadership of POWER6®. POWER7 provides increased processor core density per chip or socket, improved multithreading support, and improved core memory bandwidth (discussed below). This chip design results in increased performance when compared to POWER6.

The availability of multicore and dynamic threading allows POWER7 to support a large number of Java Virtual Machines (JVMs) running the WebSphere Commerce Server application. The POWER7 chip features up to 8 processor cores with a 4-way SMT in each core. This is equivalent to 32 logical processors on a single chip or socket. The Power 750 Express is a one- to four-socket server, with up to 32 cores that can deliver up to 128 simultaneous compute threads.

Due to the latency difference between main memory and on-chip memory cache, POWER7 was designed with three levels of on-chip cache (see Figure 1). The chip includes 32 MB of on-chip L3 cache memory implemented in embedded Dynamic Random Access Memory (DRAM) instead of the off-chip L3 cache that was used with all the prior dual-core Power chips. The POWER7 chip has two dual-channel DDR3 memory controllers implemented on the chip that delivers 100 GB/sec of sustained bandwidth per chip. This provides significant advantage for heavy cache usage.

Figure 1. Architecture of the POWER7 chip
Architecture of the POWER7 chip

POWER7 virtualization and simultaneous thread support

PowerVM is built into the hardware and provides higher performance, more scalability, and higher resource utilization than the leading Intel® virtualization platforms. PowerVM offers the capability to dynamically adjust system resources based on workload demands so that each partition gets the resources it needs. Logical partitions (LPARs) allow you to run multiple operating system instances on the same system without interference. Micro-partioning allows an LPAR to share processors with other different partitions. Micro-partitioning provided by PowerVM delivers tremendous flexibility when planning web site deployments. PowerVM can adjust the use of CPU and memory among different applications without interruption, enabling the retailer flexibility in scaling to variable business requirements.

POWER7 provides a technology breakthrough and can run a larger application workload on the same sized machine (with the same number of sockets or cores). This drives better value from the retailer’s hardware investment allowing higher Central Processing Unit (CPU) utilization, based on new Simultaneous Multi-Thread (SMT) support. SMT is a processor technology to allow multiple threads to issue instructions each cycle. SMT permits all thread instances to simultaneously compete for and share processor resources. SMT4, newly supported in POWER7, provides more concurrent instances of the application with more hardware supported threads than other solutions. SMT4 provides more threads and AIX takes advantage of more threads based on its understanding of the application. Application Server V7 also has optimization to take advantage of the SMT4 capability so that applications built on Application Server V7 can receive the SMT4 advantage without modification (for example, WebSphere Commerce V7).

Memory advancements

Memory advancements bring value to this POWER7 solution by providing more in-memory data for the software. Active Memory Expansion is a new POWER7 technology that enables the effective maximum memory capacity to be larger than the true physical memory. Innovative compression or decompression of memory content enables memory expansion up to 100 percent. This enables an application partition to do significantly more work or enable a server to run more partitions with the same physical amount of memory. Utilizing Active Memory Expansion can improve system utilization and increase a system’s throughput.

POWER7 memory architecture uses high reliability, availability, and serviceability (RAS), high performance, and low power consumption memory. The P750 and P780 use 1066 Dynamic Random Access Memory (DRAM) bus rate technology. The DRAM interface is double-ported to provide double the bandwidth of POWER6. Spare DRAM and selective mirroring provide increased memory RAS.

When a similarly powerful configuration is built on Intel platforms, the scenario creates more images for customers to manage. Power solutions have the advantage of simplified system management.

Availability and serviceability of the platform

Downtime can cause a serious impact to business continuity, and for Smarter Commerce, an unavailable commerce platform quickly impacts the retailer’s revenue. Sales will go down if the web site is unavailable. POWER7 brings a highly reliable platform to the solution. The key to the solution is the option of high availability configurations based on Power virtualization, which allows applications to automatically be moved off of a failing machine to a backup.

The retailer achieves reduced operational cost related to maintenance, space allocation, and power consumption. With midrange and high end POWER7 systems, concurrent maintenance support enables continuous application availability. Concurrent maintenance support allows fixes to be applied without taking the systems down. AIX supports hot kernel patches. This capability helps the retailer keep their systems up all the time.

Compatibility

It is important to know that the new capabilities of the POWER7 processor are utilized by Application Server V7.x, so moving to this application environment provides advantages over preserving binary versions of earlier existing applications. This is the preferred environment for WebSphere Commerce.

However, POWER7 does support compatibility modes, which allow applications to run on POWER5® or POWER6 processors to run unmodified on POWER7. This means that the code for these applications does not have to be changed or recompiled to run in compatibility mode. This provides a smooth transition path from older systems to the latest platform and minimizes the costs of migration to a new system. This compatibility mode is significant in that it allows legacy applications to be preserved where necessary, which might be the case where thousands of applications are supported by this deployment.

Also, although logical partitions that use the earlier processor compatibility modes can run on POWER7 servers, a POWER7 processor does not emulate all features of a POWER6 or a POWER5 processor. For example, certain types of performance monitoring might not be available for a logical partition if the current processor compatibility mode of a logical partition is set to the POWER5 mode.


Industry trends

Despite challenging economic times, web retail has been experiencing significant growth. Many high volume WebSphere Commerce customers are expecting significant (up to 20%) compound growth in order volume over two years (2011 and 2012). Web retail is also becoming more sophisticated and targeted, extensively leveraging promotions and marketing campaigns to entice online shoppers. To accomplish this, retailers are leveraging a rich set of features available in WebSphere Commerce. They are also integrating their WebSphere Commerce deployments with a variety of external systems, typically those providing up-to-the-minute pricing and inventory information.

The following numbers are indicative of overall direction for web retail (numbers taken from actual customer deployments):

  • For a web-intensive retailer, the highest order rate achieved by a WebSphere Commerce site was about 850 orders per minute in 2009. In 2012, it is anticipated to be 1,100 orders per minute.
  • In 2009, the highest order rate was achieved when about 9,000 shoppers were concurrently shopping on the site.
  • In 2009, the largest WebSphere Commerce deployment involved about 100 JVMs. In 2010, this number was up to 150. In 2011, it is anticipated to involve 196 JVMs.
  • In 2009, the largest WebSphere Commerce catalog in use in the field was about 7,000,000 SKUs. In 2010, this was over 10,000,000 SKUs. In 2011, this number is anticipated to grow to over 30,000,000 SKUs.
  • In 2010, the largest WebSphere Commerce database observed in the field was about 1 TB.

This growth is not only driving the overall footprint of the WebSphere Commerce deployment, as expressed by the number of required cores and the number of required Websphere Commerce JVMs, to increase, but also poses new types of performance challenges:

  • Increasing operational costs due to increasing numbers of JVMs.
  • Spikes in workload triggered by promotions and sales events.
  • Rapid changes in CPU, network, and disk I/O triggered by external systems.

Below, we will examine the typical WebSphere Commerce workload and then discuss how POWER7 features can help resolve many of these challenges.


Understanding the WebSphere Commerce workload

WebSphere Commerce is a J2EE application that is deployed on and runs in Application Server. WebSphere Commerce employs the standard three logical tier application architecture:

  • The HTTP tier is typically implemented using the IBM HTTP Server (IHS). IHS hosts the Application Server HTTP plug-in. The HTTP plug-in performs the second level of load balancing for the Application Server and WebSphere Commerce server cluster. It also performs an important caching function – caching static content - such as images, closer to the edge of the network.
  • The application tier is the Application Server and WebSphere Commerce server cluster. For WebSphere Commerce sites, it is here that most of the CPU-intensive computing happens.
  • The database tier also plays a key role in overall WebSphere Commerce performance equation. For WebSphere Commerce sites, computation on the database tier is typically not CPU intensive, but involves high rates of disk I/O.

Another aspect to consider is that conventional view of a workload focuses on the steady-state characteristics of the system. Real world WebSphere Commerce workloads tend to be dynamic. Consider the case of a major Internet retailer preparing for a major sales event. In preparation for the sales event, the retailer clears the content of the cache, recycles the WebSphere Comerce JVMs, loads a sales catalog containing items offered only for the duration of the event, and activates promotions offering discounts and gifts.

Promotions include those commonly called “door-crasher specials”, which are promotions providing special discounts to the first online shoppers to enter the site after the start of the event. Large numbers of shoppers attempt to log on to the retailer’s site to take advantage of the door-crasher specials. As a result, at the start of the sales event, the retailers systems experience a tremendous spike in volume, placing a tremendous stress on the relatively cold system.

We will look at WebSphere Commerce workloads and the type of characteristics underlying hardware infrastructure to deliver top performance for the site. We will focus on the application and database tiers. The simpler HTTP tier is out of scope for the purposes of this article.

The application tier

Figure 2 shows a logical view of a typical WebSphere Commerce cluster. A cluster of WebSphere Commerce JVMs retrieves data from and stores data in a single database instance. Although WebSphere Commerce is an Online Transaction Processing (OLTP) application, typically over 90% of database access operations are "reads" performed as shoppers browse the catalog on the web site. Caching on the application tier plays a pivotal role in WebSphere Commerce deployments. It reduces roundtrips to the database and alleviates database input/output (I/O) bottlenecks. Extensive use of caching strongly influence demands that WebSphere Commerce makes on hardware infrastructure on the application tier.

Figure 2. Logical view of a typical WebSphere Commerce deployment
Logical view of a typical WebSphere Commerce deployment

When examining the contents of each WebSphere Commerce JVM heap, you find find that 60% to 80% of the heap is occupied by long-lived (tenured) objects, while 20% to 40% of the heap is occupied by relatively short-lived (objects) typically created in the JVM nursery. The longer-lived objects are primarily cacheable objects stored in the Application Server dynacache – JSPs, JSP page fragments, commands extending the WebSphere Command Framework classes, and distributed maps.

For a typical WebSphere Commerce customer, the size of the cache is about 6 GB. For some customers, the size of the cache can reach up to 20 GB. Today, most WebSphere Commerce deployments use the 32-bit version of the JVM. This JVM is a bit more performant than the 64-bit version, but has a limitation on the maximum size of the JVM heap. This limit (the -Xmx parameter) varies a bit for each platform, but is generally about 2 GB. This means that most of the cache has to be stored outside of the JVM.

To help with this problem, dynacache provides a feature called "disk offload". When the cache gets too large to fit into the JVM heap, dynacache disk offload writes the contents of the cache out to a file on disk. In a well-tuned site, this disk offload file fits entirely into the file system cache, and is served out of RAM. In those cases where the disk offload file does not fit into RAM, WebSphere Commerce needs to perform a large number of disk I/O operations. WebSphere Commerce workloads on the application tier generally benefit from fast access to RAM and disk.

Another important aspect common to all WebSphere Commerce workloads is a high degree of parallelism. During key sales events, high-volume WebSphere Commerce sites can experience up to 10,000 concurrent shoppers browsing and placing orders. The Application Server web container uses a thread pool to service the large number of concurrent requests. Typically, under high-volume conditions, each WebSphere Commerce JVM executes between 30 and 50 web container threads simultaneously.

In addition to the web container threads, WebSphere Commerce has a scheduler feature and a number of utilities (such as dataload and stagingprop), which are executing in parallel. These features typically use up to 30 additional threads per JVM. WebSphere Commerce benefits significantly from hardware platforms that support large numbers of concurrently executing threads.

The database tier

On the database tier, the demands that WebSphere Commerce workloads place on the hardware are a bit different than on the application tier. WebSphere Commerce supports DB2 and Oracle® databases. On properly sized and tuned site, database CPU utilizations do not exceed 40%. However, the rate of I/O operations to disk is quite high.

Figure 3. NMON output showing I/O chareacteristics of a typical WebSphere Commerce site
NMON output showing I/O chareacteristics of a typical WebSphere Commerce site

Figure 3 shows a sample NMON output taken during a performance test. In this case, an LPAR was running an instance of DB2. In this test, 25% of the WebSphere Commerce cache (on the application tier) was invalidated every 20 minutes. This simulates a real world situation where an inventory feed from an external Enterprise Information System (EIS) updates the inventory levels in a WebSphere Commerce database.

You can see that the I/O rate (yellow line in Figure 3) remains consistently high throughout the test. Another important aspect to note is that despite the fact that WebSphere Commerce is an OLTP application, I/O activity is dominated by read operations (shown in blue in Figure 3). Both DB2 and Oracle databases provide buffers that allow table data to be cached in and served from RAM.


WebSphere Commerce performance on POWER7

This section takes a look at some of the performance results achieved on POWER7 hardware at the WebSphere Commerce performance lab. We take a look at how highly dynamic workloads perform on POWER7, and then discuss the scaling characteristics of the WebSphere Commerce JVM on this platform.

Handling highly dynamic workloads

The "Black Friday Cold Start" benchmark is modeled after the real world scenario, where a retailer needs to fully restart their site immediately prior to a major sales event to perform maintenance and load event-specific catalog and promotions. We used three p750s to host the web and application tiers, and one p780 to host the database tier. The load was ramped up from 0 to 9,000 concurrent shoppers in 1 second and was sustained without functional errors for one hour.

Results of our runs are summarized in Figure 4. Despite cold caches and extremely rapid load ramp-up, we achieved a maximum throughput of 1,026 orders/min at 9,000 concurrent shoppers. Response times were good and CPU utilizations quickly reached steady state, settling at or below 65%.

Figure 4. Executing the Black Friday Cold Start benchmark – maximum throughput vs. total number of virtual concurrent shoppers
Executing the Black Friday Cold Start benchmark – maximum throughput vs. total number of virtual concurrent shoppers

Figure 5 shows the results of a reliability test simulating a full day Black Friday sales event. In this case, we executed the Black Friday Cold Start benchmark described above and sustained 1,000 orders/minute order rate for a full hour, simulating the Black Friday opening hour spike. Then we reduced the order rate to about 500 orders/minute, and sustained the load for another 5 hours, simulating a full day of Black Friday shopping activity.

Figure 5. NMON output showing CPU utilization for one of the p750 WebSphere Commerce application tier LPARs during full day Black Friday reliability test
NMON output showing CPU utilization for one of the p750 WebSphere Commerce application tier LPARs during full day Black Friday reliability test

Despite the extreme ramp up of workload at the start, and the high order rate sustained over 6 hours of the tests, you can see that the CPU utilization on the application tier regains steady state quickly, and remains stable at a reasonable 30-35% range for the duration of the test. This attests to the strong fit of the POWER7 platform for WebSphere Commerce workloads and strong throughput and reliability characteristics that this combination of products provides.

Scalability

Figure 6 shows results of a WebSphere Commerce scalability investigation. In this case, a POWER7 LPAR was created and a single WebSphere Commerce JVM was deployed on it. We varied the number of cores available to the LPAR. For each number of available cores, a step-up test was performed to establish the maximum possible throughput. We assigned whole-core values as well as micro-partitioned half-core value to the LPAR. We plotted whole-core and micro-partitioned data points in different colors to analyze possible overhead associated with micro-partitioning.

Figure 6. Scaling characteristics of a single WebSphere Commerce JVM on a POWER7 platform
Scaling characteristics of a single WebSphere Commerce JVM on POWER7 platform

Results show near-perfect scalability characteristics up to 4.5 cores per JVM. Results also show that micro-partitioning overhead is small when 2.5 cores or more are assigned to a single JVM. Micro-partitioning provides for significant flexibility when planning WebSphere Commerce deployments.

You can consolidate all of your additional environments required for a WebSphere Commerce deployment – staging, performance, integration, and general quality assurance - as LPARs on a single POWER7 server or several servers. These additional environments do not always require entire cores to be assigned to them. Micro-partitioning is a powerful feature that allows you to leverage your processor resources with maximum efficiency.

Figure 7 demonstrates another important aspect of WebSphere Commerce performance on POWER7. The number of cores per socket (commonly used name for a chip) for POWER7 models can be 4, 6, or 8. The model of p750 that we used in our test had a 6-core socket. As you can see, up to 6 cores per JVM the maximum throughput increases at a near-liner trend. However, as the number of cores per JVM goes from 6 to 7, we observe a drop in throughput. This is because process execution now needs to be coordinated across the bus, which operates at half the clock frequency of the chip.

Figure 7. Effect of crossing the chip socket boundary
Effect of crossing the chip socket boundary

Caveats and best practices

The following caveats will help you get the best experience from your POWER7-based WebSphere Commerce deployment:

  • Run the latest version of AIX. At minimum, V6.1.5.1 is required for full P7 mode. On earlier versions of AIX, you only get P6 compatibility mode performance, wich is up to 30% performance penalty. We recommend that you use the latest version of the AIX operating system (version 7.1 at the time of writing).
  • Size LPARs to avoid spanning chip socket boundaries. When planning your deployments, determine your socket size first. Consult the technical documentation for your model to do so. It will be 4, 6, or 8 cores. Once you understand your socket size, plan to size your LPARs in such a way that they are not likely to span across socket boundaries. If you are using dynamic LPARs, take extra care to ensure the LPARs do not grow across socket boundaries.
  • Use optimal core/JVM ratio. A good starting point is to use 4 cores/JVM. If you are micro-partitioning, it is a good idea to use at least 2.5 cores/JVM.
  • Use enough web container and scheduler threads to fully leverage SMT4. If you do not use enough threads to fully load the POWER7 cores, WebSphere Commerce does not fully leverage the available computing power. The WebSphere Commerce workload benefits from parallel processing. For optimal performance, you likely need to use more threads than on an SMT2 system.
  • Make sure VIOS has sufficient CPU, memory, and network bandwidth. Insufficient resources in the VIO server can bottleneck the whole system. Take care when micro-partitioning the VIOS.

POWER7 provides the best hardware platform for high-volume WebSphere Commerce deployments. Understanding these caveats will help you get the most value from POWER7.


Conclusion

WebSphere Commerce workloads benefit from processor architectures that provide for a high degree of concurrency and fast access to large amounts of RAM. Key POWER7 features, such as SMT4, 32 MB L3 on-chip cache, and two dual-channel DDR3 memory controllers that deliver 100 GB/sec throughput, are a perfect fit. For comparison, Intel chips based on Nehalem architecture provide hardware support for half the number of concurrently executing threads (equivalent to SMT2), generally a smaller on-chip L3 cache (depending on processor model), and significantly lower RAM throughput. In terms of raw performance, POWER7 is the best choice for high-volume WebSphere Commerce customers.

While it is possible to launch a high-performance WebSphere Commerce web site using Intel-based hardware, POWER7 features provide an important performance edge. Additionally, the ability of POWER7 PowerVM to dynamically allocate resources to LPARs allows you to co-locate your production, performance, and other QA environments on a single POWER7 server. PowerVM can rapidly shift resources from QA enviroments to production to help contain unforeseen spikes.

Resources

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into WebSphere on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=WebSphere
ArticleID=647444
ArticleTitle=Turbocharging WebSphere Commerce web sites with POWER7 technology
publish-date=04202011