This blog promotes knowledge sharing through experience and collaboration. For more product information, visit our WebSphere Commerce CSE page. For easier navigation, utilize the Categories to find posts that match your interest.
Lesson learned from past Holiday Seasons
The holiday season is fast approaching and its time to start thinking about your site stability and how to minimize any downtime. A few things to consider is whether your site is configured to handle the increased load. Are the appropriate in place to effectively manage your site for peak efficiency. I want to highlight some areas of concern that I have seen in past Holiday Seasons.
Common Sense Approach:
The number one goal for any Commerce site is ensure site stability and that the peak shopping season is a success. Happy customers means more revenue. The most important aspect to a successful shopping experience starts with capacity planning and validating that your site is ale to handle the increased load. Too often I see that performance testing is done with sample data. Bad Idea! You have to use real data and realistic test scenarios during performance test to be able to tune the different layers in your site for optimal performance.
It is important to minimize potential problems prior to peak load y hardening your site. This can be done by implementing a freezing on new function / code. Take a proactive approach in regards to potential new problems that may arise. Prepare the environment to collect the required data before issue occurs, so there is no need to take an additional outage just collect data for troubleshooting.
One issue that I continue to see each year is misconfiguring the webcontainer pool which can lead to issue and possible outages. The WebContainer Pool is one of the most critical tuning configurations for your site. It determines the level of concurrency in the server. I have noticed that there is a misconception around this configuration (more webcontainer threads equals better performance). Wrong! You could see that having more webcontainer threads causes higher overhead ie.. (increased system resources, increased CPU utilizaton and more frequent garbage collection). All which can affect overall performance of a site. For more information on WebContainer configuration, check out this post.
Another issue that I routinely see is misconfiguring of the datasource connection pool. If this is not tuned properly, then you could overwhelm the database and cause threads to hang waiting on a db connection. Assume that you have added additional jvms to your cluster to handle the increased load. If you do not also update the number db connections allowed, you could end up with threads waiting for connections during peak load. There is a simple formula that you can use to help determine the correct settings, take a look at this post.
Memory is always a big hitter at any time but especially during peak load. There is general misconception in regards to memory - "The more heap space means the application will perform better." This is not entirely true. You may have more space on the heap will help with caching and reduce the frequency of garbage collection, but it will also increase the time spent on garage collection cycles. Every GC that occurs will pause the JVM therefore queuing up the request until it completes. A longer GC cycle under high load could actually overwhelm the JVM and cause spikes in CPU utilization, which is a perfect storm for a site outage.
Things to Consider:
I have also noticed that occasionally during peak load customers noticed locking on the keys table. Locking in the keys table is particularly sensitive, because if locked, servers can not fetch new IDs and the site can hang. Most often we see this with the ctxmgmt table as it is updated most often. With large clusters we recommend to increase the prefetch size to 5000 or 10000 which will reduce the overhead against the database. Here is quick example to help explain how it works.
You need to insert 10 keys into the database
If prefetch size is 2, then it will make 5 trips to db. After every 2 keys, it will need to db to get the next 2 unique keys.
If prefetch size is 10, then it will only need to make 1 initial trip since it will not need to get more keys until after these 10 are used.
For those who have OMS integrated with Commerce, the HotSku feature is something you will definitely want to take advantage of. It improves the performance on the Sterling side, which in turn reduces the wait time on transactions. Another benefit is to prevent deadlock and lock contention. When an item is considered hot, the system does not lock it. Instead, the changes are inserted into two additional tables, one for demand and one for supply. To learn more about this feature, take a look at this link.
Cache invalidation as you can imagine can have a huge impact on performance, especially if it if occurs during peak business hours. You will want to avoid this at all possible, as it can dramatically affect user experience. When cache invalidation occurs, there is added overhead to clear the cache as well as replicate across the nodes in the cluster. It will also force additional queries to the database to retrieve data since it is no longer in cache. There are some settings that can be found here , which will help reduce invalidation. A couple of task within Commerce that you should be aware of that can generate thousands records in cacheivl table, are dataload and stage propagation. There is a parameter "MaxInvalidationDataIds" that I would recommend to prevent overwhelming DRS, you can read more here
In almost all environments there are external interfaces that get referenced. These can potentially become bottlenecks which affect performance and can lead to site outages. When I say external interface, I am referring to services like OMS , Search Providers, SMTP Servers. If you are making synchronous calls to external services and it becomes unavailable, then all the threads that are trying to utilize this service will end up hanging. Therefore it is critical to include the external services in your performance test and ensure that you are exercising the external service in the scenarios. The test cases should include when the service is not available or slow. See how it affects the Commerce threads and develop a plan on how to recover in case this occurs. Most importantly, avoid synchronous calls to external service
One of the most overlooked areas when it comes to performance and stability of a site is the network. Network latency dramatically impacts a site's performance. These are some of the most difficult problems to troubleshoot. Its important that you have network monitoring in place and that the firewalls,etc.. are configured properly.
To ensure a successful holiday season, be prepared and don't get caught off guard. Implement the necessary monitoring, validate your failover and recovery procedures and put your data collection script in place in case there is a problem. Its important to remember Commerce is an application that is running in very complex multi-layer architecture. Just tuning the application is not enough, you need to tune the entire environment and all of its moving parts. Here's to a great successful Holiday Season!