Techniques on how to address through-put/performance and memory constraints when dealing with complex regular expressions on DataPower SOA appliances.
Appliance restarts due to insufficient compiler options policy stylesheet stack size.
Long strings (for example, URLs) as input to complex regular expressions
Environment that houses the DataPower SOA appliance estate.
Diagnosing The Problem
Intermittent appliance reloads. To address will require work with IBM support and providing the must gather documentation, such as error reports, latency logs, debug log files. Code reviews on stylesheets that are using complex PCRE expressions for the potential matching against long strings.
Resolving The Problem
When matching against regular expressions, there could be many branching search paths and in many cases, the algorithm saves state to be able to backtrack to a previous point if necessary.
Regular expressions can be constructed such that a very large of number of branches are created (for example, when using unlimited repetition of groups). When such expressions are faced with long strings provided as input, a stack overflow could occur if the required stack size is much larger than the user configurable compiler options policy stylesheet stack size for executing the stylesheet. The name for this configurable parameter is on the Compiler Options Policy tab of the XML Manager called "Maximum Stack Size".
Once a stack overflow happens, the memory required for processing can severely impact the appliance capacity and stability.
There are mitigation strategies against this type of event and are noted as the following:
- Increase the stack size on the compile options policy field of the XML Manager that is associated to the DataPower service.
- Limit the incoming URL string length in bytes on the HTTP(s) front-side protocol handler (if the incoming URL is used in a regular expression).
- Craft a string length check into the applicable stylesheet(s).
Let's take a closer look at these preventive measures.
First, let's make the presumption that we are working with a DataPower WAF (Web Application Firewall). You can increase the stack size of the Compile Options Policy. To do so, from the Web-GUI control panel, you would navigate to Objects->Service Configuration->Web Application Firewall.
Please note the following:
Once you click on the applicable Web Application Firewall, you will be presented with the following where you can then click on the ellipse next to the XML Manager.
Once the ellipse is clicked upon you will be brought to the XML Manager page, and the next action would be to click on the ellipse for the Compile Options Policy field. Please note the following screen-shots:
The field we are most concerned about is the Maximum Stack Size field. The value there is the DataPower default.
*** IMPORTANT *** After the change of the Maximum Stack Size, apply, and save of the configuration, the user still needs to force all stylesheets to re-compile. This can be done by reloading the appliance or clearing the XML Manager's cache. If neither is done, the old value is used at xslt compilation time.
In the determination of the stack needs, the recommendation would be to utilize pcretest as a "guiding tool" that allows compiling regular expressions and running regular expression matches and displaying various information about the compilation and match process. You can obtain it by installing "pcre" package on a Linux system or by building it from source. See http://www.pcre.org/. Two of the features are displaying the number of executions of the internal match() routine and the recursion depth needed.
It is used by appending the match string with \M as follows:
Minimum match() limit = 3
Minimum match() recursion limit = 2
Its output for a complex input would be as follows:
data> /personal/onevu (.. part omitted ..) mU9w\M
Minimum match() limit = 405423
Minimum match() recursion limit = 1789
Please note that the \M above used with the the pcretest tool is indeed the best measure in understanding the complexity of the match. In addition, timing the PCRE compilation and execution (match) can be performed with pcretest as well. From the man page:
|Run each compile, study, and match many times with a timer, and output resulting time per compile or match (in milliseconds). Do not set -m with -t, because you will then get the size output a zillion times, and the timing will be distorted. You can control the number of iterations that are used for timing by following -t with a number (as a separate item on the command line). For example, "-t 1000" would iterate 1000 times. The default is to iterate 500000 times.|
|This is like -t except that it times only the matching phase, not the compile or study phases.|
The stack size needed for PCRE match operation can be estimated at approximately 1-2KB per unit of recursion depth (the longer the recursion stack depth will mean the more computing resources used). In addition, it is important that we be clear on the string length limitation. For some PCREs, the recursion depth is proportional to the input string length. It should also be said that with an arbitrary input, no stack size will really be sufficient. The options to resolve this are:
- Rewrite the Regular Expression to require less stack space, preferably constant. This typically also improves the performance of the match algorithm (however, it is important to note that what is derived as through-put for a single match in msec, should take into account the high number of transactions performed by a typical production DataPower system. This may map into a substantial through-put/performance cost) .
- Limit the maximum input string length. In general, this needs to be programmed in the stylesheet using the Regular Expression.
- In a common special case of the input being the request URL or its part, its length can be limited in the service configuration.
each transform would require a contiguous stack before it runs; i.e. the default stack is 500KB so each transaction action would allocate 500KB block when it runs. If you set the max stack size to be 10MB ; for example if there is concern running into the issue that this technote talks about; then each transform would allocate 10MB. If you couple this with a lot of concurrency; that would really cause you to run out of memory quickly (e.g. if you currently expect to normally have 100 concurrent transactions ; each runs 1 AAA action ; by default that would use at least 5MB of memory just for AAA; now if you make the stack size 10MB ; now the same load would use at least 1GB just for AAA). So, it is important to make the stack only as big as it needs to be
A note on CPU optimization: There are regular expressions and input strings for which the match cannot be run in linear time. Again, the simplest method of estimating the complexity of a match is using the performance and timing features of the pcretest utility.
The second mitigation strategy would be to limit the incoming URL, which could be used as input, for the regular expression. This can be accomplished on the front-side protocol handler. In this example, we would address the HTTP front-side handler for the DataPower MPGW. Please note the following screen shots:
The field we are concerned with is the the Maximum Allowed URL Length field, which has a default value of 16384.
Our third mitigation strategy, would be the check for size in bytes with the XSLT function called "string-length". The length of the input can be determined and based upon the size, the desired action can be taken from within the stylesheet.
This link provides more information about using PCRE.
There is always the option of employing IBM Software Services for WebSphere (ISSW) for fee based consultation on using and the impact of regular expressions.
Also, for a more graceful way of handling this type of a situation IBM will be implementing APAR IC86959. Essentially the reload that occurs with this deep PCRE recursion is fixed from 126.96.36.199, 188.8.131.52, 184.108.40.206 and 220.127.116.11 firmware and the behavior changes to an orderly transform failure with an error message logged.
Additionally, the exact value is used for the stack size, but the actual memory block size allocated is rounded up / quantized. In the large block range (128kB-32MB) DataPower memory allocator allocates blocks whose size in bytes is exact power of two. The usable size is a few bytes less - a small part of the memory is used by the allocator itself. So, the following stack sizes will make full the use of the quantized memory blocks:
128kB - 8B = 131064 B
256kB - 8B = 262136 B
8MB - 8B = 8388300 B
We recommend that you use the above values reduced by another 64kB for XSL stack sizes, for compatibility with future firmware releases and with a diagnostic mode of the memory allocator, should you need to use that sometimes.
In 6.0.0, the bucket sizes are "power of 2" and "3/2 * power of 2".
For example, in the interesting range:
196 kB = 3/2 * 128 kB
384 kB = 3/2 * 256 kB
The stacks have a 20kB sentinel area, plus a page for alignment, plus (small) overhead - overall, 32 kB should be subtracted from the bucket size to get the "optimum" value.
Lastly, for those interested in understanding regular expressions more in depth here is the detail for a recommended book.
Title: Mastering Regular Expressions, 3rd Edition
By: Jeffrey E.F. Friedl
Publisher: O'Reilly Media
Safari Books Online
Print: August 2006
Ebook: June 2009
Print ISBN: 978-0-596-52812-6 | ISBN 10: 0-596-52812-4
Ebook ISBN: 978-0-596-55899-4| ISBN 10: 0-596-55899-6
Copyright © 2006, 2002, 1997 O’Reilly Media, Inc. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly Media, Inc. books may be purchased for educational, business, or sales promotional use.
Online editions are also available for most titles (safari.oreilly.com). For more information contact
our corporate/institutional sales department: 800-998-9938 or email@example.com.
THIS DOCUMENTATION IS PROVIDED FOR REFERENCE PURPOSES ONLY. WHILE EFFORTS WERE MADE TO VERIFY THE COMPLETENESS AND ACCURACY OF THE INFORMATION CONTAINED IN THIS DOCUMENTATION, THIS DOCUMENTATION IS PROVIDED "AS IS" WITHOUT ANY WARRANTY WHATSOEVER AND TO THE MAXIMUM EXTENT PERMITTED, IBM® DISCLAIMS ALL IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION THE IMPLIED WARRANTIES OF MERCHANTABILITY, NONINFRINGEMENT AND FITNESS FOR A PARTICULAR PURPOSE, WITH RESPECT TO THE SAME. IBM SHALL NOT BE RESPONSIBLE FOR ANY DAMAGES, INCLUDING WITHOUT LIMITATION, DIRECT, INDIRECT, CONSEQUENTIAL OR INCIDENTAL DAMAGES, ARISING OUT OF THE USE OF, OR OTHERWISE RELATED TO, THIS DOCUMENTATION OR ANY OTHER DOCUMENTATION. NOTWITHSTANDING ANYTHING TO THE CONTRARY, NOTHING CONTAINED IN THIS DOCUMENTATION OR ANY OTHER DOCUMENTATION IS INTENDED TO, NOR SHALL HAVE THE EFFECT OF, CREATING ANY WARRANTIES OR REPRESENTATIONS FROM IBM (OR ITS SUPPLIERS OR LICENSORS).
15 June 2018