Classification Process Performance

Classification processes are handled with sampling routines and timeout parameters that ensure minimal performance impact on database servers.

When the classifier runs, you have the option of specifying how it samples records. The default behavior takes a random sampling of rows using an appropriate statement for the database platform in question. For example, the classifier samples using a rand() statement for SQL databases. The alternative behavior is sequential sampling, which reads rows, in order, up to the specified sample size. Random sampling is the default behavior and is generally recommended because it provides more representative results. However, random sampling may run incur a slight performance penalty when compared to sequential sampling.

For both random and sequential sampling, the default sample size is 2000 rows or the total number of available rows, whichever is fewer. Larger or smaller sample sizes may be specified. If you check the random sampling box, it selects 2000 rows randomly from that table/view and then scans. If the table contains less than 2000 rows, it will scan all the rows. If you uncheck the random sampling box, it selects the first 2000 rows from that table/view and then scans. The default query time-out value is 3 minutes (180 seconds). If the process is running but stuck for 30 minutes, the entire process will be halted.

To further minimize the impact of classification processes on the database server, long running queries will be cancelled, logged, and the remainder of the table skipped. Any rows acquired up to that point will be used while evaluating rules for the table. Similarly, if a classification process runs for an extensive period of time without completing, the entire process is halted, logged with the process statistics, and the next classification process is started. This is an uncommon occurrence and usually only happens on servers that are already experiencing performance problems.

The classifier periodically throttles itself to idle so it does not overwhelm the database server with requests. If many classification rules are sampling data, the load on the database server should remain constant but the process may take additional time to run.

The classifier handles false positives by using excluded groups for schema, table and table columns. Previously, it could be a complex process to set up Guardium to ignore false positive results for future classification scans. Now, when you review classifier results, you can easily add false positive results to an exclusion group, and add that group to the classification policy to ensure those results are ignored in future scans.

Multi-thread classifier

Guardium can run more that one classifier process on a server based on the number of cores a server is setup/defined with. Basically, you can run multiple classifier processes (almost at the same time - starting time is still the same every 10 seconds or so to start).

Run these commands to find out the number of cores on your server. For example:

Test using command "nproc" or "lscpu" to see number of allowed concurrent process * 2 AND/OR grep -c processor /proc/cpuinfo AND/OR grep "cpu cores" /proc/cpuinfo |sort -u |cut -d":" -f2
Multiply the number of cores by 2 and that gives you the number of concurrent classifier processes you could define and run at the same time.

Use these CLI commands to set and to get the set value for setting concurrency levels:

grdapi set_classification_concurrency_limit limit=4 (setting up to 4 classifier processes to run at the same time)

grdapi get_classficiation_concurrency_limit (show/display the current concurrency limit, the default of any server is set to 1)