Classification process performance

Classification processes are handled with sampling routines and timeout parameters that ensure minimal performance impact on database servers.

When the classifier runs, you have the option of specifying how it samples records. The default behavior takes a random sampling of rows by using an appropriate statement for the database platform in question. For example, the classifier samples using a rand() statement for SQL databases. The alternative behavior is sequential sampling, which reads rows, in order, up to the specified sample size. Random sampling is the default behavior and is recommended because it provides more representative results. However, random sampling may incur a slight performance penalty when compared to sequential sampling.

Attention:
  • Random sampling is not supported for Sybase or PostgreSQL: sequential sampling is always used, even in random sampling is selected.
  • If random sampling fails on Oracle datasources, sequential sampling is used instead. Random sampling may fail if the classifier encounters certain error conditions, for example selecting ROWID from a join view without a key-preserved table (ORA-1445).
  • In some instances, classification rules are applied only to the first 3000 characters, no matter how large or long the sample data is. This 3000 character limitation applies to the following datatypes and datasources,
    • Text data for Sybase and PostgreSQL
    • XML data for MS SQL

For both random and sequential sampling, the default sample size is 2000 rows or the total number of available rows, whichever is fewer. Larger or smaller sample sizes may be specified. If you check the random sampling box, it selects 2000 rows randomly from that table/view and then scans. If the table contains less than 2000 rows, it scans all the rows. If you clear the random sampling box, it selects the first 2000 rows from that table/view and then scans. The default query timeout value is 3 minutes (180 seconds). If the process is running but stuck for 30 minutes, the entire process will be halted.

To further minimize the impact of classification processes on the database server, long-running queries will be canceled, logged, and the remainder of the table skipped. Any rows acquired up to that point will be used while evaluating rules for the table. Similarly, if a classification process runs for an extensive time period without completing, the entire process is halted, logged with the process statistics, and the next classification process is started. This behavior is uncommon and usually only happens on servers that are already experiencing performance problems.

The classifier periodically throttles itself to idle so it does not overwhelm the database server with requests. If many classification rules are sampling data, the load on the database server should remain constant but the process may take additional time to run.

The classifier handles false positives by using excluded groups for schema, table and table columns. Previously, it could be a complex process to set up Guardium to ignore false positive results for future classification scans. Now, when you review classifier results, you can easily add false positive results to an exclusion group, and add that group to the classification policy to ensure those results are ignored in future scans.

Multi-thread classifier

Guardium can run multiple classification threads in parallel to optimize the performance and utilization of the CPU. The number of threads that can run in parallel can be identified by multiplying the number of CPU cores in the machine by 2.

As an example, if there are 4 CPU cores, then 8 would be the maximum number of classifier processes that can be defined and run concurrently. The cap limit, irrespective of the number of CPU cores, is 100.

To retrieve or define the concurrency limit, use set_job_process_concurrency_limit.