Unleash the value of Guardium data by using the Guardium Big Data Intelligence solution

The power of a big data platform: Purpose built for data security requirements

Note from editor: This article reflects changes as a result of the announcement that the SonarG solution from JSonar is now available from IBM as part of the IBM Guardium product portfolio under the name Guardium Big Data Intelligence.

Organizations that use IBM Security Guardium activity-monitoring for data security and compliance struggle with the quantity of collected audit data, especially if they have 10 or more Guardium collectors. IBM Security Guardium Big Data Intelligence provides the power of a big data platform that is purpose-built for data security requirements. It helps augment existing Guardium deployments with the ability to quickly create an optimized security data lake that can retain large quantities of historical data over long time horizons.

This article describes the Guardium Big Data Intelligence solution, how it fits into a Guardium enterprise deployment, and how to integrate the Big Data Intelligence solution with the Guardium data and file activity monitoring solution.

How much data does Guardium's activity monitoring generate?

IBM Security Guardium provides a comprehensive data protection solution to help secure data and files. Capabilities include vulnerability assessments, activity monitoring, and at-rest encryption. Guardium actively monitors all access to sensitive data by using a host-based probe. Data events are then collected and processed at the Guardium collector (see Figure 1).

Figure 1: S-TAP-to-collector data flow
S-TAP-to-collector data flow
S-TAP-to-collector data flow

Because so much data is being sent at high velocity to a collector, the solution scales horizontally to allow many collectors to handle the workload. In a typical Guardium deployment, there is a hierarchy of components as shown in Figure 2. S-TAPs intercept database activity and send them to collectors. Collectors analyze the activity and log them to local storage on the collector. Collectors dump the data and copy them to aggregators where data is merged for reporting across the enterprise. Finally, a Central Manager manages the entire federated system.

Figure 2: Appliances in an enterprise Guardium deployment
Appliances in an enterprise Guardium deployment
Appliances in an enterprise Guardium deployment

Guardium collects a vast number of data-related events such as session data, query data, exceptions, authorization failures, and other violations. A typical Guardium system will capture many billions of complex records. Unlike other security events, database activity includes complex data, such as objects that are being operated on, what is being done, the context of the operation within the session, and much more. Because of the quantity and velocity of data, Guardium is a prime candidate to take advantage of big data technologies.

Guardium Big Data Intelligence

As shown in Figure 3, Guardium Big Data Intelligence is a big data solution for consolidating, storing, and managing all of the complex data that is captured by Guardium activity monitoring, and also providing the means to enrich that data for expanded context. By doing so, it optimizes the value and capabilities of Guardium activity monitoring and paves the way to new functionality and benefit. It takes advantage of big data technologies to streamline the collection and analysis of the large, growing pools of Guardium activity-monitoring data.

Figure 3: Guardium Big Data Intelligence provides a big data solution for data security insights

Guardium Big Data Intelligence augments the activity monitoring platform in four primary areas:

  1. By simplifying data collection and reducing the cost and time of aggregation and reporting.
  2. By enabling long-term, online retention of Guardium data in a low-cost and efficient manner. In addition, enabling organizations to easily and rapidly interact with several years' worth of data. As an example, the following extract from an internal log shows a complex query that found 19 matches from a data set of 27 billion full SQLs (representing data from 100 collectors over a period of 12 months). While looking at both session-level information (such as client IPs, DB user names) and a regular expression on the full SQL itself, this query took less than 5 minutes:

    Listing 1: Sample data from a Guardium Big Data Intelligence log
    command gdm3.$cmd command:{"aggregate":"full_sql","pipeline":[{"$match":{"Full 
    {"$joined":"session","$as":"session","$match":{"Session Id":"$session.$Session Id"},"$project":
    {"Client IP":"$session.$Analyzed Client IP","Server IP":"$session.$Server IP","Server 
    Type":"$session.$Server Type","Service Name":"$session.$Service Name","Source
    Program":"$session.$Source Program","DB User":"$session.$DB User... ntoreturn:0 keyUpdates:0
    locks(micros) w:0 reslen:130    Execution Time = 245501ms   
  3. By creating faster and broader access to valuable activity data that provides self-service access for different groups in the organization. This self-service capability reduces the need for Guardium administrators to be involved in every report that is required by the business and empowers multiple stakeholders to better leverage the high value data that Guardium captures.
  4. By enabling high-performance analytics across expanded data sets. For example, Figure 4 shows a graph of daily session throughput for a period of 6 months from a large collector environment. Figure 5 shows analysis of all exceptions for a period of 6 months, clearly highlighting several periods of high exception events. Because the analytics run on all collector data and over a large period, they are very precise in identifying outliers. For example, Figure 5 displays a visualization of unusual User Connections over time via the Outliers model.
    Guardium Big Data Intelligence delivers increased functionality, fast reporting, and advanced analytics. At the same time, it simplifies Guardium deployment and reduces the total cost of the solution through hardware reduction and operational efficiencies.
Figure 4: Session throughput for a period of 6 months
Figure 5: Exception analytics for a period of 6 months

From an architectural perspective, Guardium Big Data Intelligence integrates with the traditional Guardium architecture as an alternative to the aggregation tier. Collectors are configured to create hourly data extracts of various activities/data by using the Guardium data mart mechanism. These data mart extracts are copied by the collectors to the Big Data Intelligence solution over SCP where they are consumed and merged into a single big data store of all Guardium data, even when they are coming from hundreds of collectors. We'll cover that in more detail in the next section.

Impact of Big Data Intelligence on a Guardium enterprise deployment

As mentioned previously, Guardium enterprise deployments use collectors and aggregators. Collectors get feeds from S-TAPs (the host-based probes) and store the data locally first. Every 24 hours each collector will then move its daily data to an aggregator where data is aggregated, as shown in Figure 6. When enterprise-level queries are needed, organizations either use distributed reports that run on these aggregators (by using a reporting server) or in older deployments, by using a second tier of aggregation. A typical enterprise Guardium deployment will have more than one aggregator since the recommended ratio is between 1:8 and 1:10 aggregators per collector.

Figure 6: Guardium infrastructure architecture
Guardium infrastructure
Guardium infrastructure

With a Guardium Big Data Intelligence architecture, collectors communicate directly to the big data lake, as shown in Figure 7. This communication greatly simplifies the data collection mechanics and facilitates much more efficient collection of larger data sets while also using less hardware infrastructure.

Figure 7: Guardium Big Data Intelligence can simplify a Guardium Deployment
sonarg simplifies arch
sonarg simplifies arch

The collectors push data to the Guardium Big Data Intelligence data lake on an hourly basis where the data is merged with previous data, avoiding the 24-48 hour data lag that is common in previous architectures.

Guardium Big Data Intelligence can even consolidate activity data across multiple Central Manager domains. This is especially important for larger enterprise deployments that typically use multiple Central Managers to consolidate data, since it provides a true enterprise-wide view of activity data.

The interface between Guardium collectors and the Big Data Intelligence warehouse is based on the Guardium data mart mechanism. Data marts were introduced in Guardium V9 and are a powerful mechanism for enabling query reports to be generated on an ongoing basis without the need for aggregation processes and with very low collector processor usage. Data marts also provide the most efficient way to extract data from Guardium.

Guardium includes several data marts to feed the Big Data Intelligence data lake. These data marts run on an hourly schedule and copy files to the data lake by using scheduled jobs that generate extract files and copy the files over an encrypted channel (by using SCP). These files are processed by ETL and put in a form that makes reporting and analytics fast and simple as shown in Figure 8.

Figure 8: Guardium collectors generate data that is processed by ETL
guardium data processed by sonarg etl
guardium data processed by sonarg etl

Advantages to big data architecture by using the Guardium Big Data Intelligence Solution

From an infrastructure perspective, there are many advantages to this approach, including the following potential benefits:

  • Reduced hardware and operational costs of aggregation.
  • Reduced latency of data consolidation, down from 24 hours to 1 hour.
  • Reduced collector storage footprint by purging data more frequently from the collectors.

Importantly, both lab testing and production deployments indicate the reports run anywhere from 10-100 times faster on the big data platform than on aggregators, making the system even more valuable for both compliance reporting and security analytics. The system uses various compression techniques for keeping the cost of long-term retention down while it also keeps data in a form that allows reporting over extended periods of time without having to restore archive files.

Guardium Big Data Intelligence Requirements

Guardium Big Data Intelligence requires Linux® to run and is typically installed on top of the organization's standard Linux build. It can run on various Linux flavors but is most often installed over Red Hat Enterprise Linux.

A typical Guardium Big Data Intelligence node will have 2 CPUs, 64-128G of memory (depending on load and concurrency requirements), and 2TB-30TB of disk (depending on number of collectors, retention needs, and Guardium policy). The system can be physical or virtual. Virtual systems are easier to deploy, but physical systems with local disks tend to be less expensive due to access to lower-cost local storage.

Guardium Big Data Intelligence allows the use of inexpensive commodity-spinning disks and is specifically optimized to leverage new cloud and object store technologies. These allow enterprises to easily construct large scale security data lakes at a fraction of the cost of traditional enterprise storage.

Data marts: Integrating the Big Data Intelligence data lake with Guardium collectors and Central Managers

Integration between Guardium activity collection and the big data lake relies on the use of data marts on both Guardium collectors and Central Managers in order to organize and package the appropriate data to be exported on a recurring basis from Guardium into the Big Data Intelligence data lake.

These data-extract files are typically pushed hourly, although it varies depending on the specific data sets. For example, operational data such as S-TAP health is published every 5 minutes in order to further reduce the latency of information and improve the ability to respond to issues as they arise. Classification and Vulnerability Assessment (VA) results are published on a daily schedule.

The data mart push mechanism is highly resilient. If data is not pushed on time due to a network outage or for any other reason, it is pushed during a future cycle. All data is counted and cross-referenced by using an additional, independent data-mart extract that allows Guardium Big Data Intelligence to validate all data transfers and confirm completeness of the data transfer. If data does not arrive or there is an inconsistency, the operator is immediately alerted with details.

The data marts that are used in the integration are associated with the following types of data:

  • Collector-specific data, such as activity data and S-TAP status data. Activity data includes: session data, exception data, full SQL data, query data, and more. Activity data is also the vast majority of the data that is transferred into the big data lake. Both activity data and S-TAP status data are only activated on collectors. (S-TAPs are only connected to collectors.)
  • Group data: Guardium groups are copied to the big data lake transparently so that you are able to use groups within reports. Schedule group data marts on the Central Managers. 
    If you have a multi-CM environment, schedule the group data marts on each of the CMs and change /etc/sonar/sonargd.conf on your Guardium Big Data Intelligence system to have:
     - group_members
  • Non-activity data, such as Vulnerability Assessment and classifier data. Activate these data marts on the Guardium appliances that perform the scans. It does not matter if they are collectors, aggregators, or Central Managers.
  • System-level data marts include buffer usage data, system info data, and more that are used by Guardium Big Data Intelligence to provide an enterprise view of the operational aspects of the Guardium system. Activate these system-level data marts on all Guardium appliances.

The data marts that form the integration layer are available for Guardium V9 and V10, but not all are available in both. For example, outlier data looks different in V9 versus V10.1.2. Table 1 shows the different data mart names. Use DMs 49 and 50 for V10.1.2 and up. Use DMs 27 and 28 for versions preceding V10.1.2 (but do not use both). The DMs available as of the end of 2016 are shown in Table 1.

Important: The precise set of DMs are typically chosen based on the implementation scope and the Guardium versions.

Table 1: Available data marts for Guardium Big Data Intelligence
Data mart name Report titleUnit type Guardium version Data mart ID
Export: Access log Export: Access log  Collector All 22
Export: Session log Export: Session log  Collector All 23
Export: Session log ended Export: Session log  Collector All 24
Export: Exception log Export: Exception log  Any All 25
Export: Full SQL Export: Full SQL  Collector All 26
Export: Outliers list Analytic Outliers list  Any Versions preceding V10.1.2 27
Export: Outliers summary by hour Analytic outliers summary
By Date
Any Versions preceding V10.1.2 28
Export: Export extraction log User-defined extraction log  Any All 31
Export: Group members Export: Group members  Any All 29
Export: Policy violations Export: Policy violations  Collector All 32
Export: Buff usage monitor Buff usage monitor  Any All 33
Export: VA results Security assessment export  Any All 34
Export: Policy violations - detailed Export: Policy violations  Collector All 38
Export: Access log - detailed Export: Access log  Collector All 39
Export: Discovered instances Discovered instances  Any All 40
Export: Databases Discovered Databases Discovered  Any All 41
Export: Classifier results  Classifier results  Any All 42
Export: Datasources  Data-sources  Central Manager,
All 43
Export: S-TAP status  S-TAP status monitor  Collector All 44
Export: Installed patches Installed patches  Any All 45
Export: System info Installed patches  Any All 46
Export: User - role User - role  Central Manager,
All 47
Export: classification process log Classification process log Any All 48
Export: Outliers list - enhanced Analytic outliers list - enhanced Any V10.1.2 and up 49
Export: Outliers summary by hour - enhanced Analytic outliers summary by date - enhanced Any V10.1.2 and up 50

These data marts are usually bundled into the latest GPUs but are also provided as separate individual patches for customers that did not yet apply the GPUs. Depending on your patch level, install the appropriate patches:

  • V9:
    • GPU750
  • V10:
    • V10.1 (p120) and p172, p174, p175
    • V10.1.2 (p200) and p175
    • Releases above 10.1.2 do not have specific dependencies as of the publication of this article. You will need to check for appropriate prerequisites for the release of Guardium Big Data Intelligence that you use.

Configuring the Guardium appliances

There are three primary steps that you execute on the Guardium appliances to enable integration with the Big Data Intelligence solution:

  • Ensure that you are on the right patch level.
  • Enable and schedule data-mart extraction by using GuardAPI (grdapi) commands that are described in the following section.
  • Adjust purge schedules on collectors to reduce storage footprint (optional).

Enabling and scheduling data mart extraction

Enabling and scheduling the various data-mart extracts also involves three primary steps that are described below:

  1. Enable the appropriate DMs and point their output to the Guardium Big Data Intelligence system.
  2. Schedule the extract.
  3. Determine the extract data start (optional).

The following sample grdapi command string enables session data to be passed to the Guardium Big Data Intelligence solution. This grdapi command tells the Guardium collectors where to copy the data mart data via an SCP process. Configuring any of the other data marts simply requires that the Name field change.

grdapi datamart_update_copy_file_info destinationHost="yourhosthere"
destinationPath="/local/raid0/sonargd/incoming" destinationUser="sonargd"
Name="Export:Session Log" transferMethod="SCP"

Executing the data mart configuration commands is done only once per CM, since all collectors can then receive this information from the CM. Replace the hostname, password, and data path to reflect the details for your Guardium Big Data installation.

Once the data mart is enabled, you need to schedule the extracts. Since Guardium schedulers are local to the appliance, you need to run the grdapi scheduling command on each appliance from which data is extracted. For example, for each collector that needs to send session data you would run:

grdapi schedule_job jobType=dataMartExtraction cronString="0 45 0/1 ? * 1,2,3,4,5,6,7" 
objectName="Export:Session Log"

Here is an example of how to delete a schedule for session data mart:

grdapi delete_schedule deleteJob=true jobGroup="DataMartExtractionJobGroup"

The job name is "DataMartExtractionJob_" concatenated with the ID shown in Table 1.

Because there are many grdapi calls to issue, use the SonarCLI Expect script to automate the process and reduce work (see next section).

The recommended schedules when you enable all data marts are shown in Table 2.

Table 2: Recommended data mart schedules
Export: Access log 0 40 0/1 ? * 1,2,3,4,5,6,7 00:40
Export: Session log 0 45 0/1 ? * 1,2,3,4,5,6,7 00:45
Export: Session log ended 0 46 0/1 ? * 1,2,3,4,5,6,7 00:46
Export: Exception Log 0 25 0/1 ? * 1,2,3,4,5,6,7 00:25
Export: Full SQL 0 30 0/1 ? * 1,2,3,4,5,6,7 00:30
Export: Outliers list 0 10 0/1 ? * 1,2,3,4,5,6,7 00:10
Export: Outliers summary by hour 0 10 0/1 ? * 1,2,3,4,5,6,7 00:10
Export: Export extraction Log 0 50 0/1 ? * 1,2,3,4,5,6,7 00:50
Export: Group members 0 15 0/1 ? * 1,2,3,4,5,6,7 00:15
Export: Policy violations 0 5 0/1 ? * 1,2,3,4,5,6,7 00:05
Export: Buff usage monitor 0 12 0/1 ? * 1,2,3,4,5,6,7 00:12
Export: VA results 0 0 2 ? * 1,2,3,4,5,6,7 Daily at 2 AM
Export: Policy violations - detailed 0 5 0/1 ? * 1,2,3,4,5,6,7 00:05
Export: Access log - detailed 0 40 0/1 ? * 1,2,3,4,5,6,7 00:40
Export: Discovered instances 0 20 0/1? * 1,2,3,4,5,6,7 00:20
Export: Databases discovered 0 20 0/1? * 1,2,3,4,5,6,7 00:20
Export: Classifier results  0 20 0/1? * 1,2,3,4,5,6,7 00:20
Export: Data sources  0 0 7 ? * 1,2,3,4,5,6,7 Daily at 7 AM
Export: S-TAP Status  0 0/5 0/1 ? * 1,2,3,4,5,6,7 Every 5 minutes
Export: Installed patches 0 0 5 ? * 1,2,3,4,5,6,7 Daily at 5 AM
Export: System info 0 0 5 ? * 1,2,3,4,5,6,7 Daily at 5 AM
Export: User - role 0 5 0/1 ? * 1,2,3,4,5,6,7 00:05
Export: Classification process log 0 25 0/1 ? * 1,2,3,4,5,6,7 00:25
Export: Outliers list - enhanced 0 10 0/1 ? * 1,2,3,4,5,6,7 00:10
Export: Outliers summary by hour - enhanced 0 10 0/1 ? * 1,2,3,4,5,6,7 00:10

In most cases where you schedule the data mart, it will start to export data "from now" into the future. If you already have data on the collector (for example, from the past 10 days) and you want to have the data also moved to the big data lake, then you can set the start date for the data mart to be in the past as shown in Figure 9. Edit the data mart in the CM GUI and set the desired start date before you issue the grdapi schedule commands. If you have GPU 200 (V10.1.2 p200) or later, you can set the start date by using a grdapi. You do not need to use the GUI.

grdapi update_datamart Name="Export:User - Role" initial_start="2016-12-01 00:00:00"
Figure 9: Enabling a start date in the past
datamart start date past
datamart start date past

Optional: Reduce collector storage footprint

Once you enable an integration between Guardium and the Guardium Big Data Intelligence solution, the data is moved off the collector more frequently than when using aggregation: hourly versus daily. This creates the opportunity to purge more aggressively on the collectors and reduce the collector storage footprint. For example, rather than allocate 300GB or 600GB per collector you can allocate 100G per collector.

Note: This can only be done if you adjust your retention per collector appropriately (that is, keep less data on each collector).

The simplest path to this storage reduction is to build new collector VMs with the smaller storage footprint and ensure that the purge schedule is defined to keep only three days of data on the new collector. Redirect the S-TAPs to point at the new collectors, which in turn point to the Big Data Intelligence system. After a one-day period where both the old and new collectors point to Big Data Intelligence system concurrently, the old collectors can be backed up and decommissioned. The transition to the new collectors will be complete then. This method can also be used to simplify and accelerate Guardium upgrades since you do not have to worry about data management on the collectors.

Use SonarCLI to automate data mart setup

SonarCLI is a utility that combines a customer-provided list of Guardium appliances with a set of predefined data marts and then communicates with all Guardium appliances to execute and validate the grdapi commands necessary to establish this communication (see Figure 10). Script execution takes minutes and once completed, the big data lake will begin receiving data mart data. Note that SonarCLI is a general-purpose script execution framework and can also be used to automate grdapi executions that are unrelated to the big data solution.

Figure 10: SonarCLI scripting
sonarcli scripting
sonarcli scripting

To use SonarCLI, you set up a file that tells the system what scripts to run on collectors and CMs. The script then opens a CLI session per appliance, runs the appropriate script as defined by the config file, stores all output in a log file, and creates a summary log. Once finished you review the summary to see if everything ran to completion and you're done.

For more information, visit

Creating custom data marts

In addition to the prebuilt data marts that are used by the data lake, you can push additional data to the data lake from any report built in your Guardium environment. Any report that is executed on the Guardium collector or central manager can be converted into a data mart and its results piped directly into the data lake by using the standard data transfer process. Figures 11 and 12 show how to convert a query into a data mart from within the query builder. As shown in Figure 11, click on the Data Mart button for the query that you want to use for the data mart.

Figure 11: Converting a query to a data mart
converting query to data mart
converting query to data mart

Figure 12 defines a file name with the prefix EXP that is created on an hourly basis. The EXP prefix informs the appliance that this data mart is being created for delivery to the Big Data Intelligence application. The data mart name must begin with EXPORT and the EXP prefix must appear at the start of the filename in order for the transfer to the Big Data Intelligence solution to complete successfully.

Figure 12: Scheduling data mart delivery from Guardium to Big Data Intelligence
scheduling dm
scheduling dm

As with the standard data marts, grdapi commands must be executed to configure the SCP transfer of the file to the data lake and also to schedule this transfer on an hourly basis. Define the SCP transfer configuration by using:

grdapi datamart_update_copy_file_info destinationHost="yourhosthere"
destinationPassword="yourpwdhere" destinationPath="/local/raid0/sonargd/incoming"
destinationUser="sonargd" Name="Export:GUARD_USER_ACTIVITY“ transferMethod="SCP"

Schedule the extract/push by using:

grdapi schedule_job jobType=dataMartExtraction cronString="0 40 0/1 ? * 1,2,3,4,5,6,7"
objectName="Export:GUARD USER_ACTIVITY“


IBM Guardium Big Data Intelligence allows you to optimize your Guardium environment by using a true big data solution for managing and accessing Guardium data. When you use data marts, you can efficiently move data from Guardium appliances faster than ever before, reduce your hardware footprint and costs, enable fast reporting, and log online retention of data and advanced analytics. Augmenting Guardium with a purpose-built big data solution creates a very powerful platform for expanding the use cases and benefits of IBM Security Guardium's data protection solutions.

Downloadable resources

Related topics

Zone=Security, Data and analytics
ArticleTitle=Unleash the value of Guardium data by using the Guardium Big Data Intelligence solution