Historical Analysis of Cloud Observability Data

How to use IBM Watson Studio to analyze historical IBM Cloud logs using Apache Spark via the IBM Data Engine service.

Security breaches affecting major companies and the government are occurring with increasing frequency. Bugs can go undiscovered for long periods of time, and historical analysis of observability data can provide crucial insights. When a bug or security issue is discovered it is important to answer questions like the following:

When did this first occur?
How often has this been happening?
What might have introduced the concern?

This post will explain how to configure an IBM Log Analysis or IBM Activity Tracker instance to archive data to IBM Cloud Object Storage (COS). Then, IBM Watson Studio can be used with the IBM Cloud Data Engine service to gain insights into the historical data:

Architecture diagram.

Notes:

Numbers 1, 2 and 3 in the image above identify the following steps.
COS is the IBM Cloud Object Storage service.
IAM is the Identity and Access Management service.
The IBM Log Analysis and Activity Tracker services use the same underlying archive system.
Jupyter notebook is used to interactively visualize data.

Step 1: Create resources

The provided GitHub repository contains the Terraform configuration required to create the resources. Inspect the main.tf file to preview the resources that will be created. Notice there is a Terraform resource for each of the items in the Resources box in the architecture diagram above

Create resources using IBM Cloud Schematics

Log in to IBM Cloud.
Navigate to Create Schematics Workspaces.
Under the Specify Template section:
- Set the GitHub, GitLab or Bitbucket repository URL to https://github.com/IBM-Cloud/log-archive-analysis.
- Set the Terraform version drop down to terraform_v1.1.
- Click Next.
Under Workspace details:
- Provide a workspace name: log-archive.
- Choose a Resource Group. Remember the resource group name for use below.
- Set the Location.
- Click on Next.
Verify the details and then click Create.
Under Variables, change the resource-group-name and the other defaults as desired. The regions us-south and eu-de supported the services of interest at the time this post was authored.
Scroll to the top of the page and click Apply plan. Check the logs to see the status of the services created.

Get the Schematics output

Open an IBM Cloud Shell to run the following commands.
Get the list of workspaces, note the ID column and set the shell variable:

ibmcloud schematics workspace list

Set the WORKSPACE_ID variable:

WORKSPACE_ID=YOUR_WORKSPACE_ID

Get the configuration for the Log Analysis dashboard settings for archiving (to be used in Step 2):

ibmcloud schematics output --id $WORKSPACE_ID --output json | jq -r '.[0].output_values[].logging_dashboard_settings_archiving.value'

Get the configuration for the Jupyter notebook configuration for Python (to be used in Step 3):

ibmcloud schematics output --id $WORKSPACE_ID --output json | jq -r '.[0].output_values[].jupyter_notebook_configuration_python.value'

Step 2: Enable archiving

The IBM Observability Service provides a high-performance interactive dashboard for viewing IBM Log Analysis and IBM Activity Tracker. Service plans include 7, 14 and 30 days of search capability. Archiving enables the flow of data to a COS bucket. To enable archiving, you must open the dashboard for the Log Analysis or Activity Tracker instance of choice and configure archiving.

This post uses the Activity Tracker service.

Open the Activity Tracker instance list.
Create an Activity Tracker in the same region as your resources from the previous step (if one does not exist).
Open the dashboard of the Activity Tracker.
Click the Settings cog, click Archiving and then click Manage.
Click Enable Archiving.
Select IBM Cloud Object Storage in the Provider drop-down menu.
Fill in the values with the items generated in Step 1.
Click Save:

Examine resources

It can take 24 hours for data to flow to the bucket. While waiting, navigate to the resources created in the previous step. All resource names start with the same prefix string, and the default is log-archive. Opening the resources will have the side effect of sending events to the Activity Tracker and on to the bucket.

Data Engine — open Resource list:
- Open the Services and Software section.
- Click the log-archive-de link.
- Click on the Launch Data Engine UI button.
- In the next step, queries will be started in the Jupyter notebook, which will be displayed in jobs list of this view.
Watson Studio — open Resource list:
- Open the Services and Software section.
- Click the log-archive-watson link.
- Keep this around for the next step.
Cloud Object Storage — open Resource list:
- Open the Storage section.
- Click the log-archive-cos link.
- Click the bucket without data-engine in the name to see the Activity Tracker data.
- Click the empty bucket with data-engine in the name to see where Data Engine will store query results in the next step.

Step 3: Jupyter notebook

The Watson Studio in IBM Cloud Pak for Data provides data science capabilities, including Jupyter notebook editors that reside in a project.

Create a project

Open the Watson Studio resource (described in previous step).
Click Launch in IBM Cloud Pak for Data.
If a pop-up screen/overlay page is displayed, dismiss it. It is not needed for this post.
Click the + in the Project section to create a new project. Click Create an empty project.
Provide a name. In the Select storage service section, select log-archive-cos from the drop-down menu and click Create.

In the project, create a Jupyter notebook

Click the Assets panel at the top.
Click New asset.
Type “jupyter” in the search, which should display a Jupyter notebook editor card. Click the card.
Name the notebook and click the From URL panel at the top.
Leave the default runtime (IBM Runtime 22.1 on Python 3.9 XS 2vCPU 8 GB RAM for me) and paste this string for the Notebook URL: https://github.com/IBM-Cloud/log-archive-analysis/blob/master/logarchive.ipynb. Then, click Create.

Using the Jupyter notebook

Jupyter notebooks execute cells in the spreadsheet, and the cells are written in Python. At the top, notice the Run triangle button. It will execute the currently selected cell. The first cell installs the required packages using pip commands — Run it.

The second cell is a placeholder that contains a reminder that you must replace the content with the output from Terraform produced in Step 1. Copy and paste the contents into the notebook cell:

Use the Run button repeatedly to execute the rest of the cells in the notebook.

You will notice that the last few cells look something like this:

# find all records that are not from the sql-query (data engine) service
df = sql_r(f'SELECT * FROM FLATTEN({logsurl} STORED AS JSON) WHERE _source__host NOT RLIKE "sql-query" LIMIT 10')
with pandas.option_context('display.max_colwidth', None):
    print (df)

Even if you are not an SQL expert, you will probably be able to make your own queries. Jupyter notebook is a great environment to experiment. Hit the plus sign to add a new cell and copy/paste one of the queries and start exploring.

Clean up

Navigate to Schematics Workspaces.
Click your workspace to open.
Click Actions > Destroy resources and follow instructions.
Wait for successful completion. If it fails, try again.
Click Actions > Delete workspace and follow instructions.

Summary and next steps

The interactive search capabilities in the IBM Log Analysis and IBM Cloud Activity Tracker dashboards are great. If you need a large amount of historical log data to satisfy the requirements of your business, this post will help you perform the required historical research.

Turn on archiving for the Activity Tracker and Log Analysis instances in the IBM Observability service and direct the data to your COS buckets. The data is secure and long-term storage prices are attractive. Pay for what you use and let IBM handle scaling, reliability, accessibility, search, etc.

Next steps:

Get started with IBM Cloud Data Engine. This post just scratched the surface.
IBM Cloud Pak for Data as a Service has notebooks and so much more.
Use IBM Cloud SQL Query to Analyze VPC Network Traffic from IBM Cloud Flow Logs for VPC
IBM Cloud Object Storage is flexible, cost-effective and scalable, check it out.
Build a data lake using object storage

If you have feedback, suggestions or questions about this post, please email me or reach out to me on Twitter (@powellquiring).

Was this article helpful?

YesNo

Powell Quiring

Offering Manager

How to use IBM Watson Studio to analyze historical IBM Cloud logs using Apache Spark via the IBM Data Engine service.

Step 1: Create resources

Create resources using IBM Cloud Schematics

Get the Schematics output

Step 2: Enable archiving

Examine resources

Step 3: Jupyter notebook

Create a project

In the project, create a Jupyter notebook

Using the Jupyter notebook

Clean up

Summary and next steps

More from Cloud

Enhance your data security posture with a no-code approach to application-level encryption

Attention new clients: exciting financial incentives for VMware Cloud Foundation on IBM Cloud

The history of the central processing unit (CPU)

IBM Newsletters