Getting started with Open Data for Industries

You can use the Open Data for Industries service on IBM® Cloud Pak for Data to govern your oil and gas data.

The Open Data for Industries service is installed on top of the Cloud Pak for Data platform on a Red Hat® OpenShift® cluster. Use the following resources to learn about the tasks to complete and the guidance to consider to set up and use your environment for Open Data for Industries.

Planning your deployment

Because Open Data for Industries is installed on top of the Cloud Pak for Data platform and Red Hat OpenShift, your planning tasks include learning about the resources that are required by both Open Data for Industries and the Cloud Pak for Data platform.

For more information, see Planning.

Security

Open Data for Industries complies with a set of focused security and privacy practices: vulnerability management, threat modeling, penetration testing, privacy assessments, security testing, and patch management. To learn about all the security mechanisms that are implemented on a platform level and those practices specific for Open Data for Industries, complete the following tasks.
Platform-specific security mechanisms
Step Action Guidance
1 Review IBM Security in Development practices. IBM Security in Development provides an overview of the platform level security and privacy practices, to which Open Data for Industries is compatible.
2 Get familiar with the IBM Secure Engineering Framework. Open Data for Industries development teams are encouraged to follow the Security in Development - The IBM Secure Engineering Framework practices and procedures.
3 Learn about the basic security features on the Red Hat OpenShift Container Platform. Cloud Pak for Data builds on the security features provided by Red Hat OpenShift by creating Security Context Constraints (SCC), service accounts, and roles. That way the Open Data for Industries pods and users have the lowest level of privileges to the Red Hat OpenShift Container Platform. Furthermore Cloud Pak for Data is installed in a secure and transparent manner, which corresponds with the Basic security features on Red Hat OpenShift Container Platform.
4 Ensure that your data security is hardened for the storage solution that you selected. In general, data security is managed by your remote data sources. Only users with the appropriate credentials can access the data in a remote data source. However, if you use shared credentials to access your remote data sources, you expose your data security at risk.

Review the data at rest details in the Storage comparison section in Storage considerations.

5 Encrypt your storage partition by using the Linux Unified Key Setup-on-disk-format (LUKS). Another mechanism to ensure that your data in Open Data for Industries is stored securely is to encrypt your storage partition. If you use Linux Unified Key Setup-on-disk-format (LUKS), you must enable LUKS and format the partition with XFS before you install Cloud Pak for Data.
6 Choose whether to enable FIPS on your Red Hat OpenShift cluster. Open Data for Industries supports FIPS (Federal Information Processing Standard) compliant encryption for all encryption needs.

Read how to enable FIPS on your Red Hat OpenShift cluster.

7 Configure the communication ports used by the Cloud Pak for Data cluster.
8 If your Red Hat OpenShift cluster is configured to use a custom name for the DNS service, update the DNS service name to prevent performance problems. When you install the Cloud Pak for Data platform, the installation points to the default Red Hat OpenShift DNS service name. If your Red Hat OpenShift cluster is configured to use a custom name for the DNS service, a project administrator or cluster administrator must update the DNS service name to prevent performance problems.
9 Prepare for privacy and compliance assessments. What regulations does Cloud Pak for Data comply with? summarizes the platform features that you can use in preparation for privacy and compliance assessments vary according to your configuration and utilization.
10 Isolate the Red Hat OpenShift project (Kubernetes namespace) where Cloud Pak for Data is deployed. As an extra security measure, you can use network isolation to isolate the Red Hat OpenShift project. To do so, review the following information:
11 Protect against DDos attacks. To filter out unwanted network traffic, such as Protecting Against DDos Attacks on OpenShift, use an elastic load balancer that accepts only full HTTP connections. Using an elastic load balancer that is configured with an HTTP profile inspects the packets and forwards only the HTTP requests that are complete to Open Data for Industries.
Open Data for Industries security
Step Action Guidance
1 Get to know the available authentication and authorization mechanisms. Authentication and authorization mechanisms in Open Data for Industries is managed with an identity and access management solution that is called Keycloak, which is an open source component that is included in Red Hat OpenShift.
2 Configure Keycloak and manage users and roles. After you configure Keycloak, it creates the necessary entities to secure resources. These entities can then be maintained with Keycloak or by using the Open Data for Industries Entitlements API.
3 Review the different methods of using Keycloak to manage access to Open Data for Industries. You can use Keycloak to manage access to Open Data for Industries by:
  • Keycloak administrative console
  • The REST API
  • Keycloak Java™ SDK
  • A set of methods that are provided by IBM®.
4 Configure the idle web client session timeout in accordance with your security and compliance requirements. You can specify the length of time that users can leave their session idle before you are automatically logged out of the web client. For more information on configuring the session timeout, see the Configure token settings step in Managing Open Data for Industries users through Keycloak.
5 Encrypt communications to and from Open Data for Industries with TLS or SSL. The TLS certificate and private key (both in PEM format) can be used to enable an HTTPS connection to the Cloud Pak for Data web client. For more information, see Using a custom TLS certificate for HTTPS connections.

Open Data for Industries uses service mesh for encryption in intra service communication.

Service Mesh automatically configures workload sidecars to use mutual TLS when it calls other workloads.

By default, Service mesh configures the destination workloads by using PERMISSIVE mode.

When PERMISSIVE mode is enabled, a service can accept both plain text and mutual TLS traffic.

To allow only mutual TLS traffic, you need to change the configuration to STRICT mode. For more information, see Peer authentication.

6 Understand the auditing mechanisms available for you to use. Audit logging provides accountability, traceability, and regulatory compliance that concern access to and modification of data.

IBM Open Data for Industries supports auditing process on the platform by creating lineage among the different processes and the data states.

A correlation-id tracks the request end to end and it also reflects the authorization that is done for the request cycle.

The different kinds of the logs that are generated at various abstractions can easily be leveraged on the EFK stack. They depict various patterns and support the audit key performance indicators.

The EFK stack is implemented as part of the Open Data for Industries installation. The EFK stack is comprised of:
  • Elastic search (ES): An object store where all logs are stored.
  • Fluentd: Gathers logs from nodes and feeds them to the Elastic search.
  • Kibana: A web UI for Elastic search.

For more information, see Logs and monitoring in Open Data for Industries.

7 Use multitenancy to ensure that you make effective use of infrastructure. That way you reduce the operational expenses while still maintaining security, compliance, and independent operability. IBM® Open Data for Industries administrators can support many tenants on a shared platform either by resource separation or by logical separation. For more information, see Multitenancy in IBM Open Data for Industries.
8 Disable the external route that is used to push images to the registry server when you are not installing IBM Open Data for Industries. For the registry server, you can disable the external route that is used to push images to the registry server when you are not installing Open Data for Industries. However, if you leave the route unavailable when you try to install Open Data for Industries, the installation fails.

System requirements for Open Data for Industries

Version support

Use the following information to see what releases of Open Data for Industries are supported on the version of IBM Cloud Pak for Data you are running:

Service Cloud Pak for Data version 4.5.x
Open Data for Industries
  • 3.0.1 (OpenShift 4.6, 4.8, 4.10)
  • 3.0.2 (OpenShift 4.6, 4.8, 4.10)
Hardware requirements
Use the following information to determine whether you have the minimum required resources to install Open Data for Industries.
Important: The information in this table represents the minimum resources that you need to successfully install the service. Work with your IBM Sales representative to generate more accurate calculations based on your expected workload.
Service x86-64 POWER Z vCPU Memory Storage Notes® and additional requirements
Open Data for Industries     Operator pods:

1 vCPU

Catalog pods:

1 vCPU

Operand:

32 vCPU

Operator pods:

2 GB RAM

Catalog pods:

0.1 GB RAM

Operand:

85 GB RAM

4 TB

Minimum resources for an installation with a single replica per service.

Work with IBM Sales to get a more accurate sizing based on your expected workload.

Minimum recommended configuration.
  • 3 master nodes; each with 8 vCPU and 16 GB RAM shared on the cluster.
  • 3 worker nodes with 16 vCPU and 48 GB RAM dedicated to Open Data for Industries.
Note: The minimum requirements for worker nodes take into consideration also the requirements for system resources.
Reccommended configuration for worker nodes: The minimum requirement of 48 vCPU and 144 GB RAM applies only to worker nodes.
Storage requirements
Use the following information to determine which persistent storage options are available for Open Data for Industries.
Component OpenShift Data Foundation IBM Spectrum Fusion IBM Spectrum Scale Container Native Portworx NFS Amazon Elastic Block Store Amazon Elastic File System IBM Cloud Block Storage IBM Cloud File Storage
Open Data for Industries

   

 

Recommended storage classes for services
If you use different storage class names on your cluster, ensure that you specify equivalent storage classes.
The following storage classes are recommended for Open Data for Industries.
Note: You must specify information about the storage that you want to use when you install the service.
Storage Storage classes
OpenShift Data Foundation ocs-storagecluster-ceph-rbd
IBM Spectrum Fusion Not supported.
IBM Spectrum Scale Container Native Not supported.
Portworx portworx-metastoredb-sc
NFS Not supported.
Amazon Elastic Block Store gp2-csi or gp3-csi
Amazon Elastic File System efs-nfs-client
IBM Cloud Block Storage ibmc-block-gold
IBM Cloud File Storage ibmc-file-gold-gid or ibm-file-custom-gold-gid
Software dependencies
Use the following information to determine whether the service depends on the availability of other software.
  • External dependencies are software that must be installed in addition to the Cloud Pak for Data platform software.
  • Service dependencies are other services that must be installed on Cloud Pak for Data.
Service External dependencies Service dependencies
Open Data for Industries
To use this service, you must have:
  • Apache Airflow 2.1.4
  • Apache CouchDB 3.2.1
  • Elasticsearch 7.11.1
  • Keycloak 19.0.3
  • MinIO 4.5
  • Red Hat AMQ Broker 7.10.1
  • Redis 6.0.9
  • Red Hat OpenShift Service Mesh 2.2.3-0
None
Multitenancy support
Use the following information to determine the level of multitenancy support for Open Data for Industries.
Service Install the service in separate projects Install the service multiple times in the same project Install the service once and deploy multiple instances in the same project Install the service in separate tethered projects Deploy multiple instances to in the same tethered project
Open Data for Industries Yes No No. One instance only. Yes No

Installing

After planning your environment, you are ready to complete the tasks to install Cloud Pak for Data and Open Data for Industries on your cluster.

  1. Set up a client workstation. To install IBM Cloud Pak for Data, you must have a client workstation that can connect to the Red Hat OpenShift Container Platform cluster. For more information, see Setting up a client workstation.
  2. To successfully install IBM® Cloud Pak for Data, you must have specific information about your environment. Complete the following tasks to ensure that you have the information that you need.
    1. Obtaining your IBM entitlement API key
    2. Determining which components to install
    3. Setting up installation environment variables
  3. Prepare your cluster.
  4. Before you can install IBM® Cloud Pak for Data, you must set up persistent storage on your Red Hat® OpenShift® cluster. For more information on the supported storage and the corresponding storage classes, see "Storage requirements".
  5. Create and configure OpenShift projects (Kubernetes namespaces) where you plan to deploy the Cloud Pak for Data software. Fore more information, see Setting up projects (namespaces) on Red Hat OpenShift Container Platform.
  6. Access the software images. To pull images from a private container registry:
    1. Complete Updating the global image pull secret.
    2. Complete Mirroring images to a private container registry.
  7. Complete the appropriate tasks for your environment in Installing the IBM Cloud Pak for Data platform and services.
    Note: Follow the Specialized installations instructions.
  8. Complete the appropriate tasks for your environment in Post-installation setup (Day 1 operations).
  9. Install Open Data for Industries. For more information, see Installing Open Data for Industries.

Upgrading from Version 3.5

Open Data for Industries does not support operator-based upgrades from the Cloud Pak for Data 3.5. For more information, see Upgrading Open Data for Industries from the Version 3.5 release.

Upgrading from Version 4.0.x

For more information, see Upgrading Open Data for Industries from the Version 4.0 release.

The instructions for upgrading IBM Cloud Pak for Data assume that you already have your entitlement API key. If you don't have your entitlement API key, see Obtaining your IBM entitlement API key.

Administering

Use the following tasks to manage your deployment of Open Data for Industries and keep it running smoothly.

Step Action Guidance
1 Audit your Cloud Pak for Data environment. Refer to the guidance in Auditing your Cloud Pak for Data environment.
Before you let users access the platform, determine what you want to audit:
  • System access *
  • Sensitive data on remote databases
  • Database traffic
Important: Some types of auditing require IBM Guardium® and extra services.

* The Cloud Pak for Data platform supports auditing system access through security information and event management (SIEM) software. If you want to audit system access, determine which events create audit records based on the services that are installed in your environment.

2 Set up regular backups. Ensure that your deployment is prepared for loss of data or unplanned downtime.

Follow the guidance in Set up regular backups.

3 Add users to the service.

Open Data for Industries uses Keycloak for user, group, and role management. Follow the guidance in Add users to the service.

Repeat this process as needed.

4 Monitor the platform.

From the Cloud Pak for Data web client, you can do several tasks. Namely, monitor the services that are running on the platform, understand how you are using cluster resources, and be aware of issues as they arise.

Follow the guidance in Monitoring the platform.

Repeat this process routinely to ensure you are aware of the status of your deployment.

Tip:
The following tasks can help you automate some of this process:
5 Scale your deployment.

Adjust the capacity and resiliency of your deployment based on your workload.

Follow the guidance in Scaling services.
Service name Uses the scaleConfig setting Uses a different scaling method More information
Open Data for Industries
  Horizontal scaling is supported by OpenShift Container Platform.

Before you scale up a service, ensure that your cluster can support the additional workload. If necessary, contact your IBM Support representative.

Repeat this process as needed.

6 View logs and monitor your deployment. Open Data for Industries uses an Elasticsearch, Fluentd, Kibana (EFK) stack for logging and monitoring.

Follow the guidance in Accessing and querying logs in Open Data for Industries.

Additional information is also available externally at https://community.opengroup.org/osdu, which is the Open Group community project for the Open Subsurface Data Universe. For example, you can find additional information for Audit and metrics at this site.

Using

After you installed and set up Open Data for Industries is on Cloud Pak for Data, you can use API methods to deliver, store, search, and govern your oil and gas data. See the following resources:

What you can do Where to look
Learn about the Open Data for Industries service and each of the available API endpoints. Introduction to the API
Access the API reference information for the Open Data for Industries tasks you want to do. The site also includes an interactive pane for you to test curl commands. API reference on IBM Cloud Docs
Ingest oil and gas data files into the Open Data for Industries metadata repositories so that the data can be managed. Ingesting and governing oil and gas data with Open Data for Industries

Troubleshooting

The following resources will help you troubleshoot problems with the Open Data for Industries service on Cloud Pak for Data:

What to do Where to look
Troubleshoot general issues with the Cloud Pak for Data platform. Troubleshooting the platform
Troubleshoot errors that occur during installation and upgrade. Troubleshooting installation
If you need assistance from IBM Support, you can run a job to gather diagnostic information that you can send to IBM to help diagnose the problem. Gathering diagnostic information