Getting started with Open Data for Industries

You can use the Open Data for Industries service on IBM® Cloud Pak for Data to govern your oil and gas data.

The Open Data for Industries service is installed on top of the Cloud Pak for Data platform on a Red Hat® OpenShift® cluster. Use the following resources to learn about the tasks to complete and the guidance to consider to set up and use your environment for Open Data for Industries.

Planning your deployment

Because Open Data for Industries is installed on top of the Cloud Pak for Data platform and Red Hat OpenShift, your planning tasks include learning about the resources that are required by both Open Data for Industries and the Cloud Pak for Data platform.

Complete the following tasks before you install Open Data for Industries.

Step Action Guidance
1 Review the system requirements for the Cloud Pak for Data platform. You can skip this step if you have an existing Cloud Pak for Data deployment. Make sure that Open Data for Industries can be deployed on your environment.
As you review the requirements, consider the following factors:
  • Software requirements, including the version of Red Hat OpenShift that you plan to install or are already running.
  • Whether you have an existing private container registry.
  • Whether you already have persistent storage that is supported.
  • The size of your cluster.
  • The amount of vCPU and memory that is required to install the Cloud Pak for Data platform on your cluster.
Important: The version of Red Hat OpenShift that you choose determines:
  • Which hardware architectures are supported.
  • Which types of persistent storage are supported.
  • Which container runtimes are supported.

When you review the requirements for the Cloud Pak for Data platform, remember that more requirements might apply depending on the services you plan to install.

2 Review System requirements for Open Data for Industries in the section below.

These resources are the minimum resources that are needed to install Open Data for Industries on top of the Cloud Pak for Data platform.

If you plan to install other services in your environment, review the requirements for services in Hardware requirements and Storage requirements to ensure that you identify common hardware and storage requirements.

3 Review the hardware requirements. Review the Hardware requirements and assess the following items:
vCPU, memory, and storage
Calculate the amount of vCPU, memory, and storage that you need based on the requirements for:
  • The Cloud Pak for Data platform
  • The shared cluster components.
  • The services that you plan to install.

You can use the minimum requirements that are listed for the Cloud Pak for Data platform to run Open Data for Industries. However, if the service requires more vCPU, memory, or storage than the Cloud Pak for Data platform, add the difference between the requirements to the sizing of your cluster.

Work with IBM Sales to get a more accurate sizing based on your expected workload.

Hardware architecture
Ensure that the hardware architecture is supported by:
  • Open Data for Industries.
  • Any other services that you plan to install.
  • The persistent storage that you plan to use.
  • The deployment environment that you plan to use.
4 Choose your deployment environment. Ensure that your deployment environment supports the hardware architecture that you want to use.

Consider whether you have an existing cluster environment that you want to use.

You can skip this step if you have an existing Cloud Pak for Data deployment.

5 Choose your shared persistent storage. Ensure that the shared persistent storage is supported by:
  • Open Data for Industries.
  • Any other services that you plan to install.
  • The deployment environment that you plan to use.
6 Choose your image-hosting location. Decide whether you plan to pull images directly from the IBM Entitled Registry or mirror software images to a private container registry.

IBM Cloud Pak for Data images are accessible from the IBM Entitled Registry. In most situations, it is strongly recommended that you mirror the necessary software images from the IBM Entitled Registry to a private container registry.

You must mirror the necessary images to your private container registry in the following situations:
  • Your cluster is air-gapped (also called an offline or disconnected cluster)
  • Your cluster uses an allowlist to permit direct access by specific sites and the allowlist does not include the IBM Entitled Registry
  • Your cluster uses a blocklist to prevent direct access by specific sites and the blocklist includes the IBM Entitled Registry

See Mirroring images to your private container registry.

7 Decide on the namespace scope for Cloud Pak for Data operators. The deployment architecture that you choose you determines whether operators are in the same project (namespace):
  • In an express installation, the IBM Cloud Pak® foundational services operators and the Cloud Pak for Data operators are in the same project.
  • In a specialized installation, the IBM Cloud Pak foundational services operators and the Cloud Pak for Data operators are installed in separate projects. With separate projects, you can specify different security settings for the IBM Cloud Pak foundational services and for the Cloud Pak for Data operators.

For more information, see Creating projects (namespaces).

Depending on your needs, you can find additional information in:

Security

Open Data for Industries complies with a set of focused security and privacy practices: vulnerability management, threat modeling, penetration testing, privacy assessments, security testing, and patch management. To learn about all the security mechanisms that are implemented on a platform level and those practices specific for Open Data for Industries, complete the following tasks.
Platform-specific security mechanisms
Step Action Guidance
1 Review IBM Security in Development practices. IBM Security in Development provides an overview of the platform level security and privacy practices that Open Data for Industries is compatible to.
2 Get familiar with the IBM Secure Engineering Framework. Open Data for Industries development teams are encouraged to follow the Security in Development - The IBM Secure Engineering Framework practices and procedures.
3 Learn about the basic security features on the Red Hat OpenShift Container Platform. Cloud Pak for Data builds on the security features provided by Red Hat OpenShift by creating Security Context Constraints (SCC), service accounts, and roles. That way the Open Data for Industries pods and users have the lowest level of privileges to the Red Hat OpenShift Container Platform. Furthermore Cloud Pak for Data is installed in a secure and transparent manner, which corresponds with the Basic security features on Red Hat OpenShift Container Platform.
4 Ensure that your data security is hardened for the storage solution that you selected. In general, data security is managed by your remote data sources. Only users with the appropriate credentials can access the data in a remote data source. However, if you use shared credentials to access your remote data sources, you expose your data security at risk.

Review the data at rest details in the Storage comparison section in Storage considerations.

5 Encrypt your storage partition by using the Linux Unified Key Setup-on-disk-format (LUKS). Another mechanism to ensure that your data in Open Data for Industries is stored securely is to encrypt your storage partition. If you use Linux Unified Key Setup-on-disk-format (LUKS), you must enable LUKS and format the partition with XFS before you install Cloud Pak for Data.
6 Choose whether to enable FIPS on your Red Hat OpenShift cluster. Open Data for Industries supports FIPS (Federal Information Processing Standard) compliant encryption for all encryption needs.

Read how to enable FIPS on your Red Hat OpenShift cluster.

7 Configure the communication ports used by the Cloud Pak for Data cluster.
8 If your Red Hat OpenShift cluster is configured to use a custom name for the DNS service, update the DNS service name to prevent performance problems. When you install the Cloud Pak for Data platform, the installation points to the default Red Hat OpenShift DNS service name. If your Red Hat OpenShift cluster is configured to use a custom name for the DNS service, a project administrator or cluster administrator must update the DNS service name to prevent performance problems.
9 Prepare for privacy and compliance assessments. What regulations does Cloud Pak for Data comply with? summarizes the platform features that you can use in preparation for privacy and compliance assessments vary according to your configuration and utilization.
10 Isolate the Red Hat OpenShift project (Kubernetes namespace) where Cloud Pak for Data is deployed. As an extra security measure, you can use network isolation to isolate the Red Hat OpenShift project. To do so, review the following information:
11 Protect against DDos attacks. To filter out unwanted network traffic, such as Protecting Against DDos Attacks on OpenShift, use an elastic load balancer that accepts only full HTTP connections. Using an elastic load balancer that is configured with an HTTP profile inspects the packets and forwards only the HTTP requests that are complete to Open Data for Industries.
Open Data for Industries security
Step Action Guidance
1 Get to know the available authentication and authorization mechanisms. Authentication and authorization mechanisms in Open Data for Industries is managed with an identity and access management solution that is called Keycloak, which is an open source component that is included in Red Hat OpenShift.
2 Configure Keycloak and manage users and roles. After you configure Keycloak, it creates the necessary entities to secure resources. These entities can then be maintained with Keycloak or by using the Open Data for Industries Entitlements API.
3 Review the different methods of using Keycloak to manage access to Open Data for Industries. You can use Keycloak to manage access to Open Data for Industries by:
  • Keycloak administrative console
  • The REST API
  • Keycloak Java™ SDK
  • A set of methods that are provided by IBM®.
4 Configure the idle web client session timeout in accordance with your security and compliance requirements. You can specify the length of time that users can leave their session idle before you are automatically logged out of the web client. For more information on configuring the session timeout, see the Configure token settings step in Managing Open Data for Industries users through Keycloak.
5 Encrypt communications to and from Open Data for Industries with TLS or SSL. The TLS certificate and private key (both in PEM format) can be used to enable an HTTPS connection to the Cloud Pak for Data web client. For more information, see Using a custom TLS certificate for HTTPS connections.

Open Data for Industries uses service mesh for encryption in intra service communication.

Service Mesh automatically configures workload sidecars to use mutual TLS when it calls other workloads.

By default, Service mesh configures the destination workloads by using PERMISSIVE mode.

When PERMISSIVE mode is enabled, a service can accept both plain text and mutual TLS traffic.

To allow only mutual TLS traffic, you need to change the configuration to STRICT mode. For more information, see Peer authentication.

6 Understand the auditing mechanisms available for you to use. Audit logging provides accountability, traceability, and regulatory compliance that concern access to and modification of data.

IBM Open Data for Industries supports auditing process on the platform by creating lineage among the different processes and the data states.

A correlation-id tracks the request end to end and it also reflects the authorization that is done for the request cycle.

The different kinds of the logs that are generated at various abstractions can easily be leveraged on the EFK stack. They depict various patterns and support the audit key performance indicators.

The EFK stack is implemented as part of the Open Data for Industries installation. The EFK stack is comprised of:
  • Elastic search (ES): An object store where all logs are stored.
  • Fluentd: Gathers logs from nodes and feeds them to the Elastic search.
  • Kibana: A web UI for Elastic search.

For more information, see Logs and monitoring in Open Data for Industries.

7 Use multitenancy to ensure that you make effective use of infrastructure. That way you reduce the operational expenses while still maintaining security, compliance, and independent operability. IBM® Open Data for Industries administrators can support many tenants on a shared platform either by resource separation or by logical separation. For more information, see Multitenancy in IBM Open Data for Industries.
8 Disable the external route that is used to push images to the registry server when you are not installing IBM Open Data for Industries. For the registry server, you can disable the external route that is used to push images to the registry server when you are not installing Open Data for Industries. However, if you leave the route unavailable when you try to install Open Data for Industries, the installation fails.

System requirements for Open Data for Industries

Resources in this section are for guidance only.

Work with IBM Sales to get a more accurate sizing based on your expected workload.

Version support

Use the following information to see what releases of Open Data for Industries are supported on the version of IBM Cloud Pak for Data you are running:

Service Cloud Pak for Data version 4.0.x
Open Data for Industries
  • 3.0.0 (OpenShift 4.6, 4.8)
Hardware requirements
Use the following information to determine whether you have the minimum required resources to install Open Data for Industries.
Important: The information in this table represents the minimum resources that you need you to successfully install the service. Work with your IBM Sales representative to generate more accurate calculations based on your expected workload.
Service x86-64 POWER Z vCPU Memory Storage Notes® and additional requirements
Open Data for Industries     48 vCPU 144 GB RAM 4 TB

Minimum resources for an installation with a single replica per service.

Work with IBM Sales to get a more accurate sizing based on your expected workload.

Minimum recommended configuration.
  • 3 master nodes; each with 8 vCPU and 16 GB RAM shared on the cluster.
  • 3 worker nodes with 16 vCPU and 48 GB RAM dedicated to Open Data for Industries.
Reccommended configuration for worker nodes: The minimum requirement of 48 vCPU and 144 GB RAM applies only to worker nodes.
Storage requirements
Use the following information to determine which persistent storage options are available for Open Data for Industries.
Service OpenShift Container Storage IBM Spectrum® NFS IBM Cloud File Storage Portworx Other Notes
Open Data for Industries       Supported storage classes:
  • OpenShift Container Storage: ocs-storagecluster-ceph-rbd
  • NFS: managed-nfs-storage
  • IBM Cloud File Storage: ibmc-file-gold-gid
Software dependencies
Use the following information to determine whether the service depends on the availability of other software.
  • External dependencies are software that must be installed in addition to the Cloud Pak for Data platform software.
  • Service dependencies are other services that must be installed on Cloud Pak for Data.
Service External dependencies Service dependencies
Open Data for Industries
To use this service, you must have:
  • Apache Airflow 2.1.1
  • Apache CouchDB 3.1.1
  • Elasticsearch 7.11.1
  • Keycloak 17.0.0
  • MinIO 2020-04-15T00:39:01Z
  • Red Hat AMQ Broker 7.7.0
  • Redis 5.0.3
  • Red Hat OpenShift Service Mesh 2.1.1-0
None
Multitenancy support
Use the following information to determine the level of multitenancy support for Open Data for Industries.
Service Install the service in separate projects Install the service multiple times in the same project Install the service once and deploy multiple instances in the same project Install the service in separate tethered projects Deploy multiple instances to in the same tethered project
Open Data for Industries Yes No No. One instance only. Yes No

Installing

After you plan your environment, you are ready to complete the tasks to install Cloud Pak for Data and Open Data for Industries on your cluster.

Step Action Guidance
1 Install Red Hat OpenShift Container Platform. Follow the guidance for the statement that applies to you:
You already have an OpenShift 4.6 or 4.8 cluster
Go to the next step.
You have an older version of OpenShift
Upgrade your cluster.
You don't have an OpenShift cluster
Deploy OpenShift on your chosen environment:
2 Set up shared persistent storage. If the shared persistent storage you selected is already set up on your cluster, review Setting up shared persistent storage. Make sure you complete any additional tasks to configure the storage for Cloud Pak for Data.

If the shared persistent storage you selected is not set up on your cluster, follow the guidance in Setting up shared persistent storage to install and configure the storage.

3 Create the required OpenShift projects on your cluster. Review the guidance in Creating projects (namespaces) on Red Hat OpenShift Container Platform to determine whether:
  • You have the necessary projects on your cluster.
  • You need to create operator groups for the projects.
4 Obtain your API key. You need an IBM entitlement API key to access the Cloud Pak for Data images, which are hosted on the IBM Entitled Registry.

If you have your API key, go to the next step.

If you don't have your API key, follow the guidance in Obtaining your IBM entitlement API key.

5 Set up your cluster to pull the software images.
To pull images from the IBM Entitled Registry
Complete the appropriate steps for your environment in Configuring your cluster to pull Cloud Pak for Data images.
To mirror images to a private container registry
  1. Review the guidance in Mirroring images to your private container registry to ensure you have a private container registry that meets the minimum requirements.
    Important: To download the Open Data for Industries CASE package, run the command:
      cloudctl case save \
    --repo ${CASE_REPO_PATH} \
    --case ibm-osdu \
    --version 3.0.0 \
    --outputdir ${OFFLINEDIR}
  2. Complete the appropriate steps for your environment in Configuring your cluster to pull Cloud Pak for Data images.
6 Create the catalog source. Follow the guidance in Creating catalog sources.
Important: To create the catalog source for Open Data for Industries:
Run the following command to create the Open Data for Industries catalog source for the latest refresh:
cloudctl case launch \
  --case ${OFFLINEDIR}/ibm-osdu-3.0.0.tgz \
  --inventory osduOperatorSetup \
  --namespace openshift-marketplace \
  --action install-catalog \
    --args "--inputDir ${OFFLINEDIR} --recursive"
Verify that ibm-osdu-operator-catalog is READY:
oc get catalogsource -n openshift-marketplace ibm-osdu-operator-catalog \
-o jsonpath='{.status.connectionState.lastObservedState} {"n\"}'

Ensure that the Operator Lifecycle Manager (OLM) can use the Cloud Pak for Data operators to install the software.

7 Install IBM Cloud Pak foundational services. Follow the guidance in Installing IBM Cloud Pak foundational services.
You can skip this step in either of the following situations:
  • IBM Cloud Pak foundational services are already installed.
  • IBM Cloud Pak foundational services are not installed and you are using the specialized installation method.
8 Create operator subscriptions.

An operator subscription tells the cluster where to install an operator and gives information about the operator to Operator Lifecycle Manager (OLM).

Follow the guidance in Creating operator subscriptions. For the Open Data for Industries operator subscription commands, see "Installing Open Data for Industries" topic.

9 Install the scheduling service. The scheduling service is required if you plan to use Watson™ Machine Learning Accelerator or the quota enforcement feature in Cloud Pak for Data.

Follow the guidance in Installing the scheduling service.

10 Install IBM Cloud Pak for Data. Follow the guidance in Installing IBM Cloud Pak for Data. Depending on the number of OpenShift projects you created, you can install one or more instances of Cloud Pak for Data on your cluster.
11 Review the postinstallation tasks. Make sure that your cluster is secure and complete applicable post-installation tasks that impact how users interact with Cloud Pak for Data.
12 Install Open Data for Industries. Follow the guidance in Installing Open Data for Industries .

Ensure that you complete the installation tasks in order.

Upgrading from Version 3.5

Open Data for Industries does not support upgrade from Version 3.5 to Version 4.0.x. If you want to use Open Data for Industries on Version 4.0.x, you must install the service on 4.0.x.

For more information, see Upgrading Open Data for Industries from the Version 3.5 release.

Administering

Use the following tasks to manage your deployment of Open Data for Industries and keep it running smoothly.

Step Action Guidance
1 Audit your Cloud Pak for Data environment. Refer to the guidance in Auditing your Cloud Pak for Data environment.
Before you let users access the platform, determine what you want to audit:
  • System access *
  • Sensitive data on remote databases
  • Database traffic
Important: Some types of auditing require IBM Guardium® and extra services.

* The Cloud Pak for Data platform supports auditing system access through security information and event management (SIEM) software. If you want to audit system access, determine which events create audit records based on the services that are installed in your environment.

2 Set up regular backups. Ensure that your deployment is prepared for loss of data or unplanned downtime.

Follow the guidance in Set up regular backups.

3 Add users to the service.

Open Data for Industries uses Keycloak for user, group, and role management. Follow the guidance in Add users to the service.

Repeat this process as needed.

4 Monitor the platform.

From the Cloud Pak for Data web client, you can do several tasks. Namely, monitor the services that are running on the platform, understand how you are using cluster resources, and be aware of issues as they arise.

Follow the guidance in Monitoring the platform.

Repeat this process routinely to ensure you are aware of the status of your deployment.

Tip:
The following tasks can help you automate some of this process:
5 Scale your deployment.

Adjust the capacity and resiliency of your deployment based on your workload.

Follow the guidance in Scaling services.
Service name Uses the scaleConfig setting Uses a different scaling method More information
Open Data for Industries
  Horizontal scaling is supported by OpenShift Container Platform.

Before you scale up a service, ensure that your cluster can support the additional workload. If necessary, contact your IBM Support representative.

Repeat this process as needed.

6 View logs and monitor your deployment. Open Data for Industries uses an Elasticsearch, Fluentd, Kibana (EFK) stack for logging and monitoring.

Follow the guidance in Accessing and querying logs in Open Data for Industries.

Depending on your needs, you can find additional information in:

Additional information is also available externally at https://community.opengroup.org/osdu, which is the Open Group community project for the Open Subsurface Data Universe. For example, you can find additional information for Audit and metrics at this site.

Using

After Open Data for Industries is installed and set-up on Cloud Pak for Data, you can use API methods to deliver, store, search, and govern your oil and gas data. See the following resources:

What you can do Where to look
Learn about the Open Data for Industries service and each of the available API endpoints. Introduction to the API
Access the API reference information for the Open Data for Industries tasks you want to do. The site also includes an interactive pane for you to test curl commands. API reference on IBM Cloud Docs
Ingest oil and gas data files into the Open Data for Industries metadata repositories so that the data can be managed. Ingesting and governing oil and gas data with Open Data for Industries

Troubleshooting

Use the following resources to help you troubleshoot problems with the Open Data for Industries service on Cloud Pak for Data:

What to do Where to look
Troubleshoot general issues with the Cloud Pak for Data platform. Troubleshooting the platform
Troubleshoot errors that occur during installation and upgrade. Troubleshooting installation
If you need assistance from IBM Support, you can run a job to gather diagnostic information that you can send to IBM to help diagnose the problem. Gathering diagnostic information

About Cloud Pak for Data