Data Refinery environments (Watson Studio and IBM Knowledge Catalog)
In Data Refinery, a Spark & R runtime is started when you shape your data in Data Refinery and when you run a Data Refinery flow in a job.
- Shaping data in Data Refinery
- Running a Data Refinery flow
- Environment options in jobs
- Default environment templates
- Runtime logs for jobs
Shaping data in Data Refinery
When you select to refine data in Data Refinery, a Data Refinery runtime is started under the covers and is listed as an active runtime under Tool runtimes on the Environments page on the Manage tab of your project. You can stop the runtime from this page.
Running a Data Refinery flow
You can create a job in which to run your Data Refinery flow:
- Directly in Data Refinery by clicking the Jobs icon
from the Data Refinery toolbar and creating a job
- From your project's Assets page by selecting the Data Refinery flow and clicking ACTIONS > Create job
Environment options in jobs
When you create a job in which to run a Data Refinery flow, you can select to use one the following environments:
-
Spark & R environments
Service Spark environments are not available by default. An administrator must install the Analytics Engine powered by Apache Spark service on the IBM Cloud Pak for Data platform. To determine whether the service is installed, open the Services catalog and check whether the service is enabled.
With a Spark & R environment, the Data Refinery flow runs in its own Spark cluster. Each Spark environment consists of one SparkR kernel as a service. The kernel has a dedicated Spark cluster and Spark executors. You should use a Spark & R environment only if you are working on a large data set. If your data set is small, you should select the
Default Data Refinery XS
runtime. The reason is that, although the SparkR cluster in a Spark & R environment is fast and powerful, it requires time to create, which is noticeable when you run a Data Refinery job on small data set.You can select a default environment template included in Watson Studio or create your own Spark & R environment template.
If you create your own Spark & R environment template, you can configure the size of the Spark driver and the size and number of the executors dependent on the size of the data set.
You should always select a Spark & R environment to run Data Refinery flows that operate on large data sets.
-
Default Data Refinery XS
The
Default Data Refinery XS
runtime is used when you refine data in Data Refinery and can also be selected as the environment runtime when you create a job in which to run your Data Refinery flow.You should select the
Default Data Refinery XS
runtime to run Data Refinery flows that operate on small data sets because the runtime is instantly available and doesn't first have to be started before the job can run.The
Default Data Refinery XS
runtime is HIPAA ready. -
Hadoop cluster
Service Hadoop environments are not available by default. An administrator must install the Analytics Engine powered by Apache Spark service on the IBM Cloud Pak for Data platform. To determine whether the service is installed, open the Services catalog and check whether the service is enabled.
If you want to refine HDFS data on a Hadoop cluster, you can run your Data Refinery jobs directly on the Hadoop cluster.
After the runtime was started, it is listed as an active runtime on the Environments page of your project. The runtime is stopped when the Data Refinery job stops running.
Default environment templates
Watson Studio offers the following default Spark & R environment templates that you can use when you create a job in which to run a Data Refinery flow. Selecting a Spark environment template helps you to quickly get started running Data Refinery jobs without having to create your own Spark & R environment template. The included environment templates are listed under Templates on the Environments page on the Manage tab of your project.
- R-based runtimes for Data Refinery do not work on the IBM Z (s390x) platform.
- Up to and including Cloud Pak for Data release 5.1.1, R-based runtimes for Data Refinery do not work on the IBM Power® (ppc64le) platform. From Cloud Pak for Data release 5.1.2, R-based runtimes for Data Refinery that are based on R4.3 and above work on the IBM Power® (ppc64le) platform.
Name | Hardware configuration |
---|---|
Default Spark 3.4 & R 4.3 |
2 Executors each: 1 vCPU and 4 GB RAM; Driver: 1 vCPU and 4 GB RAM |
Default Spark 3.4 & R 4.2 Deprecated |
2 Executors each: 1 vCPU and 4 GB RAM; Driver: 1 vCPU and 4 GB RAM |
Runtime logs for jobs
To view the accumulated logs for a Data Refinery job:
- From the project's Jobs page, click the job that ran the Data Refinery flow for which you want to see logs.
- Click the job run. You can view the log tail or download the complete log file.
Next steps
- Creating your own environment template
- Creating jobs in Data Refinery
- Running a Data Refinery flow in a Hadoop environment
- Managing jobs
Parent topic: Environments