Installing Transformer
The method that you use to install Transformer depends on the location of your Spark installation and where you choose to install Transformer.
- Local Spark installation - To get started with Transformer in a development environment, you can simply install both Transformer and Spark on the same machine and run Spark locally on that machine. You develop and run Transformer pipelines on the single machine.
- Cluster Spark installation - In a production environment, use a Spark installation that runs on a cluster to leverage the performance and scale that Spark offers. Install Transformer on a machine that is configured to submit Spark jobs to the cluster. You develop Transformer pipelines locally on the machine where Transformer is installed. When you run Transformer pipelines, Spark distributes the processing across nodes in the cluster.
- Cloud Spark installation - When you have Spark installed in the cloud, you can use the Azure marketplace to install Transformer as a cloud service. When you run Transformer as a cloud service, you can easily run pipelines on cloud vendor Spark clusters such as Databricks and EMR.
After installing Transformer from a tarball, as a best practice, configure Transformer to use directories outside of the runtime directory.
If you use Docker, you can also run the Transformer image from Docker Hub.
Choosing an Installation Package
You can use Transformer prebuilt with the following Scala version:
- Scala 2.12 - Use with Spark 3.x. Requires Java JDK 11.
The Scala version that Transformer is built with determines the Java JDK version that must be installed on the Transformer machine and the Spark versions that you can use with Transformer. The Spark version that you choose determines the cluster types and the Transformer features that you can use.
For example, Amazon EMR 7.5 clusters use Spark 3.5. To run Transformer pipelines on those clusters, you use Transformer prebuilt with Scala 2.12. And since Transformer prebuilt with Scala 2.12 requires Java JDK 11, you install that JDK version on the Transformer machine.
For more information, see Cluster Compatibility Matrix, Scala, Spark, and Java JDK Requirements, and Spark Versions and Available Features.
Installing when Spark Runs Locally
To get started with Transformer in a development environment, install both Transformer and Spark on the same machine. This allows you to easily develop and test local pipelines, which run on the local Spark installation.
All users can install Transformer from a tarball and run it manually. Users with an enterprise account can install Transformer from an RPM package and run it as a service. Installing an RPM package requires root privileges.
transformer
. If a transformer
user and a
transformer
group do not exist on the machine, the installation
creates the user and group for you and assigns them the next available user ID and group
ID.transformer
user
and group, create the user and group before installation and specify the IDs that
you want to use. For example, if you're installing Transformer on
multiple machines, you might want to create the system user and group before
installation to ensure that the user ID and group ID are consistent across the
machines.Before you start, ensure that the machine meets all installation requirements and choose the installation package that you want to use.
Installing when Spark Runs on a Cluster
All users can install Transformer from a tarball and run it manually. Users with an enterprise account can install Transformer from an RPM package and run it as a service. Installing an RPM package requires root privileges.
transformer
. If a transformer
user and a
transformer
group do not exist on the machine, the installation
creates the user and group for you and assigns them the next available user ID and group
ID.transformer
user
and group, create the user and group before installation and specify the IDs that
you want to use. For example, if you're installing Transformer on
multiple machines, you might want to create the system user and group before
installation to ensure that the user ID and group ID are consistent across the
machines.Before you start, ensure that the machine meets all installation requirements and choose the installation package that you want to use.
Enabling Kerberos for Hadoop YARN Clusters
When a Hadoop YARN cluster uses Kerberos authentication, Transformer uses the user who starts the pipeline as the proxy user to launch the Spark application and to access files in the Hadoop system, unless you configure a Kerberos principal and keytab for the pipeline. The Kerberos keytab source can be defined in the Transformer properties file or in the pipeline configuration.
Using a Kerberos principal and keytab enables Spark to renew Kerberos tokens as needed, and is strongly recommended. For more information about how to configure pipelines when Kerberos is enabled, see Kerberos Authentication.
Before pipelines can use proxy users or use the keytab source defined in the Transformer properties file, you must enable these options in the Transformer installation.
Enabling Proxy Users
Before pipelines can use proxy users with Kerberos authentication, you must install
the required Kerberos client packages on the Transformer machine and then configure the environment variables used by the K5start
program.
Enabling the Properties File as the Keytab Source
Before pipelines can use the keytab source defined in the Transformer configuration properties, you must configure a Kerberos keytab and principal for Transformer.
Installation through the Azure Marketplace
You can install Transformer as a service through the Microsoft Azure marketplace.
If you have an account with Databricks, you can install StreamSets for Databricks through the Azure marketplace. StreamSets for Databricks includes both Data Collector and Transformer on the same virtual machine.
Installing Transformer on Azure
You can install Transformer on Microsoft Azure.
Transformer is installed as an RPM package on a Linux virtual machine hosted on Microsoft Azure. Transformer is available as a service on the instance after the deployment is complete.
Installing StreamSets for Databricks on Azure
You can install StreamSets for Databricks on Microsoft Azure. StreamSets for Databricks includes both Data Collector and Transformer.
Data Collector and Transformer are installed as RPM packages on a Linux virtual machine hosted on Microsoft Azure. Data Collector and Transformer are available as services on the instance after the deployment is complete.
Configuring Directories (Tarball Installation)
As a best practice, after installing Transformer from a tarball, configure Transformer to use directories outside of the runtime directory after installation. This enables the use of the directories after Transformer upgrades.
Configure the directories that store files used by Transformer so they are outside of the $TRANSFORMER_DIST directory, the base Transformer runtime directory.
You can use the default locations within the $TRANSFORMER_DIST runtime directory. However, if you use the default values, make sure the user who starts Transformer has write permission on the base Transformer runtime directory.
Run Transformer from Docker
- Enterprise account
- Users with an enterprise account register the Transformer image with Control Hub and use Control Hub authentication to access the Transformer image.
- IBM StreamSets
- Users without an enterprise account can create a free account with IBM StreamSets, then create a self-managed deployment and use it to manage a Transformer image.