Provision Data Collectors
Step 1. Create a Custom Image
Use Docker to customize the public StreamSets Data Collector Docker image as needed, and then store the private image in your private repository.
- Customized configuration files
- Resource files
- External libraries, such as JDBC drivers
- Custom stages
- Additional stage libraries - The public Data Collector Docker image includes the basic, development, and Windows stage libraries only.
- Packages and files required to enable Kerberos authentication for Data Collector:
- On Linux, the krb5-workstation and krb5-client Kerberos client packages.
- The Hadoop or HDFS configuration files required by the Kerberos-enabled
stage, for example:
- core-site.xml
- hdfs-site.xml
- yarn-site.xml
- mapred-site.xml
Each deployment managed by a Provisioning Agent specifies the Data Collector Docker image to deploy. So you can create a unique Data Collector Docker image for each deployment, or you can use one Docker image for all deployments.
For example, let's say that one deployment of provisioned Data Collectors reads from web server logs, so the Data Collector Docker image used by that deployment requires only the basic and statistics stage libraries. Another deployment of provisioned Data Collectors reads from the Google Cloud platform, so the Data Collector Docker image used by that deployment requires the Google Cloud stage library in addition to the basic and statistics stage libraries. You can create and manage two separate Data Collector Docker images for the deployments. Or you can create and manage a single image that meets the needs of both deployments.
For more information about running Data Collector from Docker, see https://hub.docker.com/r/streamsets/datacollector/.
For more information about creating private Docker images and publishing them to a private repository, see the Docker documentation.
Step 2. Create a Provisioning Agent
- Using Helm
- Helm is a tool that streamlines installing and managing Kubernetes applications.
- Without using Helm
- If you do not want to use Helm, you can define a Provisioning Agent YAML specification file, and then use Kubernetes commands to create and deploy the Provisioning Agent.
When you use either method, you can configure the Provisioning Agent to provision Data Collector containers enabled for Kerberos authentication. However, StreamSets recommends using Helm to enable Kerberos authentication.
Creating an Agent Using Helm
To create a Provisioning Agent using Helm, install Helm and download the Control Agent Helm chart that StreamSets provides. After modifying values in the Helm chart, use the Helm install command to create and deploy the Provisioning Agent as a containerized application to a Kubernetes pod.
Create one Provisioning Agent for each Kubernetes cluster where you want to provision Data Collectors. For example, if you have a production cluster and a disaster recovery cluster, you would create a total of two Provisioning Agents - one for each cluster.
Creating an Agent without Using Helm
To create a Provisioning Agent without using Helm, configure a Provisioning Agent YAML specification file, and then use the Kubernetes create command to create and deploy the Provisioning Agent as a containerized application to a Kubernetes pod.
Create one Provisioning Agent for each Kubernetes cluster where you want to provision Data Collectors. For example, if you have a production cluster and a disaster recovery cluster, you would create a total of two Provisioning Agents - one for each cluster.
Step 3. Define a Deployment YAML Specification
Define a deployment in a YAML specification file. Each file can define a single deployment. The file can optionally define a Kubernetes Horizontal Pod Autoscaler associated with the deployment.
In most cases, you define a single Data Collector container within a deployment specification. To define multiple containers, use the StreamSets Control Agent Docker image version 5.1.0 or later and define the Data Collector container as the first element in the list of containers.
apps/v1
to define each deployment.The YAML specification file can contain the following components:
- Deployment
- Use for a deployment of one or more Data Collectors that can be manually scaled. To manually scale a deployment, you modify a deployment in the Control Hub UI to increase the number of Data Collector instances.
- Deployment associated with a Kubernetes Horizontal Pod Autoscaler
- Use for a deployment of one or more Data Collectors that must automatically scale during times of peak performance. Define the deployment and Horizontal Pod Autoscaler in the same YAML specification file. The Kubernetes Horizontal Pod Autoscaler automatically scales the deployment based on CPU utilization. For more information, see the Kubernetes Horizontal Pod Autoscaler documentation.
Deployment Sample
Define only a deployment in the YAML specification file when creating a deployment for one or more Data Collectors that can be manually scaled.
apiVersion: apps/v1
kind: Deployment
metadata:
name: datacollector-deployment
namespace: <agentNamespace>
spec:
replicas: 1
selector:
matchLabels:
app: <deploymentLabel>
template:
metadata:
labels:
app : <deploymentLabel>
kerberosEnabled: true
krbPrincipal: <KerberosUser>
spec:
containers:
- name : datacollector
image: <privateImage>
ports:
- containerPort: 18630
volumeMounts:
- name: krb5conf
mountPath: /etc/krb5.conf
subPath: krb5.conf
readOnly: true
env:
- name: HOST
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: PORT0
value: "18630"
imagePullSecrets:
- name: <imagePullSecrets>
volumes:
- name: krb5conf
secret:
secretName: krb5conf
...
kerberosEnabled: true
krbPrincipal: <KerberosUser>
...
volumeMounts:
- name: krb5conf
mountPath: /etc/krb5.conf
subPath: krb5.conf
readOnly: true
...
volumes:
- name: krb5conf
secret:
secretName: krb5conf
Variable | Description |
---|---|
agentNamespace | Namespace used for the Provisioning Agent that manages this deployment. |
deploymentLabel | Label for this deployment. Must be unique for all deployments managed by the Provisioning Agent. |
KerberosUser | User for the Kerberos principal when enabling Kerberos
authentication. This attribute is optional. If you remove this
attribute, the Provisioning Agent uses The Provisioning Agent creates a unique
Kerberos principal for each deployed Data Collector
container using the following format:
For
example, if you define the
KerberosUser attribute
as marketing and the Provisioning Agent deploys two
Data Collector
containers, the agent creates the following Kerberos
principals:
|
privateImage | Path to your private Data Collector Docker
image stored in your private repository. Or, if using the public StreamSets
Data Collector Docker
image, modify the attribute as
follows:
Where
<version> is the Data Collector
version. For
example:
|
imagePullSecrets | Pull secrets required for the private image stored in your private
repository. If using the public StreamSets Data Collector Docker image, remove these lines. |
Deployment and Horizontal Pod Autoscaler Sample
Define a deployment and Horizontal Pod Autoscaler in the YAML specification file when creating a deployment for one or more Data Collectors that automatically scale during times of peak performance.
apiVersion: v1
kind: List
items:
- apiVersion: apps/v1
kind: Deployment
metadata:
name: datacollector-deployment
namespace: <agentNamespace>
spec:
replicas: 1
selector:
matchLabels:
app: <deploymentLabel>
template:
metadata:
labels:
app : <deploymentLabel>
kerberosEnabled: true
krbPrincipal: <KerberosUser>
spec:
containers:
- name : datacollector
image: <privateImage>
ports:
- containerPort: 18630
volumeMounts:
- name: krb5conf
mountPath: /etc/krb5.conf
subPath: krb5.conf
readOnly: true
env:
- name: HOST
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: PORT0
value: "18630"
imagePullSecrets:
- name: <imagePullSecrets>
volumes:
- name: krb5conf
secret:
secretName: krb5conf
- apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: datacollector-hpa
namespace: <agentNamespace>
spec:
scaleTargetRef:
apiVersion: apps/v1beta1
kind: Deployment
name: <deploymentLabel>
minReplicas: 1
maxReplicas: 10
targetCPUUtilizationPercentage: 50
...
kerberosEnabled: true
krbPrincipal: <KerberosUser>
...
volumeMounts:
- name: krb5conf
mountPath: /etc/krb5.conf
subPath: krb5.conf
readOnly: true
...
volumes:
- name: krb5conf
secret:
secretName: krb5conf
Variable | Description |
---|---|
agentNamespace | Namespace used for the Provisioning Agent that manages this deployment. |
deploymentLabel | Label for this deployment. Must be unique for all deployments managed by the Provisioning Agent. |
KerberosUser | User for the Kerberos principal when enabling Kerberos
authentication. This attribute is optional. If you remove this
attribute, the Provisioning Agent uses The Provisioning Agent creates a unique
Kerberos principal for each deployed Data Collector
container using the following format:
For
example, if you define the
KerberosUser attribute
as marketing and the Provisioning Agent deploys two
Data Collector
containers, the agent creates the following Kerberos
principals:
|
privateImage | Path to your private Data Collector Docker
image stored in your private repository. Or, if using the public StreamSets
Data Collector Docker
image, modify the attribute as
follows:
Where
<version> is the Data Collector
version. For
example:
|
imagePullSecrets | Pull secrets required for the private image stored in your private
repository. If using the public StreamSets Data Collector Docker image, remove these lines. |
kind: Deployment
name: <deploymentLabel>
In the Horizontal Pod Autoscaler definition, you also might want to modify the minimum and maximum replica values and the target CPU utilization percentage value. For more information on these values, see the Kubernetes Horizontal Pod Autoscaler documentation.
Attributes for AWS Fargate with EKS
When provisioning Data Collectors to AWS Fargate with Amazon Elastic Kubernetes Service (EKS), add the following additional attributes to the deployment YAML specification file:
- Required attribute
- Add the following required environment variable to avoid having to configure the
maximum open file limit on the virtual machines provisioned by AWS Fargate:
- name: SDC_FILE_LIMIT value: 0
- Optional attribute
- Add the following optional
resources
attribute to define the size of the virtual machines that AWS Fargate provisions. Set the values of thecpu
andmemory
attributes as needed:resources: limits: cpu: 500m memory: 2G requests: cpu: 200m memory: 2G
apiVersion: apps/v1
kind: Deployment
metadata:
name: datacollector-deployment
namespace: <agentNamespace>
spec:
replicas: 1
selector:
matchLabels:
app: <deploymentLabel>
template:
metadata:
labels:
app : <deploymentLabel>
spec:
containers:
- name : datacollector
image: <privateImage>
ports:
- containerPort: 18630
env:
- name: HOST
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: PORT0
value: "18630"
- name: SDC_FILE_LIMIT
value: 0
resources:
limits:
cpu: 500m
memory: 2G
requests:
cpu: 200m
memory: 2G
Step 4. Create a Deployment
After defining the deployment YAML specification file, use Control Hub to create a deployment.
You can create multiple deployments for a single Provisioning Agent. For example, for the Provisioning Agent running in the production cluster, you might create one deployment dedicated to running jobs that read web server logs and another deployment dedicated to running jobs that read data from Google Cloud.
Step 5. Start the Deployment
When you start a deployment, the Provisioning Agent deploys the Data Collector containers to the Kubernetes cluster and starts each Data Collector container.
If you configured the Provisioning Agent for Kerberos authentication, the Provisioning Agent works with Kerberos to dynamically create and inject Kerberos credentials (a service principal and keytab) into each deployed Data Collector container.
The agent deploys each container to a Kubernetes pod. So if the deployment specifies three Data Collector instances, the agent deploys three containers to three Kubernetes pods.
During the startup of each Data Collector container, the Data Collector registers itself with Control Hub.
-
On the Start Deployment icon:
. view, select the inactive deployment and then click the
It can take the Provisioning Agent up to a minute to provision the Data Collector containers. When complete, Control Hub indicates that the deployment is active. -
To verify that the Data Collector containers were successfully registered and are up and running, click
in the Navigation panel.
The Data Collectors view displays all registered Data Collectors - either manually administered or automatically provisioned.