Defining a connection to Cassandra data source
To access Cassandra tables, you must define a connection by using the properties in the Connection section on the Properties page.
Before you begin
By default, Cassandra Connector provides complete library with DataStax Java™ Driver for Apache Cassandra (version 4.12.0 or newer) and dependencies that allow to run the job without extra configuration activities. You can either keep the default value $(DSHOME)/../DSComponents/bin/cassandra/CassandraConnector-ThirdParty.jar or provide your own set of libraries. The Cassandra client jars configuration property can contain both JAR files or directories (separated by semicolon).
Note: Currently, only DataStax Java Driver for Apache Cassandra is supported by Cassandra
Connector.
Procedure
- In the job design canvas, double-click the Cassandra Connector stage icon to open the stage editor.
-
On the Properties page, specify values for the connection properties.
- Cluster contact points - the list of cluster seed nodes. It should contain IPs or hostnames of Cassandra cluster nodes, optionally with ports if different than the default Cassandra port. The contact points should be separated by semicolon.
- Local data center - the name of the data center local to defined contact points.
- Protocol version – chose the protocol that you want to use to connect to the cluster.
- Cassandra client jars – contains the list of folders or JAR files with Cassandra client jars and optionally some custom jars (with custom type codes). The list uses semicolon as a separator.
- Authentication – choose the authentication method configured in the
target Cassandra cluster. Currently, supported authentication types:
- Allow all authenticator – unrestricted access to entire database for all users.
- Password authentication – uses Username and Password to connect to the database.
- Kerberos - uses JAAS login configuration file. Use Krb5LoginModule to connect by using Kerberos.
- Use - SSL/TLS - Use SSL/TLS to secure the client between client and
Cassandra cluster.
- Use client-to-node encryption - The traffic between client and cluster
nodes is encrypted and the client verifies the identity of the Cassandra nodes it connects to
- Keystore path - The path to your keystore file.
- Keystore password - Provide the password that was used when generating the keystore.
- Use client certificate authentication - With these options Cassandra
nodes verify the identity of the client
- Truststore path - The path to your truststore file.
- Truststore password - Provide the password that was used when generating the truststore.
- Use client-to-node encryption - The traffic between client and cluster
nodes is encrypted and the client verifies the identity of the Cassandra nodes it connects to
- Compression type – the driver can compress data in-transition (to and from the server) using one of the methods that are supported by Cassandra: LZ4 and Snappy. You need to put appropriate JAR files supporting chosen compression type to the connector class path (by using Cassandra client jars property).
- Keys pace name – the name of the key space that contains target table
- Table name – the name of the source or target table in Cassandra database.
- Use JSON encoded map for a single row – when you select Yes the entire row is represented as a single string containing all column names and values encoded as JSON. It is supported for reading data (connector as a source) and for inserting data (connector as a target) but not update or delete. It is especially useful if table contains columns that are defined as collections or nested collections. It is currently not supported to read data from collections directly, but by using JSON feature all types of collections can be handled (for SELECT and INSERT): Set, list, map, User-Defined Types, tuples.
- Use parallel read - If you select Yes and run with
multi-player configuration, data to be read is divided between the players so that each player only
reads a part of it. There are two strategies of such division available:
- - Host Aware - ranges are grouped by their host, which is divided between the players and then merged if possible. It must be optimal in most cases.
- - Equal Splitter - token ranges are split evenly between the players. This strategy might perform better than Host Aware when number of Cassandra vnodes is low.
- Data Consistency: Check schema agreement – when set to Yes upon connection to the Cassandra cluster connector checks if all cluster nodes contain the database schema in the same version.
- Consistency level – chose one of the consistency levels that are supported by Cassandra.
- Options: Enable CQL statement tracing – with this option enabled, all CQL statements that are executed by the connector are traced by the driver – the trace data is put into job’s log. It can create huge amount of data even for a small data sample so it can be used with caution. It can be used to investigate some performance problems when processing data from Cassandra or when storing or modifying data in Cassandra.
- Custom type codecs - Cassandra enables to provide some custom codecs for
chosen types. In case of DataStage® all types that are
not supported by the DataStage natively are handled by
custom type codes and mapped to strings. Those types are: uuid, timeuuid, varint, inet. They are
handled by respective custom type codec classes:
com.ibm.is.cc.cassandra.codec.UuidToStringCodec, com.ibm.is.cc.cassandra.codec.TimeUuidToStringCodec, com.ibm.is.cc.cassandra.codec.VarIntToStringCodec, com.ibm.is.cc.cassandra.codec.InetToStringCodec.
Those classes are contained in the Cassandra connector’s JAR files. The customer can however use the property Custom type codecs to map those type in a different way or even provide some additional custom type codes. You can for example, map Boolean values to some special string representations or decode some data from BLOB fields and map them to string values. It can be also used to map some collections types to some other primitive types (mainly string) for example to produce a string representing collection’s elements. However, when the driver connects to the target database it loads its own type codes, and it refuses to install extra drivers when the mapping for selected types exists. For example, you cannot register two type codes for mapping BLOB to string.
- Click OK to Save.