Apache Cassandra Data Migrator and Table Mapping

Troubleshooting

Problem

Summary

When implementing Apache-cassandra-data-migrator, it is important not to forget that schema mapping is needed in order to make it work.

Applies to

Apache-cassandra-data-migrator up to 3.2.3

Symptoms

The typical error when mapping is not correctly in place:

ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 10000)
java.lang.ArrayIndexOutOfBoundsException: 3

Cause

The mapping required in the sparkConf.properties file is not correctly set up.

In order to assess the mapping configuration, we require the following:

full sparConf.properties file
Table schema (source and target)

Solution

Taking as an example the following table schema and assuming it is the same for our source and target database.

CREATE TABLE cycling.cyclist_name (
    id uuid PRIMARY KEY,
    firstname text,
    lastname text
)

Our mapping in the sparkConf.properties file should match this:

# comma-separated-partition-key,comma-separated-clustering-key,comma-separated-other-columns
spark.query.origin                                id,firstname,lastname
# comma-separated-partition-key
spark.query.origin.partitionKey                   id
# comma-separated-partition-key,comma-separated-clustering-key
spark.query.target.id                             id
# comma separated numeric data-type mapping (e.g. 'text' will map to '0') for all columns listed in "spark.query.origin"
spark.query.types                                 9,0,0

More info about values for the spark.query.types parameter: cassandra-data-migrator/sparkConf.properties

#############################################################################################################
# Following are the supported data types and their corresponding [Cassandra data-types] mapping
#  0: ascii, text, varchar
#  1: int
#  2: bigint, counter
#  3: double
#  4: timestamp
#  5: map (separate type by %) - Example: 5%1%0 for map<int, text>
#  6: list (separate type by %) - Example: 6%0 for list<text>
#  7: blob
#  8: set (separate type by %) - Example: 8%0 for set<text>
#  9: uuid, timeuuid
# 10: boolean
# 11: tuple
# 12: float
# 13: tinyint
# 14: decimal
# 15: date
# 16: UDT [any user-defined-type created using 'CREATE TYPE']
# 17: varint
# 18: time
# 19: smallint
# Note: Ignore "Frozen" while mapping Collections (Map/List/Set) - Example: 5%1%0 for frozen<map<int, text>>
#############################################################################################################

Furthermore, the tuning should also include the following parameters to be modified accordingly:

spark.query.ttl.cols
spark.query.writetime.cols

First of all, count all of the table columns starting from ZERO. In the example above, looking at spark.query.origin

id        ---> 0
firstname ---> 1
lastname  ---> 2

Any column part of the Primary Key does not have TTL (Primary Key = Partition Key(s) + Clustering Column(s)). As the schema table in the above example has only one partition key, this must not be included in the following parameters, therefore we leave this out and include only the following ones:

spark.query.ttl.cols                              1,2
spark.query.writetime.cols                        1,2

Also, set the following properties correctly in order for the migration into Astra DB to happen properly:

spark.target.scb                                  file:///tmp/scb/target.zip
spark.target.username                             *****************
spark.target.password                             *****************

Finally, the tool is designed for migrating millions/billions of records, but otherwise, if the volume is low, use DSBulk instead. Having said this, for smaller tables, default values for the following parameters may be reduced (otherwise the migration would go slow for a small amount of data to be transferred):

spark.numSplits 
spark.batchSize

As a general rule, we can use 1 split for every 10K rows. So for a table with e.g., 37K rows a numSplits value of 4 is ideal. The default numSplits is 10K as we expect this tool to be used for 1 billion or more rows.

batchSize (default 10) is not related to data volume but to the schema. This is to use batch writes while avoiding multi-partition writes. When the primary key and the partition key are the same, the batchSize should always be 1 to avoid multi-partition writes.

About the following, spark.origin.host, one contact point should be stated.

Additional Resources

Mapping information clearer added from 3.2.3: https://github.com/datastax/cassandra-data-migrator/blob/3.2.3/src/reso…

Working with Apache cassandra-data-migrator. How to execute it and increase --driver-memory and --executor-memory for bigger tables: cassandra-data-migrator

Downloading Secure Connect Bundle: Working with secure connect bundle

Creating an application token. Ensure the token is created with the proper role (e.g. a Database Administrator role): Managing your Astra DB organisation

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB76","label":"Data Platform"},"Business Unit":{"code":"BU048","label":"IBM Software"},"Product":{"code":"SSCR56","label":"IBM DataStax Enterprise"},"ARM Category":[{"code":"","label":""}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Version(s)"}]

Historical Number

ka0Ui0000000LndIAE

Was this topic helpful?

Document Information

Modified date:
30 January 2026

UID

ibm17258644

Tips