Contents


Set up a basic Apache Cassandra architecture

Comments

Although you've probably been hearing about big data and NoSQL databases for some time, maybe you have not yet had a chance to start working with them. While you may have an idea that a NoSQL database could help in some of your projects, you may not feel confident in approaching the task. If so, this tutorial is here to help.

I will guide you through the setup of one of the coolest and most resilient big data stores available: Apache Cassandra. This is a hands-on tutorial targeted to developers and database administrators with a basic knowledge of relational databases. I will cover the main aspects of Cassandra in detail and direct you to other sources for additional information.

To test your environment's setup, I use the Cassandra utilities and a Python script to consume the data stored in Cassandra. Don't worry if you are not familiar with these tools, as they are not required for the setup but are used only to validate the configuration and to serve as additional resources.

Basic concepts

In this section, I explain some of the details inherited by Cassandra as a distributed database. If you already have some knowledge of these concepts or if you are not interested in the theory right now, you can jump to Build the plan.

CAP theorem

CAP stands for "consistency, availability, and partition tolerance." The CAP theorem, first formulated by Eric Brewer in 2000, states that, at most, only two of these properties can obtain in any shared-data system. So you must choose two; you cannot have it all. For more about the theorem, see "Related topics" below, but I will give you an overview.

The CAP theorem, as it relates to Cassandra, is important to understand because it might cause you to conclude that Cassandra isn't the best fit for your NoSQL database solution. In any case, it will help you to start thinking about the constraints of your solution in terms of consistency and availability.

According to the theorem, for any distributed system you must choose the two most important guarantees for your system (see Figure 1). You can have all three guarantees in Cassandra, but not at the same time. Therefore, when you want a highly available database without downtimes and you don't want to be caught by occasional hardware failures, Cassandra is the best fit for solutions that focus on availability and partition tolerance.

This is in contrast to the ACID (atomicity, consistency, isolation, durability) properties of a traditional relational database management system (RDBMS), such as MySQL, DB2®, Oracle, and Sybase. I don't mean to suggest that in Cassandra you don't have atomic operations and that Cassandra data are not isolated or durable. I simply mean that those are not Cassandra's main concerns. The database was born to be natively distributed and to be easily scaled as data and application transactions grow.

Figure 1. CAP theorem guarantees and Cassandra
Chart showing AP theorem guarantees and Cassandra

Distributed database

Cassandra is a distributed database by nature. That means it was designed to run in a network of computer nodes as a server with different parts running on different machines without any specific hardware or software to manage or coordinate. All the nodes' coordination and data distribution are inside its own architecture. This is one of the reasons that a Cassandra network is easier to scale horizontally than other common relational database systems and is also cheaper.

The typical Cassandra network topology is composed of a cluster of nodes, also called a Cassandra ring, running in different network addresses located on different physical servers.

Figure 2. Cassandra cluster of nodes in different network hosts
Chart showing Cassandra cluster of nodes

This feature increases the network's availability in case of node failure. Each node can coordinate a client's request without a master node so that there is no single point of failure. It also allows you to set up different configuration strategies to make the data aware of different node locations, thus increasing, even more, system availability.

Figure 3. Cassandra cluster of eight nodes receiving a client connection writing data to a keyspace configured with a replication factor of 3
Cassandra  cluster of 8 nodes
Cassandra cluster of 8 nodes

All the data are evenly distributed across the Cassandra ring (nodes) according to a hash algorithm to create the number of copies required, also called replicas. The replication factor is an important aspect of the cluster configuration. It is defined by a keyspace or schema configuration.

All the information about the cluster data, topology, node availability, and performance is exchanged between nodes through the gossip protocol, a kind of peer-to-peer protocol. This information is important to advise the client's connections about which node is the best to write or read any data in a given time.

Cassandra clients can communicate with the server through two protocols: the CQL binary protocol or through an RPC protocol called thrift. The CQL binary protocol is a newer protocol and is preferred over thrift. Cassandra Query Language (CQL) is a language, similar to SQL, that is used by Cassandra to create commands to manipulate its schema structure and data (DDL and DML).

Basic data structure and modeling

An important and sometimes tricky aspect of Cassandra is its data modeling approach. First, you need to understand how its data are organized inside its architecture and then how to model the data structure of your application to get the most of its performance.

In Cassandra, all data are organized by partitions with a primary key (row key), which gives you access to all columns or sets of key/value pairs as shown below.

Figure 4. Cassandra data structure partition
Image shows Cassandra data structure partition

The primary key in Cassandra can comprise two special keys: the partition key and (optionally) the clustering key. The purpose of the partition key is to spread data evenly around the cluster. The job of the clustering key (also called "clustering columns") is to cluster and organize data of a partition to allow efficient queries. Consider the following example.

When you create a table in Cassandra, you use a CQL command similar to the following:

CREATE TABLE movie_catalog (category text, year int, title text, 
PRIMARY KEY (category));

The first column is considered implicitly to be the partition key for the movie_catalog table. There is no clustering key. However, assume you add the year column inside the primary key like this:

CREATE TABLE movie_catalog (category text, year int, title text, 
PRIMARY KEY (category,year))

Now the category continues to be the partition key, while the year column is the clustering key. Both columns are part of the primary key.

If you find all this confusing, don't overthink it! The important thing to know is that all Cassandra tables must have a primary key to locate the node where the data are in the cluster. That key is composed of at least a partition key. As indicated above, a clustering key, used to locate data inside the node (partition), can also be part of the primary key.

To model your tables, you must carefully choose your partition key to allow Cassandra to do a good data distribution across the nodes. It is not a good idea to have all your application data (rows) in only one partition. By the same token, you can also have too many partitions. You, therefore, need to find a good balance in grouping your data to satisfy your application requirements.

The most often-used technique to modeling in Cassandra is called Query Based Modeling. This approach requires that you think about the queries that your application user interface will issue first. You then model your tables based on those queries. That topic is a possible subject for a later tutorial.

Build the plan

Imagine you are asked to design a database tier for a critical application architecture that will store patient exam information for a large hospital. This system will require 24x7 uptime and will serve many users. Your first concern is that the database must have high availability and must be fault-tolerant so as not to adversely affect its users and hospital operations. In the next section, I describe a possible solution.

Solution overview

You decide to initially set up a basic testing cluster of three nodes for the testing (UAT) environment. Afterwards, in production, these will be deployed on three different servers' machines in the hospital data center in production.

Figure 5. Hospital application accessing database tier of servers running a Cassandra cluster of three nodes
Chart showing hospital application

The idea of having a three-node cluster instead of just one node is to increase the availability of your database system. In case of failures in one node, you still have two nodes operating to answer all the application's requests. Also, you can balance the load of requests in case of a high load, giving a lower latency for the application during data read and writes.

Cassandra client drivers can auto-discover all available nodes and choose the best coordinator node that will be responsible for writing all the copies or replicas of your data. All this is possible due to Cassandra's gossip protocol implementation that exchanges health information about the node between its peer nodes.

Replication factor

The next step is to decide which replication factor and consistency level you will use for the application. After that, you can start to install and configure the database runtime environment and create schemas and tables for your application.

In the schema definition, you set up a replication factor of 3. This means that you configure the database schema (keyspace) to create three copies of the data in your three nodes. Each time the application connects to one node and inserts one item of data into a table, it is replicated to the other two nodes automatically. This will give you more reliability in storing your data safely.

CREATE KEYSPACE patient WITH replication = {'class': 'SimpleStrategy',
    'replication_factor' : 3};

Consistency level

You also need to define the consistency level to be used by the application during the reads and writes session when a client is connected. This will help you to determine how strong you need to have the queries consistent with the data state. You might, for example, decide to have a QUORUM consistency level for writing and reading data, which means that you will force Cassandra to write and read data for the majority of the nodes (two nodes) before returning a request.

The consistency level is defined by each client session and can be changed at any time. As an example, in the cqlsh client shell shown below, you can test the consistency level any time before queries in the same way most Cassandra drivers allow you to.

cqlsh:patient> consistency QUORUM ;
Consistency level set to QUORUM.
cqlsh:patient>

There is a trade-off here between consistency and latency performance. The greater the consistency, the higher the time to read/write operations. In our basic cluster setup, this doesn't impose any significant time difference, but it is an important concept to be applied to large Cassandra clusters. There you may be dealing with large amounts of data and still need a fast response time.

What you'll need

A Cassandra database is based on a Java™ platform so that it can run in any of the many operating systems that support Java technology, with small disk space and memory available to start working. For the application described in this tutorial, I recommend the following:

  • 2GB RAM minimum available— To install and run a Cassandra database instance, I recommend that you have a machine with at least 4 GB of RAM, with at least 2 GB available. An 8GB RAM machine would be even better. If you decide to run the Cassandra instances on Docker, each container must have at least 1 GB of RAM available to run each Cassandra node.
  • Java 8— Since the release of Apache Cassandra V3, you need the Java Standard Edition 8 installed on your machine because Cassandra runs on the Java Virtual Machine (JVM). Older Cassandra versions (such as V2.2) can run with Java 7. You can check your Java version by typing
    java -version
    in your OS prompt shell.
  • Python 2.7— The Python installation is required if you want to use the Cassandra node management tool nodetool and shell utility cqlsh. These tools are useful for getting information about and managing a Cassandra instance and its databases. You can check which Python version you have installed by typing
    python --version.
  • Docker CE— This is optional if you want to configure all Cassandra nodes in containers running on the same machine. I recommend using it for creating a testing cluster environment. Don't worry if you are new to Docker containers. Below, I guide you through the required commands to set up your Cassandra cluster. Download the latest Docker CE version for your platform from the Docker website.

Installation

You can choose to install Cassandra manually from the Cassandra website or automatically via Docker containers. If you choose to use Docker containers to create your Cassandra cluster, you can skip the "Download the package" section.

Download the package

If you are using Linux, you might find a specific package for your installation but in most cases, you will download a compressed tar.gz file from the latest version available (V3.11 as of this writing).

  1. After the download, uncompress the package with the TAR utility (or similar tool):
    $ tar -xvf apache-cassandra-3.11.0-bin.tar.gz
  2. Extract the file contents to any desired location. After extraction, create an apache-cassandra-3.11.0 directory with all Cassandra binaries, configuration files, documentation files, libraries, and utility tools, such as the following:
    $ ls  
    CHANGES.txt  LICENSE.txt  NEWS.txt  NOTICE.txt  bin  conf  doc  interface  javadoc  lib  pylib  tools

Configuration

This section covers the manual setup of the first Cassandra node. If you are using Docker, you may still want to read this section to understand the main Cassandra configuration parameters. Otherwise, skip to "Set up a testing cluster using Docker."

All the Cassandra main configuration is placed in the cassandra.yaml file located in the conf directory.

Configuration parameters

Edit the cassandra.yaml file to change the following basic parameters:

  • cluster_name— This identifies the name of your Cassandra cluster of three nodes. It is important that all the nodes configuration have the same name.

    cluster_name: 'Hospital Test Cluster'

  • seeds— The IP network address or hostname list of the main node(s) of the cluster. For your testing cluster, you will set the IP address of the first node.

    seeds: "127.0.0.1"

  • listen_address— The hostname of the node that will be used by clients and other nodes to connect to this node. Instead of using the localhost (as shown here), set the real hostname used by the machine on its network.

    listen_address: localhost

  • native_transport_port— The TCP port number of node that will be used by clients to connect to this node. Make sure to use a port not blocked by firewalls. The default is 9042.

    native_transport_port: 9042

To set up a basic authentication and authorization configuration for this instance, you need to change these optional additional parameters:

  • authenticator— Enables user's authentication. You will need to change this parameter to require users to inform username and password when connecting to the cluster.

    authenticator: PasswordAuthenticator

  • authorizer— Enables the user's authorization and limits his permissions. If you change this parameter, you need to increase the system_auth keyspace replication factor to create other copies of the authorization data in other nodes in case of node unavailability.

    authorizer: CassandraRoleManager

Start the first node manually

Now that all configuration is set, you can run the Cassandra script located inside the bin directory as below. The -f option will output all the bootstrap logs in the foreground. For the first time, it is useful to check for errors during the Cassandra startup.

$ bin/cassandra -f

If you see the following log information after some seconds of initialization, it means the Cassandra node is up, running, and ready to receive client connections.

INFO  [main] 2017-08-20 18:04:58,329 Server.java:156 - Starting listening for CQL
    clients on localhost/127.0.0.1:9042 (unencrypted)...

To double-check the node status, you can use the nodetool utility, located in the bin directory. This can give you information about the Cassandra cluster and nodes. To check the cluster status just issue the following command:

$ nodetool status

The following command prints information about the cluster, including the data center name where the cluster is running (in this case, the default configuration) and the status of each node member of the cluster:

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address	Load   	Tokens   	Owns (effective)  Host ID                           	Rack
UN  127.0.0.1  103.67 KiB  256      	100.0%        	6aae6c1f-cf06-4874-9507-a43025c312d1  rack1

The UN letters before the IP address mean that the node is Up (U) and Normal (N). After the first startup, the data and logs directories will be created to store all the keyspace/tables data and logs, respectively.

If you don't want to use Docker containers, you can repeat the previous steps to create other nodes. Remember to use the IP address of the first node as the seeds configuration for the other nodes. If you do want to use Docker containers, follow the next steps.

Set up a testing cluster using Docker

Instead of installing and configuring the Cassandra in different physical server machines, you can use Docker containers to create a cluster of three nodes running in the same testing server machine. Make sure you have enough RAM to run three Cassandra instances. Otherwise, you might try to reduce the number of nodes to two.

If you have Docker installed in your testing machine, you can use the official images available on the Docker hub. To download, install, and run Cassandra 3.11, enter the following Docker command:

docker run --name node1 -d cassandra:3.11

This command searches the Internet for an image called cassandra with version tag 3.11 on the Docker hub registry. It then downloads the image and creates and starts a container named node1. This container has been previously configured in a default Cassandra configuration, similar to that described above.

You can check if the new container is up and running by using the docker ps command:

$ docker ps
CONTAINER ID    	IMAGE           	COMMAND              	CREATED          	STATUS          	PORTS                                     	NAMES
803135731d1a    	cassandra:3.11  	"/docker-entrypoint.s"   About a minute ago   Up About a minute   7000-7001/tcp, 7199/tcp, 9042/tcp, 9160/tcp   node1

Now you can start other instances, informing each new node of the location of the first node. Do that by changing the seed node IP address of the new node using the CASSANDRA_SEEDS environment variable. This changes the seeds configuration automatically inside the cassandra.yaml file of the new node created inside the container. To create and start the second node container (node2), enter the following:

$ docker run --name node2 -d -e CASSANDRA_SEEDS="$(docker inspect --format='{{
    .NetworkSettings.IPAddress }}' node1)" cassandra:3.11

To determine how many nodes are inside your cluster, execute the nodetool utility inside the node1 container:

$ docker exec -it node1 nodetool status

This command displays the status of each node of the cluster configured so far, so the expected result is similar to the following:

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address 	Load   	Tokens   	Owns (effective)  Host ID                           	Rack
UN  172.17.0.3  103.25 KiB  256      	100.0%        	f1bbd4d1-8930-45a3-ba43-4a2416617c7f  rack1
UN  172.17.0.2  108.64 KiB  256      	100.0%        	bec1b022-a397-4401-bd42-676c60397fe1  rack

If node2 started successfully, you can create a third node (node3) in the same way. When all three nodes are running, go to "Testing" below. If for some reason node2 failed to start, continue with the troubleshooting procedure here.

Troubleshooting cluster on Docker

If one of your nodes doesn't start properly, check the Cassandra logs:

$ docker logs node2

If you see an error message similar to "Unable to gossip with any seeds," you need to add a parameter by following the instructions below. Otherwise, go to "Testing."

  1. To add the necessary parameter, first capture the IP address of node1 using the Docker inspect command:
    $ docker inspect --format='{{ .NetworkSettings.IPAddress }}' node1
  2. Let's assume that the IP address returned is 172.17.0.2. Run the following commands to stop and remove the node1 container and then recreate it with exposed gossip broadcast address and port parameters:
    $ docker stop node1
    $ docker rm node1
    $ docker run --name node1 -d -e CASSANDRA_BROADCAST_ADDRESS=172.17.0.2 -p 7000:7000 cassandra:3.11
  3. Then create node2 with a broadcast IP address of 172.17.0.3, reusing the node1 address as the seed node:
    $ docker run --name node2 -d -e CASSANDRA_BROADCAST_ADDRESS=172.17.0.3 -p 7001:7000 -e CASSANDRA_SEEDS=172.17.0.2 cassandra:3.11

    This configuration will allow the two nodes to broadcast the gossip protocol information configured in the 7000 port with each other through the 7000 and 7001 container ports.
  4. Next, use docker ps to check if the two Docker processes are running. Then use the nodetool utility again to confirm the status of the cluster:
    $ docker exec -it node1 nodetool status
    Datacenter: datacenter1
    =======================
    Status=Up/Down
    |/ State=Normal/Leaving/Joining/Moving
    --  Address 	Load   	Tokens   	Owns (effective)  Host ID                           	Rack
    UN  172.17.0.3  108.29 KiB  256      	100.0%        	fd135375-711a-471a-b4e5-409199bbaaa5  rack1
    UN  172.17.0.2  108.66 KiB  256      	100.0%        
        5db97fc3-70e9-48e5-b63b-0be67e35daea  rack1

Container environment variables

To change the default Cassandra cluster name configuration inside the Docker container, you can use the container environment variables.

When you run a Cassandra Docker image, you can set Cassandra-specific configurations by passing one or more environment variables on the Docker run command line using the -e option. This will be used by the image to change the Cassandra parameters inside the container. For more information, see the Cassandra image documentation on the Docker hub.

Testing

The first step in testing your cluster configuration is to connect to it using the CQL shell utility (cqlsh). This is a Python command line script that creates a client that can connect to any cluster host. All you need to do is to issue the cqlsh command. By default, the script will try to connect to an instance running on localhost. You can change the host by passing the host parameter. See the cqlsh help for details (cqlsh --help).

CQL shell tool

If you are using Docker, you can execute cqlsh from inside the container:

$ docker exec -it node1 cqlsh
Connected to Test Cluster at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.11.0 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
cqlsh>

This shell lets you issue CQL commands similar to SQL to create and define your keyspace (schema), tables, and manipulate data. You can find more information in the Cassandra documentation.

Create the testing keyspace

Let's create your first keyspace to store all patient exams information:

  1. Issue the CQL command CREATE KEYSPACE to create the patient schema.
    cqlsh> CREATE KEYSPACE patient WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 3};
  2. Now you can access the created keyspace and create the first table to store exam data. The CQL command to create a table is similar to a SQL DDL command:
    CREATE TABLE patient.exam (
    patient_id int,
    id int,
    date timeuuid,
    details text,
    PRIMARY KEY (patient_id, id));

    The commands above create a table with a primary key that comprises the ID of the patient and the exam ID itself.

Insert data

  1. Now that you have the structure of your keyspace created, insert some sample data for three patients:
    INSERT INTO exam (patient_id,id,date,details) values (1,1,now(),'first exam patient 1');
    INSERT INTO exam (patient_id,id,date,details) values (1,2,now(),'second exam patient 1');
    INSERT INTO exam (patient_id,id,date,details) values (2,1,now(),'first exam patient 2');
    INSERT INTO exam (patient_id,id,date,details) values (3,1,now(),'first exam patient
        3');
  2. Next, run a testing query to capture all exams of patient 1:
    cqlsh:patient> select * from exam where patient_id=1;
    Figure 6. Executing a query for patient 1
    Screen capture showing executing a query for patient 1
    Screen capture showing executing a query for patient 1

Test the cluster

Now it's time to test the cluster availability, consistency, and partition tolerance. If you have a replication factor configuration of three for the patient keyspace, you will have a copy in each of the three nodes of any data written to the exam table.

Node replication test

You can insert the data into node1 and query on node3, because the data written on node1 will be automatically replicated to the node2 and node3, confirming the three replicas of the data. To test the insertion of a patient on node1 to determine if its information is available on node3, enter the following:

$ docker exec -it node1 cqlsh
Connected to Test Cluster at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.11.0 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
cqlsh> INSERT INTO patient.exam (patient_id,id,date,details) values (9,1,now(),'first exam patient 9');
cqlsh> quit;
$ docker exec -it node3 cqlsh
Connected to Test Cluster at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.11.0 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
cqlsh> select * from patient.exam where patient_id=9;
 patient_id | id | date                                 | details
------------+----+--------------------------------------+----------------------
          9 |  1 | 9cf570b0-8e9d-11e7-a592-6d2c86545d91 | first exam patient 9
(1 rows)

Node failure test

Now stop node2 and node3, then insert patient data into node1. Start node2 and node3 again to see that the data inserted into node1 has been replicated to the unavailable node2 and node3.

$ docker stop node2
$ docker stop node3
$ docker exec -it node1 cqlsh
Connected to Test Cluster at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.11.0 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
cqlsh> INSERT INTO patient.exam (patient_id,id,date,details) values (10,1,now(),'first exam patient 10');
cqlsh> quit;
$ docker start node2
$ docker start node3
$ docker exec -it node3 cqlsh
Connected to Test Cluster at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.11.0 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
cqlsh> select * from patient.exam where patient_id=10;
 patient_id | id | date                                 | details
------------+----+--------------------------------------+-----------------------
         10 |  1 | 76439070-8f04-11e7-a592-6d2c86545d91 | first exam patient 10
(1 rows)

Node consistency test

If you want strong read consistency, you must set the consistency level to QUORUM, so that any query will check the data on at least two available nodes.

$ docker stop node1
node1
$ docker stop node2
node2
$ docker exec -it node3 cqlsh
Connected to Test Cluster at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.11.0 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
cqlsh> select * from patient.exam where patient_id=10;
 patient_id | id | date                                 | details
------------+----+--------------------------------------+-----------------------
         10 |  1 | 76439070-8f04-11e7-a592-6d2c86545d91 | first exam patient 10
(1 rows)
cqlsh> consistency quorum
Consistency level set to QUORUM.
cqlsh> select * from patient.exam where patient_id=10;
NoHostAvailable:

In this case, you must have the majority of the nodes up and running (two) or the query will fail. This is a tradeoff. If you want more availability with stronger consistency, you must have more nodes in your cluster to be allowed to continue to operate in case of failing nodes. If you set the consistency to ONE, your query will succeed as there is one local copy available on node3 where you are connected.

cqlsh> consistency one
Consistency level set to ONE.
cqlsh> select * from patient.exam where patient_id=10;
 patient_id | id | date                                 | details
------------+----+--------------------------------------+-----------------------
         10 |  1 | 76439070-8f04-11e7-a592-6d2c86545d91 | first exam patient 10
(1 rows)
cqlsh> quit;

If you start node1 again, you will have the majority of the nodes, then a QUORUM consistency can be satisfied, and the query will no longer fail.

$ docker start node1
node1
$ docker exec -it node3 cqlsh
Connected to Test Cluster at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.11.0 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
cqlsh> consistency quorum
Consistency level set to QUORUM.
cqlsh> select * from patient.exam where patient_id=10;
 patient_id | id | date                                 | details
------------+----+--------------------------------------+-----------------------
         10 |  1 | 76439070-8f04-11e7-a592-6d2c86545d91 | first exam patient 10
(1 rows)

If you check your nodes' status, you can see there are two nodes up (UN) and one node down (DN).

$ docker exec -it node3 nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address     Load       Tokens       Owns (effective)  Host ID                               Rack
DN  172.17.0.3  306.19 KiB  256          100.0%            fd135375-711a-471a-b4e5-409199bbaaa5  rack1
UN  172.17.0.2  365.72 KiB  256          100.0%            5db97fc3-70e9-48e5-b63b-0be67e35daea  rack1
UN  172.17.0.4  285.42 KiB  256          100.0%            4deb44f8-9253-4bff-b74b-239085e3a912  rack1

You can also explore other testing scenarios, such as latency testing or load-balance testing, with more nodes and confirm the previously described characteristics of the Cassandra distributed cluster.

Conclusion

In this tutorial, my goal has been to guide you through a basic installation and configuration of a Cassandra cluster and to explain its most important characteristics. Also, I sought to provide you with some hands-on practice to help you to become more familiar with this kind of NoSQL database. Perhaps now you are ready to start your project and use Cassandra if your requirements match with Cassandra's features.

There are other details about modeling techniques for different data sets and administration tasks and fine-tuning for better performance that I haven't covered. Also, there are many other aspects of dealing with client drivers related to development that were out of this tutorial's scope. These can perhaps be covered in future tutorials. Meanwhile, to learn more about Cassandra, there are two sources of information that I recommend: the official Apache Cassandra website and the DataStax documentation website. See "Related topics" below.


Downloadable resources


Related topics


Comments

Sign in or register to add and subscribe to comments.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management, Open source
ArticleID=1049456
ArticleTitle=Set up a basic Apache Cassandra architecture
publish-date=09272017