Cassandra in Spring

来源：互联网发布：智联招聘 java开发编辑：程序博客网时间：2024/06/06 19:50

最近在学习Cassandra，发现只有安装配置的教程，但是对于如何实际使用介绍的很少，即使有也是非常间的示例，或者通过EasyCassandra的方案无法继续。找到下面的文章，非常难得，转过来，希望给大家帮助。

转载自：http://middlewaresnippets.blogspot.com/2015/02/cassandra-in-spring.html

Cassandra in Spring

In this post we set-up Apache Cassandra. In order to do this in a consistent manner across multiple hosts (using a tarball), we create templates. After we have set-up a Cassandra cluster, we will use Spring Data to create Java clients.

Introduction

Cassandra is designed to handle big data workloads across multiple nodes with no single point of failure. Its architecture is based on the understanding that system and hardware failures can and do occur. Cassandra addresses the problem of failures by employing a peer-to-peer distributed system across homogeneous nodes where data is distributed among all nodes in the cluster. A commit log on each node captures write activity to ensure data durability. Data is then indexed and written to an in-memory structure, called a memtable, which resembles a write-back cache. Once the memory structure is full, the data is written to disk in anSSTable (Sorted String Table) data file. All writes are automatically partitioned and replicated throughout the cluster. Using a process called compaction Cassandra periodically consolidates SSTables, discarding obsolete data and tombstones (an indicator that data was deleted). More information can be read in the write path to compaction.

Cassandra uses a storage structure similar to a Log-Structured Merge Tree (The Log-Structured Merge-Tree (LSM-Tree)), unlike a typical relational database that uses aB-Tree (B-Tree Visualization). Cassandra avoids reading before writing. Read-before-write, especially in a large distributed system, can produce stalls in read performance and other problems. Reading before writing also corrupts caches and increases IO requirements. To avoid a read-before-write condition, the storage engine groups inserts/updates to be made, and sequentially writes only the updated parts of a row in append mode. Cassandra never re-writes or re-reads existing data, and never overwrites the rows in place. More information can be read in thedatabase internals.

Install and configure Cassandra

Choosing appropriate hardware depends on selecting the right balance of the following resources: memory, CPU, disks, number of nodes, and network:

Memory, the more memory a Cassandra node has, the better the read performance. More RAM also allows memory tables (memtables - a table-specific in-memory data structure that resembles a write-back cache) to hold more recently written data. Larger memtables lead to a fewer number of SSTables being flushed to disk and fewer files to scan during a read. The ideal amount of RAM depends on the anticipated size of the hot data.
CPU, insert-heavy workloads are CPU-bound in Cassandra before becoming memory-bound (all writes go to the commit log - a file to which Cassandra appends changed data for recovery in the event of a hardware failure).
Disk, disk space depends on usage, so it is important to understand the mechanism. Cassandra writes data to disk when appending data to the commit log for durability and when flushing the memtable to SSTable data files for persistent storage. The commit log has a different access pattern (read/writes ratio) than the pattern for accessing data from SSTables. SSTables are periodically compacted. Compaction improves performance by merging and rewriting data and discarding old data. Depending on the type of compaction strategy and size of the compactions, compaction increases disk utilization, and we should leave an adequate amount of free disk space available on the node.
Network, since Cassandra is a distributed data store, it puts load on the network to handle read/write requests and replication of data across nodes. Be sure that the network can handle traffic between nodes without bottlenecks. Cassandra routes requests to replicas that are geographically closest to the coordinator node.

We set-up our environment by using the comments and steps described in the post Notes on Virtualizing Java EE Systems. The document Selecting hardware for enterprise implementations provides more information on selecting hardware resources. Implementation or design patterns that are ineffective and/or counterproductive in Cassandra production installations are described in the documentAnti-patterns in Cassandra.

Understanding the performance characteristics of the Cassandra cluster is critical to diagnosing issues and planning capacity. Cassandra exposes a number of statistics and management operations via JMX. During operation, Cassandra outputs information and statistics that can be monitored using JMX-compliant tools, for example,

the nodetool command-line utility;
the DataStax OpsCenter management console; or
JMX clients such as JConsole, Java VisualVM, and Java Mission Control. More information can be found in the post Monitoring with JMX.

In order to make the installation and configuration Cassandra consistent across different hosts by using a tarball installation, we create a template. To this end, we parameterize the following configuration files:

conf/cassandra.yaml, is the main configuration file.
conf/cassandra-env.sh contains Java Virtual Machine configuration settings.
bin/cassandra.in.sh sets up environment variables such as CLASSPATH and JAVA_HOME.

For the cassandra.yaml file we have

# The name of the cluster. This is mainly used to prevent machines in
# one logical cluster from joining another.
cluster_name: '<%= @CLUSTER_NAME %>'
 
# This defines the number of tokens randomly assigned to this node on the ring
# The more tokens, relative to other nodes, the larger the proportion of data
# that this node will store. You probably want all nodes to have the same number
# of tokens assuming they have equal hardware capability.
#
# If you leave this unspecified, Cassandra will use the default of 1 token for legacy compatibility,
# and will use the initial_token as described below.
#
# Specifying initial_token will override this setting on the node's initial start,
# on subsequent starts, this setting will apply even if initial token is set.
#
# If you already have a cluster with 1 token per node, and wish to migrate to 
# multiple tokens per node, see http://wiki.apache.org/cassandra/Operations
num_tokens: <%= @NUM_TOKENS %>
 
...
 
# any class that implements the SeedProvider interface and has a
# constructor that takes a Map<String, String> of parameters will do.
seed_provider:
    # Addresses of hosts that are deemed contact points. 
    # Cassandra nodes use this list of hosts to find each other and learn
    # the topology of the ring.  You must change this if you are running
    # multiple nodes!
    - class_name: org.apache.cassandra.locator.SimpleSeedProvider
      parameters:
          # seeds is actually a comma-delimited list of addresses.
          # Ex: "<ip1>,<ip2>,<ip3>"
          - seeds: "<%= @SEED_ADDRESSES %>"
 
...
 
# TCP port, for commands and data
storage_port: <%= @STORAGE_PORT %>
 
# SSL port, for encrypted communication.  Unused unless enabled in
# encryption_options
ssl_storage_port: <%= @SSL_STORAGE_PORT %>
 
# Address or interface to bind to and tell other Cassandra nodes to connect to.
# You _must_ change this if you want multiple nodes to be able to communicate!
#
# Set listen_address OR listen_interface, not both. Interfaces must correspond
# to a single address, IP aliasing is not supported.
#
# Leaving it blank leaves it up to InetAddress.getLocalHost(). This
# will always do the Right Thing _if_ the node is properly configured
# (hostname, name resolution, etc), and the Right Thing is to use the
# address associated with the hostname (it might not be).
#
# Setting listen_address to 0.0.0.0 is always wrong.
listen_address: <%= @LISTEN_ADDRESS %>
# listen_interface: eth0
 
...
 
# Whether to start the native transport server.
# Please note that the address on which the native transport is bound is the
# same as the rpc_address. The port however is different and specified below.
start_native_transport: true
# port for the CQL native transport to listen for clients on
native_transport_port: <%= @NATIVE_TRANSPORT_PORT %>
 
...
 
# The address or interface to bind the Thrift RPC service and native transport
# server to.
#
# Set rpc_address OR rpc_interface, not both. Interfaces must correspond
# to a single address, IP aliasing is not supported.
#
# Leaving rpc_address blank has the same effect as on listen_address
# (i.e. it will be based on the configured hostname of the node).
#
# Note that unlike listen_address, you can specify 0.0.0.0, but you must also
# set broadcast_rpc_address to a value other than 0.0.0.0.
rpc_address: <%= @RPC_ADDRESS %>
# rpc_interface: eth1
 
# port for Thrift to listen for clients on
rpc_port: <%= @RPC_PORT %>
 
# RPC address to broadcast to drivers and other Cassandra nodes. This cannot
# be set to 0.0.0.0. If left blank, this will be set to the value of
# rpc_address. If rpc_address is set to 0.0.0.0, broadcast_rpc_address must
# be set.
broadcast_rpc_address: <%= @BROADCAST_RPC_ADDRESS %>
 
...
 
# endpoint_snitch -- Set this to a class that implements
# IEndpointSnitch.  The snitch has two functions:
# - it teaches Cassandra enough about your network topology to route
#   requests efficiently
# - it allows Cassandra to spread replicas around your cluster to avoid
#   correlated failures. It does this by grouping machines into
#   "datacenters" and "racks."  Cassandra will do its best not to have
#   more than one replica on the same "rack" (which may not actually
#   be a physical location)
#
# IF YOU CHANGE THE SNITCH AFTER DATA IS INSERTED INTO THE CLUSTER,
# YOU MUST RUN A FULL REPAIR, SINCE THE SNITCH AFFECTS WHERE REPLICAS
# ARE PLACED.
#
# Out of the box, Cassandra provides
#  - SimpleSnitch:
#    Treats Strategy order as proximity. This can improve cache
#    locality when disabling read repair.  Only appropriate for
#    single-datacenter deployments.
#  - GossipingPropertyFileSnitch
#    This should be your go-to snitch for production use.  The rack
#    and datacenter for the local node are defined in
#    cassandra-rackdc.properties and propagated to other nodes via
#    gossip.  If cassandra-topology.properties exists, it is used as a
#    fallback, allowing migration from the PropertyFileSnitch.
#  - PropertyFileSnitch:
#    Proximity is determined by rack and data center, which are
#    explicitly configured in cassandra-topology.properties.
#  - Ec2Snitch:
#    Appropriate for EC2 deployments in a single Region. Loads Region
#    and Availability Zone information from the EC2 API. The Region is
#    treated as the datacenter, and the Availability Zone as the rack.
#    Only private IPs are used, so this will not work across multiple
#    Regions.
#  - Ec2MultiRegionSnitch:
#    Uses public IPs as broadcast_address to allow cross-region
#    connectivity.  (Thus, you should set seed addresses to the public
#    IP as well.) You will need to open the storage_port or
#    ssl_storage_port on the public IP firewall.  (For intra-Region
#    traffic, Cassandra will switch to the private IP after
#    establishing a connection.)
#  - RackInferringSnitch:
#    Proximity is determined by rack and data center, which are
#    assumed to correspond to the 3rd and 2nd octet of each node's IP
#    address, respectively.  Unless this happens to match your
#    deployment conventions, this is best used as an example of
#    writing a custom Snitch class and is provided in that spirit.
#
# You can use a custom Snitch by setting this to the full class name
# of the snitch, which will be assumed to be on your classpath.
endpoint_snitch: <%= @ENDPOINT_SNITCH %>

As we will be running on Java 8, we use the Garbage First Collector and TieredCompilation (which is the default in Java 8). More information on howTieredCompilation works and how to tune the Garbage First Collector can be found in the postJava Virtual Machine Code Generation and Optimization. For the cassandra-env.sh file we have (the options that are not shown are disabled)

MAX_HEAP_SIZE=<%= @HEAP_SIZE %>
HEAP_NEWSIZE=<%= @NURSERY_SIZE %>
 
# Specifies the default port over which Cassandra will be available for JMX connections.
JMX_PORT=<%= @JMX_PORT %>
 
JVM_OPTS="$JVM_OPTS -Xms${MAX_HEAP_SIZE}"
JVM_OPTS="$JVM_OPTS -Xmx${MAX_HEAP_SIZE}"
 
# Larger interned string table, for gossip's benefit (CASSANDRA-6410)
JVM_OPTS="$JVM_OPTS -XX:StringTableSize=1000003"
 
# GC tuning options
JVM_OPTS="$JVM_OPTS -XX:+UseG1GC"
JVM_OPTS="$JVM_OPTS -XX:MaxGCPauseMillis=<%= @PAUSE_TIME_GOAL_MILLIS %>"
# Settings to be able to take flight recordings (optional, needs the JavaSE Advanced license in production environments)
JVM_OPTS="$JVM_OPTS -XX:+UnlockCommercialFeatures -XX:+FlightRecorder -Dname=<%= @SERVER_NAME %>"
 
JVM_OPTS="$JVM_OPTS -Djava.net.preferIPv4Stack=true"
JVM_OPTS="$JVM_OPTS -Dcom.sun.management.jmxremote.port=$JMX_PORT"
JVM_OPTS="$JVM_OPTS -Dcom.sun.management.jmxremote.rmi.port=$JMX_PORT"
JVM_OPTS="$JVM_OPTS -Dcom.sun.management.jmxremote.ssl=false"
JVM_OPTS="$JVM_OPTS -Dcom.sun.management.jmxremote.authenticate=false"

Note that, we have also disabled the new generation setting (Xmn). As we are working with a pause time goal we must avoid explicitly setting the new generation size, as this overrides the pause time goal. The experimental flagsG1NewSizePercent (default 5%) and G1MaxNewSizePercent (default 60%) can be used to respectively set minimum and maximum for the new generation size. Values for experimental flags can be changed by enablingUnlockExperimentalVMOptions. To see what the default values of command-line options are (such as the thread stack size) refer to thecommand-line options reference. For information on the use of thread priorities refer to the documentMap Thread priorities to system thread/process priorities.

In the cassandra.in.sh file we only change the JAVA_HOME variable (the rest remains unchanged)

...
# JAVA_HOME can optionally be set here
JAVA_HOME="<%= @JAVA_HOME_DIRECTORY %>"
...

To install and configure Cassandra on a particular host, we can use

#!/bin/sh
 
# directory where the tar file is located
SOFTWARE_DIRECTORY="/u01/software/apache/cassandra/2.1.2"
 
# user and group that will be the owner of Apache Cassandra
USER_NAME="weblogic"
GROUP_NAME="javainstall"
 
# directory where Apache Cassandra will be installed
CASSANDRA_ROOT="/u01/app/apache/cassandra2.1.2"
# the name of the cluster
CLUSTER_NAME="TestCluster"
# the name of the cassandra instance (only used for flight recording purposes)
SERVER_NAME="cassandra"
# the number of tokens randomly assigned to the node on the ring
NUM_TOKENS="256"
# address to bind to and tell other Cassandra nodes to connect to
LISTEN_ADDRESS=$(hostname)
# address to bind the Thrift RPC service and native transport server to
RPC_ADDRESS="0.0.0.0"
# if rpc_address is set to 0.0.0.0, broadcast_rpc_address must be set to a value other than 0.0.0.0
BROADCAST_RPC_ADDRESS=$(hostname)
# comma-delimited list of hosts that are deemed contact points (this list is used to find nodes and learn the topology of the ring)
SEED_ADDRESSES="machine1.com"
# endpoint snitch, the snitch has two functions:
# - teach Cassandra about the network topology to route requests efficiently
# - allow Cassandra to spread replicas around the cluster to avoid correlated failures.
ENDPOINT_SNITCH="SimpleSnitch"
# storage port
STORAGE_PORT="7080"
# ssl storage port
SSL_STORAGE_PORT="7443"
# native transport server port
NATIVE_TRANSPORT_PORT="9042"
# port for Thrift to listen for clients on
RPC_PORT="9160"
 
# Java settings
# directory such that JAVA_HOME_DIRECTORY/bin contains the java executable
JAVA_HOME_DIRECTORY="/u01/app/oracle/weblogic12.1.3/jdk1.8.0_31"
# heap size (Xmx and Xms are set to equal sizes
HEAP_SIZE="1024m"
# nursery size (Xmn)
NURSERY_SIZE="256m"
# as we are using the garbage first collector give a goal for the pause times
PAUSE_TIME_GOAL_MILLIS="200"
# port over which JMX connections are accepted
JMX_PORT="7199"
 
extract_template() {
    echo "EXTRACTING TEMPLATE"
    # the apache-cassandra-2.1.2.tar.gz contains the following directory structure
    # /CASSANDRA_ROOT
    #     /bin
    #         cassandra (used to start apache cassandra)
    #         cassandra.in.sh (sets environment variables, such as CASSANDRA_HOME and JAVA_HOME)
    #     /conf
    #         cassandra.yaml (apache cassandra storage configuration)
    #         cassandra-env.sh (sets the JVM options)
    #     /data (contains data files)
    #     /lib (contains libraries)
 
    if [ ! -d "${CASSANDRA_ROOT}" ]; then
     mkdir -p ${CASSANDRA_ROOT}
    fi
 
    tar xf ${SOFTWARE_DIRECTORY}/apache-cassandra-2.1.2.tar.gz -C ${CASSANDRA_ROOT}
    ROOT_CASSANDRA_ROOT=$(echo ${CASSANDRA_ROOT} | cut -d "/" -f2)
    chown -R ${USER_NAME}:${GROUP_NAME} /${ROOT_CASSANDRA_ROOT}
}
 
edit_files() {
    echo "EDITING cassandra.in.sh"
    sed -i -e '/<%= @JAVA_HOME_DIRECTORY %>/ s;<%= @JAVA_HOME_DIRECTORY %>;'${JAVA_HOME_DIRECTORY}';g' ${CASSANDRA_ROOT}/bin/cassandra.in.sh
 
    echo "EDITING cassandra.yaml"
    sed -i -e '/<%= @CLUSTER_NAME %>/ s;<%= @CLUSTER_NAME %>;'${CLUSTER_NAME}';g' \
     -e '/<%= @NUM_TOKENS %>/ s;<%= @NUM_TOKENS %>;'${NUM_TOKENS}';g' \
     -e '/<%= @SEED_ADDRESSES %>/ s;<%= @SEED_ADDRESSES %>;'${SEED_ADDRESSES}';g' \
     -e '/<%= @STORAGE_PORT %>/ s;<%= @STORAGE_PORT %>;'${STORAGE_PORT}';g' \
     -e '/<%= @SSL_STORAGE_PORT %>/ s;<%= @SSL_STORAGE_PORT %>;'${SSL_STORAGE_PORT}';g' \
     -e '/<%= @LISTEN_ADDRESS %>/ s;<%= @LISTEN_ADDRESS %>;'${LISTEN_ADDRESS}';g' \
     -e '/<%= @NATIVE_TRANSPORT_PORT %>/ s;<%= @NATIVE_TRANSPORT_PORT %>;'${NATIVE_TRANSPORT_PORT}';g' \
     -e '/<%= @RPC_ADDRESS %>/ s;<%= @RPC_ADDRESS %>;'${RPC_ADDRESS}';g' \
     -e '/<%= @BROADCAST_RPC_ADDRESS %>/ s;<%= @BROADCAST_RPC_ADDRESS %>;'${BROADCAST_RPC_ADDRESS}';g' \
     -e '/<%= @RPC_PORT %>/ s;<%= @RPC_PORT %>;'${RPC_PORT}';g' \
     -e '/<%= @ENDPOINT_SNITCH %>/ s;<%= @ENDPOINT_SNITCH %>;'${ENDPOINT_SNITCH}';g' ${CASSANDRA_ROOT}/conf/cassandra.yaml
     
     echo "EDITING cassandra-env.sh"
     sed -i -e '/<%= @HEAP_SIZE %>/ s;<%= @HEAP_SIZE %>;'${HEAP_SIZE}';g' \
     -e '/<%= @NURSERY_SIZE %>/ s;<%= @NURSERY_SIZE %>;'${NURSERY_SIZE}';g' \
     -e '/<%= @JMX_PORT %>/ s;<%= @JMX_PORT %>;'${JMX_PORT}';g' \
     -e '/<%= @PAUSE_TIME_GOAL_MILLIS %>/ s;<%= @PAUSE_TIME_GOAL_MILLIS %>;'${PAUSE_TIME_GOAL_MILLIS}';g' \
     -e '/<%= @SERVER_NAME %>/ s;<%= @SERVER_NAME %>;'${SERVER_NAME}';g' ${CASSANDRA_ROOT}/conf/cassandra-env.sh
 }
 
create_boot_script() {
    echo "CREATING BOOT SCRIPT"
    touch /etc/rc.d/init.d/cassandra
    chmod u+x /etc/rc.d/init.d/cassandra
 
echo '#!/bin/sh
#
# chkconfig: - 95 20
#
# description: controls the Cassandra runtime lifecycle
# processname: cassandra
#
 
# Source function library.
. /etc/rc.d/init.d/functions
 
RETVAL=0
SERVICE="cassandra"
USER="'${USER_NAME}'"
CASSANDRA_HOME="'${CASSANDRA_ROOT}'"
LOCK_FILE="/var/lock/subsys/${SERVICE}"
 
start() {
    echo "Starting Cassandra"
    su - ${USER} -c "${CASSANDRA_HOME}/bin/cassandra -p ${CASSANDRA_HOME}/cassandra.pid" >/dev/null 2>/dev/null &
    RETVAL=$?
    [ $RETVAL -eq 0 ] && success || failure
    echo
    [ $RETVAL -eq 0 ] && touch ${LOCK_FILE}
    return $RETVAL
}
 
stop() {
    echo "Stopping Cassandra"
    su - ${USER} -c "kill -TERM $(cat ${CASSANDRA_HOME}/cassandra.pid)"
    RETVAL=$?
    [ $RETVAL -eq 0 ] && success || failure
    echo
    [ $RETVAL -eq 0 ] && rm -r ${LOCK_FILE}
    return $RETVAL
}
 
check() {
    echo "Checking Cassandra"
    PID=$(pgrep -of org.apache.cassandra.service.CassandraDaemon)
    RETVAL=$?
    if [ $RETVAL -eq 0 ]; then
     echo "Cassandra is running on this host with PID=$PID"
        netstat -anp | grep $PID
    else
        echo "Cassandra is not running on this host"
    fi
    return $RETVAL
} 
 
case "$1" in
    start)
        start
        ;;
    stop)
        stop
        ;;
    restart)
        check
        stop
        start
        check
        ;;
    check)
        check
        ;;
    *)
        echo $"Usage: $0 {start|stop|restart|check}"
        exit 1
esac   
 
exit $?' > /etc/rc.d/init.d/cassandra
 
    chkconfig --add cassandra
    chkconfig cassandra on
}
 
extract_template
 
edit_files
 
create_boot_script

The last step in the script creates a boot script, such that Cassandra starts when the host is started.

Set-up data model

Cassandra's data model is a partitioned row store with tunable consistency. Rows are organized into tables; the first component of a table's primary key is the partition key; within a partition, rows are clustered by the remaining columns of the key. Other columns can be indexed separately from the primary key. Tables can be created, dropped, and altered at runtime without blocking updates and queries.

A keyspace is the CQL counterpart to an SQL database, but a little different. The Cassandra keyspace is a namespace that defines how data is replicated on nodes. Typically, a cluster has one keyspace per application. Replication is controlled on a per-keyspace basis, so data that has different replication requirements typically resides in different keyspaces. Keyspaces are not designed to be used as a significant map layer within the data model. Keyspaces are designed to controldata replication for a set of tables.

Cassandra stores replicas on multiple nodes to ensure reliability and fault tolerance. A replication strategy determines the nodes where replicas are placed. The total number of replicas across the cluster is referred to as the replication factor. A replication factor of 1 means that there is only one copy of each row on one node. A replication factor of 2 means two copies of each row, where each copy is on a different node. All replicas are equally important; there is no primary or master replica. As a general rule, the replication factor should not exceed the number of nodes in the cluster. Note that we canincrease the replication factor and add the desired number of nodes later. Two replication strategies are available:

SimpleStrategy, use for a single data center only. In the case of more than one data center, use theNetworkTopologyStrategy.
NetworkTopologyStrategy, recommended for most deployments because it is much easier to expand to multiple data centers when necessary.

To create a keyspace and a table (table properties) we can use thecqlsh command-line utility (the CQL reference can be found here)

[weblogic@machine1 bin]$ ./cqlsh machine1.com 9042
Connected to TestCluster at machine1.com:9042.
[cqlsh 5.0.1 | Cassandra 2.1.2 | CQL spec 3.2.0 | Native protocol v3]
Use HELP for help.
 
# create keyspace
cqlsh> CREATE KEYSPACE IF NOT EXISTS test WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 2};
 
# create table
cqlsh> CREATE TABLE IF NOT EXISTS test.persons (  sofinummer int PRIMARY KEY , naam text ) WITH compression = {'sstable_compression': 'LZ4Compressor' } AND compaction = {'class': 'LeveledCompactionStrategy'};
 
# check created keyspace
cqlsh> SELECT * FROM system.schema_keyspaces;
 
 keyspace_name | durable_writes | strategy_class                              | strategy_options
---------------+----------------+---------------------------------------------+----------------------------
          test |           True | org.apache.cassandra.locator.SimpleStrategy | {"replication_factor":"2"}
        system |           True |  org.apache.cassandra.locator.LocalStrategy |                         {}
 system_traces |           True | org.apache.cassandra.locator.SimpleStrategy | {"replication_factor":"2"}
 
(3 rows)
 
# check created table
cqlsh> SELECT * FROM system.schema_columnfamilies WHERE columnfamily_name = 'persons' ALLOW FILTERING;
 
 keyspace_name | columnfamily_name | bloom_filter_fp_chance | caching                                     | cf_id                                | column_aliases | comment | compaction_strategy_class                                    | compaction_strategy_options | comparator                                                                              | compression_parameters                                                   | default_time_to_live | default_validator                         | dropped_columns | gc_grace_seconds | index_interval | is_dense | key_aliases    | key_validator                             | local_read_repair_chance | max_compaction_threshold | max_index_interval | memtable_flush_period_in_ms | min_compaction_threshold | min_index_interval | read_repair_chance | speculative_retry | subcomparator | type     | value_alias
---------------+-------------------+------------------------+---------------------------------------------+--------------------------------------+----------------+---------+--------------------------------------------------------------+-----------------------------+-----------------------------------------------------------------------------------------+--------------------------------------------------------------------------+----------------------+-------------------------------------------+-----------------+------------------+----------------+----------+----------------+-------------------------------------------+--------------------------+--------------------------+--------------------+-----------------------------+--------------------------+--------------------+--------------------+-------------------+---------------+----------+-------------
          test |           persons |                    0.1 | {"keys":"ALL", "rows_per_partition":"NONE"} | 296a2de0-a6ea-11e4-a8b4-b97fe3ba0f40 |             [] |         | org.apache.cassandra.db.compaction.LeveledCompactionStrategy |                          {} | org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.UTF8Type) | {"sstable_compression":"org.apache.cassandra.io.compress.LZ4Compressor"} |                    0 | org.apache.cassandra.db.marshal.BytesType |            null |           864000 |           null |    False | ["sofinummer"] | org.apache.cassandra.db.marshal.Int32Type |                      0.1 |                       32 |               2048 |                           0 |                        4 |                128 |                  0 |    99.0PERCENTILE |          null | Standard |        null
 
(1 rows)
 
cqlsh> SELECT * FROM system.schema_columns WHERE columnfamily_name = 'persons' ALLOW FILTERING;
 
 keyspace_name | columnfamily_name | column_name | component_index | index_name | index_options | index_type | type          | validator
---------------+-------------------+-------------+-----------------+------------+---------------+------------+---------------+-------------------------------------------
          test |           persons |        naam |               0 |       null |          null |       null |       regular |  org.apache.cassandra.db.marshal.UTF8Type
          test |           persons |  sofinummer |            null |       null |          null |       null | partition_key | org.apache.cassandra.db.marshal.Int32Type
 
(2 rows)
 
# to check the cluster we can use nodetool (nodetool help status, provides more information)
[weblogic@machine1 bin]$ ./nodetool --host machine1.com --port 7199 status test
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address          Load       Tokens  Owns (effective)  Host ID                               Rack
UN  192.168.231.110  77.89 KB   256     100.0%            6f563cc6-3e2f-4089-b28c-bdb58959c214  rack1
UN  192.168.231.100  104.46 KB  256     100.0%            0eddc3ea-6692-4c62-9135-67ef37d655b5  rack1

Test

To test the set-up we use a Java client. For the Java client to be able to connect, we need theCassandra Java Driver. The driver uses the binary protocol (start_native_transport: true,native_transport_port: 9042, and rpc_address: <hostname reachable from the client> must be set in thecassandra.yaml file). The document Writing your first client describes a step-by-step example.

The driver discovers nodes that constitute a cluster by querying the contact points used in building the cluster object. After this it is up to the cluster's load balancing policy to keep track of node events. If a Cassandra node fails or becomes unreachable, the Java driver automatically and transparently tries other nodes in the cluster and schedules reconnections to the dead nodes in the background. More information on the driver can be found in thedriver reference.

As an example, we can use the following

# obtain a Cluster and a Session instance, and make those available for future use
package model.utils;
 
import com.datastax.driver.core.Cluster;
import com.datastax.driver.core.ProtocolOptions;
import com.datastax.driver.core.Session;
 
import java.net.InetSocketAddress;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.TimeUnit;
 
public class CassandraUtil {
 
    private static final Cluster cluster;
    private static final Session session;
 
    private static final String[] HOST_NAMES = {"machine1.com","machine2.com"};
    private static final Integer PORT = ProtocolOptions.DEFAULT_PORT;
    private static final String KEYSPACE = "test";
 
    private CassandraUtil() {
    }
 
    static {
        try {
            List<InetSocketAddress> addresses = new ArrayList<>();
            for (String host: HOST_NAMES) {
                InetSocketAddress address = new InetSocketAddress(host, PORT);
                addresses.add(address);
            }
 
            System.out.println("initializing");
            long start_time = System.nanoTime();
 
            //cluster = Cluster.builder().addContactPoints(HOST_NAMES).build();
            cluster = Cluster.builder().addContactPointsWithPorts(addresses).build();
            session = cluster.connect(KEYSPACE);
 
            long end_time = System.nanoTime();
            System.out.println("done initializing");
            long total_time = TimeUnit.NANOSECONDS.toMillis(end_time - start_time);
            System.out.println("initializing took: " + total_time + " ms.");
        } catch (Throwable ex) {
            ex.printStackTrace();
            throw new ExceptionInInitializerError(ex);
        }
    }
 
    public static Cluster getCluster() {
        return cluster;
    }
 
    public static Session getSession() {
        return session;
    }
}

# entity representing our table (used by Spring Data)
package model.entities;
 
import org.springframework.data.cassandra.mapping.PrimaryKey;
import org.springframework.data.cassandra.mapping.Table;
 
@Table("persons")
public class Person implements Comparable<Person> {
 
    @PrimaryKey
    private Integer sofinummer;
 
    private String naam;
 
    public Integer getSofinummer() {
        return sofinummer;
    }
 
    public void setSofinummer(Integer sofinummer) {
        this.sofinummer = sofinummer;
    }
 
    public String getNaam() {
        return naam;
    }
 
    public void setNaam(String naam) {
        this.naam = naam;
    }
 
    @Override
    public boolean equals(Object object) {
        if (this == object) {
            return true;
        }
 
        if (object == null) {
            return false;
        }
 
        if (!(object instanceof Person)) {
            return false;
        }
 
        Person person = (Person) object;
        return getSofinummer().equals(person.getSofinummer());
    }
 
    @Override
    public int hashCode() {
        return getSofinummer().hashCode();
    }
 
    @Override
    public String toString() {
        return getNaam() + " " + getSofinummer();
    }
 
    @Override
    public int compareTo(Person other) {
        return this.getNaam().compareTo(other.getNaam());
    }
}

package test;
 
import com.datastax.driver.core.*;
import com.datastax.driver.core.querybuilder.Batch;
import com.datastax.driver.core.querybuilder.QueryBuilder;
import com.datastax.driver.core.querybuilder.Select;
import model.entities.Person;
import model.utils.CassandraUtil;
import org.springframework.data.cassandra.core.CassandraOperations;
import org.springframework.data.cassandra.core.CassandraTemplate;
 
import java.util.List;
 
public class Test {
 
    public static void main(String[] args) {
        try {
            Test test = new Test();
            test.getMetaData();
            test.useCQL();
            test.useSpring();
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            CassandraUtil.getSession().close();
            CassandraUtil.getCluster().close();
        }
    }
 
    public void getMetaData() {
        Configuration configuration = CassandraUtil.getCluster().getConfiguration();
        Metadata metadata = CassandraUtil.getCluster().getMetadata();
        System.out.println("Connected to cluster: " + metadata.getClusterName());
        for (Host host : metadata.getAllHosts()) {
            System.out.println("- host: " + host.getDatacenter() + ", " +
                    host.getRack() + ", " +
                    host.getAddress() + ", " +
                    configuration.getPolicies().getLoadBalancingPolicy().distance(host));
        }
 
        Metrics metrics = CassandraUtil.getCluster().getMetrics();
        System.out.println("- connected to hosts: " + metrics.getConnectedToHosts().getValue());
        System.out.println("- open connections: " + metrics.getOpenConnections().getValue());
 
        TableMetadata.Options options = metadata.getKeyspace("test").getTable("persons").getOptions();
        System.out.println("Table options:");
        System.out.println("- " + options.getCaching());
        System.out.println("- " + options.getCompaction());
        System.out.println("- " + options.getCompression());
    }
 
    public void useCQL() {
        CassandraUtil.getSession().execute("INSERT INTO test.persons (sofinummer, naam) VALUES (123456789, 'John Zorn');");
        CassandraUtil.getSession().execute("INSERT INTO test.persons (sofinummer, naam) VALUES (987654321, 'Frank Zappa');");
 
        ResultSet results = CassandraUtil.getSession().execute("SELECT * FROM test.persons");
        for (Row row : results) {
            System.out.println(row.getInt("sofinummer") + ", " + row.getString("naam"));
        }
    }
 
    public void useSpring() {
        Person johnzorn = new Person();
        johnzorn.setSofinummer(123456789);
        johnzorn.setNaam("John Zorn");
 
        Person frankzappa = new Person();
        frankzappa.setSofinummer(987654321);
        frankzappa.setNaam("Frank Zappa");
 
        CassandraOperations operations = new CassandraTemplate(CassandraUtil.getSession());
        operations.insert(johnzorn);
        operations.insert(frankzappa);
 
        Select select = QueryBuilder.select().from("test", "persons");
        List<Person> persons = operations.select(select, Person.class);
        persons.forEach(System.out::println);
    }
}

in which we have also used Spring Data, and in particular Spring Data Cassandra. In order to run the example above, we need to have the following .jar files in the class path

aopalliance-1.0.jar
cassandra-driver-core-2.1.4.jar
commons-logging-1.1.3.jar
guava-14.0.1.jar
metrics-core-3.0.2.jar
slf4j-api-1.7.5.jar
spring-aop-4.1.4.RELEASE.jar
spring-beans-4.1.4.RELEASE.jar
spring-context-4.1.4.RELEASE.jar
spring-core-4.1.4.RELEASE.jar
spring-cql-1.1.2.RELEASE.jar
spring-data-cassandra-1.1.2.RELEASE.jar
spring-data-commons-1.9.2.RELEASE.jar
spring-expression-4.1.4.RELEASE.jar
spring-tx-4.1.4.RELEASE.jar

The Cassandra Driver (and related .jar files) can be obtained here. The Spring .jar files can be obtained from the Spring Repository. We can also obtain the .jar files using Maven, for example, when using IntelliJ IDEA, click File, Project Structure and chooseLibraries; click + and choose from maven. In theDownload Library from Maven Repository screen, enter the artifact id (spring-data-cassandra), press search, select the version to be used (org.springframework.data:spring-data-cassandra:1.1.2.RELEASE), optionallly choose to download the library to a project directory, and click ok.

When the example is run, we get the following output

initializing
...
done initializing
initializing took: 745 ms.
Connected to cluster: TestCluster
- host: datacenter1, rack1, machine2.com/192.168.231.110, LOCAL
- host: datacenter1, rack1, machine1.com/192.168.231.100, LOCAL
- connected to hosts: 2
- open connections: 3
Table options:
- {keys=ALL, rows_per_partition=NONE}
- {class=org.apache.cassandra.db.compaction.LeveledCompactionStrategy}
- {sstable_compression=org.apache.cassandra.io.compress.LZ4Compressor}
123456789, John Zorn
987654321, Frank Zappa
John Zorn 123456789
Frank Zappa 987654321

Spring

Spring offers the very handy CassandraOperations API, which we can use to create a generic DAO approach as was done in the postFun with Spring. The Spring Data Commons module offers so-called repositories. The central interface in Spring Data repository abstraction is theRepository interface. It takes the domain class to manage as well as the id type of the domain class as type arguments. This interface acts primarily as a marker interface to capture the types to work with and to help discover interfaces that extend this one. The CrudRepository interface provides CRUD functionality for the entity class that is being managed. To implement theCrudRepository we can use something like

package model.logic;
 
import org.springframework.cassandra.core.util.CollectionUtils;
import org.springframework.data.cassandra.core.CassandraOperations;
import org.springframework.data.repository.CrudRepository;
 
import java.io.Serializable;
import java.lang.reflect.ParameterizedType;
import java.lang.reflect.Type;
 
public abstract class GenericCassandraRepository<T, ID extends Serializable> implements CrudRepository<T, ID> {
 
    private Class<T> persistentClass;
    private CassandraOperations operations;
 
    public GenericCassandraRepository(CassandraOperations operations) {
        Type type = getClass().getGenericSuperclass();
        if (type instanceof ParameterizedType) {
            ParameterizedType parameterizedType = (ParameterizedType) type;
            setPersistentClass((Class<T>) parameterizedType.getActualTypeArguments()[0]);
        } else {
            System.out.println("Not an instance of parameterized type: " + type);
        }
 
        setOperations(operations);
    }
 
    public Class<T> getPersistentClass() {
        return persistentClass;
    }
 
    public void setPersistentClass(Class<T> persistentClass) {
        this.persistentClass = persistentClass;
    }
 
    public CassandraOperations getOperations() {
        return operations;
    }
 
    public void setOperations(CassandraOperations operations) {
        this.operations = operations;
    }
 
    @Override
    public <S extends T> S save(S entity) {
        return getOperations().insert(entity);
    }
 
    @Override
    public <S extends T> Iterable<S> save(Iterable<S> entities) {
        return getOperations().insert(CollectionUtils.toList(entities));
    }
 
    @Override
    public T findOne(ID id) {
        return getOperations().selectOneById(getPersistentClass(), id);
    }
 
    @Override
    public boolean exists(ID id) {
        return getOperations().exists(getPersistentClass(), id);
    }
 
    @Override
    public Iterable<T> findAll() {
        return getOperations().selectAll(getPersistentClass());
    }
 
    @Override
    public Iterable<T> findAll(Iterable<ID> entities) {
        return getOperations().selectBySimpleIds(getPersistentClass(), entities);
    }
 
    @Override
    public long count() {
        return getOperations().count(getPersistentClass());
    }
 
    @Override
    public void delete(ID id) {
        getOperations().deleteById(getPersistentClass(), id);
    }
 
    @Override
    public void delete(T entity) {
        getOperations().delete(entity);
    }
 
    @Override
    public void delete(Iterable<? extends T> entities) {
        getOperations().delete(CollectionUtils.toList(entities));
    }
 
    @Override
    public void deleteAll() {
        getOperations().deleteAll(getPersistentClass());
    }
}

In the constructor we use Generics to retrieve information about the persistent entity class, i.e., find the class of theT generic argument. If we look at the JavaDocs the following can be expected:

If the superclass is a parameterized type, the Type object returned must accurately reflect the actual type parameters used in the source code. The parameterized type representing the superclass is created if it had not been created before.
A parameterized type is created the first time it is needed by a reflective method, as specified in this package. When a parameterized typep is created, the generic type declaration that p instantiates is resolved, and all type arguments ofp are created recursively.
A type variable is created the first time it is needed by a reflective method, as specified in this package. If a type variablet is referenced by a type (i.e, class, interface or annotation type)T, and T is declared by the n-th enclosing class of T, then the creation of t requires the resolution of the i-th enclosing class of T, for i = 0 to n, inclusive.

The CrudInterface interface can be extended for particular entities, for example,

package model.logic;
 
import model.entities.Person;
import org.springframework.data.repository.CrudRepository;
 
public interface PersonRepository extends CrudRepository<Person, Integer> {
    public void sortAllPersons();
}

with the corresponding implementation:

package model.logic;
 
import model.entities.Person;
import org.springframework.cassandra.core.util.CollectionUtils;
import org.springframework.data.cassandra.core.CassandraOperations;
 
import java.util.List;
import java.util.concurrent.TimeUnit;
 
public class PersonRepositoryImpl extends GenericCassandraRepository<Person, Integer> implements PersonRepository  {
    public PersonRepositoryImpl(CassandraOperations operations) {
        super(operations);
    }
 
    @Override
    public void sortAllPersons() {
        List<Person> persons = CollectionUtils.toList(super.findAll());
 
        long sequential_start_time = System.nanoTime();
        persons.stream().sorted().count();
        long sequential_end_time = System.nanoTime();
        long sequential_total_time = TimeUnit.NANOSECONDS.toMillis(sequential_end_time - sequential_start_time);
        System.out.println("sequential sort took: " + sequential_total_time + " ms.");
 
        long parallel_start_time = System.nanoTime();
        persons.parallelStream().sorted().count();
        long parallel_end_time = System.nanoTime();
        long parallel_total_time = TimeUnit.NANOSECONDS.toMillis(parallel_end_time - parallel_start_time);
        System.out.println("parallel sort took: " + parallel_total_time + " ms.");
    }
}

Using Spring configuration

We can also configure Cassandra (create a Cluster and Session instance; create aCassandraOperations instance, and inject that into our PersonRepository) by using Spring configuration, for example,

# cassandra.properties
cassandra.contactpoints=machine1.com,machine.com
cassandra.port=9042
cassandra.keyspace=test

# spring configuration
<beans:beans xmlns:beans="http://www.springframework.org/schema/beans"
     xmlns:cassandra="http://www.springframework.org/schema/data/cassandra"
     xmlns:context="http://www.springframework.org/schema/context"
     xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
     xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd
        http://www.springframework.org/schema/data/cassandra http://www.springframework.org/schema/data/cassandra/spring-cassandra.xsd
        http://www.springframework.org/schema/context http://www.springframework.org/schema/context/spring-context.xsd">
 
    <context:property-placeholder location="classpath:cassandra.properties"/>
 
    <cassandra:cluster id="cluster" contact-points="${cassandra.contactpoints}" port="${cassandra.port}"/>
 
    <cassandra:session id="session" cluster-ref="cluster" keyspace-name="${cassandra.keyspace}"/>
 
    <cassandra:mapping entity-base-packages="model.entities"/>
    <cassandra:converter/>
 
    <cassandra:template id="operations" session-ref="session"/>
 
    <beans:bean id="personrepository" class="model.logic.PersonRepositoryImpl">
        <beans:constructor-arg ref="operations"/>
    </beans:bean>
</beans:beans>

# SpringUtil class used to obtain instances from the Spring ApplicationContext
package model.util;
 
import com.datastax.driver.core.Cluster;
import com.datastax.driver.core.Session;
import model.logic.PersonRepository;
import org.springframework.context.ApplicationContext;
import org.springframework.context.support.ClassPathXmlApplicationContext;
 
import java.util.concurrent.TimeUnit;
 
public class SpringUtil {
 
    private static ApplicationContext applicationContext;
 
    private SpringUtil(){
    }
 
    static {
        System.out.println("initializing");
        long start_time = System.nanoTime();
 
        applicationContext = new ClassPathXmlApplicationContext("spring-config.xml");
 
        long end_time = System.nanoTime();
        System.out.println("done initializing");
        long total_time = TimeUnit.NANOSECONDS.toMillis(end_time - start_time);
        System.out.println("initializing took: " + total_time + " ms.");
    }
 
    public static PersonRepository getPersonRepository() {
        return applicationContext.getBean("personrepository", PersonRepository.class);
    }
 
    public static Cluster getCluster() {
        return applicationContext.getBean("cluster", Cluster.class);
    }
 
    public static Session getSession() {
        return applicationContext.getBean("session", Session.class);
    }
}

Which can be used in the following manner

package test;
 
import model.entities.Person;
import model.logic.PersonRepository;
import model.util.SpringUtil;
 
import java.util.Random;
 
public class Test {
 
    private Random generator = new Random();
 
    public static void main(String[] args) {
        Test test = new Test();
        try {
            test.doSomeTest(SpringUtil.getPersonRepository());
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            SpringUtil.getSession().close();
            SpringUtil.getCluster().close();
        }
    }
 
    private void doSomeTest(PersonRepository repository) {
        for (int i = 0; i < 10; i++) {
            // generate a random data
            Person person = createPerson();
            // insert or update a person
            repository.save(person);
            if (generator.nextDouble() < 0.001) {
                // remove a person
                repository.delete(generateSofinummer());
                // get all persons
                System.out.println("number of persons: " + repository.count());
                repository.sortAllPersons();
            } else {
                // find a person by ID
                repository.findOne(generateSofinummer());
            }
        }
    }
 
    private Person createPerson() {
        Person person = new Person();
        person.setSofinummer(generateSofinummer());
        person.setNaam(Long.toString(Math.abs(generator.nextLong()), 36));
        return person;
    }
 
    private Integer generateSofinummer() {
        return generator.nextInt(100000);
    }
}

When the example is run, we get the following output

initializing
...
done initializing
initializing took: 6511 ms.

Initialization using the Spring configuration is much slower than the direct approach as was done in theCassandraUtil class. So in the test below, we will stick to the 'CassandraUtil'-approach.

Test

To test it all we can use something like

package test;
 
import model.entities.Person;
import model.logic.PersonRepository;
import model.logic.PersonRepositoryImpl;
import model.utils.CassandraUtil;
import org.springframework.data.cassandra.core.CassandraOperations;
import org.springframework.data.cassandra.core.CassandraTemplate;
 
import java.util.Random;
 
public class LoadTest {
 
    private Random generator = new Random();
 
    public static void main(String[] args) {
        LoadTest test = new LoadTest();
        try {
            CassandraOperations operations = new CassandraTemplate(CassandraUtil.getSession());
            PersonRepository repository = new PersonRepositoryImpl(operations);
            test.doRandomReadWriteTest(repository);
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            CassandraUtil.getSession().close();
            CassandraUtil.getCluster().close();
        }
    }
 
    private void doRandomReadWriteTest(PersonRepository repository) {
        while (true) {
            // generate a random data
            Person person = createPerson();
            // insert or update a person
            repository.save(person);
            if (generator.nextDouble() < 0.001) {
                // remove a person
                repository.delete(generateSofinummer());
                // get all persons
                System.out.println("number of persons: " + repository.count());
                repository.sortAllPersons();
            } else {
                // find a person by ID
                repository.findOne(generateSofinummer());
            }
        }
    }
 
    private Person createPerson() {
        Person person = new Person();
        person.setSofinummer(generateSofinummer());
        person.setNaam(Long.toString(Math.abs(generator.nextLong()), 36));
        return person;
    }
 
    private Integer generateSofinummer() {
        return generator.nextInt(100000);
    }
}

We also start a flight recording on the Cassandra nodes, in order to see how the JVM is doing. To create flight recordings we will use the script presented in the postJava Virtual Machine Code Generation and Optimization (this is also the reason why we set the extra parameter-Dname=<%= @SERVER_NAME %> in the cassandra-env.sh file, such that we can easily find the process id of the Cassandra process by using the provided name), i.e.,

[weblogic@machine1 monitor]$ ./FlightRecording.sh 
Provide an <ACTION> and a <SERVER_NAME>
Usage FlightRecording.sh <ACTION> <SERVER_NAME>, <ACTION> must be one of {start|stop|dump|check|clean}
 
[weblogic@machine1 monitor]$ ./FlightRecording.sh start cassandra
Starting Flight Recording for server: cassandra with PID 2783
2783:
Started recording 1. The result will be written to: /home/weblogic/cassandra-29-01-2015_16:39:44.jfr
 
[weblogic@machine1 monitor]$ ./FlightRecording.sh check cassandra
2783:
Recording: recording=1 name="cassandra" duration=20m filename="/home/weblogic/cassandra-29-01-2015_16:39:44.jfr" compress=false (running)

When the test is run, the following output is observed (nice to see what parallel sorting can do)

number of persons: 1377
sequential sort took: 11 ms.
parallel sort took: 12 ms.
number of persons: 1694
sequential sort took: 3 ms.
parallel sort took: 4 ms.
number of persons: 4916
sequential sort took: 8 ms.
parallel sort took: 9 ms.
...
number of persons: 10128
sequential sort took: 6 ms.
parallel sort took: 6 ms.
...
number of persons: 20018
sequential sort took: 13 ms.
parallel sort took: 6 ms.
...
number of persons: 30236
sequential sort took: 17 ms.
parallel sort took: 6 ms
...
number of persons: 40829
sequential sort took: 20 ms.
parallel sort took: 7 ms.
...
number of persons: 51534
sequential sort took: 26 ms.
parallel sort took: 9 ms.
...
number of persons: 60884
sequential sort took: 36 ms.
parallel sort took: 13 ms.
...
number of persons: 70080
sequential sort took: 51 ms.
parallel sort took: 17 ms.
...
number of persons: 80022
sequential sort took: 60 ms.
parallel sort took: 21 ms.
...
number of persons: 90167
sequential sort took: 56 ms.
parallel sort took: 20 ms.
...
number of persons: 95031
sequential sort took: 66 ms.
parallel sort took: 22 ms
...
number of persons: 99390
sequential sort took: 59 ms.
parallel sort took: 19 ms.

Monitoring results. To set-up an MBean Browser we have to provide a JMX URL, i.e,service:jmx:rmi:///jndi/rmi://<HOSTNAME>:<JMX_PORT>/jmxrmi. The following shows snapshots from the JMX Console in Java Mission Control

On Linux we can use lsof to get insight in the open files, and sar to obtain statistics related to the disk.

# list open files
[weblogic@machine1 cassandra2.1.2]$ lsof -i -a -p 2783
COMMAND  PID     USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
java    2783 weblogic   53u  IPv4  14637      0t0  TCP *:7199 (LISTEN)
java    2783 weblogic   54u  IPv4  14640      0t0  TCP *:44168 (LISTEN)
java    2783 weblogic   59u  IPv4  18410      0t0  TCP machine1.com:35531->machine2.com:empowerid (ESTABLISHED)
java    2783 weblogic   60u  IPv4  18411      0t0  TCP machine1.com:empowerid->machine1.com:39417 (ESTABLISHED)
java    2783 weblogic   61r  IPv6  19642      0t0  TCP machine1.com:9042->192.168.231.1:53436 (ESTABLISHED)
java    2783 weblogic   75r  IPv4  19619      0t0  TCP machine1.com:39418->machine1.com:empowerid (ESTABLISHED)
java    2783 weblogic   88u  IPv4  14894      0t0  TCP machine1.com:empowerid (LISTEN)
java    2783 weblogic   90u  IPv6  15012      0t0  TCP machine1.com:9042 (LISTEN)
java    2783 weblogic   91u  IPv4  15014      0t0  TCP machine1.com:apani1 (LISTEN)
java    2783 weblogic   92u  IPv4  18409      0t0  TCP machine1.com:empowerid->machine2.com:57281 (ESTABLISHED)
java    2783 weblogic   95u  IPv4  19603      0t0  TCP machine1.com:35536->machine2.com:empowerid (ESTABLISHED)
java    2783 weblogic   96u  IPv4  19606      0t0  TCP machine1.com:39417->machine1.com:empowerid (ESTABLISHED)
java    2783 weblogic   97u  IPv4  19607      0t0  TCP machine1.com:empowerid->machine1.com:39418 (ESTABLISHED)
java    2783 weblogic   99u  IPv4  19620      0t0  TCP machine1.com:empowerid->machine2.com:57291 (ESTABLISHED)
 
# i/o statistics
[weblogic@machine1 cassandra2.1.2]$ sar -b
Linux 2.6.32-504.3.3.el6.x86_64 (machine1.com)  01/29/2015  _x86_64_ (4 CPU)
04:30:01 PM       tps      rtps      wtps   bread/s   bwrtn/s
04:40:01 PM      4.43      1.82      2.61     70.90     29.22
04:50:01 PM     18.95      0.04     18.91      1.82    279.62
05:00:01 PM     31.86      0.45     31.41    132.36    478.27
Average:        18.15      0.79     17.36     68.57    257.92
 
# individual block device i/o
[weblogic@machine1 cassandra2.1.2]$ sar -p -d
Linux 2.6.32-504.3.3.el6.x86_64 (machine1.com)  01/29/2015  _x86_64_ (4 CPU)
04:30:01 PM       DEV                 tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
04:40:01 PM       sda                1.57     35.45     14.67     31.86      0.03     16.66      8.14      1.28
04:40:01 PM vg_machine1-lv_root      2.86     35.45     14.55     17.49      0.07     24.30      4.48      1.28
04:40:01 PM vg_machine1-lv_swap      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
04:50:01 PM       sda                1.45      0.91    139.81     96.75      0.03     23.39     13.28      1.93
04:50:01 PM vg_machine1-lv_root     17.49      0.91    139.81      8.04      0.84     48.16      1.10      1.93
04:50:01 PM vg_machine1-lv_swap      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
05:00:01 PM       sda                1.74     66.18    239.14    175.20      0.11     62.79     15.40      2.68
05:00:01 PM vg_machine1-lv_root     30.12     66.18    239.14     10.14      8.58    284.98      0.89      2.68
05:00:01 PM vg_machine1-lv_swap      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
Average:          sda                1.59     34.29    128.98    102.68      0.06     35.24     12.28      1.95
Average:    vg_machine1-lv_root     16.56     34.29    128.94      9.86      3.11    187.95      1.18      1.95
Average:    vg_machine1-lv_swap      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
 
# swap space
[weblogic@machine1 cassandra2.1.2]$ sar -S
Linux 2.6.32-504.3.3.el6.x86_64 (machine1.com)  01/29/2015  _x86_64_ (4 CPU)
04:30:01 PM kbswpfree kbswpused  %swpused  kbswpcad   %swpcad
04:40:01 PM   4194300         0      0.00         0      0.00
04:50:01 PM   4194300         0      0.00         0      0.00
05:00:01 PM   4194300         0      0.00         0      0.00
Average:      4194300         0      0.00         0      0.00
 
# CPU and memory usage by the JVM
[weblogic@machine1 cassandra2.1.2]$ ps -p 2783 -o %cpu,%mem,cmd
%CPU %MEM CMD
39.8 26.3 /u01/app/oracle/weblogic12.1.3/jdk1.8.0_25/bin/java -Xms1024m -Xmx1024m -XX:StringTableSize=1000003 -XX:+UseG1GC
 
# run queue and load average
[weblogic@machine1 cassandra2.1.2]$ sar -q
Linux 2.6.32-504.3.3.el6.x86_64 (machine1.com)  01/29/2015  _x86_64_ (4 CPU)
04:30:01 PM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15
04:40:01 PM         0       379      0.09      0.10      0.10
04:50:01 PM         0       382      0.00      0.04      0.06
05:00:01 PM         0       383      0.29      0.11      0.06
05:10:01 PM         0       379      0.00      0.04      0.05
Average:            0       381      0.10      0.07      0.07
 
# after we shutdown machine1.com, we check the status of the cluster
[weblogic@machine2 bin]$ ./nodetool --host machine2.com --port 7199 status test
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address          Load       Tokens  Owns (effective)  Host ID                               Rack
UN  192.168.231.110  136.14 KB  256     100.0%            6f563cc6-3e2f-4089-b28c-bdb58959c214  rack1
DN  192.168.231.100  141.25 KB  256     100.0%            0eddc3ea-6692-4c62-9135-67ef37d655b5  rack1
 
[weblogic@machine2 bin]$ ./cqlsh machine2.com 9042
Connected to TestCluster at machine2.com:9042.
[cqlsh 5.0.1 | Cassandra 2.1.2 | CQL spec 3.2.0 | Native protocol v3]
Use HELP for help.
cqlsh> SELECT count(*) FROM test.persons;
 
 count
-------
 99399
 
(1 rows)

From the Flight Recording, we can obtain detailed information on how the Java Virtual Machine is doing. TheGeneral, Overview tab provides a first peak (the JVM Information tab provides the information about the Java Virtual Machine settings)

The Memory, Garbage Collections tab provides information regarding the individual garbage collections (such as pause times)

The Memory, Allocations tab shows information about object allocations

The Code, Overview tab shows were the hot spots are

The Code, Compilations tab gives insight in compilation times

The Events, Graph tab gives insight in how the threads are doing. But first we obtain the top 5 threads from theThreads, Hot Threads tab

Great stuff! As a last remark, SANs were designed to solve problems Cassandra does not have, i.e., Cassandra was designed from the start for commodity hardware.

References

[1] Cassandra Documentation.
[2] Spring Data Cassandra - Reference Documentation.

0 0