cassandra （2）Understanding the Architecture【Planning a cluster deployment】

来源：互联网发布：编程知乎编辑：程序博客网时间：2024/05/20 16:33

When planning a Cassandra cluster deployment, you should have a good idea of the initial volume of data
you plan to store and a good estimate of your typical application workload.
The following topics provide information for planning your cluster:
Selecting hardware for enterprise implementations
Choosing appropriate hardware depends on selecting the right balance of the following resources:
memory, CPU, disks, number of nodes, and network.
Memory
The more memory a Cassandra node has, the better read performance. More RAM allows for larger cache
sizes and reduces disk I/O for reads. More RAM also allows memory tables (memtables) to hold more
recently written data. Larger memtables lead to a fewer number of SSTables being flushed to disk and
fewer files to scan during a read. The ideal amount of RAM depends on the anticipated size of your hot
data.
• For dedicated hardware, the optimal price-performance sweet spot is 16GB to 64GB; the minimum is
8GB.
• For a virtual environments, the optimal range may be 8GB to 16GB; the minimum is 4GB.
• For testing light workloads, Cassandra can run on a virtual machine as small as 256MB.
• For setting Java heap space, see Tuning Java resources.
CPU
Insert-heavy workloads are CPU-bound in Cassandra before becoming memory-bound. Cassandra is
highly concurrent and uses as many CPU cores as available:
• For dedicated hardware, 8-core processors are the current price-performance sweet spot.
• For virtual environments, consider using a provider that allows CPU bursting, such as Rackspace

Cloud Servers.

Disk
Disk space depends a lot on usage, so it's important to understand the mechanism. Cassandra writes
data to disk when appending data to the commit log on page 213 for durability and when flushing
memtable on page 215 to SSTable on page 216 data files for persistent storage. SSTables are
periodically compacted. Compaction improves performance by merging and rewriting data and discarding
old data. However, depending on the type of compaction_strategy and size of the compactions,
compaction can substantially increase disk utilization and data directory volume. For this reason,
you should leave an adequate amount of free disk space available on a node: 50% (worst case) for
SizeTieredCompactionStrategy and large compactions, and 10% for LeveledCompactionStrategy. The
following links provide information about compaction:
• Configuring compaction and compression
• The Apache Cassandra storage engine
• Leveled Compaction in Apache Cassandra
• When to Use Leveled Compaction
For information on calculating disk size, see Calculating usable disk capacity.
Recommendations:
Capacity per node
Ideal capacity for Cassandra 1.2 and later is 3-5TB per node. For Cassandra 1.1, it is 500-800GB per node.

Capacity and I/O
When choosing disks, consider both capacity (how much data you plan to store) and I/O (the write/read
throughput rate). Some workloads are best served by using less expensive SATA disks and scaling disk
capacity and I/O by adding more nodes (with more RAM).
Solid-state drives
SSDs are the recommended choice for Cassandra. Cassandra's sequential, streaming write patterns
minimize the undesirable effects of write amplification associated with SSDs. This means that Cassandra
deployments can take advantage of inexpensive consumer-grade SSDs. Enterprise level SSDs are not
necessary because Cassandra's SSD access wears out consumer-grade SSDs in the same time frame as
more expensive enterprise SSDs.
Note: For SSDs it is recommended that both commit logs and SSTables are on the same mount
point.
Number of disks - SATA
Ideally Cassandra needs at least two disks, one for the commit log and the other for the data directories.
At a minimum the commit log should be on its own partition.
Commit log disk - SATA
The disk not need to be large, but it should be fast enough to receive all of your writes as appends
(sequential I/O).
Data disks
Use one or more disks and make sure they are large enough for the data volume and fast enough to both
satisfy reads that are not cached in memory and to keep up with compaction.
RAID on data disks
It is generally not necessary to use RAID for the following reasons:
• Data is replicated across the cluster based on the replication factor you've chosen.
• Starting in version 1.2, Cassandra includes takes care of disk management with the JBOD (Just a
bunch of disks) support feature. Because Cassandra properly reacts to a disk failure, based on your
availability/consistency requirements, either by stopping the affected node or by blacklisting the failed
drive, this allows you to deploy Cassandra nodes with large disk arrays without the overhead of RAID 10.
RAID on the commit log disk
Generally RAID is not needed for the commit log disk. Replication adequately prevents data loss. If you
need the extra redundancy, use RAID 1.

Extended file systems
DataStax recommends deploying Cassandra on XFS. On ext2 or ext3, the maximum file size is 2TB even
using a 64-bit kernel. On ext4 it is 16TB.
Because Cassandra can use almost half your disk space for a single file, use XFS when using large disks,
particularly if using a 32-bit kernel. XFS file size limits are 16TB max on a 32-bit kernel, and essentially
unlimited on 64-bit.
Number of nodes
Prior to version 1.2, the recommended size of disk space per node was 300 to 500GB. Improvement
to Cassandra 1.2, such as JBOD support, virtual nodes, off-heap Bloom filters, and parallel leveled
compaction (SSD nodes only), allow you to use few machines with multiple terabytes of disk space.
Network
Since Cassandra is a distributed data store, it puts load on the network to handle read/write requests and
replication of data across nodes. Be sure to choose reliable, redundant network interfaces and make sure
that your network can handle traffic between nodes without bottlenecks.
• Recommended bandwidth is 1000 Mbit/s (gigabit) or greater.

• Bind the Thrift interface (listen_address) to a specific NIC (Network Interface Card).
• Bind the RPC server interface (rpc_address) to another NIC.
Cassandra efficiently routes requests to replicas that are geographically closest to the coordinator node
and chooses a replica in the same rack if possible; it always chooses replicas located in the same data
center over replicas in a remote data center.
Firewall
If using a firewall, make sure that nodes within a cluster can reach each other. See Configuring firewall
port access on page 55.
Generally, when you have firewalls between machines, it is difficult to run JMX across a network and
maintain security. This is because JMX connects on port 7199, handshakes, and then uses any port
within the 1024+ range. Instead use SSH to execute commands remotely connect to JMX locally or use
the DataStax OpsCenter.

Planning an Amazon EC2 cluster
DataStax provides an Amazon Machine Image (AMI) to allow you to quickly deploy a multi-node
Cassandra cluster on Amazon EC2.
The DataStax AMI initializes all nodes in one availability zone using the SimpleSnitch.
If you want an EC2 cluster that spans multiple regions and availability zones, do not use the DataStax
AMI. Instead, install Cassandra on your EC2 instances as described in Installing Cassandra Debian
packages, and then configure the cluster as a multiple data center cluster.
Use the following guidelines when setting up your cluster:
• For production Cassandra clusters on EC2, use Large or Extra Large instances with local storage.
Amazon Web Service recently reduced the number of default ephemeral disks attached to the image
from four to two. Performance will be slower for new nodes unless you manually attach the additional
two disks; see Amazon EC2 Instance Store.
• RAID 0 the ephemeral disks, and put both the data directory and the commit log on that volume. This
has proved to be better in practice than putting the commit log on the root volume (which is also
a shared resource). For more data redundancy, consider deploying your Cassandra cluster across
multiple availability zones or using EBS volumes to store your Cassandra backup files.
• Cassandra JBOD support allows you to use standard disks, but you may get better throughput with
RAID0. RAID0 splits every block to be on another device so that writes are written in parallel fashion
instead of written serially on disk.
• EBS volumes are not recommended for Cassandra data volumes for the following reasons:
• EBS volumes contend directly for network throughput with standard packets. This means that EBS
throughput is likely to fail if you saturate a network link.
• EBS volumes have unreliable performance. I/O performance can be exceptionally slow, causing the
system to back load reads and writes until the entire cluster becomes unresponsive.
• Adding capacity by increasing the number of EBS volumes per host does not scale. You can easily
surpass the ability of the system to keep effective buffer caches and concurrently serve requests
for all of the data it is responsible for managing.
For more information and graphs related to ephemeral versus EBS performance, see the blog article
Systematic Look at EC2 I/O.

Calculating usable disk capacity
Determining how much data your Cassandra nodes can hold.

To calculate how much data your Cassandra nodes can hold, calculate the usable disk capacity per node
and then multiply that by the number of nodes in your cluster. Remember that in a production cluster, you
will typically have your commit log and data directories on different disks.
1. Start with the raw capacity of the physical disks:
raw_capacity = disk_size * number_of_data_disks
2. Calculate the usable disk space as follows:
( raw_capacity * 0.9 ) = formatted_disk_space
During normal operations, Cassandra routinely requires disk capacity for compaction and repair
operations. For optimal performance and cluster health, DataStax recommends not filling your disks
to capacity, but running at 50% to 80% capacity depending on the compaction_strategy and size of the
compactions.
3. Account for file system formatting overhead (roughly 10 percent):
formatted_disk_space * (0.5 to 0.8) = usable_disk_space

Calculating user data size
Accounting for storage overhead in determining user data size.
As with all data storage systems, the size of your raw data will be larger once it is loaded into Cassandra
due to storage overhead. On average, raw data is about two times larger on disk after it is loaded into the
database, but could be much smaller or larger depending on the characteristics of your data and tables.
The following calculations account for data persisted to disk, not for data stored in memory.
• Determine column overhead:
regular_total_column_size = column_name_size + column_value_size + 15
counter - expiring_total_column_size = column_name_size + column_value_size
+ 23
Every column in Cassandra incurs 15 bytes of overhead. Since each row in a table can have different
column names as well as differing numbers of columns, metadata is stored for each column. For
counter columns and expiring columns, you should add an additional 8 bytes (23 bytes total).
• Account for row overhead.
Every row in Cassandra incurs 23 bytes of overhead.
• Estimate primary key index size:
primary_key_index = number_of_rows * ( 32 + average_key_size )
Every table also maintains a partition index. This estimation is in bytes.
• Determine replication overhead:
replication_overhead = total_data_size * ( replication_factor - 1 )
The replication factor plays a role in how much disk capacity is used. For a replication factor of 1, there
is no overhead for replicas (as only one copy of data is stored in the cluster). If replication factor is
greater than 1, then your total data storage requirement will include replication overhead.

Anti-patterns in Cassandra
Implementation or design patterns that are ineffective and/or counterproductive in Cassandra production
installations. Correct patterns are suggested in most cases.
Network attached storage
Storing SSTables on a network attached storage (NAS) device is of limited use. Using a NAS device often
results in network related bottlenecks resulting from high levels of I/O wait time on both reads and writes.
The causes of these bottlenecks include:
• Router latency.

• The Network Interface Card (NIC) in the node.
• The NIC in the NAS device.
There are exceptions to this pattern. If you use NAS, ensure that each drive is accessed only by one
machine and each drive is physically close to the node.
Shared network file systems
Shared network file systems (NFS) have the same limitations as NAS. The temptation with NFS
implementations is to place all SSTables in a node into one NFS. Doing this deprecates one of
Cassandra's strongest features: No Single Point of Failure (SPOF). When all SSTables from all nodes are
stored onto a single NFS, the NFS becomes a SPOF. To best use Cassandra, avoid using NFS.
Excessive heap space size
DataStax recommends using the default heap space size for most use cases. Exceeding this size can
impair the Java virtual machine's (JVM) ability to perform fluid garbage collections (GC). The following
table shows a comparison of heap space performances reported by a Cassandra user:

Heap CPU utilizationQueries per secondLatency
40 GB 50% 7501 second
8 GB 5% 8500 (not maxed out) 10 ms

For information on heap sizing, see Tuning Java resources on page 179.

Cassandra's rack feature
Defining one rack for the entire cluster is the simplest and most common implementation. Multiple racks
should be avoided for the following reasons:
• Most users tend to ignore or forget rack requirements that racks should be organized in an alternating
order. This order allows the data to get distributed safely and appropriately.
• Many users are not using the rack information effectively. For example, setting up with as many racks
as nodes (or similar non-beneficial scenarios).
• Expanding a cluster when using racks can be tedious. The procedure typically involves several node
moves and must ensure that racks are distributing data correctly and evenly. When clusters need
immediate expansion, racks should be the last concern.
To use racks correctly:
• Use the same number of nodes in each rack.
• Use one rack and place the nodes in different racks in an alternating pattern. This allows you to still
get the benefits of Cassandra's rack feature, and allows for quick and fully functional expansions.
Once the cluster is stable, you can swap nodes and make the appropriate moves to ensure that nodes
are placed in the ring in an alternating fashion with respect to the racks.

Multiple-gets
Multiple-gets may cause problems. One sure way to kill a node is to buffer 300MB of data, timeout, and
then try again from 50 different clients.
You should architect your application using many single requests for different rows. This method ensures
that if a read fails on a node, due to a backlog of pending requests, an unmet consistency, or other error,
only the failed request needs to be retried.
Ideally, use the same key reading for the entire key or slices. Be sure to keep the row sizes in mind to
prevent out-of-memory (OOM) errors by reading too many entire ultra-wide rows in parallel.

Using the Byte Ordered Partitioner
The Byte Ordered Partitioner (BOP) is not recommended.
Use virtual nodes and use either the Murmur3Partitioner (default) or the RandomPartitioner. Virtual
nodes allow each node to own a large number of small ranges distributed throughout the cluster. Using
virtual nodes saves you the effort of generating tokens and assigning tokens to your nodes. If not using
virtual nodes, these partitioners are recommended because all writes occur on the hash of the key and
are therefore spread out throughout the ring amongst tokens range. These partitioners ensure that your
cluster evenly distributes data by placing the key at the correct token using the key's hash value. Even if
data becomes stale and needs to be deleted, this ensures that data removal also takes place while evenly
distributing data around the cluster.
Reading before writing
Reads take time for every request, as they typically have multiple disk hits for uncached reads. In work
flows requiring reads before writes, this small amount of latency can affect overall throughput. All write I/
O in Cassandra is sequential so there is very little performance difference regardless of data size or key
distribution.
Load balancers
Cassandra was designed to avoid the need for load balancers. Putting load balancers between Cassandra
and Cassandra clients is harmful to performance, cost, availability, debugging, testing, and scaling. All
high-level clients, such as Astyanax and pycassa, implement load balancing directly.
Insufficient testing
Be sure to test at scale and production loads. This the best way to ensure your system will function
properly when your application goes live. The information you gather from testing is the best indicator of
what throughput per node is needed for future expansion calculations.
To properly test, set up a small cluster with production loads. There will be a maximum throughput
associated with each node count before the cluster can no longer increase performance. Take the
maximum throughput at this cluster size and apply it linearly to a cluster size of a different size. Next
extrapolate (graph) your results to predict the correct cluster sizes for required throughputs for your
production cluster. This allows you to predict the correct cluster sizes for required throughputs in the
future. The Netflix case study shows an excellent example for testing.

Lack of familiarity with Linux
Linux has a great collection of tools. Become familiar with the Linux built-in tools. It will help you greatly
and ease operation and management costs in normal, routine functions. The essential list of tools and
techniques to learn are:
• Parallel SSH and Cluster SSH: The pssh and cssh tools allow SSH access to multiple nodes. This is
useful for inspections and cluster wide changes.
• Passwordless SSH: SSH authentication is carried out by using public and private keys. This allows
SSH connections to easily hop from node to node without password access. In cases where more
security is required, you can implement a password Jump Box and/or VPN.
• Useful common command-line tools include:
• top: Provides an ongoing look at processor activity in real time.
• System performance tools: Tools such as iostat, mpstat, iftop, sar, lsof, netstat, htop, vmstat, and
similar can collect and report a variety of metrics about the operation of the system.
• vmstat: Reports information about processes, memory, paging, block I/O, traps, and CPU activity.
• iftop: Shows a list of network connections. Connections are ordered by bandwidth usage, with the
pair of hosts responsible for the most traffic at the top of list. This tool makes it easier to identify
the hosts causing network congestion.

Running without the recommended settings
Be sure to use the recommended settings in the Cassandra documentation.
Also be sure to consult the Planning a Cassandra cluster deployment documentation, which discusses
hardware and other recommendations before making your final hardware purchases.
More anti-patterns
For more about anti-patterns, visit Matt Dennis` slideshare.