Sunday, July 5, 2015

Cassandra DataStax Community AMI on Amazon EC2 - Explained

Hi,
After handling our Cassandra Cluster in past time, I'll share some information about the locations of configuration and about the basic management of the cluster.
I'm mainly doing this because it took me a while until i found all of the components, and i think that it will save others some substantial amount of time in the future.



There are several components discussed that come with the DataStax Community AMI on AWS:
  1. OpsCenter - A Monitoring tool to show the metrics of your Cassandra clusters, you can manage multiple clusters with a single OpsCenter installation, and you can define if you want the OpsCenter installed when you launch the Instance of the AMI.
    It gives you a convenient way to show the internals of the performance of your Cassandra cluster.
    Great tool by DataStax by the way, given free of course with the basic actions.
    Let’s say we’ve created a 3 node cluster: cassandra-1, cassandra-2, cassandra-3.
    On cassandra-1 we have the OpsCenter installed.
    Ops Center Link will be:http://cassandra-1:8888/opscenter/index.html
  2. Cassandra - The actual process of the cassandra node.
  3. DataStax Agent - Service running on the node and reporting the metrics of Cassandra to the OpsCenter instance.

You can restart the cluster (It will do it gracefully - node by node) in the right upper corner, via "Restart".
(Note: Do not change the cassandra node configuration from the OpsCenter, it will override the changes and change the cassandra.yaml into a bad format)
This is where not to change (smile)
When pressing a specific node, you will be able to see the "Actions" menu -> "Configure".
You can view the configuration, but don't save any changes, for your own good!
There are 3 relevant linux services that you should know that run in your Instance:
1) “cassandra” - Cassandra node's process (Currently on latest version 2.1.7 - The latest update)
2) “datastax-agent” - Responsible on collection of metrics and reporting to the OpsCenter. Makes the administration actions possible. (Currently on latest version 5.1.3)
3) “opscenterd” - Running only on “cassandra-1” in our case - Relevant to the Web UI running the OpsCenter. (Currently on latest version 5.1.3 - The latest update)
Administration options on each node:
(Service names (star): cassandra, datastax-agent, opscenterd [relevant only to cassandra-1] )
Stop: sudo service (star) stop
Start: sudo service (star) start
Status: sudo service (star) status
Restart: sudo service (star) restart

OpsCenter View: (Popup menu on the right side)

Nodes: Status of the cassandra nodes, and access to each node separately.
Activities: On going actions in the cluster about the nodes.
Data: Keyspaces and their column families.
This configuration refers to each node (location of the relevant files):
Data Directories: /var/lib/cassandra
Log Directories:
Cassandra: /var/log/cassandra (File: system.log)
DataStax Agent: /var/log/datastax-agent (File: agent.log)
Runtime Files: /var/run/cassandra
Cassandra Jars:
/usr/share/cassandra
/usr/share/cassandra/lib
Bin Files:
/usr/bin
/usr/sbin

Configuration Files:

/etc/cassandra - (The important file is: cassandra.yaml)

Important entries in the cassandra.yaml:

"cluster_name: 'your-cluster-name' " - Defines the clusters association - Should be the same on all of the nodes of the same cluster - this is a logical name.
seed_provider:
   - class_name: org.apache.cassandra.locator.SimpleSeedProvider
     parameters:
         - seeds: "cassandra-1,cassandra-2,cassandra-3"
The seeds should be defined as all of the wanted access points that we want to synchronize the cluster's ring.
"listen_address: cassandra-1" - should be defined on each node to it's private ip. (in the case we are talking about the first node)
"broadcast_rpc_address: cassandra-1" - should be defined on each node to it's private ip.
"endpoint_snitch: Ec2Snitch" - Defines that the snitch is an amazon snitch that reveals the network topology.

/etc/datastax-agent/datastax-agent-env.sh - DataStax agent configuration
Service startup script - /etc/init.d/cassandra
Cassandra user limits - /etc/security/limits.d
Cassandra defaults - /etc/default/cassandra
"nodetool" - Cassandra's control and information tool: (Exist on each node)
"nodetool help" - displays all of the possible commands.

Upgrade versions (minor) -
(In case we are running on an ubuntu machine)
After you’ve launched an instance of the AMI, you might need to upgrade the version of either of the services,
If you would like to update, you should run:
SERVICE => will the wanted service name.
1) "sudo apt-get update" - updating repository listing of the versions.
2) "sudo apt-cache policy $SERVICE" - To see all of the possibilities and the currently installed version.
3) "sudo apt-get upgrade $SERVICE" -  To upgrade the specific service, and all of it's dependencies.
(Note: If you upgrade the opscenterd, it's dependency is datastax-agent, and it's dependency is cassandra - so all of them will be upgraded together)

Another important thing - Configuration changes

When you upgrade the services, after all is installed, the installation will try to merge the configuration files to override currently running setting,
be careful, and don't accept all of the changes.
Just go over them with the "D" option that means "Show" ("N" - is the default and it's "decline"), and note the things the installation wants to change, and then apply them if needed,
Check the cassandra change log of the relevant version you are installing to know what's needed.
If not, all of you server configurations will be overridden!

This gave a short overview of the locations and some basic Cassandra handling in the AWS environment.
If you have any further questions, feel free to comment, and i hope i will be able to assist.
Have fun with the info :)

Thursday, March 19, 2015

Write Batch Size Error - spark-cassandra-connector



The Use Case:
The Data is currently on Amazon’s S3 storage system and most it is time series data.
We compute and analyze our data using an “Apache Spark” cluster.
While trying to migrate the data to a more efficient storage model (both reads and writes), we tried out Apache Cassandra Database, and during running the first benchmark, we encountered a lot of failures during the “write” part of the Spark nodes.
I'm using the DataStax Java API spark-cassandra-connector.
(You can find some old but useful examples in the DataStax Blog and in the DataStax JavaAPI Documentation)
I started off with using a 2 node Cassandra Cluster, running on 2 c3.2xlarge machines on AWS EC2. I used the DataStax Community AMI to create the nodes and connected two of them into a cluster.

I’ve written a Spark Java application that reads the data from S3 and writes out the RDDs to the Cassandra cluster.


During the runJob at RDDFunctions.scala:24stage there were a lot of failures like:


“java.io.IOException: Failed to write 273 batches to test.some_cassandra_table.
……”


Finally the Spark application would fail because of the failures, so i tried to find out the cause of the problem, i went to one of the Cassandra nodes and checked the nodes log located at:
/var/log/cassandra/system.log”, and many warning messages of the next type coming in all the time:
WARN  [SharedPool-Worker-132] 2015-03-19 07:44:43,229 BatchStatement.java:243 - Batch of prepared statements for [test.some_cassandra_table] is of size 5264, exceeding specified threshold of 5120 by 144.


After looking for the solution to my problem on Google and coming up with nothing except a variation of the next StackOverFlow answer in many sites, that did not help much, i started looking for the meaning of the Log messages.
I found the definition of the “Batch size threshold” in the Cassandra nodes at  “/etc/cassandra/cassandra.yaml”:


batch_size_warn_threshold_in_kb: 5


Ending up to 5120 bytes, and that’s the value in the logs.


The next step was trying to figure out how to change the write batch size of the DataStax cassandra-spark-driver, and i found the next documentation reference: Link


In the “Tuning” paragraph in mentioned the “spark.cassandra.output.batch.size.rows” parameter you can set to the SparkConf while creating the JavaSparkContext, and i changed it to 5120, but it didn’t give the wanted effect, it multiplied the batch size to being much higher.
It’s default is “auto”, and the “auto”s outcome was much better.


So, after being frustrated with the outcome, i went on reading the source code of the cassandra driver, in the scala class: “WriteConf.scala” and found the usage of another parameter that really made a real change, when using the “auto” default the WriteConf goes straight to the second parameters, “spark.cassandra.output.batch.size.bytes”, value.

I changed the value to being 8192 (instead of the deafult: 1024 * 16 = 16384), making the batch size smaller, and things started working fine.


Although the log message is a warning, it was working some of the time, and sometimes failing, but with changing the parameter, it did not fail anymore.


And if you are asking why the threshold of 5K, i found the next explanation: Link.
Quote: Key reasoning for the desire comes from Patrick McFadden


"Yes that was in bytes. Just in my own experience, I don't recommend more
than ~100 mutations per batch. Doing some quick math I came up with 5k as
100 x 50 byte mutations.


Totally up for debate."


So as we see we can tune these things a bit more, depends on the queries we want to run and depends on the actual environment.
In the future i’ll play a bit more with these configuration, but for now it’s enough for me to continue the benchmark with no errors :)


I felt that someone else must of had the same problem but i didn’t find anything about it written in the internet,
If anyone has a better solution or some other insights i would love to hear, please comment to this post.

I hope this might help someone else.

Monday, December 15, 2014

Annoyed of retyping - "hdfs..."

Hi to all,
Recently I’ve been annoyed of retyping the same command over and over while viewing and making modifications on my HDFS,
Don’t you hate it when you have to type “hadoop fs ….” or in Hadoop 2 “hdfs dfs…”.

So I’ve created some useful Alias commands (that you can use on Linux/Unix/OSx), just put them in your “.bashrc file in your home directory (Or what ever shell you are running).
You can understand the context of the command because they are the same as the Linux commands, but with the prefix of "h".



Another perk of the whole deal is that you can work with these aliases to work with Amazons S3 system as well, the usual way to work with S3 command is “s3cmd”, we have it installed on the given server that is connected to the HDFS, but this is really easy to use with Hadoop’s Native file system implementation.
The only thing you have to do is to add the protocol type you are trying to access the file system: (for example)

“hls s3n://some-bucket-name/key/file.txt”


If you have some more useful commands that you use a lot while working i would to to be enlightened.

Enjoy typing less :)