Monday, December 15, 2014

Annoyed of retyping - "hdfs..."

Hi to all,
Recently I’ve been annoyed of retyping the same command over and over while viewing and making modifications on my HDFS,
Don’t you hate it when you have to type “hadoop fs ….” or in Hadoop 2 “hdfs dfs…”.

So I’ve created some useful Alias commands (that you can use on Linux/Unix/OSx), just put them in your “.bashrc file in your home directory (Or what ever shell you are running).
You can understand the context of the command because they are the same as the Linux commands, but with the prefix of "h".

Another perk of the whole deal is that you can work with these aliases to work with Amazons S3 system as well, the usual way to work with S3 command is “s3cmd”, we have it installed on the given server that is connected to the HDFS, but this is really easy to use with Hadoop’s Native file system implementation.
The only thing you have to do is to add the protocol type you are trying to access the file system: (for example)

“hls s3n://some-bucket-name/key/file.txt”

If you have some more useful commands that you use a lot while working i would to to be enlightened.

Enjoy typing less :)

Friday, December 12, 2014

Spark Configuration Mess Solved

Hi to all,
Our problem was passing configuration from Spark Driver to the Slaves.
A short background on our running environment (Before we talk about the problem):

We are running Apache Spark, on Amazon’s EMR (Elastic Map Reduce).
All is running over Apache Yarn resource management (this complicates things), the input data is on S3 file system on Amazon as well, and the HDFS is on the Spark Cluster.
Our Spark Applications are written in the Java API, and we are running them via “spark-submit” script on the cluster in “client-mode”.
The mapper functions that run on the RDDs need to access a MongoDB (a different one on each environment – development, testing and production).

The last requirement brought up the problem,

How can we pass the parameters of the relevant MongoDB details, without actually passing the parameters in the constructor to each and every function we write in the code?
I would like to run the code seamlessly without caring if the current code is running on the Driver node or that is was serialized and sent to another slave.

We were using hard coded values at first, but to support different deployment and running environments, it’s not good enough.
We are using the
--properties-file parameter to thespark-submit”, because we need to pass parameters to the various applications, and different ones of course, for example:
  • Log4j Configuration file for the application.
  • MongoDB details.
  • Application running status
  • Etc.
In that manner we were reading all of the parameters passed via the Properties file as “Java System Properties” simply with:


But the problem was that some code, was sent to the Spark slaves, and they didn’t have all of these System properties because they run on a different JVM.
We wanted to find a way to pass these configurations to the slaves without producing Ugly code.
But we don’t know where to read the configurations from, because we can’t know on which slave the code will run and in what context. There is no way to use the SparkContext because it’s not Serializable

The solution is: (Trying to keep is simple and modular)
  • We are keeping all of our configurations on HDFS. (A place that all the participants in the cluster can access)
  • Every application has its own properties files in its own place on the HDFS.
  • Using Apache Common Configurations (to read the configurations and to use “include” statements in the properties files) we are reading all the relevant configurations, but the hard part is passing the link to: “Properties file location”.A singleton object (I really hate using Singleton, but the situation forced us) that we will callConfigProviderreturns a Configuration object (Map) of properties that are loaded from that will be a system property (Both in the master and in the slaves), the ConfigProvider is loaded only once in every JVM this way. I’ve tried using all the documented ways to pass the property and the conclusion was that a lot of them just don’t work (all of them were written in the properties file passes to spark-submit script):
    1. spark.yarn.appMasterEnv.[var-name] – returns null in the slave, both with the prefix or without.
    2. spark.executorEnv.[var-name]  – returns null in the slave, both with the prefix or without.
    3. spark.[var-name] – returns null in the slave, both with the prefix or without.
By the way, all of them return the correct value on the Driver running the application if you access them with the full name, because of the “spark.” Prefix.

The right way to pass the parameter is through the property:
spark.driver.extraJavaOptions and spark.executor.extraJavaOptions”:
I’ve passed both the log4J configurations property and the parameter that I needed for the configurations. (To the Driver I was able to pass only the log4j configuration).
For example (was written in properties file passed in spark-submit with —properties-file):

spark.driver.extraJavaOptions –Dlog4j.configuration=file:///spark/conf/ -
spark.executor.extraJavaOptions –Dlog4j.configuration=file:///spark/conf/ hdfs:///some/path/on/hdfs/
And to list out the parameters and where they are passed to:
  1. Passes the log4j configurations file to the Driver. I’ve tried to pass another parameter there but it didn’t accept it very well.
  2. Passes the log4j configurations file to any executor, and I’ve passed another parameter as a Java system property, this was successful and the executors code managed to read it as “System.getProperty()”.
  3. To read the location of the properties file, I’ve added the same name of the property with the spark. Prefix, and in the Driver, where the application starts I was able to read it because it’s passed to the driver from the properties file to driver code
The ConfigProvider class needs to check for existence of the parameters ( or and if one of them is found redirect the apache commons configurations, and then to instantiate the properties object.
Another defensive measure that I’ve taken, is once I’ve read the configurations from the properties file, I copied them into a
Map and kept that object as a member in the ConfigProvider class. (I kept it immutable that way, and it became an object that can be serialized in case you need it to be).

What about Unit testing you ask – Do I need configuration files for all of my tests?

I’ve added aninjectConfiguration(Map properties)method to the ConfigProvider, and because it’s a Singleton in the JVM, you can inject your own properties to the code and test your inner classes that use the ConfigProvider to get the needed configurations to run.

If you need more elaboration about other configurations to spark,
I’ll be happy to answer, Just comment below.

Enjoy configuring your Applications.

Friday, September 26, 2014

What are you Logging about?

Every application today needs some kind of way to report out it’s actions or current state. The main reasons that we would need there reportings are:
  1. Debugging
  2. Monitoring the Applications (components) Actions
  3. Offline and Backwards Analytics
In this case we are talking about all of these actions in a JVM based environment.

Mostly when we use the phrase “To Log” in programming it means that we want to output information from our application out a user / administrator / logs monitor.
Two things that actually define a “Log File” that is a lot of Rows of:
  1. Time Stamp - some kind representation of it.
  2. Message - some kind of information about the event that occurred.

What are we going to do?

We are going to go over some of the most famous and commonly used Logging Frameworks in Java, get to know the Pros and Cons of some of them, Show some code samples and examples (you can find the full example project on GitHub), finally get to a Conclusion which is the best one to use?.

What will we be covering:
  1. Logging Common Phrases
  2. Console logging
  3. Log4J
  4. Logback
  5. Log4J 2
  6. Log4J2 vs Logback - Performance
  7. SLF4J
  8. Conclusion
(We are not going to cover the “java.util.logging”)
Logging Common Phrases
  1. Logger -
Probably this will be the Object (I do hope it’s an interface) that you’ll be using to do all of your “Logging” through.
  1. Logging level -
The logging frameworks have almost the same names for the logging levels, in some you’ll find them as method names and in some you’ll pass them as a parameter to the “log()” method with the message to be printed, but eventually the list is pretty the same (the order is from the most weak to the strongest): Trace, Debug, Info, Warn, Error, Fatal.
  1. Root Logger -
This is a definition of the top most logger, all of the setting applied to it, most of the time, will propagate to it’s child loggers.
  1. Appender -
Appenders are mostly definitions of pipes to different kinds of output methods to the logging framework, some of the most common are: stdout, strerr, FileAppender, RollingFileAppender, SocketAppender and more, you can understand their uses by their names.
  1. Log File -
In the most stupid way, the actual file that you are writing all of your data to. It has the characteristics we’ve mentioned above in each row, “time stamp + message”, and sometime people name the log file as something associated with the Data or Timestamp (Like: “2014-09-26-06:45-Application.log”) to easily parse and search for log file later.
Console logging
The “Hardcore” or “Old School” Way! or, as i like to call it: “The Worst Way”.
This method is used mostly for small programs or at development time, you shouldn’t use it extensively because it makes your code a nightmare to maintain, and in a large scale application, when you are watching the console where you ran the application and you suddenly see the next line that someone wrote popping every second:
$$$  ### I’m Here, This is not suppose to Happen ### $$$”.
or the uninformative one:
%%%% ------ Yeah -------- %%%%
You probably want to kill the person who did it. (If say you haven’t seen something like this in the code, you are Lying!!!)
And eventually if you want to change something after, you find yourself doing search and replace over “System.out”....and changing a lot of code.
A bit about Threading, there is a difference between System.out and System.err that you would probably not notice in the documentation, the System.err ensures that the Order of the messages that you print out via multiple thread with be like a FIFO queue because all of the things you write to the Standard Err is immediately shown to the user, when the System.out does not ensure the order. (Both of them access the write()method, that inside has a Synchronized block, but it’s about the implementation of the JVM and the Operation system behind that matters).
In short, (it’s a joke, i didn’t make it short at all)
try to avoid this method.
Log4J (Versions 1.2.X)
The veteran, it is here for long, since 2001, and he is very widely used in the world.
He is very flexible and convenient because you use one interface in you code to log all of you messages, and you can control the logging level and the output methods via a configuration file.
Lets see the example of a small Java application that i wrote to see some of the usages and the advantages and problems.

The explanation is in the code so i’ll write some points:
  • Double checking because of String concatenation.
  • Logger is being taken via a Factory by the class name.
  • There is a log level control by the methods being printed.
  • Log4J is a Thread safe framework.

To not lengthen this post more, there are some more important configuration settings explained in the Next Post.

So, how do we watch all of these Logs?

There is an open source project called ChainSaw, that is written in Java, and lets you see the logs in a Graphical user interface.
The thing is that the last commit to this project was around 6 years ago.

As they say in their website, “Logback is intended as a successor to the popular log4j project, picking up where log4j leaves off”. (You can read their Article about the reasons to prefer logback over log4j)
I agree with some parts, and i’ll point out some nice features that i liked about “Logback” that were missing in Log4J:
  • Automatic reloading of configuration files.
  • Automatic compression of old archived log files.
  • Conditional processing of configuration files in the XML - used for the same configuration file for different environments.
  • Implementation of Logback for the Android environment.
  • logback-classic speaks SLF4J natively (We will talk about that a bit later).
  • Log4J Translator - for configurations in case of migration

A small snippet of the logback.xml configuration file:

For more elaboration please read the Logback Manual.

Viewing logs:
Just like “Chainsaw” for log4j, we have “Lilith”, one of the differences is that this project is active today too.
(They have mentioned that Mac users might have problems with installations but they have an explanation for fixing it on their website)

Configuration of the path to the logback file:

It is the contender of logback, and the newer version of “Log4j 1.2.X”  
In their website they have a Benchmark of Performance saying that they are much better (faster) then logback.

I won’t talk much about the features of Log4J 2 because it’s mostly eye to eye with logback.

For more elaboration about Usage and configuration. Here you can find the “Log4J 2” user manual, and the Wiki pages as well.

Logback vs Log4J 2

And now for the interesting part that will convince us (or maybe not) about “Who is the winner?”
I didn’t do the comparison myself, but i’ll let you read a Great Post by Tal Weiss from “Takipi” (I really recommend reading their Blog), he did a benchmark of performance between the two and a couple of more frameworks, and the conclusion was that  Log4J 2 is the winner regarding the performance.

After talking about implementations and benchmarks, eventually we get to the point (almost).
SLF4J, Simple Logging Facade for Java, is a logging implementation that allows the User (the programmer) to plug-in the desired logging system at deployment time (and development of course).

The bindings are made through the the connectors loaded to the Java Classpath.
SLF4J has a connector Jar for every major Logging framework.
The benchmark made by Takipi was implemented via SLF4J as well with all of the different implementations.

In the “logging-example” project i’ve implemented the same code as the Log4JWoker, you can view it Here, and download the project and run it yourself.

One of the coolest things you can notice is that you can avoid the Double check in case if the string concatenation.

Notice that the “{}” inside the message is a placeholder for the parameters you pass to the logging method, and is evaluated only if the proper logging level is enabled.

You’ll see if you run it that there is a clash of Multiple Bindings and the Log output is:

And you can see that the after the clash SLF4J runs with it’s default implementation of “Logback”.

In logback and other frameworks the possibility for logging via the “log()” method, and providing the “Level” as a parameter was deprecated. Lidalia supplies a decorator for SLF4J, that allows you to access exactly that.

The list of the supported implementations of connectors is:
(and you can implement one of you own if you would like)

We’ve stated and compared the different logging frameworks available in Java, and my recommendation is to use SLF4J as the facade to the logging issue, and leaving the decision to which underlying framework use to your personal taste or performance constraints.
This will leave your code clean and maintainable, and you’ll be able to use a general interface of logging over any implementation you’ll choose.
I would recommend using one the “Log4J 2” or the default “Logback” implementations, it really depends on your specific needs and tastes.

Some useful SLF4J and Log4J2 connector links:

Thank you very much for reading,

If any more elaboration is needed please write a comment :)