Monday, December 15, 2014

Annoyed of retyping - "hdfs..."

Hi to all,
Recently I’ve been annoyed of retyping the same command over and over while viewing and making modifications on my HDFS,
Don’t you hate it when you have to type “hadoop fs ….” or in Hadoop 2 “hdfs dfs…”.

So I’ve created some useful Alias commands (that you can use on Linux/Unix/OSx), just put them in your “.bashrc file in your home directory (Or what ever shell you are running).
You can understand the context of the command because they are the same as the Linux commands, but with the prefix of "h".

Another perk of the whole deal is that you can work with these aliases to work with Amazons S3 system as well, the usual way to work with S3 command is “s3cmd”, we have it installed on the given server that is connected to the HDFS, but this is really easy to use with Hadoop’s Native file system implementation.
The only thing you have to do is to add the protocol type you are trying to access the file system: (for example)

“hls s3n://some-bucket-name/key/file.txt”

If you have some more useful commands that you use a lot while working i would to to be enlightened.

Enjoy typing less :)

Friday, December 12, 2014

Spark Configuration Mess Solved

Hi to all,
Our problem was passing configuration from Spark Driver to the Slaves.
A short background on our running environment (Before we talk about the problem):

We are running Apache Spark, on Amazon’s EMR (Elastic Map Reduce).
All is running over Apache Yarn resource management (this complicates things), the input data is on S3 file system on Amazon as well, and the HDFS is on the Spark Cluster.
Our Spark Applications are written in the Java API, and we are running them via “spark-submit” script on the cluster in “client-mode”.
The mapper functions that run on the RDDs need to access a MongoDB (a different one on each environment – development, testing and production).

The last requirement brought up the problem,

How can we pass the parameters of the relevant MongoDB details, without actually passing the parameters in the constructor to each and every function we write in the code?
I would like to run the code seamlessly without caring if the current code is running on the Driver node or that is was serialized and sent to another slave.

We were using hard coded values at first, but to support different deployment and running environments, it’s not good enough.
We are using the
--properties-file parameter to thespark-submit”, because we need to pass parameters to the various applications, and different ones of course, for example:
  • Log4j Configuration file for the application.
  • MongoDB details.
  • Application running status
  • Etc.
In that manner we were reading all of the parameters passed via the Properties file as “Java System Properties” simply with:


But the problem was that some code, was sent to the Spark slaves, and they didn’t have all of these System properties because they run on a different JVM.
We wanted to find a way to pass these configurations to the slaves without producing Ugly code.
But we don’t know where to read the configurations from, because we can’t know on which slave the code will run and in what context. There is no way to use the SparkContext because it’s not Serializable

The solution is: (Trying to keep is simple and modular)
  • We are keeping all of our configurations on HDFS. (A place that all the participants in the cluster can access)
  • Every application has its own properties files in its own place on the HDFS.
  • Using Apache Common Configurations (to read the configurations and to use “include” statements in the properties files) we are reading all the relevant configurations, but the hard part is passing the link to: “Properties file location”.A singleton object (I really hate using Singleton, but the situation forced us) that we will callConfigProviderreturns a Configuration object (Map) of properties that are loaded from that will be a system property (Both in the master and in the slaves), the ConfigProvider is loaded only once in every JVM this way. I’ve tried using all the documented ways to pass the property and the conclusion was that a lot of them just don’t work (all of them were written in the properties file passes to spark-submit script):
    1. spark.yarn.appMasterEnv.[var-name] – returns null in the slave, both with the prefix or without.
    2. spark.executorEnv.[var-name]  – returns null in the slave, both with the prefix or without.
    3. spark.[var-name] – returns null in the slave, both with the prefix or without.
By the way, all of them return the correct value on the Driver running the application if you access them with the full name, because of the “spark.” Prefix.

The right way to pass the parameter is through the property:
spark.driver.extraJavaOptions and spark.executor.extraJavaOptions”:
I’ve passed both the log4J configurations property and the parameter that I needed for the configurations. (To the Driver I was able to pass only the log4j configuration).
For example (was written in properties file passed in spark-submit with —properties-file):

spark.driver.extraJavaOptions –Dlog4j.configuration=file:///spark/conf/ -
spark.executor.extraJavaOptions –Dlog4j.configuration=file:///spark/conf/ hdfs:///some/path/on/hdfs/
And to list out the parameters and where they are passed to:
  1. Passes the log4j configurations file to the Driver. I’ve tried to pass another parameter there but it didn’t accept it very well.
  2. Passes the log4j configurations file to any executor, and I’ve passed another parameter as a Java system property, this was successful and the executors code managed to read it as “System.getProperty()”.
  3. To read the location of the properties file, I’ve added the same name of the property with the spark. Prefix, and in the Driver, where the application starts I was able to read it because it’s passed to the driver from the properties file to driver code
The ConfigProvider class needs to check for existence of the parameters ( or and if one of them is found redirect the apache commons configurations, and then to instantiate the properties object.
Another defensive measure that I’ve taken, is once I’ve read the configurations from the properties file, I copied them into a
Map and kept that object as a member in the ConfigProvider class. (I kept it immutable that way, and it became an object that can be serialized in case you need it to be).

What about Unit testing you ask – Do I need configuration files for all of my tests?

I’ve added aninjectConfiguration(Map properties)method to the ConfigProvider, and because it’s a Singleton in the JVM, you can inject your own properties to the code and test your inner classes that use the ConfigProvider to get the needed configurations to run.

If you need more elaboration about other configurations to spark,
I’ll be happy to answer, Just comment below.

Enjoy configuring your Applications.