Programing Excavation: December 2014

Hi to all,

Our problem was passing configuration from Spark Driver to the Slaves.
A short background on our running environment (Before we talk about the problem):

We are running Apache Spark, on Amazon’s EMR (Elastic Map Reduce).
All is running over Apache Yarn resource management (this complicates things), the input data is on S3 file system on Amazon as well, and the HDFS is on the Spark Cluster.
Our Spark Applications are written in the Java API, and we are running them via “spark-submit” script on the cluster in “client-mode”.
The mapper functions that run on the RDDs need to access a MongoDB (a different one on each environment – development, testing and production).

The last requirement brought up the problem,

How can we pass the parameters of the relevant MongoDB details, without actually passing the parameters in the constructor to each and every function we write in the code?
I would like to run the code seamlessly without caring if the current code is running on the Driver node or that is was serialized and sent to another slave.

We were using hard coded values at first, but to support different deployment and running environments, it’s not good enough.
We are using the “--properties-file” parameter to the “spark-submit”, because we need to pass parameters to the various applications, and different ones of course, for example:

Log4j Configuration file for the application.
MongoDB details.
Application running status
Etc.

In that manner we were reading all of the parameters passed via the Properties file as “Java System Properties” simply with:

System.getProperty(“Property-Name”);

But the problem was that some code, was sent to the Spark slaves, and they didn’t have all of these System properties because they run on a different JVM.
We wanted to find a way to pass these configurations to the slaves without producing Ugly code.
But we don’t know where to read the configurations from, because we can’t know on which slave the code will run and in what context. There is no way to use the SparkContext because it’s not Serializable.

The solution is: (Trying to keep is simple and modular)

We are keeping all of our configurations on HDFS. (A place that all the participants in the cluster can access)
Every application has its own properties files in its own place on the HDFS.
Using Apache Common Configurations (to read the configurations and to use “include” statements in the properties files) we are reading all the relevant configurations, but the hard part is passing the link to: “Properties file location”.A singleton object (I really hate using Singleton, but the situation forced us) that we will call “ConfigProvider” returns a Configuration object (Map) of properties that are loaded from the “application.properties.file” value that will be a system property (Both in the master and in the slaves), the ConfigProvider is loaded only once in every JVM this way. I’ve tried using all the documented ways to pass the “application.properties.file” system property and the conclusion was that a lot of them just don’t work (all of them were written in the properties file passes to spark-submit script):

spark.yarn.appMasterEnv.[var-name] – returns null in the slave, both with the prefix or without.
spark.executorEnv.[var-name] – returns null in the slave, both with the prefix or without.
spark.[var-name] – returns null in the slave, both with the prefix or without.

By the way, all of them return the correct value on the Driver running the application if you access them with the full name, because of the “spark.” Prefix.

The right way to pass the parameter is through the property:
“spark.driver.extraJavaOptions” and “spark.executor.extraJavaOptions”:

I’ve passed both the log4J configurations property and the parameter that I needed for the configurations. (To the Driver I was able to pass only the log4j configuration).

For example (was written in properties file passed in spark-submit with “—properties-file”):

“
spark.driver.extraJavaOptions –Dlog4j.configuration=file:///spark/conf/log4j.properties -

spark.executor.extraJavaOptions –Dlog4j.configuration=file:///spark/conf/log4j.properties -Dapplication.properties.file=hdfs:///some/path/on/hdfs/app.properties

spark.application.properties.file hdfs:///some/path/on/hdfs/app.properties

“

And to list out the parameters and where they are passed to:

Passes the log4j configurations file to the Driver. I’ve tried to pass another parameter there but it didn’t accept it very well.
Passes the log4j configurations file to any executor, and I’ve passed another parameter as a Java system property, this was successful and the executors code managed to read it as “System.getProperty()”.
To read the location of the properties file, I’ve added the same name of the property with the “spark.” Prefix, and in the Driver, where the application starts I was able to read it because it’s passed to the driver from the properties file to driver code

The ConfigProvider class needs to check for existence of the parameters (spark.application.properties.file or application.properties.file) and if one of them is found redirect the apache commons configurations, and then to instantiate the properties object.
Another defensive measure that I’ve taken, is once I’ve read the configurations from the properties file, I copied them into a Map and kept that object as a member in the ConfigProvider class. (I kept it immutable that way, and it became an object that can be serialized in case you need it to be).

What about Unit testing you ask – Do I need configuration files for all of my tests?

I’ve added an “injectConfiguration(Map properties)” method to the ConfigProvider, and because it’s a Singleton in the JVM, you can inject your own properties to the code and test your inner classes that use the ConfigProvider to get the needed configurations to run.

If you need more elaboration about other configurations to spark,
I’ll be happy to answer, Just comment below.

Enjoy configuring your Applications.

Programing Excavation

Monday, December 15, 2014

Annoyed of retyping - "hdfs..."

Friday, December 12, 2014

Spark Configuration Mess Solved