Hi to all,
Our
problem was passing configuration from Spark Driver to the Slaves.
A short background on our running environment (Before we talk about the
problem):
We are running Apache Spark, on Amazon’s
EMR (Elastic Map Reduce).
All is running over Apache Yarn resource management (this complicates things), the
input data is on S3 file system on Amazon as well, and the HDFS is on the Spark
Cluster.
Our Spark Applications are written in the Java API, and we are running them via
“spark-submit” script on the cluster in “client-mode”.
The mapper functions that run on the RDDs need to access a MongoDB (a
different one on each environment – development, testing and production).
The last requirement brought up the problem,
How can we pass the parameters of the relevant MongoDB
details, without actually passing the parameters in the constructor to each and
every function we write in the code?
I would like to run the code seamlessly without caring if the current code is
running on the Driver node or that is was serialized and sent to another
slave.
We were using hard coded values at first, but
to support different deployment and running environments, it’s not good enough.
We are using the “--properties-file”
parameter to the “spark-submit”, because we need to pass parameters to the various applications,
and different ones of course, for example:
- Log4j Configuration
file for the application.
- MongoDB details.
- Application running
status
- Etc.
In that manner we were reading all of the
parameters passed via the Properties file as “Java System Properties” simply
with:
System.getProperty(“Property-Name”);
But the problem was that some code, was sent to
the Spark slaves, and they didn’t have all of these System properties because
they run on a different JVM.
We wanted to find a way to pass these configurations to the slaves without
producing Ugly code.
But we don’t know where to read the configurations from, because we can’t know on
which slave the code will run and in what context. There is no way to use the SparkContext
because it’s not Serializable.
The solution is: (Trying to keep is simple and modular)
- We are keeping all of our configurations on HDFS.
(A place that all the participants in the cluster can access)
- Every application
has its own properties files in its own place on the HDFS.
- Using Apache Common Configurations (to read the
configurations and to use “include” statements in the properties files) we are
reading all the relevant configurations, but the hard part is passing the link
to: “Properties file location”.A singleton object (I really hate using Singleton,
but the situation forced us) that we will call “ConfigProvider” returns a Configuration object (Map) of
properties that are loaded from the “application.properties.file” value that will be a system property (Both in the
master and in the slaves), the ConfigProvider is
loaded only once in every JVM this way. I’ve tried using all
the documented ways to pass the “application.properties.file”
system property and the conclusion was that a lot of them just don’t
work (all of them were written in the properties file passes to spark-submit script):
- spark.yarn.appMasterEnv.[var-name] – returns null in the slave, both with the prefix or without.
- spark.executorEnv.[var-name] – returns null in the slave, both with the prefix
or without.
- spark.[var-name] –
returns null in the slave, both with the prefix or without.
By the way, all of them return the correct
value on the Driver running the application if you access them with the full
name, because of the “spark.” Prefix.
The right way to pass the parameter is through
the property:
“spark.driver.extraJavaOptions” and “spark.executor.extraJavaOptions”:
I’ve passed both the log4J configurations
property and the parameter that I needed for the configurations. (To the Driver
I was able to pass only the log4j configuration).
For example (was written in properties file
passed in spark-submit with “—properties-file”):
“
spark.driver.extraJavaOptions
–Dlog4j.configuration=file:///spark/conf/log4j.properties -
spark.executor.extraJavaOptions
–Dlog4j.configuration=file:///spark/conf/log4j.properties
-Dapplication.properties.file=hdfs:///some/path/on/hdfs/app.properties
spark.application.properties.file
hdfs:///some/path/on/hdfs/app.properties
“
And to list out the parameters and where they
are passed to:
- Passes the log4j configurations file to the Driver.
I’ve tried to pass another parameter there but it didn’t accept it very well.
- Passes the log4j
configurations file to any executor, and I’ve passed another parameter as a
Java system property, this was successful and the executors code managed to
read it as “System.getProperty()”.
- To read the location of the properties file, I’ve
added the same name of the property with the “spark.” Prefix,
and in the Driver, where the application starts I was able to read it because
it’s passed to the driver from the properties file to driver code
The ConfigProvider class
needs to check for existence of the parameters (spark.application.properties.file
or application.properties.file) and if one of them is found redirect the apache
commons configurations, and then to instantiate the properties object.
Another defensive measure that I’ve taken, is once I’ve read the configurations
from the properties file, I copied them into a Map and
kept that object as a member in the ConfigProvider class.
(I kept it immutable that way, and it became an object that can be serialized
in case you need it to be).
What about Unit testing you ask – Do I need
configuration files for all of my tests?
I’ve added an “injectConfiguration(Map properties)”
method to the ConfigProvider,
and because it’s a Singleton in the JVM, you can inject your own
properties to the code and test your inner classes that use the ConfigProvider to get the needed configurations to run.
If you need more elaboration about other
configurations to spark,
I’ll be happy to answer, Just comment below.
Enjoy configuring your Applications.