Hi to all,
Our
problem was passing configuration from Spark Driver to the Slaves.A short background on our running environment (Before we talk about the problem):
We are running Apache Spark, on Amazon’s
EMR (Elastic Map Reduce).
All is running over Apache Yarn resource management (this complicates things), the input data is on S3 file system on Amazon as well, and the HDFS is on the Spark Cluster.
Our Spark Applications are written in the Java API, and we are running them via “spark-submit” script on the cluster in “client-mode”.
The mapper functions that run on the RDDs need to access a MongoDB (a different one on each environment – development, testing and production).
All is running over Apache Yarn resource management (this complicates things), the input data is on S3 file system on Amazon as well, and the HDFS is on the Spark Cluster.
Our Spark Applications are written in the Java API, and we are running them via “spark-submit” script on the cluster in “client-mode”.
The mapper functions that run on the RDDs need to access a MongoDB (a different one on each environment – development, testing and production).
The last requirement brought up the problem,
How can we pass the parameters of the relevant MongoDB
details, without actually passing the parameters in the constructor to each and
every function we write in the code?
I would like to run the code seamlessly without caring if the current code is running on the Driver node or that is was serialized and sent to another slave.
I would like to run the code seamlessly without caring if the current code is running on the Driver node or that is was serialized and sent to another slave.
We were using hard coded values at first, but
to support different deployment and running environments, it’s not good enough.
We are using the “--properties-file” parameter to the “spark-submit”, because we need to pass parameters to the various applications, and different ones of course, for example:
We are using the “--properties-file” parameter to the “spark-submit”, because we need to pass parameters to the various applications, and different ones of course, for example:
- Log4j Configuration file for the application.
- MongoDB details.
- Application running status
- Etc.
In that manner we were reading all of the
parameters passed via the Properties file as “Java System Properties” simply
with:
System.getProperty(“Property-Name”);
But the problem was that some code, was sent to
the Spark slaves, and they didn’t have all of these System properties because
they run on a different JVM.
We wanted to find a way to pass these configurations to the slaves without producing Ugly code.
But we don’t know where to read the configurations from, because we can’t know on which slave the code will run and in what context. There is no way to use the SparkContext because it’s not Serializable.
We wanted to find a way to pass these configurations to the slaves without producing Ugly code.
But we don’t know where to read the configurations from, because we can’t know on which slave the code will run and in what context. There is no way to use the SparkContext because it’s not Serializable.
The solution is: (Trying to keep is simple and modular)
- We are keeping all of our configurations on HDFS. (A place that all the participants in the cluster can access)
- Every application has its own properties files in its own place on the HDFS.
- Using Apache Common Configurations (to read the
configurations and to use “include” statements in the properties files) we are
reading all the relevant configurations, but the hard part is passing the link
to: “Properties file location”.A singleton object (I really hate using Singleton,
but the situation forced us) that we will call “ConfigProvider” returns a Configuration object (Map
) of properties that are loaded from the “application.properties.file” value that will be a system property (Both in the master and in the slaves), the ConfigProvider is loaded only once in every JVM this way. I’ve tried using all the documented ways to pass the “application.properties.file” system property and the conclusion was that a lot of them just don’t work (all of them were written in the properties file passes to spark-submit script): - spark.yarn.appMasterEnv.[var-name] – returns null in the slave, both with the prefix or without.
- spark.executorEnv.[var-name] – returns null in the slave, both with the prefix or without.
- spark.[var-name] – returns null in the slave, both with the prefix or without.
The right way to pass the parameter is through
the property:
“spark.driver.extraJavaOptions” and “spark.executor.extraJavaOptions”:
“spark.driver.extraJavaOptions” and “spark.executor.extraJavaOptions”:
I’ve passed both the log4J configurations
property and the parameter that I needed for the configurations. (To the Driver
I was able to pass only the log4j configuration).
For example (was written in properties file
passed in spark-submit with “—properties-file”):
“
spark.driver.extraJavaOptions –Dlog4j.configuration=file:///spark/conf/log4j.properties -
spark.driver.extraJavaOptions –Dlog4j.configuration=file:///spark/conf/log4j.properties -
spark.executor.extraJavaOptions
–Dlog4j.configuration=file:///spark/conf/log4j.properties
-Dapplication.properties.file=hdfs:///some/path/on/hdfs/app.properties
spark.application.properties.file
hdfs:///some/path/on/hdfs/app.properties
“
And to list out the parameters and where they
are passed to:
- Passes the log4j configurations file to the Driver. I’ve tried to pass another parameter there but it didn’t accept it very well.
- Passes the log4j configurations file to any executor, and I’ve passed another parameter as a Java system property, this was successful and the executors code managed to read it as “System.getProperty()”.
- To read the location of the properties file, I’ve added the same name of the property with the “spark.” Prefix, and in the Driver, where the application starts I was able to read it because it’s passed to the driver from the properties file to driver code
Another defensive measure that I’ve taken, is once I’ve read the configurations from the properties file, I copied them into a Map
What about Unit testing you ask – Do I need
configuration files for all of my tests?
I’ve added an “injectConfiguration(Map properties) ”
method to the ConfigProvider,
and because it’s a Singleton in the JVM, you can inject your own
properties to the code and test your inner classes that use the ConfigProvider to get the needed configurations to run.
If you need more elaboration about other
configurations to spark,
I’ll be happy to answer, Just comment below.
I’ll be happy to answer, Just comment below.
Enjoy configuring your Applications.
No comments:
Post a Comment