Controlling Package Versions with Maven on Spark
When you submit a job to Spark you create an uber-jar file with all packages needed. Unfortunately, there is no guarantee these same classes will actually be used to run your application. This is because if Spark finds the same classes in it’s own directory of jars, it will give these precedence. Often, this will cause no problems, but sometimes, when there is a difference in the class API between two versions, runtime errors occur, such as a NoClassDefFoundError or NoSuchMethodError.
This problem of incompatibility between application and container versions is not specific to Spark, it can be found anywhere you use an application uber-jar and a container, for example when using hadoop.
After having finished reading this you should know how to avoid having your specified Java package be swapped out for an incompatible version at runtime. Below we can see that after submitting our example program to Spark we get a NoClassDefFoundError. This is due to an incompatible version of the com.google.cloud package being on the runtime classpath.
Spark Command: /Library/Java/JavaVirtualMachines/jdk1.8.0_144.jdk/Contents/Home/bin/java -cp /usr/local/Cellar/apache-spark/2.4.4/libexec/conf/:/usr/local/Cellar/apache-spark/2.4.4/libexec/jars/* -Xmx1g org.apache.spark.deploy.SparkSubmit --master local --class jp.gmo.sparktest.App target/spark-test-1.0-SNAPSHOT.jar input/table.csv output gmo-blog-spark-test ======================================== 20/01/04 11:59:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Exception in thread "main" java.lang.NoClassDefFoundError: com/google/cloud/storage/Acl$Entity at jp.gmo.sparktest.App.main(App.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: com.google.cloud.storage.Acl$Entity at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 13 more
If we look closely at the submit command at the top of the above listing, we see that the classpath is set to include all jars under /usr/local/Cellar/apache-spark/2.4.4/libexec/jars and this Spark jar directory contains a lot of jars, 226 on my machine. This means the chances of your application’s dependencies being found here are significant. If Spark decides to use it’s own versions instead of yours and the versions are incompatible the chances of errors such as above occurring are high, a clearly intolerable situation. Runtime errors are dependent on the logic of your program and as such are not always obvious.
You may be thinking we can just specify which versions to use within our pom.xml file, for example if your application dependencies use google guava, chances are high there will be multiple dependencies on different guava versions. In this case you may specify a dependency on the desired version. But this won’t help in our situation, because for us, a container such as Spark decides to ignore this version provided by Maven, in the uber-jar, and instead uses it’s own version.
What we need is a way to ensure the runtime dependencies use exactly the same versions as the build-time dependencies. This is where the Maven Shade Plugin is useful. To quote from the Shade Plugin docs:
This plugin provides the capability to package the artifact in an uber-jar, including its dependencies and to shade – i.e. rename – the packages of some of the dependencies.
By renaming packages, Spark can not use other versions of the same packages. After the packages are shaded they are untouchable by Spark, Hadoop or any other container’s dependency resolution. Let’s look at the use of the plugin from a pom.xml
<plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-shade-plugin</artifactId> <version>3.2.1</version> <executions> <execution> <phase>package</phase> <goals> <goal>shade</goal> </goals> <configuration> <transformers> <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer"> <mainClass>jp.gmo.sparktest.App</mainClass> </transformer> </transformers> <filters> <filter> <artifact>*:*</artifact> <excludes> <exclude>META-INF/maven/**</exclude> <exclude>META-INF/*.SF</exclude> <exclude>META-INF/*.DSA</exclude> <exclude>META-INF/*.RSA</exclude> </excludes> </filter> </filters> <relocations> <relocation> <pattern>com</pattern> <shadedPattern>repackaged.com.google.common</shadedPattern> <includes> <include>com.google.common.**</include> </includes> </relocation> </relocations> </configuration> </execution> </executions> </plugin>
We can see our main class specified. Also there are a bunch of exclusions, these are signature files. If we do not exclude these we may see
java.lang.SecurityException: Invalid signature file digest for Manifest main attributes
at runtime. We can control which packages get shaded in the relocations block. We set the the packages to be shaded with the includes and set the package naming for the shaded packages with the shadedPattern. We can run the mvn package command to build a shaded uber-jar with this plugin. Among the output we can see which packages are shaded.
[INFO] --- maven-shade-plugin:3.2.1:shade (default) @ spark-test --- [INFO] Including commons-codec:commons-codec:jar:1.9 in the shaded jar. [INFO] Including com.google.code.gson:gson:jar:2.2.4 in the shaded jar. [INFO] Including com.fasterxml.jackson.core:jackson-core:jar:2.6.7 in the shaded jar. [INFO] Including com.google.code.findbugs:jsr305:jar:1.3.9 in the shaded jar. [INFO] Including javax.annotation:javax.annotation-api:jar:1.2 in the shaded jar. [INFO] Including com.google.protobuf:protobuf-java:jar:2.5.0 in the shaded jar. [INFO] Including com.google.cloud:google-cloud-storage:jar:1.102.0 in the shaded jar. [INFO] Including com.google.cloud:google-cloud-core-http:jar:1.91.3 in the shaded jar. [INFO] Including com.google.cloud:google-cloud-core:jar:1.91.3 in the shaded jar. ... [INFO] Including com.google.guava:guava:jar:28.1-jre in the shaded jar. [INFO] Including com.google.guava:failureaccess:jar:1.0.1 in the shaded jar. [INFO] Including com.google.guava:listenablefuture:jar:9999.0-empty-to-avoid-conflict-with-guava in the shaded jar. [INFO] Including org.checkerframework:checker-qual:jar:2.8.1 in the shaded jar. [INFO] Including com.google.errorprone:error_prone_annotations:jar:2.3.2 in the shaded jar. [INFO] Including com.google.j2objc:j2objc-annotations:jar:1.3 in the shaded jar. [INFO] Including org.codehaus.mojo:animal-sniffer-annotations:jar:1.18 in the shaded jar. [INFO] Replacing original artifact with shaded artifact. [INFO] Replacing /Users/me/projects/spark-test-project/target/spark-test-1.0-SNAPSHOT.jar with /Users/me/projects/spark-test-project/target/spark-test-1.0-SNAPSHOT-shaded.jar ...
We can see that a new shaded uber-jar has been created at spark-test-1.0-SNAPSHOT.jar. This may now be submitted to Spark and we can be confident that the package versions specified in our pom.xml will be used at runtime on Spark. Now let’s look at the contents of our uber-jar.
➜ spark-test-project jar tvf target/spark-test-1.0-SNAPSHOT.jar | grep repackaged ... 1938 Sat Jan 04 13:39:32 JST 2020 repackaged/com/google/common/google/common/util/concurrent/Service.class 1530 Sat Jan 04 13:39:32 JST 2020 repackaged/com/google/common/google/common/util/concurrent/ServiceManager$1.class 1530 Sat Jan 04 13:39:32 JST 2020 repackaged/com/google/common/google/common/util/concurrent/ServiceManager$2.class 914 Sat Jan 04 13:39:32 JST 2020 repackaged/com/google/common/google/common/util/concurrent/ServiceManager$EmptyServiceManagerWarning.class 965 Sat Jan 04 13:39:32 JST 2020 repackaged/com/google/common/google/common/util/concurrent/ServiceManager$FailedService.class 996 Sat Jan 04 13:39:32 JST 2020 repackaged/com/google/common/google/common/util/concurrent/ServiceManager$Listener.class ...
We see that where previously we had classes with package com.google.common, the package has been renamed to repackaged.com.google.common.google.common. This new package is unique to our spark-test-1.0-SNAPSHOT.jar and classes belonging to this package are not present anywhere else in Spark at runtime. Therefore such classes cannot be swapped at runtime for a class with the same package and name. Now the application is able to be submitted to Spark without the error that we saw at the beginning of this article.