In many contexts, micro-benchmarking and performance are not the main issue and other considerations should drive a developer’s choices regarding implementation. That’s not always the case tough. This post is just to share and illustrate a sort of Scala project template that I put together and found quite useful in few situations where micro-benchmarking and performance might be a concern (or simply a legitimate curiosity). Broadly speaking, in Scala there are two viable approaches: JMH (Java Microbenchmarking Harness) and ScalaMeter and, if you are interested, a discussion about their pros and cons and their place in the Scala ecosystem is available here. To note that JMH is part of OpenJdk project but that doesn’t mean that benchmarks have to run on Open Jdk. As a matter of fact, all the examples presented in this post were executed on an Oracle Jdk. NOTE: all the code is available here.
Settings and dependencies
For JMH, we can use sbt-jmh, an sbt plugin for JMH, so I added the following line to project/plugins.sbt:
addSbtPlugin("pl.project13.scala" % "sbt-jmh" % "0.3.4")
And then we set the dependencies in the build.sbt
import sbt.Keys._ lazy val root = (project in file(".")). enablePlugins(JmhPlugin). settings( name := "scala-jmh-performance-testing", organization := "LansaloLtd", version := "1.0", scalaVersion := "2.12.3", libraryDependencies ++= Seq( "org.openjdk.jmh" % "jmh-generator-annprocess" % "1.21", "commons-codec" % "commons-codec" % "1.9", "com.storm-enroute" %% "scalameter" % "0.8.2", "org.scalatest" %% "scalatest" % "3.0.1" % Test ), testFrameworks += new TestFramework("org.scalameter.ScalaMeterFramework"), parallelExecution in Test := false )
Personally I like to use both ScalaMeter and JMH (that is why both are present as dependency above).
Far to be an exhaustive list of pros and cons between the two but I would like to mention at least that the first allows to create graphs really easily and follow the same approach as ScalaCheck with generators that can be composed via for-comprehension, on the other hand is not actively maintained. The second seems to be the preferred choice for Scala and for some measurement is really straight forward. At the present time, the only documentation available for JMH is via examples. Note also that we have introduced a dependency on OpenJDK (org.openjdk.jmh) but that is independent from the runtime jdk.
Targets
Purely for demonstration purposes we need two candidates or targets for our benchmark. For no particular reason I chose:
// candidate 1 def mapOnFunctionsAndList(input: String, slice: Int, funcs: List[String => Int]): List[Int] = { funcs.map(f => slide(input, slice).map(f).min) }
and:
// candidate 2 def foldOnFunctionsList(input: String, slice: Int, funcs: List[String => Int]): List[Int] = { funcs.foldLeft(List.empty[Int])((acc, func) => { slide(input, slice).map(str => func(str)).min :: acc }) }
Basically, we start from a certain number n of functions (String => Int
), then we slide a given String obtaining z substrings from it. For each function, 1 to n, we pass all the substrings obtaining z Int but we keep only the minimum. Therefore, if we start with a list of n functions, we end up with a list on n Int. These two implementations are equivalent and return same results given the same inputs but are there differences when it comes to performances?
JMH
With JMH we can work out pretty easily the throughput for each of our two candidates. First we need to define the targets of our benchmark and a State for them:
package com.lansalo.jmh import com.lansalo.Target._ import org.openjdk.jmh.annotations._ class TargetPerformance { import Scopes._ def testMapOnFunctionsAndList(state: BenchmarkState): Unit = { mapOnFunctionsAndList(state.title, state.slice, state.funcs) } def testFoldOnFunctionsList(state: BenchmarkState): Unit = { foldOnFunctionsList(state.title, state.slice, state.funcs) } } object Scopes { import com.lansalo.HashingFunctions.hashingFunctions val novelTitle = "The Persecution and Assassination of Jean-Paul Marat as Performed by the Inmates of the Asylum of Charenton Under the Direction of the Marquis de Sade" @State(Scope.Benchmark) class BenchmarkState { val funcs: List[String => Int] = hashingFunctions(200) val title: String = novelTitle val slice: Int = 4 } }
And then the proper benchmark (that can also be executed within an IDE like IntelliJ):
package com.lansalo.jmh.benchmark import java.util.concurrent.TimeUnit import com.lansalo.jmh.{Scopes, TargetPerformance} import org.openjdk.jmh.annotations._ import org.openjdk.jmh.results.format.ResultFormatType import org.openjdk.jmh.runner.Runner import org.openjdk.jmh.runner.options.{Options, OptionsBuilder} object BenchmarkRunner_ThroughputBenchmark { // run sbt clean jmh:compile from terminal first. def main(args: Array[String]): Unit = { val opt: Options = new OptionsBuilder().include(classOf[ThroughputBenchmark].getSimpleName) .resultFormat(ResultFormatType.TEXT).build new Runner(opt).run } } @OutputTimeUnit(TimeUnit.SECONDS) @Warmup(iterations = 30) @Measurement(iterations = 30) @State(Scope.Benchmark) private class ThroughputBenchmark extends TargetPerformance { @Benchmark @BenchmarkMode(Array(Mode.Throughput, Mode.AverageTime)) @Fork(value = 1) override def testFoldOnFunctionsList(state: Scopes.BenchmarkState): Unit = super.testFoldOnFunctionsList(state) @Benchmark @BenchmarkMode(Array(Mode.Throughput, Mode.AverageTime)) @Fork(value = 1) override def testMapOnFunctionsAndList(state: Scopes.BenchmarkState): Unit = super.testMapOnFunctionsAndList(state) }
In this example, we measure the throughput and average time:
@BenchmarkMode(Array(Mode.Throughput, Mode.AverageTime))
in a separate JVM:
@Fork(value = 1)
We also run 30 warmup iterations (ignored for the final results) and 30 proper iterations where the actual measurement is taken:
@Warmup(iterations = 30) @Measurement(iterations = 30)
Warmup is needed mainly because of JVM features as just-in-time compilation but that would take us to talk about all the background behind microbenchmarking which would be an entire chapter on its own. I can recommend few readings though: Nanotrusting the Nanotime, Avoiding Benchmarking Pitfalls on the JVM, Java JVM Warmup.1
We are almost ready, run sbt clean jmh:compile
from the terminal and then we can run ThroughputBenchmark
straight from IntelliJ and obtain something like:
REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial experiments, perform baseline and negative tests that provide experimental control, make sure the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts. Do not assume the numbers tell you what you want them to tell. Benchmark Mode Cnt Score Error Units ThroughputBenchmark.testFoldOnFunctionsList thrpt 30 95.529 ± 0.940 ops/s ThroughputBenchmark.testMapOnFunctionsAndList thrpt 30 71.549 ± 0.734 ops/s ThroughputBenchmark.testFoldOnFunctionsList avgt 30 0.010 ± 0.001 s/op ThroughputBenchmark.testMapOnFunctionsAndList avgt 30 0.011 ± 0.001 s/op
Not really interested in the numbers themselves (which also depends upon the hardware) but more in the sort of intelligence we can gather in this way. ScalaMeter offer different possibilities and customizations as well. I quite like the graphs bits and with the following code:
package com.lansalo.scalameter.benchmark import com.lansalo.HashingFunctions.hashingFunctions import com.lansalo.Target._ import java.io.File import org.scalameter.{Bench, Gen} import org.scalameter.persistence.SerializationPersistor import scala.util.Random object PerformanceBenchmark extends Bench.OfflineReport { override lazy val persistor = SerializationPersistor(new File("target/scalameter/performance/results")) val functionsGen: Gen[Int] = Gen.range("functionsNumber")(100, 1000, 100) val inputGen: Gen[Int] = Gen.range("imputLength")(20, 200, 20) val inputs: Gen[(String, List[String => Int])] = for { functionsNumber foldOnFunctionsList(param._1, 4, param._2) } } } }
we can obtain the below graph:
Where orange is for fold and blue for map.
We can also investigate the two implementations from a memory usage point of view. JMH provides a series of Profilers that can be quite handy in those cases (in the following example we chose a HotspotMemoryProfiler
):
object BenchmarkRunner_MemoryFootprint { // run sbt clean jmh:compile from terminal first. def main(args: Array[String]): Unit = { val opt: Options = new OptionsBuilder().include(classOf[MemoryFootprint].getSimpleName).addProfiler(classOf[HotspotMemoryProfiler]) .resultFormat(ResultFormatType.TEXT).build new Runner(opt).run } }
Resulting in a long list of counters (below a partial output):
Benchmark Mode Cnt Score Error Units MemoryFootprint.testFoldOnFunctionsList thrpt 71.335 ops/s MemoryFootprint.testFoldOnFunctionsList:·sun.gc.collector.0.invocations thrpt 108.000 ? MemoryFootprint.testFoldOnFunctionsList:·sun.gc.collector.0.lastEntryTime thrpt 10179095151.000 ? MemoryFootprint.testFoldOnFunctionsList:·sun.gc.collector.0.lastExitTime thrpt 10178954989.000 ? MemoryFootprint.testFoldOnFunctionsList:·sun.gc.collector.0.time thrpt 75456013.000 ? MemoryFootprint.testFoldOnFunctionsList:·sun.gc.collector.1.invocations thrpt ≈ 0 ? MemoryFootprint.testFoldOnFunctionsList:·sun.gc.collector.1.lastEntryTime thrpt ≈ 0 ? MemoryFootprint.testFoldOnFunctionsList:·sun.gc.collector.1.lastExitTime thrpt ≈ 0 ? MemoryFootprint.testFoldOnFunctionsList:·sun.gc.collector.1.time thrpt ≈ 0 ?
you can take a look at here to see how to make sense of them. The following ScalaMeter example instead provide a more basic picture of the heap usage:
object MemoryBenchmark extends Bench.OfflineReport { override lazy val persistor = SerializationPersistor(new File("target/scalameter/memoryusage/results")) override def measurer = new Measurer.MemoryFootprint // rest of the code is the same as above }
For the two fold/map implementations the graph wouldn’t be so interesting (as they overlap) but you can take a look at another post I wrote about value classes to get an idea.
Finally, if running the benchmark on the Oracle JDK, we can use JMH to exploit the Flight Recorder:
object BenchmarkRunner_FlightRecorder { // run sbt clean jmh:compile from terminal first. def main(args: Array[String]): Unit = { val opt: Options = new OptionsBuilder().include(classOf[FlightRecorderDump].getSimpleName) .resultFormat(ResultFormatType.TEXT).build new Runner(opt).run } } @Warmup(iterations = 30) @Measurement(iterations = 30) @State(Scope.Benchmark) private class FlightRecorderDump extends TargetPerformance { @Benchmark @Fork(value = 1, jvmArgsAppend = Array("-XX:+UnlockCommercialFeatures", "-XX:+FlightRecorder", "-XX:+UnlockDiagnosticVMOptions", "-XX:+DebugNonSafepoints", "-XX:FlightRecorderOptions=defaultrecording=true,dumponexit=true,dumponexitpath=/tmp/foldOnFunctionsList.jfr")) override def testFoldOnFunctionsList(state: Scopes.BenchmarkState): Unit = super.testFoldOnFunctionsList(state) @Benchmark @Fork(value = 1, jvmArgsAppend = Array("-XX:+UnlockCommercialFeatures", "-XX:+FlightRecorder", "-XX:+UnlockDiagnosticVMOptions", "-XX:+DebugNonSafepoints", "-XX:FlightRecorderOptions=defaultrecording=true,dumponexit=true,dumponexitpath=/tmp/mapOnFunctionsAndList.jfr")) override def testMapOnFunctionsAndList(state: Scopes.BenchmarkState): Unit = super.testMapOnFunctionsAndList(state) }
And this will generate two separate reports, /tmp/foldOnFunctionsList.jfr for the fold version and /tmp/mapOnFunctionsAndList.jfr for the version with map. With JMC (Java Mission Control) we can compare a rich collection of parameters and and different aspects between the two implementations. Below a general overview of the heap usage for the the map implementation, just as an example:
—————————————-
1. [In this post a certain familiarity with basic concepts of benchmarking is somehow taken for granted as the main focus it to illustrate how JMH and ScalaMeter can be used to produce some results. Still, the interpretation of those results, require the understanding of the mentioned background]↩