8 min read

In this two part post, you’ll learn the basics of how to benchmark and optmize Go code. You’ll do this by trying out a few different implementations for calculating a specific element in Pascal’s Triangle (a.k.a. the binomial coeffecient). In Part 1 we setup Go, covered the structure of our example, created an interface, and conducted some testing.

Here in Part 2 we will cover benchmarking our example for performance, and look at a recursive and bultin implementation.

All of the code for these two posts is available on github.

Benchmarking for Performance

Since we’ve verified that our implementation works in Part 1, let’s benchmark it to see how fast it is. Benchmarking one implementation by itself is not incredibly useful, but soon we will benchmark other implementations too and compare the results.

In Go, benchmarks are simply functions that start with the word “Benchmark” and accept *testing.B as an argument. Like testing functions, benchmark functions also need to exist in a file that ends in “_test.go”. As we did for our tests, we’ll write a utility function that we can use to benchmark any implementation.

Add the following to test/benchmark_test.go

package test

import (
     "github.com/albrow/go-benchmark-example/common"
     "github.com/albrow/go-benchmark-example/implementations"
     "testing"
)

func BenchmarkNaive(b *testing.B) {
     benchmarkPascaler(b, implementations.Naive)
}

func benchmarkPascaler(b *testing.B, p common.Pascaler) {
     for i := 0; i < b.N; i++ {
          p.Pascal(32, 16)
     }
}

We’ll use the parameters n=32 and m=16 for our benchmarks. You could actually choose anything you want, but if n gets too big, you’ll overflow any int values (which are stored in 64 bits on 64-bit operating systems). If that happens, any implementation that relies on storing numbers as ints will return incorrect results (and may even panic, depending on what math you do). When I tested the limit, overflow first started happening around n=67, m=33.

You can run the benchmark by adding the bench flag to the test command: go test ./test -bench .. If it works, your output should look like this:

?        github.com/albrow/go-benchmark-example/common     [no test files]
?        github.com/albrow/go-benchmark-example/implementations     [no test files]
PASS
BenchmarkNaive       100000          15246 ns/op
ok       github.com/albrow/go-benchmark-example/test     3.970s

The important bit is the line that starts with BenchmarkNaive. The first number tells us that this particular benchmark was run 100,000 times. Go will automatically run the benchmark as many times as it needs to until it gets a consistent result. Typically, faster benchmarks will be run more times. The second number tells us that it took an average of 15,246 ns per operation. The exact number can vary widely from machine to machine, which is why it doesn’t mean much on its own. Let’s write another implementation so we have something to compare it to.

Recursive Implementation

You may have noticed that our naive implementation was doing more work than it needed to. We don’t actually need to construct the entire triangle to figure out a specific element. We actually only need to calculate the elements directly above the one we want (along with the elements directly above those, and so on). What if instead of iterating through from top to bottom, we started at the bottom and iterated upwards, calculating only the elements we needed? It should perform better because we’re doing less calculations, right? Let’s write an implementation that does exactly that and benchmark it to find out for sure.

The idea of moving from bottom to top leads pretty naturally to a recursive implementation. For the base case, we know that the top element is 1. We also know that any element on the edges, where m == 0 or m == n is 1. For all other cases, the element is equal to the sum of the elements directly above it.

We’ll write the implementation in implementations/recursive.go, and it looks like this:

package implementations

type recursiveType struct{}

var Recursive = &recursiveType{}

func (p *recursiveType) Pascal(n, m int) int {
     if n == 0 || m == 0 || n == m {
          return 1
     }
     return p.Pascal(n-1, m-1) + p.Pascal(n-1, m)
}

func (p *recursiveType) String() string {
     return "Recursive Implementation"
}

Personally, I think this implementation is much cleaner and easier to understand. Let’s test it for correctness by adding a few lines to test/pascal_test.go:

func TestRecursive(t *testing.T) {
     testPascaler(t, implementations.Recursive)
}

Then run the tests again with go test ./…. If it works you should see the same output as before.

Now, let’s benchmark this implementation to see how fast it is compared to our first one. Add the following lines to test/benchmark_test.go:

func BenchmarkRecursive(b *testing.B) {
     benchmarkPascaler(b, implementations.Recursive)
}

Then run the benchmarks with go test ./test -bench .. You might be surprised to find that with the parameters n=32, m=16, the recursive implementation is much, much slower! In fact, if you increase n much more the benchmark will timeout because the recursive implementation takes too long. These were the results on my machine:

BenchmarkNaive         100000         15246 ns/op
BenchmarkRecursive       1    4005199128 ns/op

This shows the recursive implementation took just over 4 seconds, whereas the “naive” implementation took a tiny fraction of a millisecond!

The reason this happens has to do with the way Go allocates stack frames for functions. Currently, the Go compiler does not optimize much for recursion (though it might in the future), which means our code results in a lot of extra memory allocations. In languages/compilers that heavily optimize recursive function calls, it’s possible that the recursive implementation would be faster, but not so in Go.

A naive runtime analysis that didn’t take stack frame allocation into account would incorrectly predict that the recursive implementation would be faster. This example illustrates why benchmarking is important whenever you care about performance!

Built-in Implementation

There are a lot of techniques you could use to optimize the function further. You could try writing an implementation that moves from bottom to top without using recursion. You could take advantage of the symmetry of Pascal’s Triangle and cut number of calculations in half. You could even take advantage of memoization or use a lookup table to prevent repeated calculations. But for the sake of time, in this two part series we’re just going to try one more implementation.

If you are familiar with Combinatorics or Statistics, you might recognize that what we’ve been trying to calculate with our Pascal function is the equivalent of the binomial coefficient. There are formulas for calcuating the binomial coefficient directly, such as without building a complicated data structure.

The most common formula involves factorials and is as follows:

binomial(n, k) = n! / (k! * (n-k)!)

One unexpected problem with this formula is that the intermediate values in the numerator and denominator can get too large and overflow 64-bit integers. This can happen even when the final result can be represented perfectly with 64 bits.

Luckily, there is a package in the Go standard library called “big” for dealing with numbers that are too big to fit in a 64-bit int. That package also has a function for calculating the binomial coeffecient, which is exactly what we’re looking for.

Here’s how we can use the builtin function to create our own implementation in implementations/builtin.go:

package implementations

import (
     "math/big"
)

type builtinType struct{}

var Builtin = builtinType{}

func (p builtinType) Pascal(n, m int) int {
     z := big.NewInt(0)
     z.Binomial(int64(n), int64(m))
     return int(z.Int64())
}

func (p builtinType) String() string {
     return "Builtin Implementation"
}

Test it for correctness by adding the following lines to test/pascal_test.go and running the tests again:

func TestBuiltin(t *testing.T) {
     testPascaler(t, implementations.Builtin)
}

Then, we can compare the performance by adding the following lines to test/benchmark_test.go and running the benchmarks:

func BenchmarkBuiltin(b *testing.B) {
     benchmarkPascaler(b, implementations.Builtin)
}

On my machine, the builtin implementation was only slightly faster for parameters n=32 and m=16:

func BenchmarkBuiltin(b *testing.B) {
     benchmarkPascaler(b, implementations.Builtin)
}

Let’s see what happens if we increase n and m. You will probably need to skip the recursive implementation by adding the following line inside of the BenchmarkRecursive function:

 b.Skip("Skipping slower recursive implementation")

Here are the results on my machine for n=64, m=32:

BenchmarkNaive        50000    40234 ns/op
BenchmarkBuiltin   50000    25233 ns/op

As you can see, when we increase the difficulty, the builtin implementation really starts to out-perform the others. We can go even higher, but keep in mind that doing so will overflow 64-bit ints. Since the builtin implementation uses the big package, we could work around this restriction. However, the benchmark results for the naive implementation wouldn’t mean much since it can’t return the correct values.

To demonstrate how the builtin implementation performs with higher values of n, here’s the results of n=10,000 and m=5,000 on my machine (skipping the other implementations):

BenchmarkBuiltin    300    4978329 ns/op

That’s only 4 ms to compute a really high coeffecient. It appears that for high values of n and m, the builtin implementation is by far the best, not to mention the fact that it is the only one out of the three that returns correct results for higher values of n and m.

Conclusion

In this two part series, we tested and benchmarked three different implementations of the Pascal function, we learned how to write benchmarking functions, and we found that sometimes the fastest implementation is not necessarily the one you would expect. Benchmarking is critical whenever you are writing performance-sensitive code.

Looking for a challenge? See if you can come up with an implementation to beat the builtin one. Some ideas are using parallelization, memoization, GPU acceleration, or binding to some optimized c libraries with cgo. Don’t forget to test each implementation for correctness. Feel free to send a pull request here if you are able to beat it. Good luck!

About the Author

Alex Browne is a programmer and entrepreneur with about 4 years of product development experience and 3 years experience working on small startups. He has worked with a wide variety of technologies and has single-handedly built many production-grade applications. He is currently working with two co-founders on an early stage startup exploring ways of applying machine learning and computer vision to the manufacturing industry.His favorite programming language is Go.

LEAVE A REPLY

Please enter your comment!
Please enter your name here