17 min read

In this article by Gabriele Lanaro, author of the book, Python High Performance – Second Edition, we will see that Python is a mature and widely used language and there is a large interest in improving its performance by compiling functions and methods directly to machine code rather than executing instructions in the interpreter.

In this article, we will explore two projects–Numba and PyPy–that approach compilation in a slightly different way. Numba is a library designed to compile small functions on the fly. Instead of transforming Python code to C, Numba analyzes and compiles Python functions directly to machine code. PyPy is a replacement interpreter that works by analyzing the code at runtime and optimizing the slow loops automatically.

(For more resources related to this topic, see here.)

Numba

Numba was started in 2012 by Travis Oliphant, the original author of NumPy, as a library for compiling individual Python functions at runtime using the Low-Level Virtual Machine  ( LLVM ) toolchain.

LLVM is a set of tools designed to write compilers. LLVM is language agnostic and is used to write compilers for a wide range of languages (an important example is the clang compiler). One of the core aspects of LLVM is the intermediate representation (the LLVM IR), a very low-level platform-agnostic language similar to assembly, that can be compiled to machine code for the specific target platform.

Numba works by inspecting Python functions and by compiling them, using LLVM, to the IR. As we have already seen in the last article, the speed gains can be obtained when we introduce types for variables and functions. Numba implements clever algorithms to guess the types (this is called type inference) and compiles type-aware versions of the functions for fast execution.

Note that Numba was developed to improve the performance of numerical code. The development efforts often prioritize the optimization of applications that intensively use NumPy arrays.

Numba is evolving really fast and can have substantial improvements between releases and, sometimes, backward incompatible changes.  To keep up, ensure that you refer to the release notes for each version. In the rest of this article, we will use Numba version 0.30.1; ensure that you install the correct version to avoid any error.

The complete code examples in this article can be found in the Numba.ipynb notebook.

First steps with Numba

Getting started with Numba is fairly straightforward. As a first example, we will implement a function that calculates the sum of squares of an array. The function definition is as follows:

    def sum_sq(a):
        result = 0
        N = len(a)
        for i in range(N):
            result += a[i]
        return result

To set up this function with Numba, it is sufficient to apply the nb.jit decorator:

    from numba import nb


    @nb.jit
    def sum_sq(a):
        ...

The nb.jit decorator won’t do much when applied. However, when the function will be invoked for the first time, Numba will detect the type of the input argument, a , and compile a specialized, performant version of the original function.

To measure the performance gain obtained by the Numba compiler, we can compare the timings of the original and the specialized functions. The original, undecorated function can be easily accessed through the py_func attribute. The timings for the two functions are as follows:

    import numpy as np


    x = np.random.rand(10000)


    # Original
    %timeit sum_sq.py_func(x)
    100 loops, best of 3: 6.11 ms per loop


    # Numba
    %timeit sum_sq(x)
    100000 loops, best of 3: 11.7 µs per loop

You can see how the Numba version is order of magnitude faster than the Python version. We can also compare how this implementation stacks up against NumPy standard operators:

    %timeit (x**2).sum()
    10000 loops, best of 3: 14.8 µs per loop

In this case, the Numba compiled function is marginally faster than NumPy vectorized operations. The reason for the extra speed of the Numba version is likely that the NumPy version allocates an extra array before performing the sum in comparison with the in-place operations performed by our sum_sq function.

As we didn’t use array-specific methods in sum_sq, we can also try to apply the same function on a regular Python list of floating point numbers. Interestingly, Numba is able to obtain a substantial speed up even in this case, as compared to a list comprehension:

    x_list = x.tolist()
    %timeit sum_sq(x_list)
    1000 loops, best of 3: 199 µs per loop


    %timeit sum([x**2 for x in x_list])
    1000 loops, best of 3: 1.28 ms per loop

Considering that all we needed to do was apply a simple decorator to obtain an incredible speed up over different data types, it’s no wonder that what Numba does looks like magic. In the following sections, we will dig deeper and understand how Numba works and evaluate the benefits and limitations of the Numba compiler.

Type specializations

As shown earlier, the nb.jit decorator works by compiling a specialized version of the function once it encounters a new argument type. To better understand how this works, we can inspect the decorated function in the sum_sq example.

Numba exposes the specialized types using the signatures attribute. Right after the sum_sq definition, we can inspect the available specialization by accessing the sum_sq.signatures, as follows:

    sum_sq.signatures
    # Output:
    # []

If we call this function with a specific argument, for instance, an array of float64 numbers, we can see how Numba compiles a specialized version on the fly. If we also apply the function on an array of float32, we can see how a new entry is added to the sum_sq.signatures list:

    x = np.random.rand(1000).astype('float64')
    sum_sq(x)
    sum_sq.signatures
    # Result:
    # [(array(float64, 1d, C),)]


    x = np.random.rand(1000).astype('float32')
    sum_sq(x)
    sum_sq.signatures
    # Result:
    # [(array(float64, 1d, C),), (array(float32, 1d, C),)]

It is possible to explicitly compile the function for certain types by passing a signature to the nb.jit function.

An individual signature can be passed as a tuple that contains the type we would like to accept. Numba provides a great variety of types that can be found in the nb.types module, and they are also available in the top-level nb namespace. If we want to specify an array of a specific type, we can use the slicing operator, [:], on the type itself. In the following example, we demonstrate how to declare a function that takes an array of float64 as its only argument:

    @nb.jit((nb.float64[:],))
    def sum_sq(a):

Note that when we explicitly declare a signature, we are prevented from using other types, as demonstrated in the following example. If we try to pass an array, x, as float32, Numba will raise a TypeError:

    sum_sq(x.astype('float32'))
    # 

TypeError: No matching definition for argument type(s) 
    array(float32, 1d, C)

Another way to declare signatures is through type strings. For example, a function that takes a float64 as input and returns a float64 as output can be declared with the float64(float64) string. Array types can be declared using a [:] suffix. To put this together, we can declare a signature for our sum_sq function, as follows:

    @nb.jit("float64(float64[:])")
    def sum_sq(a):

You can also pass multiple signatures by passing a list:

    @nb.jit(["float64(float64[:])",
             "float64(float32[:])"])
    def sum_sq(a):

Object mode versus native mode

So far, we have shown how Numba behaves when handling a fairly simple function. In this case, Numba worked exceptionally well, and we obtained great performance on arrays and lists.The degree of optimization obtainable from Numba depends on how well Numba is able to infer the variable types and how well it can translate those standard Python operations to fast type-specific versions. If this happens, the interpreter is side-stepped and we can get performance gains similar to those of Cython.

When Numba cannot infer variable types, it will still try and compile the code, reverting to the interpreter when the types can’t be determined or when certain operations are unsupported. In Numba, this is called object mode and is in contrast to the intepreter-free scenario, called native mode.

Numba provides a function, called inspect_types, that helps understand how effective the type inference was and which operations were optimized. As an example, we can take a look at the types inferred for our sum_sq function:

    sum_sq.inspect_types()

When this function is called, Numba will print the type inferred for each specialized version of the function. The output consists of blocks that contain information about variables and types associated with them. For example, we can examine the N = len(a) line:

    # --- LINE 4 --- 
    #   a = arg(0, name=a)  :: array(float64, 1d, A)
    #   $0.1 = global(len: <built-in function len>)  :: 
    Function(<built-in function len>)
    #   $0.3 = call $0.1(a)  :: (array(float64, 1d, A),) -> int64
    #   N = $0.3  :: int64


    N = len(a)

For each line, Numba prints a thorough description of variables, functions, and intermediate results. In the preceding example, you can see (second line) that the argument a is correctly identified as an array of float64 numbers. At LINE 4, the input and return type of the len function is also correctly identified (and likely optimized) as taking an array of float64 numbers and returning an int64.

If you scroll through the output, you can see how all the variables have a well-defined type. Therefore, we can be certain that Numba is able to compile the code quite efficiently. This form of compilation is called native mode.

As a counter example, we can see what happens if we write a function with unsupported operations. For example, as of version 0.30.1, Numba has limited support for string operations.

We can implement a function that concatenates a series of strings, and compiles it as follows:

    @nb.jit
    def concatenate(strings):
        result = ''
        for s in strings:
            result += s
        return result

Now, we can invoke this function with a list of strings and inspect the types:

    concatenate(['hello', 'world'])
    concatenate.signatures
    # Output: [(reflected list(str),)]
    concatenate.inspect_types()

Numba will return the output of the function for the reflected list (str) type. We can, for instance, examine how line 3 gets inferred. The output of concatenate.inspect_types() is reproduced here:

    # --- LINE 3 --- 
    #   strings = arg(0, name=strings)  :: pyobject
    #   $const0.1 = const(str, )  :: pyobject
    #   result = $const0.1  :: pyobject
    #   jump 6
    # label 6


    result = ''

You can see how this time, each variable or function is of the generic pyobject type rather than a specific one. This means that, in this case, Numba is unable to compile this operation without the help of the Python interpreter. Most importantly, if we time the original and compiled function, we note that the compiled function is about three times slower than the pure Python counterpart:

    x = ['hello'] * 1000
    %timeit concatenate.py_func(x)
    10000 loops, best of 3: 111 µs per loop


    %timeit concatenate(x)
    1000 loops, best of 3: 317 µs per loop

This is because the Numba compiler is not able to optimize the code and adds some extra overhead to the function call.As you may have noted, Numba compiled the code without complaints even if it is inefficient. The main reason for this is that Numba can still compile other sections of the code in an efficient manner while falling back to the Python interpreter for other parts of the code. This compilation strategy is called object mode.

It is possible to force the use of native mode by passing the nopython=True option to the nb.jit decorator. If, for example, we apply this decorator to our concatenate function, we observe that Numba throws an error on first invocation:

    @nb.jit(nopython=True)
    def concatenate(strings):
        result = ''
        for s in strings:
            result += s
        return result


    concatenate(x)



    # Exception:
    # TypingError: Failed at nopython (nopython frontend)

This feature is quite useful for debugging and ensuring that all the code is fast and correctly typed.

Numba and NumPy

Numba was originally developed to easily increase performance of code that uses NumPy arrays. Currently, many NumPy features are implemented efficiently by the compiler.

Universal functions with Numba

Universal functions are special functions defined in NumPy that are able to operate on arrays of different sizes and shapes according to the broadcasting rules. One of the best features of Numba is the implementation of fast ufuncs.

We have already seen some ufunc examples in article 3, Fast Array Operations with NumPy and Pandas. For instance, the np.log function is a ufunc because it can accept scalars and arrays of different sizes and shapes. Also, universal functions that take multiple arguments still work according to the  broadcasting rules. Examples of universal functions that take multiple arguments are np.sum or np.difference.

Universal functions can be defined in standard NumPy by implementing the scalar version and using the np.vectorize function to enhance the function with the broadcasting feature. As an example, we will see how to write the Cantor pairing function.

A pairing function is a function that encodes two natural numbers into a single natural number so that you can easily interconvert between the two representations. The Cantor pairing function can be written as follows:

    import numpy as np


    def cantor(a, b):
        return  int(0.5 * (a + b)*(a + b + 1) + b)

As already mentioned, it is possible to create a ufunc in pure Python using the np.vectorized decorator:

    @np.vectorize
    def cantor(a, b):
        return  int(0.5 * (a + b)*(a + b + 1) + b)


    cantor(np.array([1, 2]), 2)
    # Result:
    # array([ 8, 12])

Except for the convenience, defining universal functions in pure Python is not very useful as it requires a lot of function calls affected by interpreter overhead. For this reason, ufunc implementation is usually done in C or Cython, but Numba beats all these methods by its convenience.

All that is needed to do in order to perform the conversion is using the equivalent decorator, nb.vectorize. We can compare the speed of the standard np.vectorized version which, in the following code, is called cantor_py, and the same function is implemented using standard NumPy operations:

    # Pure Python
    %timeit cantor_py(x1, x2)
    100 loops, best of 3: 6.06 ms per loop
    # Numba
    %timeit cantor(x1, x2)
    100000 loops, best of 3: 15 µs per loop
    # NumPy
    %timeit (0.5 * (x1 + x2)*(x1 + x2 + 1) + x2).astype(int)
    10000 loops, best of 3: 57.1 µs per loop

You can see how the Numba version beats all the other options by a large margin! Numba works extremely well because the function is simple and type inference is possible.

An additional advantage of universal functions is that, since they depend on individual values, their evaluation can also be executed in parallel. Numba provides an easy way to parallelize such functions by passing the target="cpu" or target="gpu" keyword argument to the nb.vectorize decorator.

Generalized universal functions

One of the main limitations of universal functions is that they must be defined on scalar values. A generalized universal function, abbreviated gufunc, is an extension of universal functions to procedures that take arrays.

A classic example is the matrix multiplication. In NumPy, matrix multiplication can be applied using the np.matmul function, which takes two 2D arrays and returns another 2D array. An example usage of np.matmul is as follows:

    a = np.random.rand(3, 3)
    b = np.random.rand(3, 3)


    c = np.matmul(a, b)
    c.shape
    # Result:
    # (3, 3)

As we saw in the previous subsection, a ufunc broadcasts the operation over arrays of scalars, its natural generalization will be to broadcast over an array of arrays. If, for instance, we take two arrays of 3 by 3 matrices, we will expect np.matmul to take to match the matrices and take their product. In the following example, we take two arrays containing 10 matrices of shape (3, 3). If we apply np.matmul, the product will be applied matrix-wise to obtain a new array containing the 10 results (which are, again, (3, 3) matrices):

    a = np.random.rand(10, 3, 3)
    b = np.random.rand(10, 3, 3)


    c = np.matmul(a, b)
    c.shape
    # Output
    # (10, 3, 3)

The usual rules for broadcasting will work in a similar way. For example, if we have an array of (3, 3) matrices, which will have a shape of (10, 3, 3), we can use np.matmul to calculate the matrix multiplication of each element with a single (3, 3) matrix. According to the broadcasting rules, we obtain that the single matrix will be repeated to obtain a size of (10, 3, 3):

    a = np.random.rand(10, 3, 3)
    b = np.random.rand(3, 3) # Broadcasted to shape (10, 3, 3)
    c = np.matmul(a, b)
    c.shape
    # Result:
    # (10, 3, 3)

Numba supports the implementation of efficient generalized universal functions through the nb.guvectorize decorator. As an example, we will implement a function that computes the euclidean distance between two arrays as a gufunc. To create a gufunc, we have to define a function that takes the input arrays, plus an output array where we will store the result of our calculation.

The nb.guvectorize decorator requires two arguments:

  • The types of the input and output: two 1D arrays as input and a scalar as output
  • The so called layout string, which is a representation of the input and output sizes; in our case, we take two arrays of the same size (denoted arbitrarily by n), and we output a scalar

In the following example, we show the implementation of the euclidean function using the nb.guvectorize decorator:

    @nb.guvectorize(['float64[:], float64[:], float64[:]'], '(n), (n) -
    > ()')
    def euclidean(a, b, out):
        N = a.shape[0]
        out[0] = 0.0
        for i in range(N):
            out[0] += (a[i] - b[i])**2

There are a few very important points to be made. Predictably, we declared the types of the inputs a and b as float64[:], because they are 1D arrays. However, what about the output argument? Wasn’t it supposed to be a scalar? Yes, however, Numba treats scalar argument as arrays of size 1. That’s why it was declared as float64[:].

Similarly, the layout string indicates that we have two arrays of size (n) and the output is a scalar, denoted by empty brackets–(). However, the array out will be passed as an array of size 1.

Also, note that we don’t return anything from the function; all the output has to be written in the out array.

The letter n in the layout string is completely arbitrary; you may choose to use k  or other letters of your liking. Also, if you want to combine arrays of uneven sizes, you can use layouts strings, such as (n, m).

Our brand new euclidean function can be conveniently used on arrays of different shapes, as shown in the following example:

    a = np.random.rand(2)
    b = np.random.rand(2)
    c = euclidean(a, b) # Shape: (1,)


    a = np.random.rand(10, 2)
    b = np.random.rand(10, 2)
    c = euclidean(a, b) # Shape: (10,)


    a = np.random.rand(10, 2)
    b = np.random.rand(2)
    c = euclidean(a, b) # Shape: (10,)

How does the speed of euclidean compare to standard NumPy? In the following code, we benchmark a NumPy vectorized version with our previously defined euclidean function:

    a = np.random.rand(10000, 2)
    b = np.random.rand(10000, 2)


    %timeit ((a - b)**2).sum(axis=1)
    1000 loops, best of 3: 288 µs per loop


    %timeit euclidean(a, b)
    10000 loops, best of 3: 35.6 µs per loop

The Numba version, again, beats the NumPy version by a large margin!

Summary

Numba is a tool that compiles fast, specialized versions of Python functions at runtime. In this article, we learned how to compile, inspect, and analyze functions compiled by Numba. We also learned how to implement fast NumPy universal functions that are useful in a wide array of numerical applications. 

Tools such as PyPy allow us to run Python programs unchanged to obtain significant speed improvements. We demonstrated how to set up PyPy, and we assessed the performance improvements on our particle simulator application.

Resources for Article:


Further resources on this subject:


LEAVE A REPLY

Please enter your comment!
Please enter your name here