In this article by **Samuli Thomasson**, the author of the book, **Haskell High Performance Programming**, we will know how to choose and design optimal data structures in applications. You will be able to drop the level of abstraction in slow parts of code, all the way to mutable data structures if necessary.

*(For more resources related to this topic, see here.)*

# Annotating strictness and unpacking datatype fields

We used the *BangPatterns* extension to make function arguments strict:

```
{-# LANGUAGE BangPatterns #-}
f !s (x:xs) = f (s + 1) xs
f !s _ = s
```

Using bangs for annotating strictness in fact predates the *BangPatterns* extension (and the older compiler flag *-fbang-patterns* in GHC6.x). With just plain Haskell98, we are allowed to use bangs to make datatype fields strict:

`> data T = T !Int`

A bang in front of a field ensures that whenever the outer constructor (*T*) is in WHNF, the inner field is as well in WHNF. We can check this:

```
> T undefined `seq` ()
*** Exception: Prelude.undefined
```

There are no restrictions to which fields can be strict, be it recursive or polymorphic fields, although, it rarely makes sense to make recursive fields strict. Consider the fully strict linked list:

```
data List a = List !a !(List a)
| ListEnd
```

With this much strictness, you cannot represent parts of infinite lists without always requiring infinite space. Moreover, before accessing the head of a finite strict list you must evaluate the list all the way to the last element. Strict lists don’t have the streaming property of lazy lists.

By default, all data constructor fields are pointers to other data constructors or primitives, regardless of their strictness. This applies to basic data types such asInt, Double, Char, and so on, which are not primitive in Haskell. They are data constructors over their primitive counterparts *Int#*, *Double#*, and *Char#*:

```
> :info Int
data Int = GHC.Types.I# GHC.Prim.Int#
```

There is a performance overhead, the size of pointer dereference between types, say, Int and *Int#*, but an Int can represent lazy values (called thunks), whereas primitives cannot. Without thunks, we couldn’t have lazy evaluation. Luckily,GHC is intelligent enough to unroll wrapper types as primitives in many situations, completely eliminating indirect references.

The hash suffix is specific to GHC and always denotes a primitive type. The GHC modules do expose the primitive interface. Programming with primitives you can further micro-optimize code and get C-like performance. However, several limitations and drawbacks apply.

### Using anonymous tuples

Tuples may seem harmless at first; they just lump a bunch of values together. But note that the fields in a tuple aren’t strict, so a twotuple corresponds to the slowest *PairP* data type from our previous benchmark.

If you need a strict *Tuple* type, you need to define one yourself. This is also one more reason to prefer custom types over nameless tuples in many situations. These two structurally similar tuple types have widely different performance semantics:

```
data Tuple = Tuple {-# UNPACK #-} !Int {-# UNPACK #-} !Int
data Tuple2 = Tuple2 {-# UNPACK #-} !(Int, Int)
```

If you really want unboxed anonymous tuples, you can enable the *UnboxedTuples* extension and write things with types like *(# Int#, Char# #)*. But note that a number of restrictions apply to unboxed tuples like to all primitives. The most important restriction is that unboxed types may not occur where polymorphic types or values are expected, because polymorphic values are always considered as pointers.

# Representing bit arrays

One way to define a bitarray in Haskell that still retains the convenience of *Bool* is:

```
import Data.Array.Unboxed
type BitArray = UArrayInt Bool
```

This representation packs 8 bits per byte, so it’s space efficient. See the following section on arrays in general to learn about time efficiency – for now we only note that *BitArray* is an immutable data structure, like *BitStruct*, and that copying small BitStructs is cheaper than copying BitArrays due to overheads in *UArray*.

Consider a program that processes a list of integers and tells whether they are even or odd counts of numbers divisible by 2, 3, and 5. We can implement this with simple recursion and a three-bit accumulator. Here are three alternative representations for the accumulator:

```
type BitTuple = (Bool, Bool, Bool)
data BitStruct = BitStruct !Bool !Bool !Bool deriving Show
type BitArray = UArrayInt Bool
```

And the program itself is defined along these lines:

```
go :: acc -> [Int] ->acc
go acc [] = acc
go (two three five) (x:xs) = go ((test 2 x `xor` two)
(test 3 x `xor` three)
(test 5 x `xor` five)) xs
test n x = x `mod` n == 0
```

I’ve omitted the details here. They can be found in the *bitstore.hs* file.

The fastest variant is BitStruct, then comes BitTuple (30% slower), and BitArray is the slowest (130% slower than BitStruct). Although BitArray is the slowest (due to making a copy of the array on every iteration), it would be easy to scale the array in size or make it dynamic. Note also that this benchmark is really on the extreme side; normally programs do a bunch of other stuff besides updating an array in a tight loop.

If you need fast array updates, you can resort to mutable arrays discussed later on. It might also be tempting to use *Data.Vector.Unboxed.VectorBool* from the *vector* package, due to its nice interface. But beware that that representation uses one byte for every bit, wasting 7 bits for every bit.

# Mutable references are slow

*Data.IORef* and *Data.STRef* are the smallest bits of mutable state, references to mutable variables, one for IO and other for ST. There is also a *Data.STRef.Lazy* module, which provides a wrapper over strict STRef for lazy ST.

However, because IORef and STRef are references, they imply a level of indirection. GHC intentionally does not optimize it away, as that would cause problems in concurrent settings. For this reason, IORef or STRef shouldn’t be used like variables in C, for example. Performance will for sure be very bad.

Let’s verify the performance hit by considering the following ST-based sum-of-range implementation:

```
-- file: sum_mutable.hs
import Control.Monad.ST
import Data.STRef
count_st :: Int ->Int
count_st n = runST $ do
ref <- newSTRef 0
let go 0 = readSTRef ref
go i = modifySTRef' ref (+ i) >> go (i - 1)
go n
```

And compare it to this pure recursive implementation:

```
count_pure :: Int ->Int
count_pure n = go n 0 where
go 0 s = s
go i s = go (i - 1) $! (s + i)
```

The *ST* implementation is many times slower when at least *-O* is enabled. Without optimizations, the two functions are more or less equivalent in performance;there is similar amount of indirection from not unboxing arguments in the latter version. This is one example of the wonders that can be done to optimize referentially transparent code.

# Bubble sort with vectors

Bubble sort is not an efficient sort algorithm, but because it’s an in-place algorithm and simple, we will implement it as a demonstration of mutable vectors:

```
-- file: bubblesort.hs
import Control.Monad.ST
import Data.Vector as V
import Data.Vector.Mutable as MV
import System.Random (randomIO) -- for testing
```

The (naive) bubble sort compares values of all adjacent indices in order, and swaps the values if necessary. After reaching the last element, it starts from the beginning or, if no swaps were made, the list is sorted and the algorithm is done:

```
bubblesortM :: (Ord a, PrimMonad m)
=>MVector (PrimState m) a -> m ()
bubblesortM v = loop where
indices = V.fromList [1 .. MV.length v - 1]
loop = do swapped <- V.foldM' f False indices – (1)
if swapped then loop else return () – (2)
f swapped i = do – (3)
a <- MV.read v (i-1)
b <- MV.read v i
if a > b then MV.swap v (i-1) i>> return True
else return swapped
```

At *(1)*, we fold monadically over all but the last index, keeping state about whether or not we have performed a swap in this iteration. If we had, at *(2)* we rerun the fold or, if not, we can return. At *(3)* we compare an index and possibly swap values.

We can write a pure function that wraps the stateful algorithm:

```
bubblesort :: Ord a => Vector a -> Vector a
bubblesort v = runST $ do
mv <- V.thaw v
bubblesortM mv
V.freeze mv
```

*V.thaw* and *V.freeze (both O(n))* can be used to go back and forth with mutable and immutable vectors.

Now, there are multiple code optimization opportunities in our implementation of bubble sort. But before tackling those, let’s see how well our straightforward implementation fares using the following *main*:

```
main = do
v <- V.generateM 10000 $ _ ->randomIO :: IO Double
let v_sorted = bubblesort v
median = v_sorted ! 5000
print median
```

We should remember to compile with *-O2*. On my machine, this program takes about 1.55s, and Runtime System reports 99.9% productivity, 18.7 megabytes allocated heap and 570 Kilobytes copied during GC.

So now with a baseline, let’s see if we can squeeze out more performance from vectors. This is a non-exhaustive list:

- Use unboxed vectors instead. This restricts the types of elements we can store, but it saves us a level of indirection. Down to 960ms and approximately halved GC traffic.
- Large lists are inefficient, and they don’t compose with vectors stream fusion. We should change indices so that it uses
*V.enumFromTo*instead (alternatively turn on*OverloadedLists*extension and drop*V.fromList*). Down to 360ms and 94% less GC traffic. - Conversion functions
*V.thaw*and*V.freeze*are O(n), that is, they modify copies. Using in-place*V.unsafeThaw*and*V.unsafeFreeze*instead is sometimes useful.*V.unsafeFreeze*in the*bubblesort*wrapper is completely safe, but*V.unsafeThaw*is not. In our example, however, with*-O2*, the program is optimized into a single loop and all those conversions get eliminated. - Vector operations (
*V.read*,*V.swap*) in*bubblesortM*are guaranteed to never be out of bounds, so it’s perfectly safe to replace these with unsafe variants (*V.unsafeRead*,*V.unsafeSwap*) that don’t check bounds. Speed-up of about 25 milliseconds, or 5%.

To summarize, applying good practices and safe usage of unsafe functions, our Bubble sort just got 80% faster. These optimizations are applied in the*bubblesort-optimized.hs*file (omitted here).

We noticed that almost all GC traffic came from a linked list, which was constructed and immediately consumed. Lists are bad for performance in that they don’t fuse like vectors. To ensure good vector performance, ensure that the fusion framework can work effectively. Anything that can be done with a vector should be done.

As final note, when working with vectors(and other libraries) it’s a good idea to keep the Haddock documentation handy. There are several big and small performance choices to be made. Often the difference is that of between O(n) and O(1).

# Speedup via continuation-passing style

Implementing monads in **continuation-passing style** (**CPS**) can have very good results. Unfortunately, no widely-used or supported library I’m aware of would provide drop-in replacements for ubiquitous **Maybe**, **List**, **Reader**, **Writer**, and **State** monads.

It’s not that hard to implement the standard monads in CPS from scratch. For example, the *State* monad can be implemented using the *Cont* monad from *mtl* as follows:

```
-- file: cont-state-writer.hs
{-# LANGUAGE GeneralizedNewtypeDeriving #-}
{-# LANGUAGE MultiParamTypeClasses #-}
{-# LANGUAGE FlexibleInstances #-}
{-# LANGUAGE FlexibleContexts #-}
import Control.Monad.State.Strict
import Control.Monad.Cont
newtypeStateCPS s r a = StateCPS (Cont (s -> r) a)
deriving (Functor, Applicative, Monad, MonadCont)
instance MonadState s (StateCPS s r) where
get = StateCPS $ cont $
next curState→ next curStatecurState
put newState = StateCPS $ cont $
next curState→ next () newState
runStateCPS :: StateCPS s s () -> s -> s
runStateCPS (StateCPS m) = runCont m (_ -> id)
```

In case you’re not familiar with the continuation-passing style and the *Cont* monad, the details might not make much sense, instead of just returning results from a function, a function in CPS applies its results to a continuation. So in short, to “get” the state in continuation-passing style, we pass the current state to the “next” continuation (first argument) and don’t change the state (second argument). To “put”, we call the continuation with unit (no return value) and change the state to new state (second argument to next).

*StateCPS* is used just like the *State* monad:

```
action :: MonadStateInt m => m ()
action = replicateM_ 1000000 $ do
i<- get
put $! i + 1
main = do
print $ (runStateCPS action 0 :: Int)
print $ (snd $ runState action 0 :: Int)
```

That *action* operation is, in the CPS version of the state monad, about 5% faster and performs 30% less heap allocation than the state monad from *mtl*. This program is limited pretty much only by the speed of monadic composition, so these numbers are at least very close to maximum speedup we can have from CPSing the state monad. Speedups of the writer monad are probably near these results.

Other standard monads can be implemented similarly to StateCPS. The definitions can also be generalized to monad transformers over an arbitrary monad (*a la ContT*). For extra speed, you might wish to combine many monads in a single CPS monad, similarly to what RWST does.

# Summary

We witnessed the performance of the *bytestring*, *text*, and *vector* libraries, all of which get their speed from fusion optimizations, in contrast to linked lists, which have a huge overhead despite also being subject to fusion to some degree. However, linked lists give rise to simple difference lists and zippers. The builder patterns for *lists*, *bytestring*, and *text*were introduced. We discovered that the *array* package is low-level and clumsy compared to the superior *vector* package, unless you must support Haskell 98.

We also saw how to implement Bubble sort using vectors and how to speedup via continuation-passing style.

# Resources for Article:

**Further resources on this subject:**

- Data Tables and DataTables Plugin in jQuery 1.3 with PHP [article]
- Data Extracting, Transforming, and Loading [article]
- Linking Data to Shapes [article]